Word embeddings CLARIN.SI-embed.sl 2.0

Dataset

PID

CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl, etc. The embeddings are based on the skip-gram model of fastText trained on 5,791,405,942 tokens of running text for 3,471,054 lowercased surface forms.

The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-sl web crawl corpus (http://hdl.handle.net/11356/1517).

Identifier
PID	http://hdl.handle.net/11356/1791
Related Identifier	http://hdl.handle.net/11356/1204
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1791

Provenance
Creator	Terčon, Luka; Ljubešić, Nikola; Erjavec, Tomaž
Publisher	Jožef Stefan Institute
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline	Linguistics