Word embeddings CLARIN.SI-embed.mk 2.0

PID

CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms.

The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-mk web crawl corpus (http://hdl.handle.net/11356/1512).

Identifier
PID http://hdl.handle.net/11356/1788
Related Identifier http://hdl.handle.net/11356/1359
Related Identifier https://www.clarin.si/info/k-centre/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1788
Provenance
Creator Terčon, Luka; Ljubešić, Nikola
Publisher Jožef Stefan Institute
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Macedonian
Resource Type toolService
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics