Multilingual Culture-Independent Word Analogy Datasets

PID

Word analogy task evaluates word embeddings, based on analagous word pairs (eg. "Paris - France" should be equivalent to "Rome - Italy", "son - daughter" should be equivalent to "brother - sister"). The dataset has been inspired by Mikolov's analogy test set in English (http://download.tensorflow.org/data/questions-words.txt). It was first written for Slovenian and then partially translated and partially done from scratch for the other languages (Croatian, Finnish, Estonian, Swedish, Latvian, Lithuanian, Russian and English).

The analogy dataset is composed of fifteen categories, five semantical and ten syntactical. Each dataset has about 19,000 entries.

In addition to nine monolingual datasets (one for each language), we also composed 72 cross-lingual datasets (one for each language pair), where one half of the entry (one analogy, eg. "mother-father") is in one language and the other half of the entry (eg. "sister-brother") is in another language.

Identifier
PID http://hdl.handle.net/11356/1261
Related Identifier https://arxiv.org/abs/1911.10038
Related Identifier http://embeddia.eu
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1261
Provenance
Creator Ulčar, Matej; Vaik, Kristiina; Lindström, Jessica; Linde, Dace; Dailidėnaitė, Milda; Šumakov, Andrei
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2019
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene; Croatian; English; Finnish; Estonian; Latvian; Lithuanian; Swedish; Russian
Resource Type lexicalConceptualResource
Format application/zip; text/plain; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline Linguistics