Dataset of Slovene idiomatic expressions SloIE

PID

SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an idiomatic meaning, with appropriate manual annotations for each token. The idiomatic expressions were selected from the Slovene Lexical Database (http://hdl.handle.net/11356/1030). We selected only expressions that can occur with both a literal and an idiomatic meaning. The sentences were extracted from the Gigafida corpus.

For each sentence, the file first contains the text of the sentence prefixed by #. This is followed by a line of numbers indicating the positions of tokens that belong to the expression. The numbers also indicate the word order for expressions where the word order is flexible. They are ordered according to the dictionary form of the expression (e.g., the first number indicates the position where the first word of the expression - in its dictionary form - occurs). Each token is labelled with either 'DA', indicating tokens in an expression that have an idiomatic meaning, 'NE', indicating tokens in an expression that have a literal meaning, or '*', indicating tokens outside the expression. Additionally, 'NEJASEN ZGLED' indicates tokens where the annotators could not determine the meaning from the example sentence. Each token is also tagged with the dictionary form of the expression that is present in the sentence.

Key reference: Škvorc, Tadej, Polona Gantar, and Marko Robnik-Šikonja. "MICE: Mining Idioms with Contextual Embeddings." arXiv preprint arXiv:2008.05759 (2020).

Identifier
PID http://hdl.handle.net/11356/1335
Related Identifier https://arxiv.org/abs/2008.05759
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1335
Provenance
Creator Škvorc, Tadej; Gantar, Polona; Robnik-Šikonja, Marko
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2020
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics