CorefUD conversion of Slovene coreference resolution corpus coref149

PID

This corpus is the CorefUD conversion of the coref149 corpus for coreference resolution in Slovene (http://hdl.handle.net/11356/1182). It contains 149 documents annotated with coreference information. Coreference in Universal Dependencies (CorefUD) is an initiative to collect coreference corpora in various languages and harmonize them to the same scheme and data format (CoNLL-U). The coreference information is stored in the MISC column. More concretely, the start and end of each coreference mention is marked with the "Entity=" attribute. For example, "Entity=(e0" marks the start of the entity e0 at the current token while "Entity=e0) marks the end of the entity e0 at the current token. For full details on the format, please see http://hdl.handle.net/11234/1-5478. To ensure compliance with the CoNLL-U format, the corpus was automatically annotated with trankit v1.1.2 to obtain lemmas, part of speech tags (UPOS, XPOS - MULTEXT-East V6), features, and dependencies (head, dependency relation). To enable implementation into the SloBENCH evaluation framework (https://slobench.cjvt.si/), we release the labeled training set (containing 100 documents) and the unlabeled test set (containing 49 documents) in the CorefUD format. Please note that the labels are available in the original coref149 corpus but omitted here to deter misuse of the test set labels. In comparison to the original coref149 corpus, this contains the same texts and coreference information in a different (more universal) format.

Identifier
PID http://hdl.handle.net/11356/1989
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1989
Provenance
Creator Klemen, Matej; Žitnik, Slavko
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2024
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; text/plain; downloadable_files_count: 3
Discipline Linguistics