Croatian linguistic training corpus hr500k 2.0

PID

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, https://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, (3) the Janes annotation guidelines for named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (4) the PARSEME guidelines for annotating multi-word expressions, https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/ and (4) the semantic role labelling annotation protocol for Slovenian and Croatian, https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf.

Different to the previous version of the dataset, it is now encoded in the conllup format, as are other linguistic training datasets for Croatian and Serbian. The PARSEME multi-word expression annotation layer was added as well, together with countless corrections of labels on all available levels.

The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).

Identifier
PID http://hdl.handle.net/11356/1792
Related Identifier http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
Related Identifier http://hdl.handle.net/11356/1183
Related Identifier https://github.com/reldi-data/hr500k
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1792
Provenance
Creator Ljubešić, Nikola; Samardžić, Tanja
Publisher Jožef Stefan Institute
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Croatian
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; application/gzip; downloadable_files_count: 7
Discipline Linguistics