Training corpus hr500k 1.0

Dataset

PID

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotations (and other aspects) of the hr500k corpus are documented in the teiHeader and back element of the TEI encoded corpus. In short, they follow (1) the MULTEXT-East V5 morphosyntactic specifications for Croatian, https://nl.ijs.si/ME/V5/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, while (4) the semantic role labelling annotation guidelines are currently in the publication process.

Identifier
PID	http://hdl.handle.net/11356/1183
Related Identifier	http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
Related Identifier	http://hdl.handle.net/11356/1792
Related Identifier	https://github.com/nljubesi/hr500k
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1183

Provenance
Creator	Ljubešić, Nikola; Agić, Željko; Klubička, Filip; Batanović, Vuk; Erjavec, Tomaž
Publisher	Jožef Stefan Institute
Publication Year	2018
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics