CMC training corpus Janes-Norm 1.0

Dataset

PID

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require higlhy accurate and reliable annotations.

The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf

Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1079.

Identifier
PID	http://hdl.handle.net/11356/1080
Related Identifier	http://hdl.handle.net/11356/1084
Related Identifier	https://nl.ijs.si/janes/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1080

Provenance
Creator	Erjavec, Tomaž; Fišer, Darja; Čibej, Jaka; Arhar Holdt, Špela
Publisher	Jožef Stefan Institute
Publication Year	2016
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/pdf; application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline	Linguistics