Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs, forums and news comments.
The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, and word normalisation of non-standard Slovene.
The corpus is composed of three parts. One is Janes-Norm 1.2 proper (5,000 texts and 93,000 words, texts to 2016), which has automatically assigned lemmas and morphosyntactic tags. The other two parts constitute the complete Janes-Tag 3.0 (http://hdl.handle.net/11356/1732) corpus, which has manually annotated morphosyntactic tagging, lemmatisation and named entity annotation (15,000 texts and 20,000 words). One part of Janes-Tag 3.0 is the older Janes-Tag 2.1 (texts to 2016) and the newer Janes-RSDO (tweets only, texts up to 2022). Both Janes-Norm and Janes-Tag (but not Janes-RSDO) have texts classified according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness.
The data is available in the source TEI encoding and in derived CoNLL-U format. All three parts contain lemmas and JOS/MULTEXT-East morphosyntactic descriptions, while Janes-Tag and Janes-RSDO also contain Universal Dependencies morphological features, and Janes-Tag also named entity annotations.
Compared to the previous version, this one corrects some capitalisation errors in normalised words of Janes-Norm, updates the encoding, and adds Janes-RSDO.
The first version of this corpus is described in:
FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project:
language resources and tools for Slovene user generated content. Language
Resources & Evaluation. https://rdcu.be/7RX4