Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs, forums and news comments.
The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations.
The corpus is composed of two parts, the older (texts to 2016) and smaller (65,000 words) Janes Tag 2.1, and the tweet-only newer (2022, 125,000 words) Janes RSDO. Only the Janes Tag 2.1 part is annotated with named entities and with classification of the texts according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness.
The data is available in the source TEI encoding and in derived CoNLL-U format. Both contain JOS/MULTEXT-East morphosyntactic descriptions as well as Universal Dependencies morphological features.
Compared to the previous version, this one corrects some errors, updates the encoding, and adds Janes-RSDO.
The first version of this corpus is described in:
FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2020. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation. https://doi.org/10.1007/s10579-018-9425-z
Note that a related corpus, Janes-Norm 3.0 (http://hdl.handle.net/11356/1733), is also available. It contains Janes-Tag 3.0 and an additional subcorpus with manually checked sentences, tokens and normalised words but only automatically assigned lemmas and MULTEXT-East MSDs.