CMC training corpus Janes-Tag 3.0

Dataset

PID

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs, forums and news comments.

The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations.

The corpus is composed of two parts, the older (texts to 2016) and smaller (65,000 words) Janes Tag 2.1, and the tweet-only newer (2022, 125,000 words) Janes RSDO. Only the Janes Tag 2.1 part is annotated with named entities and with classification of the texts according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness.

The data is available in the source TEI encoding and in derived CoNLL-U format. Both contain JOS/MULTEXT-East morphosyntactic descriptions as well as Universal Dependencies morphological features.

Compared to the previous version, this one corrects some errors, updates the encoding, and adds Janes-RSDO.

The first version of this corpus is described in:

FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2020. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation. https://doi.org/10.1007/s10579-018-9425-z

Note that a related corpus, Janes-Norm 3.0 (http://hdl.handle.net/11356/1733), is also available. It contains Janes-Tag 3.0 and an additional subcorpus with manually checked sentences, tokens and normalised words but only automatically assigned lemmas and MULTEXT-East MSDs.

Identifier
PID	http://hdl.handle.net/11356/1732
Related Identifier	https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag
Related Identifier	https://doi.org/10.1007/s10579-018-9425-z
Related Identifier	http://hdl.handle.net/11356/1238
Related Identifier	https://nl.ijs.si/janes/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1732

Provenance
Creator	Lenardič, Jakob; Čibej, Jaka; Arhar Holdt, Špela; Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Zupan, Katja; Dobrovoljc, Kaja
Publisher	Jožef Stefan Institute
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline	Linguistics