CMC training corpus Janes-Norm 3.0

Dataset

PID

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs, forums and news comments.

The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, and word normalisation of non-standard Slovene.

The corpus is composed of three parts. One is Janes-Norm 1.2 proper (5,000 texts and 93,000 words, texts to 2016), which has automatically assigned lemmas and morphosyntactic tags. The other two parts constitute the complete Janes-Tag 3.0 (http://hdl.handle.net/11356/1732) corpus, which has manually annotated morphosyntactic tagging, lemmatisation and named entity annotation (15,000 texts and 20,000 words). One part of Janes-Tag 3.0 is the older Janes-Tag 2.1 (texts to 2016) and the newer Janes-RSDO (tweets only, texts up to 2022). Both Janes-Norm and Janes-Tag (but not Janes-RSDO) have texts classified according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness.

The data is available in the source TEI encoding and in derived CoNLL-U format. All three parts contain lemmas and JOS/MULTEXT-East morphosyntactic descriptions, while Janes-Tag and Janes-RSDO also contain Universal Dependencies morphological features, and Janes-Tag also named entity annotations.

Compared to the previous version, this one corrects some capitalisation errors in normalised words of Janes-Norm, updates the encoding, and adds Janes-RSDO.

The first version of this corpus is described in:

FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation. https://rdcu.be/7RX4

Identifier
PID	http://hdl.handle.net/11356/1733
Related Identifier	https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Norm
Related Identifier	https://doi.org/10.1007/s10579-018-9425-z
Related Identifier	http://hdl.handle.net/11356/1084
Related Identifier	https://nl.ijs.si/janes/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1733

Provenance
Creator	Lenardič, Jakob; Čibej, Jaka; Arhar Holdt, Špela; Erjavec, Tomaž; Fišer, Darja
Publisher	Jožef Stefan Institute
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline	Linguistics