Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0

Dataset

PID

MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a lesser extent, the web. The corpus was designed for the needs of the e-learning environment "Slovenščina na dlani", where it served as a source of grammar and spelling exercises. The texts have therefore been selected to be as style-neutral as possible, proofread, and thematically interesting for the learner population. Some texts originate from the Slovenian Reference corpus Gigafida, while many texts (primarily literary) were newly gathered.

The corpus as a whole is available in the CLARIN.SI concordances, while the openly available ccMAKS dataset includes 10% of the texts, sampled in accordance with the authorship agreements. In the project "Empirical foundations for digitally-supported development of writing skills", the corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in language didactics and NLP.

The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccmaks.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276) and the ccmaks.noske.regi file is needed for SketchEngine-type concordancers.

Identifier
PID	http://hdl.handle.net/11356/1692
Related Identifier	http://projekt.slo-na-dlani.si/en/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1692

Provenance
Creator	Verdonik, Darinka; Majninger, Sandi; Dobrovoljc, Kaja; Antloga, Špela; Zögling Markuš, Aleksandra; Voršič, Ines; Zemljak Jontes, Melita; Koletnik, Mihaela; Valh Lopert, Alenka; Šek Martük, Polonca; Kosem, Iztok; Majhenič, Simona; Ferme, Marko; Žagar, Aleš; Arhar Holdt, Špela
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; application/octet-stream; downloadable_files_count: 3
Discipline	Linguistics