Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0

PID

MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a lesser extent, the web. The corpus was designed for the needs of the e-learning environment "Slovenščina na dlani", where it served as a source of grammar and spelling exercises. The texts have therefore been selected to be as style-neutral as possible, proofread, and thematically interesting for the learner population. Some texts originate from the Slovenian Reference corpus Gigafida, while many texts (primarily literary) were newly gathered.

The corpus as a whole is available in the CLARIN.SI concordances, while the openly available ccMAKS dataset includes 10% of the texts, sampled in accordance with the authorship agreements. In the project "Empirical foundations for digitally-supported development of writing skills", the corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in language didactics and NLP.

The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccmaks.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276) and the ccmaks.noske.regi file is needed for SketchEngine-type concordancers.

Identifier
PID http://hdl.handle.net/11356/1692
Related Identifier http://projekt.slo-na-dlani.si/en/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1692
Provenance
Creator Verdonik, Darinka; Majninger, Sandi; Dobrovoljc, Kaja; Antloga, Špela; Zögling Markuš, Aleksandra; Voršič, Ines; Zemljak Jontes, Melita; Koletnik, Mihaela; Valh Lopert, Alenka; Šek Martük, Polonca; Kosem, Iztok; Majhenič, Simona; Ferme, Marko; Žagar, Aleš; Arhar Holdt, Špela
Publisher Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2022
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; application/octet-stream; downloadable_files_count: 3
Discipline Linguistics