Corpus of Slovenian textbooks ccUčbeniki 1.0

PID

ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The textbooks, prepared by various authors, cover different subjects as is documented in the ccucbeniki-metadata file.

The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in the field of language didactics and NLP.

The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccucbeniki.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276).

Identifier
PID http://hdl.handle.net/11356/1693
Related Identifier https://www.cjvt.si/prop/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1693
Provenance
Creator Kosem, Iztok; Pori, Eva; Žagar, Aleš; Arhar Holdt, Špela
Publisher Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2022
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); PUB; https://creativecommons.org/licenses/by-nc-sa/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics