ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The textbooks, prepared by various authors, cover different subjects as is documented in the ccucbeniki-metadata file.
The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in the field of language didactics and NLP.
The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccucbeniki.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276).