JezKor is a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise.
The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.