Corpus of Slovene linguistic scientific writing JezKor

PID

JezKor is a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise.

The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.

Identifier
PID http://hdl.handle.net/11356/1755
Related Identifier https://rsdo.slovenscina.eu/terminoloski-portal
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1755
Provenance
Creator Atelšek, Simon; Nemec, Karmen; Jemec Tomazin, Mateja
Publisher ZRC SAZU
Publication Year 2023
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics