Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0

PID

OSS is a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal (https://openscience.si). It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file.

Note that there exist similar, but older and smaller corpora KAS 2.0 (http://hdl.handle.net/11356/1448) and KAS 1.0 (http://hdl.handle.net/11356/1244). These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for "KAS" to find them.

Identifier
PID http://hdl.handle.net/11356/1774
Related Identifier https://openscience.si/
Related Identifier https://rsdo.slovenscina.eu/terminoloski-portal
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1774
Provenance
Creator Žagar, Kristjan; Ferme, Marko; Ojsteršek, Milan; Jemec Tomazin, Mateja; Erjavec, Tomaž
Publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 10
Discipline Linguistics