Corpus of Academic Slovene (MSc/MA theses) KAS-mag 1.0

PID

The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or 500 million tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si).

The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into pages, these into paragraphs, and then into sentences. The sentence tokens are morphosyntactically annotated, words are lemmatised and English-Slovene pairs of term candidates are marked up and linked.

The corpus is distributed in the canonical TEI encoding, in the so called vertical format used by the (no)Sketch Engine and CWB concordancers, and as plain text files. Each distribution format also contains a file with thesis metadata.

This repository entry contains the corpus of MSc/MA theses only; separate entries are available that contain PhD theses (KAS-dr: http://hdl.handle.net/11356/1265), BSc/BA theses (KAS-dipl: http://hdl.handle.net/11356/1267) and the complete KAS corpus with all three (KAS: http://hdl.handle.net/11356/1244).

Identifier
PID http://hdl.handle.net/11356/1266
Related Identifier https://rdcu.be/b7GrB
Related Identifier http://nl.ijs.si/kas/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1266
Provenance
Creator Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; Hrovat, Goran
Publisher Jožef Stefan Institute; Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication Year 2019
Rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0; https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0; ACA
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/gzip; downloadable_files_count: 3
Discipline Linguistics