Developmental corpus Šolar 3.0

Dataset

PID

The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. The information on school (elementary or secondary), subject, level (grade or year), type of text, region, and date of production is provided for each text. School essays form the majority of the corpus while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications, etc.

Part of the corpus (2,094 texts) is annotated with teachers' corrections using a system of labels described in the attached document (in Slovenian). Teacher corrections were part of the original files and reflect real classroom situations of essay marking. Corrections were then inserted into texts by annotators and subsequently categorized.

The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).

The corpus is available in TEI format, where the original and corrected versions of the texts are encoded separately, while intertextual links with error labels give the relations between the two. Additionally, the corpus is available also in the CoNLL-U and JSON formats, as well as vertical files for use with Sketch Engine type concordancers.

As opposed to the previous version 2.0, which was also available in two separate versions, i.e. Šolar Clear 2.0 (http://hdl.handle.net/11356/1219), with the students' text without teacher corrections, and Šolar Error (http://hdl.handle.net/11356/1231), with only those sentences that have teacher corrections, the current version has a different encoding, error annotations were manually edited in cca. 350 texts, and the linguistic annotation was performed with a better tool.

Identifier
PID	http://hdl.handle.net/11356/1589
Related Identifier	http://hdl.handle.net/11356/1214
Related Identifier	http://hdl.handle.net/11356/1231
Related Identifier	https://rsdo.slovenscina.eu/jezikovni-viri
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1589

Provenance
Creator	Arhar Holdt, Špela; Rozman, Tadeja; Stritar Kučuk, Mojca; Krek, Simon; Krapš Vodopivec, Irena; Stabej, Marko; Pori, Eva; Goli, Teja; Lavrič, Polona; Laskowski, Cyprian; Kocjančič, Polonca; Klemenc, Bojan; Krsnik, Luka; Kosem, Iztok
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2022
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; application/pdf; downloadable_files_count: 4
Discipline	Linguistics