Training corpus ssj500k 1.3

Dataset

PID

The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from the jos1M corpus forming a training corpus with 500,000 words, manually checked and annotated on the levels of tokenization, segmentation, morphosyntactic tagging, syntactic dependency parsing and named entities. The ssj500k corpus uses the JOS morphosyntactic tagset with 1,902 tags and dependencies with 10 labels. The part of the corpus annotated with dependency relations contains 11,411 sentences, named entities are annotated in the original jos100k part of the corpus.

Identifier
PID	http://hdl.handle.net/11356/1029
Related Identifier	http://hdl.handle.net/11356/1052
Related Identifier	http://eng.slovenscina.eu/tehnologije/ucni-korpus
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1029

Provenance
Creator	Krek, Simon; Erjavec, Tomaž; Dobrovoljc, Kaja; Može, Sara; Ledinek, Nina; Holz, Nanika
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2013
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); PUB; https://creativecommons.org/licenses/by-nc-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics