Training corpus ssj500k 1.3

PID

The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from the jos1M corpus forming a training corpus with 500,000 words, manually checked and annotated on the levels of tokenization, segmentation, morphosyntactic tagging, syntactic dependency parsing and named entities. The ssj500k corpus uses the JOS morphosyntactic tagset with 1,902 tags and dependencies with 10 labels. The part of the corpus annotated with dependency relations contains 11,411 sentences, named entities are annotated in the original jos100k part of the corpus.

Identifier
PID http://hdl.handle.net/11356/1029
Related Identifier http://hdl.handle.net/11356/1052
Related Identifier http://eng.slovenscina.eu/tehnologije/ucni-korpus
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1029
Provenance
Creator Krek, Simon; Erjavec, Tomaž; Dobrovoljc, Kaja; Može, Sara; Ledinek, Nina; Holz, Nanika
Publisher Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2013
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); PUB; https://creativecommons.org/licenses/by-nc-sa/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline Linguistics