Training corpus jos1M 1.2

PID

The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions and lemmas with about one fourth of the more problematic annotations hand-validated.

The morphosyntactic descriptions are given in both the JOS/MULTEXT-East framework (http://nl.ijs.si/ME/V6/msd/), as well as in the framework of Universal Dependencies for Slovene (https://universaldependencies.org/treebanks/sl_ssj/index.html).

The corpus is available in source TEI XML with the MSDs in English or Slovene and in the derived vertical format, used by CQP and (no)Sketch Engine concordancers and in CONLL-U, used by Universal Dependencies. Note that the corpus does not contain syntactic dependencies.

The texts or paragraphs of the jos1M corpus overlap with this of the ssj500k annotated corpus (http://hdl.handle.net/11356/1210), but the latter has been fully manually annotated, as well as having its tokenisation and sentence segmentation corrected. The texts and paragraphs in the jos1M corpus are marked if they are also included in ssj500k, while the CONLL-U is also split into the part that is included in ssj500k and that which is not. The latter can serve as an additional training set for morphosyntactic tagging and lemmatisation to ssj500k.

Identifier
PID http://hdl.handle.net/11356/1213
Related Identifier http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html
Related Identifier http://hdl.handle.net/11356/1037
Related Identifier http://nl.ijs.si/jos/jos1M-en.html
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1213
Provenance
Creator Erjavec, Tomaž; Krek, Simon; Dobrovoljc, Kaja
Publisher Jožef Stefan Institute
Publication Year 2019
Rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); https://creativecommons.org/licenses/by-nc/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline Linguistics