Corpus of term-annotated texts RSDO5 1.0

PID

The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually annotated terms. The corpus texts were published between 2000 and 2019, are either PhD theses (3), a scientific book based on a PhD thesis (1), graduate level text books (4), or journal articles (4) and belong to the fields of biomechanics (3), linguistics (3), chemistry (3), or veterinary science (3).

Apart from the manually annotated terms, the corpus was automatically annotated with Universal Dependencies annotations, i.e. tokenisation, sentence segmentation, lemmatisation, morpological features and dependency syntax.

Identifier
PID http://hdl.handle.net/11356/1400
Related Identifier http://hdl.handle.net/11356/1470
Related Identifier https://rsdo.slovenscina.eu/en/terminology-portal
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1400
Provenance
Creator Jemec Tomazin, Mateja; Trojar, Mitja; Žagar, Mojca; Atelšek, Simon; Fajfar, Tanja; Erjavec, Tomaž
Publisher ZRC SAZU
Publication Year 2021
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline Linguistics