-
Q-CAT Corpus Annotation Tool 1.1
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Post-edited and error annotated machine translation corpus PErr 1.0
The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their... -
Training corpus SETimes.SR 1.0
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic... -
Slovenian Twitter hate speech dataset IMSyPP-sl
A hand-labeled training (50,000 tweets labeled twice) and evaluation set (10,000 tweets labeled twice) for hate speech on Slovenian Twitter. The data files contain tweet IDs,... -
Training corpus ssj500k 2.3
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Slovene Web genre identification corpus GINCO 1.0
The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and... -
CMC training corpus Janes-Norm 1.1
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.1
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
Corpus of term-annotated texts RSDO5 1.1
The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
List of formulaic sequences in standard written Slovenian
This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic... -
Choice of plausible alternatives dataset in Macedonian COPA-MK
The COPA-MK dataset (Choice of plausible alternatives in Macedonian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the... -
Q-CAT Corpus Annotation Tool 1.3
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Training corpus SUK 1.0
The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with... -
English-Slovenian text genre dataset X-GENRE
The X-GENRE dataset comprises almost 3,000 web texts in English and Slovenian, manually-annotated with genre labels. The dataset allows for automated genre identification and... -
Macedonian linguistic training corpus SETimes.MK 0.1
The SETimes.MK corpus is a sample of 570 sentences from the now unavailable setimes.com website of news articles on topics of South-Eastern Europe. The sentences were manually... -
Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1
The corpus contains 23 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 15th-19th century, together with over 50... -
List of formulaic sequences in spoken Slovenian
This document contains 2,374 formulaic sequences in spoken Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic structure,... -
Terminology identification dataset KAS-term 1.0
The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the... -
CMC training corpus Janes-Tag 1.2
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...