-
Tweet code-switching corpus Janes-Preklop 1.0
Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),... -
Sentiment Annotated Dataset of Croatian News
We present a collection of sentiment annotations for news articles (article links) in Croatian language. A set of 2025 news articles was gathered from 24sata, one of the leading... -
Training corpus hr500k 1.0
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and... -
CMC training corpus Janes-Norm 3.0
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Training corpus SUK 1.1
The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with... -
MULTEXT-East "1984" annotated corpus 4.0
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original... -
Choice of plausible alternatives dataset in Serbian COPA-SR
The COPA-SR dataset (Choice of plausible alternatives in Serbian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1
ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Training corpus jos1M 1.2
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This... -
Q-CAT Corpus Annotation Tool 1.4
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The... -
xLiMe Twitter Corpus XTC 1.0.1
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,... -
Dataset of Slovene idiomatic expressions SloIE
SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an... -
Corpus of comma placement Vejica 1.3
A collection of sentences demonstrating and correcting comma usage. The sentences come from five sources: - KUST: a Slovene learner corpus,... -
Training corpus ssj500k 1.4
The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Q-CAT Corpus Annotation Tool 1.5
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.0
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
Corpus of term-annotated texts RSDO5 1.0
The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually... -
CMC shortening corpus Janes-Kratko 1.0
Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and...