-
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Blog post and comment corpus Janes-Blog 1.0
Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts... -
Brexit stance annotated tweets
The corpus contains over 4.5 million tweets (tweet IDs) automatically labeled by a machine learning program with stance regarding Brexit: Positive (supporting Brexit), Negative... -
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Se...
This model for named entity recognition of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1
ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
xLiMe Twitter Corpus XTC 1.0.1
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,... -
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.0
The model for lemmatisation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Annotated corpus of Croatian language-related news comments MetaLangNEWS-COMM...
A comprehensive corpus of user comments on online news articles on the topic of language from major Croatian daily newspapers and news portals, published in the five-year period... -
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0
The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.0
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
CMC shortening corpus Janes-Kratko 1.0
Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and... -
Corpus of Bosnia and Herzegovina language-related news comments MetaLangNEWS-...
A comprehensive corpus of user comments on online news articles on the topic of language from major daily newspapers and news portals in Bosnia and Herzegovina, published in the... -
Annotated corpus of Serbian language-related news comments MetaLangNEWS-COMME...
A comprehensive corpus of user comments on online news articles on the topic of language from major Serbian daily newspapers and news portals, published in the five-year period... -
Serbian Twitter training corpus ReLDI-NormTag-sr 1.1
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
CMC training corpus Janes-Norm 1.2
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Croatian 1.0
The model for lemmatisation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k... -
The CLASSLA-Stanza model for lemmatisation of non-standard Slovenian 2.1
This model for lemmatisation of non-standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus... -
CMC training corpus Janes-Tag 1.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
CMC training corpus Janes-Tag 3.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,...