-
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Croatian 1.1
The model for lemmatisation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k... -
The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...
This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on... -
Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0
FRENK-STYRIA-24sata is a dataset of moderated newspaper comments from the website 24sata.hr with metadata on the time of publishing, user identifier, thread identifier and... -
Twitter corpus Janes-Tweet 1.0
Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into... -
CMC training corpus Janes-Norm 1.1
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Croatian Twitter training corpus ReLDI-NormTag-hr 1.1
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
News comment corpus Janes-News 1.0
Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is... -
Forum corpus Janes-Forum 1.0
Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...
This model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on... -
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Slove...
This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training... -
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.1
The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k... -
Dataset and baseline model of moderated content FRENK-MMC-RTV 1.0
FRENK-MMC-RTV is a dataset of moderated newspaper comments from the website rtvslo.si with metadata on the time of publishing, user identifier, thread identifier and whether the... -
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Croat...
This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k... -
CMC training corpus Janes-Tag 1.2
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Tweet code-switching corpus Janes-Preklop 1.0
Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),... -
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Serbi...
This model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR... -
The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...
This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on... -
CMC training corpus Janes-Norm 3.0
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,... -
Corpus of Montenegrin language-related news comments MetaLangNEWS-COMMENTS-Me
A comprehensive corpus of user comments on online news articles on the topic of language from major Montenegrin daily newspapers and news portals, published in the five-year...