Dataset - B2FIND

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Croatian 1.1

The model for lemmatisation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...

This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on...

Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0

FRENK-STYRIA-24sata is a dataset of moderated newspaper comments from the website 24sata.hr with metadata on the time of publishing, user identifier, thread identifier and...

Twitter corpus Janes-Tweet 1.0

Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into...

CMC training corpus Janes-Norm 1.1

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

News comment corpus Janes-News 1.0

Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is...

Forum corpus Janes-Forum 1.0

Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...

This model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on...

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Slove...

This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training...

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.1

The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

Dataset and baseline model of moderated content FRENK-MMC-RTV 1.0

FRENK-MMC-RTV is a dataset of moderated newspaper comments from the website rtvslo.si with metadata on the time of publishing, user identifier, thread identifier and whether the...

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Croat...

This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Tweet code-switching corpus Janes-Preklop 1.0

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Serbi...

This model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...

This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on...

CMC training corpus Janes-Norm 3.0

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,...

Corpus of Montenegrin language-related news comments MetaLangNEWS-COMMENTS-Me

A comprehensive corpus of user comments on online news articles on the topic of language from major Montenegrin daily newspapers and news portals, published in the five-year...

60 datasets found