CLARIN - Repositories

Tweet code-switching corpus Janes-Preklop 1.0

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...

Twitter corpus Janes-Tweet 1.0

Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into...

News comment corpus Janes-News 1.0

Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is...

Forum corpus Janes-Forum 1.0

Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is...

Blog post and comment corpus Janes-Blog 1.0

Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts...

Wikipedia talk corpus Janes-Wiki 1.0

Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured...

CMC training corpus Janes-Tag 2.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Serbian Twitter training corpus ReLDI-NormTag-sr 1.1

ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Tweet comma corpus Janes-Vejica 1.0

Janes-Vejica is a corpus of Slovene tweets where commas are annotated with the reason for their (in)correct use, according to the supplied typology. The corpus was sampled from...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

CMC training corpus Janes-Norm 1.2

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

CMC shortening corpus Janes-Kratko 1.0

Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and...

CMC training corpus Janes-Syn 1.0

Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene...

Serbian web corpus srWaC 1.1

The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,...

Bosnian web corpus bsWaC 1.1

The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,...

Croatian web corpus hrWaC 2.1

The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via...

Japanese web corpus with difficulty levels jpWaC-L 1.0

The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the...

Learners' corpus Šolar 1.0

Šolar consists of 2,703 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage...

Digital library and corpus of historical Slovene IMP 1.1

The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains...

4,412 datasets found