CLARIN - Repositories

Reference corpus of historical Slovene goo300k 1.2

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...

Czech RST Discourse Treebank 1.0

The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text...

Pan-Latin Geothermal Energy Lexicon

The Pan-Latin Geothermal Energy Lexicon (Lessico panlatino dell’energia geotermica), developed within the Realiter network, contains the basic terms related to geothermal energy...

The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of...

The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank training corpus...

The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian...

This model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank...

The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1

The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training...

The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonia...

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the 1984 training...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Bulg...

This model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0

The model for lemmatisation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training...

The CLASSLA-StanfordNLP model for lemmatisation of standard Bulgarian 1.1

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the BulTreeBank...

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Bulgarian...

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

GECCC Grammar Error Correction Corpus for Czech (2022-09-28)

Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website...

GECCC Grammar Error Correction Corpus for Czech

Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website...

Q-CAT Corpus Annotation Tool 1.5

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The...

Greek web corpus MaCoCu-el 1.0

The Greek web corpus MaCoCu-el 1.0 was built by crawling the ".gr", ".ελ", ".cy" and ".eu" internet top-level domains in 2023, extending the crawl dynamically to other domains...

Ukrainian web corpus MaCoCu-uk 1.0

The Ukrainian web corpus MaCoCu-uk 1.0 was built by crawling the ".ua" and ".укр" internet top-level domains in 2022, extending the crawl dynamically to other domains as well....

Catalan web corpus MaCoCu-ca 1.0

The Catalan web corpus MaCoCu-ca 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu" internet top-level domains in 2022, extending the crawl dynamically...

Icelandic web corpus MaCoCu-is 2.0

The Icelandic web corpus MaCoCu-is 2.0 was built by crawling the ".is" internet top-level domain in 2021 and 2023, extending the crawl dynamically to other domains as well. The...

Croatian linguistic training corpus hr500k 2.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

4,412 datasets found