CLARIN - Repositories

Terminological multiword expressions lexicon

The Terminological Multiword Expressions Lexicon contains multiword terms extracted from various terminological sources. The entries were lemmatized and tagged according to the...

ASR database ARTUR 0.1 (transcriptions)

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840...

ASR database ARTUR 0.1 (audio)

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840...

Corpus of textbooks for learning Slovenian as L2 KUUS 1.0

The KUUS corpus comprises 17 textbooks for Slovenian as a second and foreign language published between 2002 and 2022 at the Centre for Slovene as a Second and Foreign Language...

Corpus of Slovenian textbooks ccUčbeniki 1.0

ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The...

Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0

MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a lesser extent, the web. The corpus was designed for the needs of the...

Q-CAT Corpus Annotation Tool 1.4

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The...

Icelandic web corpus MaCoCu-is 1.0

The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler...

Q-CAT Corpus Annotation Tool 1.3

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...

Q-CAT Corpus Annotation Tool 1.2

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...

Q-CAT Corpus Annotation Tool 1.1

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...

Q-CAT Corpus Annotation Tool 1.0

The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...

Error-annotated developmental corpus Šolar 2.0 Error

The corpus contains 2094 texts from the corpus Šolar 2.0 (http://hdl.handle.net/11356/1214), i.e. only those in which error annotations can be found. For each text, the...

Developmental corpus (without language corrections) Šolar 2.0 Clear

Šolar 2.0 Clear is an adapted version of the Šolar 2.0 corpus, cf. http://hdl.handle.net/11356/1214. The Šolar 2.0 Clear corpus consists of texts written by students in Slovene...

Corpus of comma placement Vejica 1.3

A collection of sentences demonstrating and correcting comma usage. The sentences come from five sources: - KUST: a Slovene learner corpus,...

Automatically constructed multiword lexicon srMWELex v0.5

The srMWELex lexicon is an automatically constructed lexicon of Serbian multiword expression candidates (mostly collocations) from the parsed srWaC 1.0 corpus by using the...

Slovenian parliamentary corpus SlovParl 2.0

The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after...

Training corpus ssj500k 2.0

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

Developmental corpus of Slovene (without language corrections) Šolar-Clear

Šolar-Clear is an adapted version of the Šolar 1.0 corpus, cf. http://hdl.handle.net/11356/1036. The Šolar(-Clear) corpus consists of texts written by students in Slovene...

xLiMe Twitter Corpus XTC 1.0.1

The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...

4,412 datasets found