-
Terminological multiword expressions lexicon
The Terminological Multiword Expressions Lexicon contains multiword terms extracted from various terminological sources. The entries were lemmatized and tagged according to the... -
ASR database ARTUR 0.1 (transcriptions)
ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840... -
ASR database ARTUR 0.1 (audio)
ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840... -
Corpus of textbooks for learning Slovenian as L2 KUUS 1.0
The KUUS corpus comprises 17 textbooks for Slovenian as a second and foreign language published between 2002 and 2022 at the Centre for Slovene as a Second and Foreign Language... -
Corpus of Slovenian textbooks ccUčbeniki 1.0
ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The... -
Corpus of Slovenian texts for pedagogical purposes ccMAKS 1.0
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a lesser extent, the web. The corpus was designed for the needs of the... -
Q-CAT Corpus Annotation Tool 1.4
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The... -
Icelandic web corpus MaCoCu-is 1.0
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler... -
Q-CAT Corpus Annotation Tool 1.3
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Q-CAT Corpus Annotation Tool 1.2
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Q-CAT Corpus Annotation Tool 1.1
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Q-CAT Corpus Annotation Tool 1.0
The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these... -
Error-annotated developmental corpus Šolar 2.0 Error
The corpus contains 2094 texts from the corpus Šolar 2.0 (http://hdl.handle.net/11356/1214), i.e. only those in which error annotations can be found. For each text, the... -
Developmental corpus (without language corrections) Šolar 2.0 Clear
Šolar 2.0 Clear is an adapted version of the Šolar 2.0 corpus, cf. http://hdl.handle.net/11356/1214. The Šolar 2.0 Clear corpus consists of texts written by students in Slovene... -
Corpus of comma placement Vejica 1.3
A collection of sentences demonstrating and correcting comma usage. The sentences come from five sources: - KUST: a Slovene learner corpus,... -
Automatically constructed multiword lexicon srMWELex v0.5
The srMWELex lexicon is an automatically constructed lexicon of Serbian multiword expression candidates (mostly collocations) from the parsed srWaC 1.0 corpus by using the... -
Slovenian parliamentary corpus SlovParl 2.0
The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after... -
Training corpus ssj500k 2.0
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Developmental corpus of Slovene (without language corrections) Šolar-Clear
Šolar-Clear is an adapted version of the Šolar 1.0 corpus, cf. http://hdl.handle.net/11356/1036. The Šolar(-Clear) corpus consists of texts written by students in Slovene... -
xLiMe Twitter Corpus XTC 1.0.1
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...