CLARIN - Repositories

Inflectional lexicon srLex 1.2

srLex is a large inflectional lexicon of Serbian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma,...

Inflectional lexicon hrLex 1.2

hrLex is a large inflectional lexicon of Croatian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma,...

Dataset of normalised Slovene text KonvNormSl 1.0

Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content....

Inflectional lexicon hrLex 1.1

hrLex is a large inflectional lexicon of Croatian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma,...

Corpus of comma placement Vejica 1.0

A collection of sentences demonstrating and correcting comma usage. The sentences come from four sources: - KUST: a Slovene learner corpus,...

MULTEXT-East free lexicons 4.0

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of...

Training corpus jos1M 1.1

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

Lexicon of historical Slovene imp25k 1.1

The imp25k lexicon of historical Slovene was created automatically from the goo300k and foo3M annotated corpora and contains attested and manually verified word forms and their...

Training corpus ssj500k 1.3

The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...

Corpus OVER

Many studies in cognitive linguistics have analysed the semantics of 'over', notably the semantics associated with 'over' as a preposition. Most of them generally conclude that...

AlbMoRe Movie Reviews in Albanian

AlbMoRe is a sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and...

Optimal Reference Translations from English to Czech

This corpus contains annotations of translation quality from English to Czech in seven categories on both segment- and document-level. There are 20 documents in total, each with...

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains...

SYN v4: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the...

HamleDT 2.0

HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a...

Poeti d’Italia in lingua latina

The Italian Poetry in Latin Project was conceived with the aim of locating, assessing, collating and computerizing Latin poems produced in Italy or in Italian cultural...

Musisque Deoque (MQDQ)

Musisque Deoque, the whole corpus of the Latin poets, from the beginnings to the end of VIIth century, was established at the end of 2005 with the main goal of creating a...

MT@BZ annotation guidelines v1.0

The MT@BZ annotation guidelines are guidelines for legal Italian-German machine translation quality assessment. Particularly, they cover the South Tyrolean German variety. They...

MT@BZ translation corpus v1.0

The MT@BZ is a translation corpus that consists of 52 decrees published by the Autonomous Province of Bolzano (South Tyrol) aligned with their machine translated versions. More...

ACTER (Annotated Corpora for Term Extraction Research) v1.5

ACTER (Annotated Corpora for Term Extraction Research) is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains...

4,412 datasets found