CLARIN - Repositories

SentiLex-PT 02

SentiLex-PT is a sentiment lexicon for Portuguese, made up of 7,014 lemmas, and 82,347 inflected forms. In detail, the lexicon describes: 4,779 (16,863) adjectives, 1,081...

SemFi: Finnish Semantics with Syntactic Relations

Context This dataset is covered in detail in the following publication: Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost...

Movie Title Puns

Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a...

Finnish Dialect Normalization Model

This is an OpenNMT-py model for normalizing spoken Finnish text into written Finnish. For usage, please see https://github.com/mikahama/murre/ This model has been produced in...

Exploring genealogical blends_Online Corpus

The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN...

s.morfcorpus.6ec19594.20131227-2309

WMT 2013 Crawled News monolingual corpus, Czech, segmented by Morfessor

El mejor conjunto de datos para identificación del sarcasmo

Este corpus contiene todas las locuciones de dos episodios de South Park (voces para América Latina) y dos episodios de Archer (voces para España). Cada locución ha sido anotado...

SemKpv - Semantic Database for Komi-Zyrian

This SQLite database contains Komi-Zyrian lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...

Celebrities and Famous People, and their Properties

Context This dataset is based on the work presented in the following publication, please cite it if you use the data in an academic publication: Alnajjar, K., Hämäläinen, M.,...

NoticIA

We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative...

UralicNLP - The NLP library for Uralic languages

UralicNLP is a natural language processing library targeted mainly for Uralic languages. UralicNLP can produce morphological analysis, generate morphological forms, lemmatize...

Skolt Sami - North Sami Cognates

A human curated list of Skolt Sami (sms) - North Sami (sme) cognates found with an automatic method described in: Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates...

Laburpen corpusa The Basque Summaries Corpus

School summaries obtained from Unai Atutxa's thesis (Atutxa, 2022) are available under the CC BY-NC 4.0 license. A total of 1676 extractions and abstractions have been...

SemMdf - Semantic Database for Moksha

This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...

Psycholinguistic Experiment Video

This is a video recording that is being used in psycholinguistic experiments.

CATUC: Corpus académico de textos universitarios en castellano

This research was conducted on a corpus of texts produced by first-year undergraduate students at the University of the Basque Country (UPV/EHU). The corpus is called CATUC:...

Model of English OCR Post-Correction

This is an OpenNMT-py model for OCR post-correction in English Usage, see: https://github.com/mikahama/natas This is a part of the following publication: Mika Hämäläinen, and...

Prague Dependency Treebank 2.0 Sample Data

This is a small sample dataset from PDT 2.0. As such it can be released under a very permissive CC-BY license.

FinMeter - Tools for assessing Finnish poetry

FinMeter is a library for analyzing poetry in Finnish. It handles typical rhyming such as alliteration, assonance and consonance, Japanese meters and Kalevala meter. It can also...

Finnish Words and their Concreteness Values

Context This data has been produced for poem generation in Finnish. If you use this dataset in your publication, please cite: Hämäläinen, M., & Alnajjar, K. (2019). Let’s...

4,731 datasets found