-
SentiLex-PT 02
SentiLex-PT is a sentiment lexicon for Portuguese, made up of 7,014 lemmas, and 82,347 inflected forms. In detail, the lexicon describes: 4,779 (16,863) adjectives, 1,081... -
SemFi: Finnish Semantics with Syntactic Relations
Context This dataset is covered in detail in the following publication: Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost... -
Movie Title Puns
Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a... -
Finnish Dialect Normalization Model
This is an OpenNMT-py model for normalizing spoken Finnish text into written Finnish. For usage, please see https://github.com/mikahama/murre/ This model has been produced in... -
Exploring genealogical blends_Online Corpus
The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN... -
s.morfcorpus.6ec19594.20131227-2309
WMT 2013 Crawled News monolingual corpus, Czech, segmented by Morfessor -
El mejor conjunto de datos para identificación del sarcasmo
Este corpus contiene todas las locuciones de dos episodios de South Park (voces para América Latina) y dos episodios de Archer (voces para España). Cada locución ha sido anotado... -
SemKpv - Semantic Database for Komi-Zyrian
This SQLite database contains Komi-Zyrian lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the... -
Celebrities and Famous People, and their Properties
Context This dataset is based on the work presented in the following publication, please cite it if you use the data in an academic publication: Alnajjar, K., Hämäläinen, M.,... -
NoticIA
We present NoticIA, a dataset consisting of 850 Spanish news articles featuring prominent clickbait headlines, each paired with high-quality, single-sentence generative... -
UralicNLP - The NLP library for Uralic languages
UralicNLP is a natural language processing library targeted mainly for Uralic languages. UralicNLP can produce morphological analysis, generate morphological forms, lemmatize... -
Skolt Sami - North Sami Cognates
A human curated list of Skolt Sami (sms) - North Sami (sme) cognates found with an automatic method described in: Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates... -
Laburpen corpusa The Basque Summaries Corpus
School summaries obtained from Unai Atutxa's thesis (Atutxa, 2022) are available under the CC BY-NC 4.0 license. A total of 1676 extractions and abstractions have been... -
SemMdf - Semantic Database for Moksha
This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the... -
Psycholinguistic Experiment Video
This is a video recording that is being used in psycholinguistic experiments. -
CATUC: Corpus académico de textos universitarios en castellano
This research was conducted on a corpus of texts produced by first-year undergraduate students at the University of the Basque Country (UPV/EHU). The corpus is called CATUC:... -
Model of English OCR Post-Correction
This is an OpenNMT-py model for OCR post-correction in English Usage, see: https://github.com/mikahama/natas This is a part of the following publication: Mika Hämäläinen, and... -
Prague Dependency Treebank 2.0 Sample Data
This is a small sample dataset from PDT 2.0. As such it can be released under a very permissive CC-BY license. -
FinMeter - Tools for assessing Finnish poetry
FinMeter is a library for analyzing poetry in Finnish. It handles typical rhyming such as alliteration, assonance and consonance, Japanese meters and Kalevala meter. It can also... -
Finnish Words and their Concreteness Values
Context This data has been produced for poem generation in Finnish. If you use this dataset in your publication, please cite: Hämäläinen, M., & Alnajjar, K. (2019). Let’s...