Dataset - B2FIND

SentiLex-PT 02

SentiLex-PT is a sentiment lexicon for Portuguese, made up of 7,014 lemmas, and 82,347 inflected forms. In detail, the lexicon describes: 4,779 (16,863) adjectives, 1,081...

Celebrities and Famous People, and their Properties

Context This dataset is based on the work presented in the following publication, please cite it if you use the data in an academic publication: Alnajjar, K., Hämäläinen, M.,...

Prague Dependency Treebank 2.0 Sample Data

This is a small sample dataset from PDT 2.0. As such it can be released under a very permissive CC-BY license.

CLIN26-Bracmat-poster.pdf

Linguistic and algebraic expressions can be analysed with similar pattern matching (PM) methods, suggesting a trove of useful methods for Natural Language Processing (NLP). For...

SemFi: Finnish Semantics with Syntactic Relations

Context This dataset is covered in detail in the following publication: Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost...

HABE-IXA euskarazko idazmen proben corpusa HABE-IXA Basque written test corpus

This corpus contains essays written in official HABE exams for assessing student's knowledge of the Basque language. We have collected 120 essays in each of the B1, B2, C1 and...

Exploring genealogical blends_Online Corpus

The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN...

Syntax Maker - The NLG tool for Finnish

Syntax maker is the natural language generation tool for generating syntactically correct sentences in Finnish automatically. The tool is especially useful in the case of...

Interaction and dialogue with large-scale textual data: Parliamentary speeche...

Prof. Dr. Andreas Blätte's keynote talk at the CLARIN Annual Conference 2015. Additional material, including the presented 3D visualisations, are available via...

Syntactically annotated Czech legal texts

Two legal texts syntactically manually annotated according to the Prague dependency treebank framework. Dependency trees are presented as images. The annotation editor TrEd was...

SemMdf - Semantic Database for Moksha

This SQLite database contains Moksha lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...

Model of English OCR Post-Correction

This is an OpenNMT-py model for OCR post-correction in English Usage, see: https://github.com/mikahama/natas This is a part of the following publication: Mika Hämäläinen, and...

Sign Language Interaction

This is a sign language interaction recording made for scientific purposes.

Movie Title Puns

Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a...

Haur Hezkuntzako ipuin-bilduma

Euskal Herriko Ikastolen elkartean lantzen diren ipuinen bilduma

Murre - Normalize non-standard Finnish and dialectalize standard Finnish

A python library for normalizing dialectal Finnish and dialectalizing standard Finnish. Normalization Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text...

El mejor conjunto de datos para identificación del sarcasmo

Este corpus contiene todas las locuciones de dos episodios de South Park (voces para América Latina) y dos episodios de Archer (voces para España). Cada locución ha sido anotado...

Wikipedia paths

Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.

SemKpv - Semantic Database for Komi-Zyrian

This SQLite database contains Komi-Zyrian lemmas and their frequencies in a big corpus. The lemmas are linked to each other based on the syntactic relations they have had in the...

HD graduondokoa (Magia argibideak)

Magia jokoak egiteko argibide sorta

52 datasets found