Dataset - B2FIND

CoNLL-based Extended Czech Named Entity Corpus 1.0

This is a Czech Named Entity Corpus 1.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C. The...

VALLEX 2.5

The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description...

RobeCzech Base

RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that...

Processing of intraclausal garden-path structures in Czech

Experimental materials, data and R scripts used in the paper "Garden-path sentences and the diversity of their (mis)representations" (Ceháková - Chromý, 2023).

MorfFlex CZ 2.0

MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of...

CWC2011

Web corpus of Czech, created in 2011. Contains newspapers+magazines, discussions, blogs. See http://www.lrec-conf.org/proceedings/lrec2012/summaries/120.html for details.

Czech Models (CNEC) for NameTag

Czech models for NameTag, providing recognition of named entities. The models are trained on Czech Named Entity Corpus 2.0 and 1.1.

SQAD v2

Simple question answering database (SQAD) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging)...

FERNET-C5

The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4...

CoCzeFLA Chroma 2023.04

A new version of the previously published corpus Chroma. The version 2023.04 includes six children. Two transcripts (Julie20221, Klara30424) were removed since they did not meet...

Diffusion of phonetic updates within phonological neighborhoods, ELOPE, Data

Phonological neighborhood density is known to influence lexical access, speech production as well as perception processes. Lexical competition is thought to be the central...

NomVallex 2.0

NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on...

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcri...

ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Czech Text Document Corpus v 2.0

BASIC INFORMATION Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text...

MorfFlex CZ

Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for...

Large Corpus of Czech Parliament Plenary Hearings

We present a large corpus of Czech parliament plenary sessions. The corpus consists of approximately 444 hours of speech data and corresponding text transcriptions. The whole...

MorfFlex CZ 160310

Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for...

Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 2.0,...

Czech Named Entity Corpus 1.0

The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a...

Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ...

41 datasets found