Dataset - B2FIND

Eesti taskuhäälingukorpus

Korpus koosneb eesti taskuhäälingusaadetest ja nende transkriptsioonidest. Korpuses on kokku 10 633 episoodi 184 erinevast taskuhäälingust, kogukestusega 10 918 tundi, mis on...

Estonian Teen Language Corpus

Estonian Teen Language Corpus (Eesti teismeliste keele korpus) is a corpus representing spoken and written language data, collected from Estonian teenagers (ages 9-18) between...

Eesti Rahvusringhäälingu raadiosaadete korpus

Korpus koosneb ERR-i raadiosaadetest ja nende transkriptsioonidest. Korpuses on 53 000 raadiosaadet kogukestusega 16 tuhat tundi, mis on salvestatud vahemikus 1930–2022....

Phonetic Corpus of Estonian Spontaneous Speech v1.3

The Phonetic Corpus of Estonian Spontaneous Speech consists of recordings that have been annotated on different linguistic tiers including words and segments and their...

Cyfry

A small spoken digits corpus in polish. Contains 488 recordings of 25 speakers reading 20 digits (0-9) each. Amounts to around 76 minutes of recordings. Split into train (~72%),...

EU Parliament Speech corpus

A collection of 1040 EU parliament speeches with transcription and annotations. Includes original speeches and PL/EN translations.

Clarin-PL Studio Corpus (EMU)

Polish speech corpus of read speech recorded in a studio. Contains many speakers, each reading a few dozen different sentences and a list of words with rare phonemes. Useful for...

Speech tools plugin for Annotation Pro

This resource describes the Annotation Pro plugin containing various tools for automatic processing of speech data. The initial tool provides only a speech aligner, but more are...

Clarin-PL Mobile Corpus (EMU)

Polish speech corpus of read speech recorded over the phone. Contains many speakers, each reading a few dozen different sentences and a list of words with rare phonemes. Useful...

Business English learner speech corpus SAPS

SAPS is a specialized speech corpus which contains business meeting simulations in English between undergraduate students of Languages for Business and Economics at the School...

Clarin-PL Studio Corpus (EMU;updated phonetics)

Polish speech corpus of read speech recorded in a studio. Contains many speakers, each reading a few dozen different sentences and a list of words with rare phonemes. Useful for...

STAZKA – Speech recordings from vehicles

The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project...

Czech Senior COMPANION Expressive Speech Corpus

The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded...

Vystadial 2013 – English data

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

Phonetic Corpus of Estonian Spontaneous Speech (online search engine)

Studio recordings of spontaneous Estonian segmented phonetically on word, sound, and other linguistic levels. Current size about 22 hours of speech, 155 000 words. Online search...

A Small Dataset for English-to-Czech Speech Translation in the Travel Domain

This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate). The "part1" and "part2" corpora...

Spoken corpus of Karel Makoň (2020-11-16)

Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism.

English TTS speech corpus of air traffic (pilot) messages - German accent

The corpus contains recordings of male speaker, native in German, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control...

Balaxan Corpus of Kurmanji

Balaxan is the first speech corpus of Kurmanji Kurdish with 58 utterances by speakers of Kurmanji. utterances are divided into 4 categories based on their sentence structures:...

Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four...

32 datasets found