Estonian Teen Language Corpus

DOI

Estonian Teen Language Corpus (Eesti teismeliste keele korpus) is a corpus representing spoken and written language data, collected from Estonian teenagers (ages 9-18) between 2019-2023. The corpus consists of four types of files. Spoken language data is represented by .eaf and .tsv files (spoken_eaf.zip, spoken_tsv.zip), and contain transcriptions of recordings made of teenagers' spontaneous speech, where one participant recorded a conversation between themselves and another person or several other people. Transcriptions are annotated on different linguistic tiers, including words, morphology, language, etc (see teke_spoken_metadata.txt). The corpus version 1.0 contains transcriptions of 116 conversations, most around one hour in length. The corpus can be used for addressing various linguistic research questions, as well as training various language technological applications (e.g. speech recognition, dialogue systems).

Written language data is made up of online chats between two teenagers (ages 10-17). Chats are represented by .tsv and .html files (chat_html.zip, chat_tsv.zip). The corpus version 1.0 includes 110 chats. Annotation includes language tags and abbreviations. All personal information has been anonymised.

Estonian Teen Language Corpus is a product of several consequtive projects, which are further described here: https://teismelistekeel.ee/.

To access the corpus, please write to Virve Vihman (virve.vihman@ut.ee).

Identifier
DOI https://datadoi.ee/handle/33/596
Related Identifier https://doi.org/10.1515/lingvan-2021-0152
Metadata Access https://datadoi.ee/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:datadoi.ee:33/596
Provenance
Creator Vihman, Virve-Anneli; Pilvik, Maarja-Liisa; Mandel, Aive; Kängsepp, Annika; Aigro, Mari; Koreinik, Kadri; Praakli, Kristiina; Lindström, Liina
Publisher Institute of Estonian and General Linguistics, University of Tartu
Publication Year 2024
Rights info:eu-repo/semantics/restrictedAccess; http://creativecommons.org/licenses/by-nc-nd/4.0/
OpenAccess false
Contact Institute of Estonian and General Linguistics, University of Tartu
Representation
Language Estonian
Resource Type info:eu-repo/semantics/dataset
Format TSV; HTML; EAF; TXT; CSV; text/plain; application/zip; text/csv
Discipline Other