Corpus for training and evaluating diacritics restoration systems

PID

Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.

All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.

The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.

Identifier
PID http://hdl.handle.net/11234/1-2607
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-2607
Provenance
Creator Náplava, Jakub; Straka, Milan; Hajič, Jan; Straňák, Pavel
Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year 2018
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Czech; Vietnamese; Romanian; Moldavian; Moldovan; Polish; Slovak; Spanish; Castilian; Croatian; Irish; Latvian; Hungarian; French; Turkish
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 13
Discipline Linguistics