-
FAUST cs-en 0.5
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).... -
Czech and English abstracts of ÚFAL papers
This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles... -
Synthetic part of CzEng 2.0
CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for... -
IDENTICv1.0
IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide... -
Hunglish Corpus
Billingual written general; 2 million sentences -
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set... -
CsEnVi Pairwise Parallel Corpora
CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:... -
LongEval Test Collection
The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on... -
Czech-Slovak Parallel Corpus
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –... -
IDENTICv1.0-raw
Raw Text -
WMT 13 Test Set
We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,... -
Multilingual corpus of juridical texts
International conventions and treaties arranged as a paralell corpus aligned on paragraph level -
OdiEnCorp 2.0
Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel... -
English-Urdu Religious Parallel Corpus
English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with... -
ParCorFull: A Parallel Corpus Annotated with Full Coreference
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual... -
HindEnCorp 0.5
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was... -
Czech-English Parallel Corpus 1.0 (CzEng 1.0)
CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for... -
English-Slovak Parallel Corpus
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –... -
Czech and English abstracts of ÚFAL papers (2022-11-11)
This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics,... -
Prague Czech-English Dependency Treebank 2.0
Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed...