Dataset - B2FIND

FAUST cs-en 0.5

This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....

Czech and English abstracts of ÚFAL papers

This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles...

Synthetic part of CzEng 2.0

CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for...

IDENTICv1.0

IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide...

Hunglish Corpus

Billingual written general; 2 million sentences

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set...

CsEnVi Pairwise Parallel Corpora

CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:...

LongEval Test Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

Czech-Slovak Parallel Corpus

Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...

IDENTICv1.0-raw

Raw Text

WMT 13 Test Set

We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...

Multilingual corpus of juridical texts

International conventions and treaties arranged as a paralell corpus aligned on paragraph level

OdiEnCorp 2.0

Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel...

English-Urdu Religious Parallel Corpus

English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with...

ParCorFull: A Parallel Corpus Annotated with Full Coreference

ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual...

HindEnCorp 0.5

HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was...

Czech-English Parallel Corpus 1.0 (CzEng 1.0)

CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for...

English-Slovak Parallel Corpus

English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...

Czech and English abstracts of ÚFAL papers (2022-11-11)

This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics,...

Prague Czech-English Dependency Treebank 2.0

Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed...

85 datasets found