-
IDENTICv1.0-raw
Raw Text -
Additional German-Czech reference translations of the WMT'11 test set
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation... -
ParCorFull: A Parallel Corpus Annotated with Full Coreference
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual... -
Czech-English Parallel Corpus 1.0 (CzEng 1.0)
CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for... -
Synthetic part of CzEng 2.0
CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for... -
Czech-English Manual Word Alignment
Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources. -
FAUST 0.5
Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test... -
English-Slovak Parallel Corpus
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –... -
Prague Czech-English Dependency Treebank 2.0
Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed... -
KonText Web Demo
An interactive web demo for querying selected ÚFAL and LINDAT corpora. LINDAT/CLARIN KonText is a fork of ÚČNK KonText (https://github.com/czcorpus/kontext, maintained by Tomáš... -
English-Urdu Religious Parallel Corpus
English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with... -
Multilingual corpus of juridical texts
International conventions and treaties arranged as a paralell corpus aligned on paragraph level -
OdiEnCorp 2.0
Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel... -
LongEval Test Collection
The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on... -
LongEval Train Collection
The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on... -
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)
EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set... -
Czech and English abstracts of ÚFAL papers (2022-11-11)
This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics,... -
IDENTICv1.0
IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide... -
HindEnCorp 0.5
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was... -
Covert translation: Business Communication (new)
Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit authentischen...