85 datasets found

Keywords: parallel corpus

Filter Results
  • Macedonian-English parallel corpus MaCoCu-mk-en 2.0

    The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk” and “.мкд” internet top-level domains in 2021, extending the crawl dynamically to other...
  • Maltese-English parallel corpus MaCoCu-mt-en 2.0

    The Maltese-English parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....
  • Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0

    The Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...
  • Slovene-English parallel corpus MaCoCu-sl-en 2.0

    The Slovene-English parallel corpus MaCoCu-sl-en 2.0 was built by crawling the “.si” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...
  • Catalan-English parallel corpus MaCoCu-ca-en 1.0

    The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the...
  • TED-ELH Parallel Corpus (ELEXIS)

    The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data. See also: http://hdl.handle.net/20.500.11821/34
  • Finnish-English parallel corpus fienWaC 1.0

    The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor...
  • Turkish-English parallel corpus MaCoCu-tr-en 1.0

    The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other...
  • Icelandic-English parallel corpus MaCoCu-is-en 2.0

    The Icelandic-English parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021, extending the crawl dynamically to other domains as...
  • CzEng 0.7

    CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual...
  • ParaCrawl Corpus version 1.0

    The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of...
  • Hunglish Corpus

    Billingual written general; 2 million sentences
  • FAUST cs-en 0.5

    This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....
  • Czech and English abstracts of ÚFAL papers

    This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles...
  • CsEnVi Pairwise Parallel Corpora

    CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:...
  • UFAL Parallel Corpus of North Levantine 1.0

    This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the...
  • Czech-Slovak Parallel Corpus

    Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...
  • Hindi Visual Genome 1.0

    Data Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We...
  • PAWS

    PAWS is a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In...
  • WMT 13 Test Set

    We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...
You can also access this registry using the API (see API Docs).