Dataset - B2FIND

Parallel corpus EN-SL RSDO4 1.0

The RSDO4 parallel corpus of English-Slovene and Slovene-English translation pairs was collected as part of work package 4 of the Slovene in the Digital Environment project. It...

Parallel Corpus (EN-LT-DA) of General Data Protection Regulation (ELEXIS)

Trilingual parallel corpus on general data protection regulation. The size of the corpus is 54,468 words in English, 42,566 words in Lithuanian, and 47,740 words in Danish.

DSI-enriched ParaCrawl 9 en-nl corpus

This is a derivative work based on Paracrawl release 9 English-Dutch (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the...

Tourism English-Croatian Parallel Corpus 2.0

Sentence aligned parallel corpus built by automatically crawling 25 websites from the tourism domain.

MULTEXT-East "1984" annotated corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

English-Montenegrin parallel corpus of subtitles Opus-MontenegrinSubs 1.0

This corpus contains parallel English-Montenegrin subtitles collected in the scope of conducting a linguistic and translatological research by Petar Božović for his PhD thesis...

Parallel Corpus (EN-LT) of EUR-Lex Documents That Include Terms with the Adje...

Bilingual parallel corpus of the EU English documents containing terms with the adjective 'green' and their Lithuanian translations. The size of the corpus is 4,447,683 words in...

Slovene-English parallel corpus MaCoCu-sl-en 1.0

The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....

Parallel Corpus (EN-FR-LT) of EU Financial Documents (ELEXIS)

Parallel corpus is comprised of 154 EU legislative documents (English documents and their translations into French and Lithuanian) related to various financial issues and...

Ukrainian-English parallel corpus MaCoCu-uk-en 1.0

The Ukrainian-English parallel corpus MaCoCu-uk-en 1.0 was built by crawling the ".ua" and ".укр" internet top-level domain in 2022, extending the crawl dynamically to other...

Maltese-English parallel corpus MaCoCu-mt-en 1.0

The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....

Croatian-English parallel corpus hrenWaC 2.0

The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor...

Turkish-English parallel corpus MaCoCu-tr-en 2.0

The Turkish-English parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021, extending the crawl dynamically to other...

MULTEXT-East "1984" document corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Serbian-English parallel corpus MaCoCu-sr-en 1.0

The Serbian-English parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022, extending the crawl dynamically to...

DSI-enriched ParaCrawl 9 en-es corpus

This is a derivative work based on Paracrawl release 9 English-Spanish (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the...

Parallel sense-annotated corpus ELEXIS-WSD 1.0

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10...

JRC EU DGT Translation Memory Parsebank DGT-UD 1.0

DGT-UD is a 2 billion word 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with...

Croatian-English parallel corpus MaCoCu-hr-en 1.0

The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as...

Post-edited and error annotated machine translation corpus PErr 1.0

The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their...

85 datasets found