Dataset - B2FIND

Catalan-English parallel corpus MaCoCu-ca-en 1.0

The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the...

Slovene-English parallel corpus MaCoCu-sl-en 2.0

The Slovene-English parallel corpus MaCoCu-sl-en 2.0 was built by crawling the “.si” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...

Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0

The Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...

Maltese-English parallel corpus MaCoCu-mt-en 2.0

The Maltese-English parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....

Macedonian-English parallel corpus MaCoCu-mk-en 2.0

The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk” and “.мкд” internet top-level domains in 2021, extending the crawl dynamically to other...

Bosnian-English parallel corpus MaCoCu-bs-en 1.0

The Bosnian-English parallel corpus MaCoCu-bs-en 1.0 was built by crawling the “.ba” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...

Bulgarian-English parallel corpus MaCoCu-bg-en 1.0

The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other...

Slovene-English parallel corpus slenWaC 1.0

The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor...

Bulgarian-English parallel corpus MaCoCu-bg-en 2.0

The Bulgarian-English parallel corpus MaCoCu-bg-en 2.0 was built by crawling the “.bg” and “.бг” internet top-level domains in 2021, extending the crawl dynamically to other...

Parallel corpus of idiomatic text ParaDiom 1.0

ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English...

Macedonian-English parallel corpus MaCoCu-mk-en 1.0

The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other...

Greek-English parallel corpus MaCoCu-el-en 1.0

The Greek-English parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023, extending the crawl dynamically to...

Croatian-English parallel corpus MaCoCu-hr-en 2.0

The Croatian-English parallel corpus MaCoCu-hr-en 2.0 was built by crawling the “.hr” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other...

Serbian-English parallel corpus srenWaC 1.0

The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor...

Parallel corpus EN-SL RSDO4 2.0

The RSDO4 parallel corpus of English-Slovene and Slovene-English translation pairs was collected as part of work package 4 of the Slovene in the Digital Environment project. It...

Albanian-English parallel corpus MaCoCu-sq-en 1.0

The Albanian-English parallel corpus MaCoCu-sq-en 1.0 was built by crawling the “.al” internet top-level domain in 2022, extending the crawl dynamically to other domains as...

Parallel Corpus (EN-LT-FR) of EUR-Lex Document Extracts That Include Terms wi...

Trilingual parallel corpus of EUR-Lex Document Extracts that include terms with colour names (black, white and grey). The size of the corpus is 23,198 words in English, 19,262...

Parallel sense-annotated corpus ELEXIS-WSD 1.1

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10...

Icelandic-English parallel corpus MaCoCu-is-en 1.0

The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as...

Bilingual Corpus of Underground Mining (ELEXIS)

PodzemniRadovi-sr-en, dvojezični poravnati korpus radova iz oblasti rudarstva. Undeground-mining-sr-en: bilingual texts from the Underground Mining Engineering journal (55...

85 datasets found