-
Ukrainian-English parallel corpus MaCoCu-uk-en 1.0
The Ukrainian-English parallel corpus MaCoCu-uk-en 1.0 was built by crawling the ".ua" and ".укр" internet top-level domain in 2022, extending the crawl dynamically to other... -
Catalan-English parallel corpus MaCoCu-ca-en 1.0
The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the... -
Greek-English parallel corpus MaCoCu-el-en 1.0
The Greek-English parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023, extending the crawl dynamically to... -
Serbian Web Corpus PDRS 1.0
PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a... -
Montenegrin web corpus meWaC 1.0
The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into... -
Serbian web corpus srWaC 1.1
The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,... -
Bosnian web corpus bsWaC 1.1
The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,... -
Croatian web corpus hrWaC 2.1
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via... -
Greek web corpus MaCoCu-el 1.0
The Greek web corpus MaCoCu-el 1.0 was built by crawling the ".gr", ".ελ", ".cy" and ".eu" internet top-level domains in 2023, extending the crawl dynamically to other domains... -
Ukrainian web corpus MaCoCu-uk 1.0
The Ukrainian web corpus MaCoCu-uk 1.0 was built by crawling the ".ua" and ".укр" internet top-level domains in 2022, extending the crawl dynamically to other domains as well.... -
Catalan web corpus MaCoCu-ca 1.0
The Catalan web corpus MaCoCu-ca 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu" internet top-level domains in 2022, extending the crawl dynamically... -
Icelandic web corpus MaCoCu-is 2.0
The Icelandic web corpus MaCoCu-is 2.0 was built by crawling the ".is" internet top-level domain in 2021 and 2023, extending the crawl dynamically to other domains as well. The... -
Icelandic web corpus MaCoCu-is 1.0
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler... -
Turkish web corpus MaCoCu-tr 1.0
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The... -
Slovene-English parallel corpus MaCoCu-sl-en 1.0
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-level domain in 2021, extending the crawl dynamically to other domains as well.... -
Macedonian web corpus MaCoCu-mk 1.0
The Macedonian web corpus MaCoCu-mk 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other domains as well.... -
Maltese web corpus MaCoCu-mt 2.0
The Maltese web corpus MaCoCu-mt 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is... -
Maltese-English parallel corpus MaCoCu-mt-en 1.0
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well.... -
Croatian-English parallel corpus hrenWaC 2.0
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor... -
Serbian web corpus MaCoCu-sr 1.0
The Serbian web corpus MaCoCu-sr 1.0 was built by crawling the ".rs" and ".срб" internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains as...