62 datasets found

Keywords: web corpus

Filter Results
  • Ukrainian-English parallel corpus MaCoCu-uk-en 1.0

    The Ukrainian-English parallel corpus MaCoCu-uk-en 1.0 was built by crawling the ".ua" and ".укр" internet top-level domain in 2022, extending the crawl dynamically to other...
  • Catalan-English parallel corpus MaCoCu-ca-en 1.0

    The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the...
  • Greek-English parallel corpus MaCoCu-el-en 1.0

    The Greek-English parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023, extending the crawl dynamically to...
  • Serbian Web Corpus PDRS 1.0

    PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a...
  • Montenegrin web corpus meWaC 1.0

    The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into...
  • Serbian web corpus srWaC 1.1

    The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,...
  • Bosnian web corpus bsWaC 1.1

    The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,...
  • Croatian web corpus hrWaC 2.1

    The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via...
  • Greek web corpus MaCoCu-el 1.0

    The Greek web corpus MaCoCu-el 1.0 was built by crawling the ".gr", ".ελ", ".cy" and ".eu" internet top-level domains in 2023, extending the crawl dynamically to other domains...
  • Ukrainian web corpus MaCoCu-uk 1.0

    The Ukrainian web corpus MaCoCu-uk 1.0 was built by crawling the ".ua" and ".укр" internet top-level domains in 2022, extending the crawl dynamically to other domains as well....
  • Catalan web corpus MaCoCu-ca 1.0

    The Catalan web corpus MaCoCu-ca 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu" internet top-level domains in 2022, extending the crawl dynamically...
  • Icelandic web corpus MaCoCu-is 2.0

    The Icelandic web corpus MaCoCu-is 2.0 was built by crawling the ".is" internet top-level domain in 2021 and 2023, extending the crawl dynamically to other domains as well. The...
  • Icelandic web corpus MaCoCu-is 1.0

    The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler...
  • Turkish web corpus MaCoCu-tr 1.0

    The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The...
  • Slovene-English parallel corpus MaCoCu-sl-en 1.0

    The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....
  • Macedonian web corpus MaCoCu-mk 1.0

    The Macedonian web corpus MaCoCu-mk 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other domains as well....
  • Maltese web corpus MaCoCu-mt 2.0

    The Maltese web corpus MaCoCu-mt 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...
  • Maltese-English parallel corpus MaCoCu-mt-en 1.0

    The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....
  • Croatian-English parallel corpus hrenWaC 2.0

    The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor...
  • Serbian web corpus MaCoCu-sr 1.0

    The Serbian web corpus MaCoCu-sr 1.0 was built by crawling the ".rs" and ".срб" internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains as...
You can also access this registry using the API (see API Docs).