65 datasets found

Keywords: web corpus

Filter Results
  • Croatian-English parallel corpus MaCoCu-hr-en 1.0

    The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as...
  • Bosnian web corpus MaCoCu-bs 1.0

    The Bosnian web corpus MaCoCu-bs 1.0 was built by crawling the ".ba" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The...
  • Croatian web corpus MaCoCu-hr 2.0

    The Croatian web corpus MaCoCu-hr 2.0 was built by crawling the ".hr" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The...
  • Slovene Web genre identification corpus GINCO 1.0

    The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and...
  • DSI-enriched ParaCrawl 9 en-es corpus

    This is a derivative work based on Paracrawl release 9 English-Spanish (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the...
  • Maltese web corpus MaCoCu-mt 1.0

    The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...
  • Maltese web corpus MaCoCu-mt 2.0

    The Maltese web corpus MaCoCu-mt 2.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is...
  • Serbian-English parallel corpus MaCoCu-sr-en 1.0

    The Serbian-English parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022, extending the crawl dynamically to...
  • Croatian web corpus CLASSLA-web.hr 1.0

    The Croatian web corpus CLASSLA-web.hr 1.0 is based on the MaCoCu-hr 2.0 web corpus crawl (http://hdl.handle.net/11356/1806), which was additionally cleaned and enriched with...
  • Macedonian web corpus MaCoCu-mk 2.0

    The Macedonian web corpus MaCoCu-mk 2.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other domains as well....
  • Turkish-English parallel corpus MaCoCu-tr-en 2.0

    The Turkish-English parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021, extending the crawl dynamically to other...
  • Croatian-English parallel corpus hrenWaC 2.0

    The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor...
  • Maltese-English parallel corpus MaCoCu-mt-en 1.0

    The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....
  • Turkish web corpus MaCoCu-tr 2.0

    The Turkish web corpus MaCoCu-tr 2.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The...
  • Serbian web corpus CLASSLA-web.sr 1.0

    The Serbian web corpus CLASSLA-web.sr 1.0 is based on the MaCoCu-sr 1.0 web corpus crawl (http://hdl.handle.net/11356/1807), which was additionally cleaned and enriched with...
  • Greek web corpus MaCoCu-el 1.0

    The Greek web corpus MaCoCu-el 1.0 was built by crawling the ".gr", ".ελ", ".cy" and ".eu" internet top-level domains in 2023, extending the crawl dynamically to other domains...
  • Ukrainian-English parallel corpus MaCoCu-uk-en 1.0

    The Ukrainian-English parallel corpus MaCoCu-uk-en 1.0 was built by crawling the ".ua" and ".укр" internet top-level domain in 2022, extending the crawl dynamically to other...
  • Slovene-English parallel corpus MaCoCu-sl-en 1.0

    The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-level domain in 2021, extending the crawl dynamically to other domains as well....
  • Bosnian web corpus bsWaC 1.1

    The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration,...
  • Serbian web corpus MaCoCu-sr 1.0

    The Serbian web corpus MaCoCu-sr 1.0 was built by crawling the ".rs" and ".срб" internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains as...
You can also access this registry using the API (see API Docs).