27 datasets found

Keywords: word normalisation

Filter Results
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

    ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

    ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • CMC training corpus Janes-Tag 1.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Serbian Twitter training corpus ReLDI-NormTag-sr 1.0

    ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...
  • CMC training corpus Janes-Norm 1.0

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...
  • Dataset of normalised Slovene text KonvNormSl 1.0

    Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content....
  • Wikipedia talk corpus Janes-Wiki 1.0

    Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured...
  • CMC training corpus Janes-Tag 2.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

    ReLDI-NormTagNER-hr 2.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • CMC training corpus Janes-Tag 3.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,...
  • CMC training corpus Janes-Tag 1.1

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • CMC training corpus Janes-Norm 1.2

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...
  • Serbian Twitter training corpus ReLDI-NormTag-sr 1.1

    ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...
  • Croatian Twitter training corpus ReLDI-NormTag-hr 1.0

    ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

    ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

    ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Blog post and comment corpus Janes-Blog 1.0

    Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts...
  • CMC training corpus Janes-Tag 2.1

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • CMC training corpus Janes-Norm 3.0

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,...
  • CMC training corpus Janes-Tag 1.2

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
You can also access this registry using the API (see API Docs).