73 datasets found

Keywords: manual annotation

Filter Results
  • CMC training corpus Janes-Tag 2.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Training corpus ssj500k 1.3

    The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...
  • Dataset of normalised Slovene text KonvNormSl 1.0

    Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content....
  • CMC training corpus Janes-Norm 1.0

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...
  • CMC training corpus Janes-Tag 1.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Corpus of comma placement Vejica 1.0

    A collection of sentences demonstrating and correcting comma usage. The sentences come from four sources: - KUST: a Slovene learner corpus,...
  • Serbian Twitter training corpus ReLDI-NormTag-sr 1.0

    ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...
  • Croatian linguistic training corpus hr500k 2.0

    The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...
  • Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

    ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

    ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Reference corpus of historical Slovene goo300k 1.2

    goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...
  • KrdWrd CANOLA Corpus 1.1

    The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and...
  • KrdWrd CANOLA Corpus 1.0

    The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and...
You can also access this registry using the API (see API Docs).