50 datasets found

Keywords: named entities

Filter Results
  • CMC training corpus Janes-Tag 2.1

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Training corpus SUK 1.1

    The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...
  • Blog post and comment corpus Janes-Blog 1.0

    Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts...
  • ReLDI token+tag+lemma+NER web service for WebLicht

    WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging and Named Entity Recognition. Tool source files are...
  • Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

    ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • xLiMe Twitter Corpus XTC 1.0.1

    The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...
  • Training corpus ssj500k 1.4

    The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named...
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

    ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Training corpus ssj500k 2.2

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • Training corpus ssj500k 2.1

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • Training corpus ssj500k 2.0

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0

    This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence...
  • CMC training corpus Janes-Tag 3.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,...
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

    ReLDI-NormTagNER-hr 2.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Slovenian parliamentary corpus ParlaMeter-sl 1.0

    The ParlaMeter-sl corpus contains minutes of the National Assembly of the Republic of Slovenia and currently covers its VIIth mandate (2014-08-01 to 2018-06-22). The corpus...
  • Serbian linguistic training corpus SETimes.SR 2.0

    The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation,...
  • CMC training corpus Janes-Tag 2.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Training corpus ssj500k 1.3

    The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...
  • Wikipedia talk corpus Janes-Wiki 1.0

    Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured...
  • Croatian linguistic training corpus hr500k 2.0

    The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...
You can also access this registry using the API (see API Docs).