15 datasets found

Keywords: dependency treebank

Filter Results
  • Training corpus SETimes.SR 1.0

    The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic...
  • Training corpus ssj500k 2.3

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • Training corpus SUK 1.0

    The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...
  • Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1

    The corpus contains 23 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 15th-19th century, together with over 50...
  • Training corpus hr500k 1.0

    The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...
  • Training corpus SUK 1.1

    The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...
  • Training corpus ssj500k 1.4

    The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named...
  • Training corpus ssj500k 2.2

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • Annotated Corpus of Pre-Standardized Balkan Slavic Literature

    The corpus contains 15 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 16th-19th century, together with over 30...
  • Training corpus ssj500k 2.1

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • Training corpus ssj500k 2.0

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • CMC training corpus Janes-Syn 1.0

    Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene...
  • Serbian linguistic training corpus SETimes.SR 2.0

    The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation,...
  • Training corpus ssj500k 1.3

    The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...
  • Croatian linguistic training corpus hr500k 2.0

    The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...
You can also access this registry using the API (see API Docs).