Dataset - B2FIND

Character-level part-of-speech tagger of Slovene language

Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been...

xLiMe Twitter Corpus XTC 1.0.1

The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Training corpus jos1M 1.2

The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

MULTEXT-East "1984" annotated corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Training corpus SUK 1.1

The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...

CMC training corpus Janes-Tag 2.1

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Training corpus hr500k 1.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard ...

This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on...

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Serbi...

This model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Slov...

This model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croa...

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

Word embeddings CLARIN.SI-embed.hr 1.0

CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC and a 400-million-token-heavy collection...

Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1

The corpus contains 23 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 15th-19th century, together with over 50...

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Croat...

This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serb...

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

The Trankit model for linguistic processing of standard Slovenian

This is a retrained Slovenian standard model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization,...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Mace...

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Slov...

This model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

64 datasets found