-
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1
ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Training corpus jos1M 1.2
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This... -
xLiMe Twitter Corpus XTC 1.0.1
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,... -
Character-level part-of-speech tagger of Slovene language
Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
The Trankit model for linguistic processing of spoken and written Slovenian 1.1
This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation... -
The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1
The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training... -
Training corpus ssj500k 2.2
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Annotated Corpus of Pre-Standardized Balkan Slavic Literature
The corpus contains 15 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 16th-19th century, together with over 30... -
The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croa...
The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the... -
The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonia...
This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the 1984 training... -
MULTEXT-East non-commercial lexicons 4.0
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of... -
Word embeddings CLARIN.SI-embed.sr 1.0
CLARIN.SI-embed.sr contains word embeddings induced from the srWaC web corpus. The embeddings are based on the skip-gram model of fastText trained on 554,606,544 tokens of... -
The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian...
This model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank... -
The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian...
This model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training... -
CMC training corpus Janes-Tag 3.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,... -
MULTEXT-East free lexicons 4.0
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of... -
Serbian linguistic training corpus SETimes.SR 2.0
The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation,... -
The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serb...
The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the... -
Word embeddings CLARIN.SI-embed.sl 1.0
CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC etc. The...