-
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.1
The model for lemmatisation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR... -
The CLASSLA-Stanza model for lemmatisation of non-standard Serbian 2.1
The model for lemmatisation of non-standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training corpus... -
Training corpus jos1M 1.1
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This... -
The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1
The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training... -
Word embeddings CLARIN.SI-embed.sl 1.0
CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC etc. The... -
CMC training corpus Janes-Tag 2.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
ReLDI tag+lemma web service for WebLicht
WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging, and lemmatisation. -
The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1
The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training... -
Spoken Torlak dialect corpus 1.0 (transcription)
Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local... -
Morphological lexicon Sloleks 1.2
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains... -
The CLASSLA-Stanza model for lemmatisation of standard Croatian 2.1
The model for lemmatisation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus... -
CMC training corpus Janes-Tag 1.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Serbian Twitter training corpus ReLDI-NormTag-sr 1.0
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0
ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Reference corpus of historical Slovene goo300k 1.2
goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text... -
Synergies between transcription and lexical database building: The case of Ge...
Building a lemmatised corpus of German Sign Language (DGS) using iLex, a relational database and annotation tool; consistent token-type matching (lemmatisation) and quality... -
How Much Top-Down and Bottom-Up do We Need to Build a Lemmatized Corpus?
Building a lemmatised corpus of German Sign Language (DGS) using iLex; lemmatisation as top-down and lexicon building as bottom-up process; lemma revision -
Transkriptionskonventionen im Vergleich
Synopsis of transcription conventions used in six international sign language research projects including annotation tool and tiers in transcripts, divided into conventional... -
Die Erstellung von Fachgebärdenlexika am Institut für Deutsche Gebärdensprach...
Detailed description of how six corpus-based LSP dictionaries German – German Sign Language (DGS) were produced including elicitation methods, annotation and...