Dataset - B2FIND

Training corpus ssj500k 1.4

The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named...

Annotated Corpus of Pre-Standardized Balkan Slavic Literature

The corpus contains 15 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 16th-19th century, together with over 30...

CMC training corpus Janes-Norm 1.1

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0

This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence...

CMC training corpus Janes-Tag 1.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Post-edited and error annotated machine translation corpus PErr 1.0

The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their...

KrdWrd CANOLA Corpus 1.0

The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and...

KrdWrd CANOLA Corpus 1.1

The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen Web pages. It was harvested, annotated and...

68 datasets found