Dataset - B2FIND

Training corpus SUK 1.0

The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...

Corpus of Academic Slovene (BSc/BA theses) KAS-dipl 1.0

The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1,1 billion tokens) written 2000 - 2018 and gathered from the digital...

English-Slovene term candidates KAS-biterm 1.0

KAS-biterm is an automatically generated glossary of English terms with their translations into Slovene. The pairs, possibly with their English and Slovene acronyms, were...

Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

The corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 1919 to 1939 (Zbirka stenografskih beležk, zapisnikov sej...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint-en.ana 4.1 is the English machine translation of the ParlaMint.ana 4.1 (http://hdl.handle.net/11356/1911) set of corpora of parliamentary debates across Europe. The...

Digital library and corpus of historical Slovene IMP 1.1

The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains...

Slovenian parliamentary corpus (1990-2022) siParl 4.0

The siParl 4.0 corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...

Corpus of Slovenian school texts SBSJ 1.0

Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Tweet code-switching corpus Janes-Preklop 1.0

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...

Spoken corpus Gos VideoLectures 4.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 4.2 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were...

Slovenian parliamentary corpus (1990-2018) siParl 2.0

The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...

Training corpus hr500k 1.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

CMC training corpus Janes-Norm 3.0

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,...

Slovenian parliamentary corpus (1990-2022) siParl 3.0

The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...

Corpus of academic Slovene KAS 2.0

The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens)...

CMC training corpus Janes-Tag 2.1

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

130 datasets found