Dataset - B2FIND

Heinrich Wölfflins –Gesammelte Werke (HWGW) Digital Edition Dataset

Heinrich Wölfflin – Gesammelte Werke (HWGW) Digital edition Datasets TEI/XML data for the online edition at https://hwgw.humanitiesconnect.pub

Corpus of 1968 Slovenian literature Maj68 2.0

Maj68 corpus contains 1,521 texts by 198 known authors published between 1964 and 1972 in the periodicals "Tribuna", "Problemi" and "Problemi. Literatura." The texts contain...

Training corpus SETimes.SR 1.0

The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic...

Spoken corpus Gos 1.1

Gos is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and...

Japanese web corpus with difficulty levels jpWaC-L 1.0

The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the...

Training corpus ssj500k 2.3

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

Collection of Slovenian paremiological units Pregovori 1.1

This corpus collects and annotates the extensive and highly valuable diachronic collection of 37,390 Slovenian proverbs, 50 years and more in the making at the ZRC SAZU...

Multilingual comparable corpora of parliamentary debates ParlaMint 3.0

ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora...

Spoken corpus Gos 1.0

GOS is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and...

CMC training corpus Janes-Norm 1.1

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20...

Corpus of term-annotated texts RSDO5 1.1

The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually...

News comment corpus Janes-News 1.0

Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is...

Forum corpus Janes-Forum 1.0

Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is...

ŠUSS archive of questions and answers about the Slovenian language (1998-2010)

This corpus contains the Q&A archive of the ŠUSS language consultancy service. The ŠUSS internet forum was active 1998-2010. Questions posted by users were answered by a...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

MULTEXT-East "1984" document corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Spoken corpus Gos VideoLectures 2.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Croatian parliamentary corpus ParlaMeter-hr 1.0

The ParlaMeter-hr corpus contains minutes of the National Assembly of the Republic of Croatia and currently covers its VIth mandate (2016-11-15 - 2018-11-21). The corpus...

130 datasets found