Dataset - B2FIND

Monitor corpus of Slovene Trendi 2023-02

The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from...

Corpus of Croatian news portals ENGRI (2014-2018)

The corpus consists of texts collected from the most popular (based on the Reuters Institute Digital News Report for 2018, retrieved from http://www.digitalnewsreport.org in...

Frequency lists of word-level n-grams from the Trendi corpus 2020

Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST...

Annotated corpus of Slovenian language-related news articles MetaLangNEWS-Sl

A comprehensive corpus of news articles on the topic of language, published in major Slovenian daily newspapers and news portals in the five-year period of January 1, 2015 -...

Monitor corpus of Slovene Trendi 2022-05

The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 48 different publishers. Trendi 2022-05 covers the period from...

Annotated corpus of Macedonian language-related news articles MetaLangNEWS-Mk

A comprehensive corpus of news articles on the topic of language, published in major Macedonian daily newspapers and news portals in the five-year period of January 1, 2015 -...

Manually sentiment annotated Slovenian news corpus SentiNews 1.0

Between 2 and 6 annotators independently sentiment annotated a stratified random sample of 10,427 documents from the Slovenian news portals 24ur, Dnevnik, Finance, Rtvslo, and...

Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0

EACL Hackashop Keyword Challenge Datasets In this repository you can find ids of articles used for the keyword extraction challenge at EACL Hackashop on News Media Content...

Frequency lists of word-level n-grams from the Trendi corpus 2021

Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST...

Annotated corpus of Croatian language-related news articles MetaLangNEWS-Hr

A comprehensive corpus of news articles on the topic of language, published in major Croatian daily newspapers and news portals in the five-year period of January 1, 2015 -...

Latvian Delfi article archive (in Latvian and Russian) 1.0

This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords...

Ekspress news article archive (in Estonian and Russian) 1.0

The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with...

Automatically sentiment annotated Slovenian news corpus AutoSentiNews 1.0

The corpus contains 256,567 documents from the Slovenian news portals 24ur, Dnevnik, Finance, Rtvslo, and Žurnal24. These portals contain political, business, economic and...

Corpus of Montenegrin language-related news articles MetaLangNEWS-Me

A comprehensive corpus of news articles on the topic of language, published in major Montenegrin daily newspapers and news portals in the five-year period of January 1, 2015 -...

24sata news article archive 1.0

The 24sata news portal consists of a portal with daily news and several smaller portals covering news from specific topics, such as automotive news, health, culinary content,...

Corpus of Bosnia and Herzegovina language-related news articles MetaLangNEWS-Bs

A comprehensive corpus of news articles on the topic of language, published in major daily newspapers and news portals in Bosnia and Herzegovina in the five-year period of...

Monitor corpus of Slovene Trendi 2022-10

The Trendi corpus is a monitor corpus of Slovene. It contains news from 106 different media websites, published by 48 different publishers. Trendi 2022-10 covers the period from...

Frequency lists of word-level n-grams from the Trendi corpus 2019

Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST...

Sentiment Annotated Dataset of Croatian News

We present a collection of sentiment annotations for news articles (article links) in Croatian language. A set of 2025 news articles was gathered from 24sata, one of the leading...

Slovenian keyword extraction dataset from SentiNews 1.0

The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords....

22 datasets found