Dataset - B2FIND

Reference List of Slovene Frequent Common Words

The reference list of Slovene most frequent common words was prepared by selecting vocabulary at the intersection of the most frequent 10,000 lemmas of four Slovene text...

A Resource for Evaluating Graded Word Similarity in Context: CoSimLex

The dataset contains human similarity ratings for pairs of words. The annotators were presented with contexts that contained both of the words in the pair and the dataset...

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as...

CroSloEngual BERT

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing...

ELMo embeddings model, Slovenian

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus...

SimLex-999 Slovenian translation SimLex-999-sl 1.0

The resource contains English SimLex-999 (Hill et al. 2015) and their Slovene translations. In the translation process, the word pairs were first translated by two translators...

Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as...

Latvian user comment dataset 1.0

The dataset is an archive of reader comments from the Delfi news site from 2014-2019, containing approximately 12M comments, mostly in the Latvian language, with some in...

Dataset of Slovene idiomatic expressions SloIE

SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an...

Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0

EACL Hackashop Keyword Challenge Datasets In this repository you can find ids of articles used for the keyword extraction challenge at EACL Hackashop on News Media Content...

24sata news comment dataset 1.0

The dataset of user comments provided for research purposes for the EMBEDDIA, a Horizon 2020 project, extracted from the database of user comments from the 24sata.hr news...

Summarization datasets from the KAS corpus KAS-Sum 1.0

Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus...

Corpus of academic Slovene KAS 2.0

The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens)...

24sata news article archive 1.0

The 24sata news portal consists of a portal with daily news and several smaller portals covering news from specific topics, such as automotive news, health, culinary content,...

Sentiment Annotated Dataset of Croatian News

We present a collection of sentiment annotations for news articles (article links) in Croatian language. A set of 2025 news articles was gathered from 24sata, one of the leading...

Ekspress news article archive (in Estonian and Russian) 1.0

The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with...

Slovenian keyword extraction dataset from SentiNews 1.0

The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords....

CroSloEngual BERT 1.1

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing...

Latvian Delfi article archive (in Latvian and Russian) 1.0

This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords...

ELMo embeddings models for seven languages

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian,...

26 datasets found