-
Emoji Sentiment Ranking 1.0
A lexicon of 751 emoji characters with automatically assigned sentiment. The sentiment is computed from 70,000 tweets, labeled by 83 human annotators in 13 European languages.... -
English-Montenegrin parallel corpus of subtitles Opus-MontenegrinSubs 1.0
This corpus contains parallel English-Montenegrin subtitles collected in the scope of conducting a linguistic and translatological research by Petar Božović for his PhD thesis... -
MULTEXT-East "1984" annotated corpus 4.0
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original... -
Tourism English-Croatian Parallel Corpus 2.0
Sentence aligned parallel corpus built by automatically crawling 25 websites from the tourism domain. -
DSI-enriched ParaCrawl 9 en-nl corpus
This is a derivative work based on Paracrawl release 9 English-Dutch (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the... -
xLiMe Twitter Corpus XTC 1.0.1
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,... -
Parallel sense-annotated corpus ELEXIS-WSD 1.1
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10... -
Icelandic-English parallel corpus MaCoCu-is-en 1.0
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as... -
Japanese-Slovene learner's dictionary jaSlo 3.1
The jaSlo dictionary is primarily intended for Slovene students learning Japanese. For each entry, it contains the Japanese headword (kanji, hiragana or katakana, and romaji),... -
Twitter sentiment for 15 European languages
The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators. There are 15 Twitter corpora for the corresponding 15 European languages.... -
Albanian-English parallel corpus MaCoCu-sq-en 1.0
The Albanian-English parallel corpus MaCoCu-sq-en 1.0 was built by crawling the “.al” internet top-level domain in 2022, extending the crawl dynamically to other domains as... -
Serbian-English parallel corpus srenWaC 1.0
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor... -
Croatian-English parallel corpus MaCoCu-hr-en 2.0
The Croatian-English parallel corpus MaCoCu-hr-en 2.0 was built by crawling the “.hr” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other... -
Greek-English parallel corpus MaCoCu-el-en 1.0
The Greek-English parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023, extending the crawl dynamically to... -
Macedonian-English parallel corpus MaCoCu-mk-en 1.0
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other... -
Bulgarian-English parallel corpus MaCoCu-bg-en 2.0
The Bulgarian-English parallel corpus MaCoCu-bg-en 2.0 was built by crawling the “.bg” and “.бг” internet top-level domains in 2021, extending the crawl dynamically to other... -
Slovene-English parallel corpus slenWaC 1.0
The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor... -
MULTEXT-East non-commercial lexicons 4.0
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of... -
Bulgarian-English parallel corpus MaCoCu-bg-en 1.0
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other... -
Bosnian-English parallel corpus MaCoCu-bs-en 1.0
The Bosnian-English parallel corpus MaCoCu-bs-en 1.0 was built by crawling the “.ba” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains...