Twitter sentiment for 15 European languages

Dataset

PID

The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators. There are 15 Twitter corpora for the corresponding 15 European languages. The data can be used to train and evaluate Twitter sentiment classifiers, to compute annotator agreement, or to study the differences between language usage on Twitter.

The data analysis is described in the following papers:

I. Mozetič, M. Grčar, J. Smailović. Multilingual Twitter sentiment classification: The role of human annotators, PLoS ONE 11(5): e0155036, doi: 10.1371/journal.pone.e0155036, 2016. (http://dx.doi.org/10.1371/journal.pone.0155036)

I. Mozetič, L. Torgo, V. Cerqueira, J. Smailović. How to evaluate sentiment classifiers for Twitter time-ordered data?, PLoS ONE 13(3): e0194317, doi: 10.1371/journal.pone.0194317, 2018. (https://dx.doi.org/10.1371/journal.pone.0194317)

Identifier
PID	http://hdl.handle.net/11356/1054
Related Identifier	https://dx.doi.org/10.1371/journal.pone.0155036
Related Identifier	https://dx.doi.org/10.1371/journal.pone.0194317
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1054

Provenance
Creator	Mozetič, Igor; Grčar, Miha; Smailović, Jasmina
Publisher	Jožef Stefan Institute
Publication Year	2016
Funding Reference	info:eu-repo/grantAgreement/EC/FP7/610704; info:eu-repo/grantAgreement/EC/FP7/317532; info:eu-repo/grantAgreement/EC/H2020/640772
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Albanian; Bosnian; Bulgarian; Croatian; English; German; Hungarian; Polish; Portuguese; Serbian; Russian; Slovak; Slovenian; Slovene; Spanish; Castilian; Swedish
Resource Type	corpus
Format	text/plain; application/octet-stream; text/plain; charset=utf-8; downloadable_files_count: 16
Discipline	Linguistics