Slovenian Twitter dataset 2018-2020 1.0

Dataset

PID

The dataset represents the Twitter production in Slovenian in the period from 2018 until 2020. It consists of tweet IDs, retweet IDs, pseudo-anonymized user IDs, publication dates, and automatically assigned hate labels (acceptable, inappropriate, offensive, violent) with https://huggingface.co/IMSyPP/hate_speech_slo.

The dataset is the basis for the two following papers: - "Retweet communities reveal the main source of hate speech" - https://arxiv.org/pdf/2105.14898.pdf - "Community evolution in retweet networks" - https://arxiv.org/pdf/2105.06214.pdf

Identifier
PID	http://hdl.handle.net/11356/1423
Related Identifier	https://arxiv.org/pdf/2105.14898.pdf
Related Identifier	https://arxiv.org/pdf/2105.06214.pdf
Related Identifier	http://imsypp.ijs.si
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1423

Provenance
Creator	Evkoski, Bojan; Pelicon, Andraž; Mozetič, Igor; Ljubešić, Nikola; Kralj Novak, Petra
Publisher	Jožef Stefan Institute
Publication Year	2021
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics