Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0

PID

FRENK-STYRIA-24sata is a dataset of moderated newspaper comments from the website 24sata.hr with metadata on the time of publishing, user identifier, thread identifier and whether the comment was deleted by the moderators or not. The full text of each comment is encrypted via a character-replacement method so that the comments are not readable by humans. Basic punctuation is not encrypted in order to enable tokenization. The main use of this dataset are experiments on automating comment moderation. For real-world usage, a fastText classification model trained on non-encrypted data is made available as well.

Identifier
PID http://hdl.handle.net/11356/1202
Related Identifier https://drive.google.com/file/d/13m7PFn49_tnEfFjcbqk8cugG4ZTy2A5I/view
Related Identifier http://nl.ijs.si/frenk/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1202
Provenance
Creator Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja
Publisher Jožef Stefan Institute
Publication Year 2018
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Croatian
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics