24sata news comment dataset 1.0


The dataset of user comments provided for research purposes for the EMBEDDIA, a Horizon 2020 project, extracted from the database of user comments from the 24sata.hr news portal. The 24sata.hr is the largest-circulation daily newspaper in Croatia, reaching on average 2 million readers daily. The dataset provides the comments metadata including the link to the relevant article, the ID of the comment author (anonymized), and timestamp. The comments are also labelled if they are blocked by human moderators.

Description of the Datasets.

The 24sata dataset consists of 11 columns and 21548192 rows. Each row represents one user comment on the 24sata news portal. Comments are added by registered users below the published news article.

Columns: 'comment_id' - The internal id of the comment. Unique for each row. 'user_id' - The internal id of the user writing the comment. Unique for each user. '0' for all blocked comments. 'content' - The content (text) of the user comment. 'site' - The site the comment came from. 'reply_to_id' - The 'comment_id' of the parent comment - if this comment was intended as a reply. 'created_date' - The date the comment was created. 'last_change' - The date the comment was last edited. 'article_id' - A public id of the article where this comment was posted. The article itself can be accessed by appending article_id to the site. So an article with article_id 614684 and site 'www.24sata.hr' can be found on 'www.24sata.hr/a-614684'. (note the added 'a-' before the article name) 'infringed_on_rule' - If the user has infringed on rules with this comment, id of the rule is given. The description of the rules is given below. 'like_counts' - A number of times other users have voted in favour of this comment, similar to the Like button. 'dislike_counts' - A number of times other users have voted against this comment, opposite of the Like button.

PID http://hdl.handle.net/11356/1399
Related Identifier https://doi.org/10.21248/jlcl.34.2020.224
Related Identifier https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
Related Identifier http://embeddia.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1399
Creator Shekhar, Ravi; Pranjic, Marko; Pollak, Senja; Pelicon, Andraž; Purver, Matthew
Publisher Styria Media Group
Publication Year 2021
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); https://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Language Croatian
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; application/zip; downloadable_files_count: 3
Discipline Linguistics