Latvian user comment dataset 1.0

Dataset

PID

The dataset is an archive of reader comments from the Delfi news site from 2014-2019, containing approximately 12M comments, mostly in the Latvian language, with some in Russian.

Description of the Datasets

There are 6 CSV files: * lv-comments-2014.csv contains 2 753 655 comments from year 2014 * lv-comments-2015.csv contains 2 221 122 comments from year 2015 * lv-comments-2016.csv contains 1 897 669 comments from year 2016 * lv-comments-2017.csv contains 1 896 083 comments from year 2017 * lv-comments-2018.csv contains 2 222 051 comments from year 2018 * lv-comments-2019.csv contains 1 421 883 comments from year 2019

In sum: 12 412 463 comments

Columns: * comment_id (string) - the ID of the written comment * article_id (string) - the ID of the article for which the comment was written * created_time (string) - the time and date of the comment * subject (string) - the title of the comment * reply_to_comment_id (string) - the parent comments ID * content (string) - the comment itself * is_anonymous (string) - * 1 if the comment was published anonymously * 0 if the comment was published by a registered user * is_enabled (string) - * 1 if the comment was published (online) * 0 if it wasn’t published * Questionable field: not all have been manually moderated * No additional information from the moderators * channel_language (string) - the language of the channel * 'nat' for Latvian * 'rus' for Russian * create_user_id (string) - the user ID of the commentator * modereted_by (string) - the ID of the moderator

Identifier
PID	http://hdl.handle.net/11356/1407
Related Identifier	https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
Related Identifier	http://embeddia.eu/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1407

Provenance
Creator	Shekhar, Ravi; Purver, Matthew; Pollak, Senja; Pelicon, Andraž; Krustok, Ivar
Publisher	Ekspress Meedia Group
Publication Year	2021
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/825153
Rights	Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); https://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Latvian; Russian
Resource Type	corpus
Format	application/octet-stream; text/csv; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline	Linguistics