The dataset is an archive of reader comments from the Delfi news site from 2014-2019, containing approximately 12M comments, mostly in the Latvian language, with some in Russian.
Description of the Datasets
There are 6 CSV files:
* lv-comments-2014.csv
contains 2 753 655 comments from year 2014
* lv-comments-2015.csv
contains 2 221 122 comments from year 2015
* lv-comments-2016.csv
contains 1 897 669 comments from year 2016
* lv-comments-2017.csv
contains 1 896 083 comments from year 2017
* lv-comments-2018.csv
contains 2 222 051 comments from year 2018
* lv-comments-2019.csv
contains 1 421 883 comments from year 2019
In sum: 12 412 463 comments
Columns:
* comment_id
(string) - the ID of the written comment
* article_id
(string) - the ID of the article for which the comment was written
* created_time
(string) - the time and date of the comment
* subject
(string) - the title of the comment
* reply_to_comment_id
(string) - the parent comments ID
* content
(string) - the comment itself
* is_anonymous
(string) -
* 1 if the comment was published anonymously
* 0 if the comment was published by a registered user
* is_enabled
(string) -
* 1 if the comment was published (online)
* 0 if it wasn’t published
* Questionable field: not all have been manually moderated
* No additional information from the moderators
* channel_language
(string) - the language of the channel
* 'nat' for Latvian
* 'rus' for Russian
* create_user_id
(string) - the user ID of the commentator
* modereted_by
(string) - the ID of the moderator