Tweets used to explore causes of self-reported foodborne illnesses on social media 2017

Dataset

DOI

Data collected from Twitter social media platform (10 Nov 2017 - 18 Dec 2017) to explore causes of self-reported foodborne illnesses on social media from posts originating in Scotland, UK. The dataset contains Tweet IDs and keywords used to search for Tweets using a programatic access via the public Twitter API. In addition, this archive also includes keywords that were used to cluster retrieved Tweets into smaller groups of messages containing mentions of specific keywords. This includes lists of keywords describing ingredients, foods and drinks, cooking techniques, and domestic implements. Additional keywords relating to food and places associated with food (e.g. restaurants) were generated using an automated machine learning tool based on a set of seed keywords. Finally, the last set of keywords used to cluster retrieved Tweets includes a list of names of food businesses located in Glasgow, UK. Social media and other forms of online content have enormous potential as a way to understand people's opinions and attitudes, and as a means to observe emerging phenomena - such as disease outbreaks. How might policy makers use such new forms of data to better assess existing policies and help formulate new ones? This one year demonstrator project is a partnership between computer science academics at the University of Aberdeen and officers from Food Standards Scotland which aims to answer this question. Food Standards Scotland is the public-sector food body for Scotland created by the Food (Scotland) Act 2015. It regularly provides policy guidance to ministers in areas such as food hygiene monitoring and reporting, food-related health risks, and food fraud. The project will develop a software tool (the Food Sentiment Observatory) that will be used to explore the role of data from sources such as Twitter, Facebook, and TripAdvisor in three policy areas selected by Food Standards Scotland: - attitudes to the differing food hygiene information systems used in Scotland and the other UK nations; - study of an historical E.coli outbreak to understand effectiveness of monitoring and decision making protocols; - understanding the potential role of social media data in responding to new and emerging forms of food fraud. The Observatory will integrate a number of existing software tools (developed in our recent research) to allow us to mine large volumes of data to identify important textual signals, extract opinions held by individuals or groups, and crucially, to document these data processing operations - to aid transparency of policy decision-making. Given the amount of noise appearing in user-generated online content (such as fake restaurant reviews) it is our intention to investigate methods to extract meaningful and reliable knowledge, to better support policy making.

The search for relevant data content was performed using a custom built data collection module within the Observatory platform. A public API provided by Twitter was utilised to gather all social media messages (Tweets) matching a specific set of keywords. Each line in the sickness-keywords.txt file contains a search keyword/phrase used to retrieve matching Tweets, which had to include at least one of the search keywords/phrases. Therefore, the search string used by the API was constructed as follows: keyword1 OR keyword2 OR keyword3 OR ... The Twitter API allows historical searches to be restricted to Tweets associated with a specific location, however, this can be only specified as a specific radius from a given latitude and longitude geo-point. We used Twitter's geo-resticted search by defining a Lat/Long point and radius (in kilometres). In order to cover major areas in Scotland we used the following three geo-restrictions: Latitude =57.502053 Longitude=-4.954833 Radius = 220 km; Latitude =55.837799 Longitude=-3.221740 Radius = 70 km; Latitude =55.475221 Longitude=-4.369812 Radius = 90 km. Clustering keywords describing ingredients, foods and drinks, cooking techniques and domestic implements were extracted from DBpedia. Clustering keywords describing foods and places to eat were generated using a machine learning tool (see Related Resources) utilising the Word2vec approach (English Google News Negative 300 model was used in this case). The following seed words were used to generate keywords referring to high risk food: shellfish, meat, cheese, pate, egg, barbecue, salad, fish, milk, chicken, burger, lettuce, rice, food. The following seed words were used to generate keywords referring to places to eat: takeaway, restaurant, cafe, bistro, kitchen, eatery, hotel, pub, bakery, shop. Open dataset containing food hygiene rating data for Scotland in 2017 (see Related Resources) was used to extract names of businesses based in Glasgow, UK.

Identifier
DOI	https://doi.org/10.5255/UKDA-SN-853375
Metadata Access	https://datacatalogue.cessda.eu/oai-pmh/v0/oai?verb=GetRecord&metadataPrefix=oai_ddi25&identifier=a8c9871c504bb30ae675b3dc764d5e264e8ccbb0dcf86a3715478acfadeaf3d0

Provenance
Creator	Edwards, P, University of Aberdeen; Markovic, M, University of Aberdeen; Petrunova, N, University of Aberdeen; Lin, C, University of Aberdeen; Corsar, D, University of Aberdeen
Publisher	UK Data Service
Publication Year	2018
Funding Reference	Economic and Social Research Council
Rights	Peter Edwards, University of Aberdeen; The Data Collection is available to any user without the requirement for registration for download/access.
OpenAccess	true

Representation
Language	English
Resource Type	Text
Discipline	Social Sciences
Spatial Coverage	Scotland; United Kingdom