These datasets concern unstructured data (articles) from news items detected by an event-based surveillance system; PADI-Web, between 2022 and 2023.
Collected articles were manually annotated by relevance for epidemic intelligence purposes with the help of two epidemiologists
Extracted data include relevant articles (with two possible labels; epidemiological events or general information) and irrelevant information regarding three different diseases: Avian Influenza (AI), African Swine Fever (ASF) and West Nile Virus disease(WNV).
This database is extensive as it deals with different types of diseases (zoonotic, cross-border and vectorial disease ) and can be used to train or evaluate classification approaches to automatically identify written text on these diseases events and classify them by relevance.
The structure of the dataset is as follow:
Alert_id: Article identifier. Note that each article has a unique ID, if an article reports multiple events, it is duplicated and each line represent one event.
Title: Article's title given by the news outlet.
hsource: URL of the news outlet reporting the article.
Source: Name of the news outlet reporting the article.
url: URL information of the article reporting the considered event. Note that multiple articles can report same event.
Issue_date: Date of the article publication
Country: Name of the country where the event happened
Place_name: Name of the administration, city or district where the event happened, if none of these is mentionned in the text, the country's name is reported.
Administrative_division: The administrative level at which the information is reported (country, department, city...)
Disease_name: Name of the disease that is reported in the article
Species_name: Name of the affected host that is reported in the article
Manualclass: Manual classification (Relevant or Irrelevant)
Lat: Place_name lattitude coordinates
Lon: Place_name longitude coordinates