Avian Influenza events from different digital surveillance tools

DOI

This dataverse contains all necessary input files in order to extract a normalized epidemiological event dataset from raw event data and then to evaluate the obtained normalized event dataset. Our raw event data correspond to Avian Influenza events affecting bird species from 2019 to 2021, collected by three sources: PADI-web, ProMED and EMPRES-i. in our context, we define an epidemiological event as the detection of the virus at a specific date and time and in a specific location.

On the one hand, Indicator-Based Surveillance (IBS) refers to structured data collected through official routine surveillance systems. EMPRES-i is such an example of this surveillance. We use the EMPRES-i data as a ground-truth in our evaluation. On the other hand, Event-Based Surveillance (EBS) refers to unstructured data gathered from sources of intelligence of any nature, which can be either official (e.g. veterinary reports) or unofficial (e.g. news articles) sources. Moreover, the existing EBS tools are also categorized into three categories: 1) moderated (i.e. human-curated), 2) partially moderated and 3) fully-automated. PADI-web is fully-automated and relies only on news articles, whereas ProMED is a human-curated system that relies on a network of experts worldwide who detect epidemiological information from official and unofficial sources.

The datasets in this repository are used in our work to evaluate and to compare a set of EBS tools of different nature in order to identify their strengths and weaknesses in Epidemic Intelligence. We invite interested readers to read our paper: N. Arınık & R. Interdonato & M. Roche & M. Teisseire, "An Evaluation Framework for Comparing Epidemic Intelligence Systems," in IEEE Access, vol. 11, pp. 31880-31901, 2023, doi: 10.1109/ACCESS.2023.3262462.

Briefly, we perform in our work the evaluation of a given set of EBS tools in terms of four aspects: 1) spatial analysis (how the events are geographically distributed), 2) temporal analysis (how the events are temporally distributed), 2) thematic entity analysis (what thematic entities are extracted from the events and how they are related to spatio-temporal analysis) and 4) news outlet analysis (what news sources play key role in epidemiological information dissemination). For each aspect, we also propose an appropriate visualization for end-users.

Our code for obtaining a normalized event dataset from raw event data is publicly available online on https://github.com/arinik9/epidnews2event (it uses the files "raw_event_data.zip" and "eval_event_data.zip").

Our code for evaluating normalized event datasets is publicly available online on https://github.com/arinik9/compebs (it uses the files "normalized_events.zip" and "eval_event_matching.zip").

Note that the structure of an event dataset (e.g. "normalized_events/events/padiweb/events.csv" in "normalized_events.zip" ) is as follows:

id: Event identifier article_id: Article/report identifiers of a given event in the considered EBS/IBS system. Note that multiple article can report same event. url: URL information of the news articles reporting the considered event. Note that multiple article can report same event. Available only for PADI-Web and ProMED. source: Name of the news outlet reporting the considered news article. geonames_id: GeoNames identifier for the spatial entity. geoname_json: Raw GeoNames geocoding result for the spatial entity. loc_name: Place name of the spatial entity. loc_country_code: Associated country code for the spatial entity. continent: Associated continent information for the spatial entity. lat: Lattitude for the spatial entity. lng: Longitude for the spatial entity. hierarchy_data: All GeoNames identifiers higher up in the hierarchy of the spatial entity. published_at: Article/report publication date. disease: Disease information with associated hierarchy. host: Host information with associated hierarchy. day_no: Day value of the publication date. week_no: Week value of publication date. biweek_no: Bi-week value publication date. month_no: Month value of publication date. year: Year value of publication date. season: Season value of of publication date.

Identifier
DOI https://doi.org/10.57745/Y3XROX
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/Y3XROX
Provenance
Creator Arınık Nejat ORCID logo; Interdonato Roberto ORCID logo; Roche Mathieu ORCID logo; Teisseire Maguelonne ORCID logo
Publisher Recherche Data Gouv
Contributor TEISSEIRE, Maguelonne
Publication Year 2023
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact TEISSEIRE, Maguelonne (INRAE)
Representation
Resource Type Text; Dataset
Format application/zip
Size 2257185; 79904; 1124160; 34758683
Version 2.0
Discipline Computer Science; Life Sciences; Medicine