MilkOligoCorpus, a rich semantic annotated resource for milk oligosaccharide complex information extraction

Dataset

DOI

The MilkOligoCorpus is a dataset of 30 Pubmed abstracts and full-text extracts from scientific articles on the composition of milk oligosaccharides in mammalian species, manually annotated for training and evaluating information extraction tools. This corpus is designed to support the development and assessment of tools for named entity recognition, entity linking and relation extraction to extract the variability of milk oligosaccharides profiles. Named entity linking is essential for integrating information from diverse sources by mapping entity mentions to standard categories and associating them with unique identifiers. Thus, along with the corpus annotation we developed four semantic resources to address the absence of existing ontologies for several entities: (i) the Female parity thesaurus, (ii) the sample thesaurus, (iii) the MO methods thesaurus, (iv) the Oligo type thesaurus available at https://doi.org/10.57745/RA5DAC. An annotation schema was also developed, that identifies the entities of interest and establishes relations between them. This annotation schema serves as the foundation for the manual annotations along with guidelines, a 66-pages document that dictates the instructions on how to perform the annotations, available in the repository Z. This archive includes: (i) the HoloOligo corpus dataset, (ii) the list of the document annotated in the HoloOligo corpus, (iii) the three thesaurus required for the manual annotation, which are not available elsewhere, (iv) the annotation schema. An article detailing the development of the annotation schema and the creation of the gold standard corpus will be submited to PLOS One.

Identifier
DOI	https://doi.org/10.57745/LFXGFO
Metadata Access	https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/LFXGFO

Provenance
Creator	Combes, Sylvie ; Rumeau, Mathilde ; Nedellec, Claire ; Deleger, Louise ; Bossy, Robert ; Loux, Valentin (ORCID: 0000-0002-8268-915X); Ba, Mouhamadou ; Courtin, Marine ; Knudsen, Christelle
Publisher	Recherche Data Gouv
Contributor	Combes, Sylvie; Entrepôt-Catalogue Recherche Data Gouv
Publication Year	2025
Funding Reference	Agence nationale de la recherche ANR-21-CE20-0045
Rights	etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess	true
Contact	Combes, Sylvie (INRAE)

Representation
Resource Type	Dataset
Format	image/png; text/tab-separated-values; application/zip
Size	74772; 139; 249488; 171; 3436; 4810; 96; 105
Version	1.0
Discipline	Agriculture, Forestry, Horticulture; Computer Science; Agricultural Sciences; Agriculture, Forestry, Horticulture, Aquaculture; Agriculture, Forestry, Horticulture, Aquaculture and Veterinary Medicine; Life Sciences