MilkOligoCorpus, a rich semantic annotated resource for milk oligosaccharide complex information extraction

DOI

The MilkOligoCorpus is a dataset of 30 Pubmed abstracts and full-text extracts from scientific articles on the composition of milk oligosaccharides in mammalian species, manually annotated for training and evaluating information extraction tools. This corpus is designed to support the development and assessment of tools for named entity recognition, entity linking and relation extraction to extract the variability of milk oligosaccharides profiles. Named entity linking is essential for integrating information from diverse sources by mapping entity mentions to standard categories and associating them with unique identifiers. Thus, along with the corpus annotation we developed four semantic resources to address the absence of existing ontologies for several entities: (i) the Female parity thesaurus, (ii) the sample thesaurus, (iii) the MO methods thesaurus, (iv) the Oligo type thesaurus available at https://doi.org/10.57745/RA5DAC. An annotation schema was also developed, that identifies the entities of interest and establishes relations between them. This annotation schema serves as the foundation for the manual annotations along with guidelines, a 66-pages document that dictates the instructions on how to perform the annotations, available in the repository Z. This archive includes: (i) the HoloOligo corpus dataset, (ii) the list of the document annotated in the HoloOligo corpus, (iii) the three thesaurus required for the manual annotation, which are not available elsewhere, (iv) the annotation schema. An article detailing the development of the annotation schema and the creation of the gold standard corpus will be submited to PLOS One.

Identifier
DOI https://doi.org/10.57745/LFXGFO
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/LFXGFO
Provenance
Creator Combes, Sylvie ORCID logo; Rumeau, Mathilde ORCID logo; Nedellec, Claire ORCID logo; Deleger, Louise ORCID logo; Bossy, Robert ORCID logo; Loux, Valentin (ORCID: 0000-0002-8268-915X); Ba, Mouhamadou ORCID logo; Courtin, Marine ORCID logo; Knudsen, Christelle ORCID logo
Publisher Recherche Data Gouv
Contributor Combes, Sylvie; Entrepôt-Catalogue Recherche Data Gouv
Publication Year 2025
Funding Reference Agence nationale de la recherche ANR-21-CE20-0045
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Combes, Sylvie (INRAE)
Representation
Resource Type Dataset
Format image/png; text/tab-separated-values; application/zip
Size 74772; 139; 249488; 171; 3436; 4810; 96; 105
Version 1.0
Discipline Agriculture, Forestry, Horticulture; Computer Science; Agricultural Sciences; Agriculture, Forestry, Horticulture, Aquaculture; Agriculture, Forestry, Horticulture, Aquaculture and Veterinary Medicine; Life Sciences