The MilkOligoCorpus is a dataset of 30 Pubmed abstracts and full-text extracts from scientific articles on the composition of milk oligosaccharides in mammalian species, manually annotated for training and evaluating information extraction tools. This corpus is designed to support the development and assessment of tools for named entity recognition, entity linking and relation extraction to extract the variability of milk oligosaccharides profiles.
Named entity linking is essential for integrating information from diverse sources by mapping entity mentions to standard categories and associating them with unique identifiers. Thus, along with the corpus annotation we developed four semantic resources to address the absence of existing ontologies for several entities: (i) the Female parity thesaurus, (ii) the sample thesaurus, (iii) the MO methods thesaurus, (iv) the Oligo type thesaurus available at https://doi.org/10.57745/RA5DAC.
An annotation schema was also developed, that identifies the entities of interest and establishes relations between them. This annotation schema serves as the foundation for the manual annotations along with guidelines, a 66-pages document that dictates the instructions on how to perform the annotations, available in the repository Z.
This archive includes: (i) the HoloOligo corpus dataset, (ii) the list of the document annotated in the HoloOligo corpus, (iii) the three thesaurus required for the manual annotation, which are not available elsewhere, (iv) the annotation schema.
An article detailing the development of the annotation schema and the creation of the gold standard corpus will be submited to PLOS One.