Mapping onto EUDAT-B2FIND Metadata Schema

The provided metadata must be mapped to the B2FIND schema in a meaningful way. Currently this is done in close cooperation between the data provider and the B2FIND team. By iteratively discussing the process a suitable solution is reached in each case.

Specification of community metadata

The implementation of the mapping, as described in the following subsection, is based on a detailed specification and documentation of the community-specific metadata. We have designed a spreadsheet template for gathering the required data. The Excel template can be requested via the support form, by sending us an email or by downloading the version in the google drive at Community-B2FIND_template.xlsx

. This template is divided into several parts, each in their own tab:

  • General Information: In this tab, data providers should provide information about the contact persons and the community.
  • Metadata Specification: Please give us more detailed information about the specific metadata formats, schemas and structure used.
  • Harvesting: Here the 'harvesting endpoints' (e.g. OAI-URL's) should be provided, as well as the protocols and APIs used, and the subsets, if available.
  • Mapping: In this table, the mapping of the community properties to the B2FIND schema and coverage information should be laid out. This is iteratively discussed and developed with the data provider during the initial intake process.

Homogenisation and Semantic Mapping

To transform and reformat the harvested raw metadata records to datasets, which can be uploaded to the B2FIND catalogue and indexed and displayed in the B2FIND portal, the following processing steps must be carried out:

  1. Select entries from the XML records, based on XPATH rules that depend on community-specific metadata formats (see providing metadata).
  2. Parse through the selected values and assign them to the keys specified in the XPATH rules, i.e. fields of the B2FIND schema.
  3. Store the resulting key-value pairs in JSON dictionaries.
  4. Check and validate these JSON records before uploading to the B2FIND repository.
This mapping procedure needs regular adaption and extensions according to the needs of the changing requirements of the communities.

EUDAT-B2FIND Metadata Schema

To allow a unique search space, B2FIND established a common, interdisciplinary metadata schema. This schema is based on the DataCite Metadata Schema 4.1 and therefore as well compatible with guidelines of other e-infrastructures as OpenAire, their schemas are based as well on the DataCite schema.

The B2FIND Metadata Schema 1.0 is the current version and was released on August 12, 2017. The associated XSD file is available and downloadable as XSD file from b2find_schema_0.1.xsd .

Currently the schema comprises 19 fields or facets as listed in the following table with their description, allowed values and references to the associated properties in the DataCite Metadata Schema 4.1.

Metadata Type B2FIND Name Description Allowed values DataCite 4.0 reference Obligation Occurence Comments and Issues
General Information Title A name or a title by which a resource is known Textual 3. Title Mandatory 1 Coding must be UTF-8 (unicode)
Description An additional information describing the content of the resource. Could be an abstract, a summary or a Table of Content. Textual 17.Description Recommended 0-1 Coding should be UTF-8 (unicode)
Tags A subject, keyword, classification code, or key phrase describing the content. List of strings, filter out 'non nouns' by using 'stop words' 6.Subject Optional 1 Try to use keyword thesauri from communities
Identifier DOI A persistent, citable identifier (registered at DataCite) that uniquely identifies a resource. Must be resolvable URL, registered at DataCite as DOI 1.Identifier 1.1. identiferType = DOI Mandatory (at least one resource identifier is mandatory) 1-3
PID A persistent identifier (implemented as a handle in a Handleserver) that uniquely identifies a resource. Must be resolvable URL and registered at a handle server 1.Identifier
Source An identifier (URL) that uniquely identifies a resource. Should be resolvable URL 1.Identifier
RelatedIdentifier A link to related resources or supplements. Should be resolvable URL 12. relatedIdentifier Optional 0-n
MetaDataAccess Link to the original harvested metadata record (GetRecord request) Should be resolvable URL N/A Recommended 0-1
Provenance Creator The main researchers involved in producing the data, or the authors of the publication, in priority order. List of names 2. Creator Recommended 0-n
Publisher The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role. List of names 4. Publisher Recommended 0-n
Contributor The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource. List of names 7. Contributor Optional 0-n
PublicationYear The year when the data was or will be made publicly available. UTC Year format (YYYY) 5. PublicationYear Recommended 0-1
Rights Any rights information for this resource. Textual 16. Rights Optional 0-1
OpenAccess Is the dataset openly accessible or not. Boolean Optional 0-1
Contact Any contact information for this resource. List of Names [ may be 7. Contributor] Optional 0-n
Representation Language The primary language of the resource. Allowed values are taken from ISO 639-1 language codes. 9. Language Optional 0-1 Examples: English, German, French
ResourceType A description of the resource Textual 10. ResourceType Recommended 0-1
Format Technical format of the resource Textual 14. Format Optional 0-1
Coverage Discipline The scientific disciplines linked with the resource. Controlled vocabulary, see b2find_disciplines.json N/A [ sometimes information in 6. Subject ] Recommended 0-n
Spatial Coverage A geolocation where the research data was gathered or/and about which the data is focused and related to. Content of this category is displayed in plain text. If a longitude/latitude information is given it will be displayed at the map. Textual geo spatial description (Spatial region or named place (geonames)) and if longitude/latitude information is given displayed at the map. 18. Geolocation Optional 0-1
Temporal Coverage Period of time the research data itself is related to. Could be a date format or plain text. Date-time representation 8. Date / [8.1 dateType = Collected?] Optional 0-1 Not really provided by DataCite in the sense of coverage

Concordance with other Standards

As said before the EUDAT-B2FIND schema is compatible with other widely used standards. In the following table the compatibility with the core schema of EUDAT-B2SHARE and the open access initiative OpenAIRE is shown by referring to the DataCite schema. The obligation is specified for each field, where M stands for mandatory, R for recommended and O for optional.

DataCite # DataCite 4.1 B2FIND B2SHARE OpenAIRE Comments and Issues
1 Identifier(M) (+ 1.1. identifierType=[DOI]) [Source(URL) | DOI | PID] (M) PID(M),DOI,URL Identifier(M) (+ 1.1. identifierType=[DOI , ...]) While for B2SHARE always a PID is provided, B2FIND requires at least an URL linked to the underlying data resource
2 Creator(M) Creator(R) Creator(R) Creator(M)
3 Title(M) Title(M) Title(M) Title(M)
4 Publisher(M) Publisher(R) Publisher(R) Publisher(M)
5 PublicationYear(M) PublicationYear(O) PublicationYear(O) PublicationYear(M)
6 Subject(R) Tags and Discipline(R) Keywords and Discipline(R) Subject(O)
7 Contributor [ --> Contact] Contributors Contributor (MA/O)
8 Date [ --> Temporal Coverage] The DataCite definition is here very vague (*Different dates relevant to the work*). For B2FIND we have here *PubicationYear*, i.e. the year the dataset is published, and *TemperalCoverage*, i.e. the interval in time the data covers, with a powerful 'Filter by time' associated.
9 Language(O) Language(O) Language(O) Language(R)
10 ResourceType(M) ResourceType(R) ResourceType(R) ResourceType(R)
11 AlternateIdentifier(O) N/A Alternate Identifiers(O) AlternateIdentifier(O)
12 RelatedIdentifier(R) N/A N/A RelatedIdentifier(MA)
13 Size N/A Size per data object (file) Size(O)
14 Format Format Format(O)
15 Version N/A [ --> checksum] Version(O)
16 Rights(O) Rights(MA)
17 Description Description Description(MA)
18 GeoLocation(R) SpatialCoverage(O) GeoLocation(O) In B2FIND *SpatialCoverage*, i.e. the geo spatial coverage, is associated with a 'Filter by location' interface.
19 FundingReference N/A N/A N/A