RMQS1 16S bioinformatic config files and control sample data

Dataset

DOI

RMQS: The French Soil Quality Monitoring Network (RMQS) is a national program for the assessment and long-term monitoring of the quality of French soils. This network is based on the monitoring of 2240 sites representative of French soils and their land use. These sites are spread over the whole French territory (metropolitan and overseas) along a systematic square grid of 16 km x 16 km cells. The network covers a broad spectrum of climatic, soil and land-use conditions (croplands, permanent grasslands, woodlands, orchards and vineyards, natural or scarcely anthropogenic land and urban parkland). The first sampling campaign in metropolitan France took place from 2000 to 2009. Dataset: This dataset contains config files used to run the bioinformatic pipeline and the control sample data that were not published before Reference environmental DNA samples named “G4” in internal laboratory processes were added for each molecular analysis. They were used for technical validation, but not necessarily published alongside the datasets. The taxonomy and OTU abundance files for these control samples were built like the taxonomy and abundance file of the main dataset. As these internal control samples were clustered against the RMQS dataset in an open reference fashion, they contained new OTUs (noted as “OUT”) that corresponded to sequences that did not match any of 188,030 RMQS reference sequences. The sample bank association file links each sample to its sequencing library. The G4 metadata file links each G4 to its library, molecular tag and sequence repository information. File structure:

Taxonomy files rmqs1_control_taxonomy_<rank>:

Taxonomy is splitted across five files with one line per site and one column per taxa. Each line sums to 10k (rarefaction threshold). Three supplementary columns are present:

Unknown: not matching any reference. Unclassified: missing taxa between genus and phylum. Environmental: matched to sample from environmental study, generally with only a phylum name.

rmqs1_16S_otu_abundance.tsv:

OTU abundance per site (one column per OTUs, “DB” + number for OTUs from RMQS reference set, “OUT” for OTUs not matching any “DB” ones). Each line sums to 10k (rarefaction threshold).

rmqs1_16S_bank_association.tsv:

two columns file with bank name for each sample

rmqs1_16S_bank_metadata.tsv:

library_name: library name used in labs study_accession, sample_accession, experiment_accession, run_accession: SRA EBI identifier library_name_genoscope: library name used in the Genoscope sequence center MID: multiplex identifier sequence run_alias: Genoscope internal alias ftp_link: FTP link to download library

Input_G4.txt:

Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv.

project_G4.tab:

Comma separated file containing the needed information to generate the Input.txt file with the BIOCOM-PIPE pipeline for controls only:

PROJECT: Project name chosen by the user LIBRARY_NAME: Library name chosen by the user LIBRARY_NAME_RECEIVED: Library name chosen by the sequencing partner and used by BIOCOM-PIPE SAMPLE_NAME: Sample name chosen by the user MID_F: MID name or MID sequence associated to the Forward primer MID_R: MID name or MID sequence associated to the Reverse primer TARGET: Target gene (16S, 18S, or 23S) PRIMER_F: Forward primer name used for amplification PRIMER_R: Reverse primer name used for amplification SEQUENCE_PRIMER_F: Forward primer sequence used for amplification SEQUENCE_PRIMER_R: Reverse primer sequence used for amplification

Input_GLOBAL.txt:

Tabulated file containing the parameters and the bioinformatic steps done by the BIOCOM-PIPE pipeline to extract, treat and analyze controls and samples from raw librairies detailed in the rmqs1_16S_bank_metadata.tsv.

project_GLOBAL.tab:

Comma separated file containing the needed information to generate the Input.txt file for controls and samples with the BIOCOM-PIPE pipeline:

Details:

Three libraries (58,59 and 69) data were re-sequenced and are not detailed in files. Some samples can be present in several libraries. We kept only the one with the highest number of sequences.

Identifier
DOI	https://doi.org/10.57745/XBFOJP
Related Identifier	IsCitedBy https://doi.org/10.1371/journal.pone.0186766
Metadata Access	https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/XBFOJP

Provenance
Creator	Terrat, Sébastien ; Dequiedt, Samuel
Publisher	Recherche Data Gouv
Contributor	Cottin, Aurélien
Publication Year	2023
Funding Reference	French National Research Agency (ANR) ANR-10-INBS-09-08 ; French National Research Agency (ANR) ANR-11-INBS-0001 ; French Agency for Ecological Transition (ADEME) ; France Génomique
Rights	etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess	true
Contact	Cottin, Aurélien (INRAE)

Representation
Resource Type	Dataset
Format	text/plain; text/tab-separated-values; application/gzip
Size	10413; 143493; 8814; 266094; 362535; 33093; 117004; 522347; 80032; 16460; 32344; 13212
Version	3.0
Discipline	Agriculture, Forestry, Horticulture; Geosciences; Agricultural Sciences; Agriculture, Forestry, Horticulture, Aquaculture; Agriculture, Forestry, Horticulture, Aquaculture and Veterinary Medicine; Biology; Biospheric Sciences; Earth and Environmental Science; Ecology; Environmental Research; Life Sciences; Natural Sciences