MIMIC2: Murine Intestinal Microbiota Integrated Catalog v2

Dataset

DOI

Dataset overview The MIMIC2 dataset provides:

a non-redundant high-quality catalog of 5.0 million genes 6,967 Metagenome-Assembled Genomes (MAGs) 1,252 Metagenomic Species Pangenomes (MSPs)

This dataset can be used to analyze shotgun sequencing data of the murine gut microbiota.

How to use this dataset

Create a gene abundance table by aligning reads from each sample against the catalog. For this purpose, you can use Meteor or NGLess. Then, normalize raw counts by gene length. Taxonomic profiling: the abundance of each species can be estimated as the average abundance of its 100 first core genes. To reduce the false positive rate, only consider that a species is present if at least 10/100 marker genes are detected.

Methods Data sources The MIMIC2 dataset was constructed using two different data sources:

Source 1: the Mouse Gastrointestinal Bacterial Catalogue (MGBC) which is a compilation of 276 genomes from cultured isolates and 45,218 metagenome-assembled genomes (MAGs) from 1,960 publicly available mouse metagenomes Source 2: 68 samples of Messaoudene et al. (PRJNA783624) and 85 deeply sequenced samples from bioproject CNP0000619 published by Xiao et al.

Metagenomic assembly De novo metagenomic assembly was performed on the 153 samples from the data Source 2. First, sequencing adapters removal and read trimming was performed with fastp. Reads mapped on the host genome (GCF_000001635.27) with bowtie2 were removed with samtools. Finally, Metagenomic assembly was performed with metaSPAdes. Contigs of less than 1500 bp were removed. MAGs recovery Reads of each sample from the data Source 2 were aligned to their respective assembly with bowtie2 and results were indexed in sorted bam files with samtools. Then, contigs coverage was computed in each sample with jgi_summarize_bam_contig_depths. MAGs were generated with MetaBAT 2 and MAGs quality was assessed with checkM. MAGs with completeness 5% or N50 < 8Kb were discarded. Non-redundant gene catalog Genes were predicted on all contigs from the data Source 2 with Prodigal (parameters : -m -p meta ). Likewise, genes were predicted on all genomes from the data Source 1 (MGBC) with Prodigal (parameters : -m -p single ). Genes from the two data sources were pooled and those shorter than 90 bp or incomplete were discarded. Finally, genes were clustered with cd-hit-est (parameters -c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0 ) by choosing those from the longest contigs as representatives.

MSPs recovery Samples from 19 cohorts (see below) were aligned against the non-redundant gene catalog with the Meteor software suite to produce a raw gene abundance table (5M genes quantified in 1374 samples). Then, co-abundant genes were binned in 1,252 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of > 500 co-abundant genes that likely belong to the same microbial species) using MSPminer.

The 19 cohorts used to recover the MSPs are:

PRJNA783624 CNP0000619 PRJEB15095 PRJEB22007 PRJEB22710 PRJEB31298 PRJEB32790 PRJEB32890 PRJEB3374 PRJEB36943 PRJEB44286 PRJEB7759 PRJNA293255 PRJNA390686 PRJNA397886 PRJNA515074 PRJNA540893 PRJNA549182 PRJEB40719

MSPs taxonomic annotation Representative genomes of the MMGC collection were annotated with GTDB-Tk based on GTDB r202. Then, taxonomic annotation of MMGC genomes was propagated to the corresponding MSPs.

For the MSPs without any corresponding MAG, taxonomic annotation was performed by alignment of all core and accessory genes against representative genomes of the GTDB database (release r202) using blastn (version 2.7.1, task = megablast, word_size = 16). A species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean nucleotide identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation.

Construction of the phylogenetic tree 39 universal phylogenetic markers genes were extracted from the 1,252 MSPs (or the corresponding MAGs if available) with fetchMGs. Then, the markers were separately aligned with MUSCLE. The 40 alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).

Identifier
DOI	https://doi.org/10.15454/L11MXM
Metadata Access	https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.15454/L11MXM

Provenance
Creator	PLAZA ONATE, Florian; GITTON-QUENT, Oscar; ALMEIDA, Mathieu; LE CHATELIER, Emmanuelle
Publisher	Recherche Data Gouv
Contributor	PLAZA ONATE, Florian
Publication Year	2021
Rights	etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess	true
Contact	PLAZA ONATE, Florian (INRAE)

Representation
Resource Type	Dataset
Format	application/x-xz; text/tab-separated-values; application/octet-stream; application/x-gzip
Size	5069746568; 331570; 39546; 170004936; 324241; 1038657938; 1516575251
Version	5.1
Discipline	Life Sciences