Dataset overview
The MIMIC2 dataset provides:
a non-redundant high-quality catalog of 5.0 million genes
6,967 Metagenome-Assembled Genomes (MAGs)
1,252 Metagenomic Species Pangenomes (MSPs)
This dataset can be used to analyze shotgun sequencing data of the murine gut microbiota.
How to use this dataset
Create a gene abundance table by aligning reads from each sample against the catalog. For this purpose, you can use Meteor or NGLess. Then, normalize raw counts by gene length.
Taxonomic profiling: the abundance of each species can be estimated as the average abundance of its 100 first core genes. To reduce the false positive rate, only consider that a species is present if at least 10/100 marker genes are detected.
Methods
Data sources
The MIMIC2 dataset was constructed using two different data sources:
Source 1: the Mouse Gastrointestinal Bacterial Catalogue (MGBC) which is a compilation of 276 genomes from cultured isolates and 45,218 metagenome-assembled genomes (MAGs) from 1,960 publicly available mouse metagenomes
Source 2: 68 samples of Messaoudene et al. (PRJNA783624) and 85 deeply sequenced samples from bioproject CNP0000619 published by Xiao et al.
Metagenomic assembly
De novo metagenomic assembly was performed on the 153 samples from the data Source 2.
First, sequencing adapters removal and read trimming was performed with fastp. Reads mapped on the host genome (GCF_000001635.27) with bowtie2 were removed with samtools. Finally, Metagenomic assembly was performed with metaSPAdes. Contigs of less than 1500 bp were removed.
MAGs recovery
Reads of each sample from the data Source 2 were aligned to their respective assembly with bowtie2 and results were indexed in sorted bam files with samtools. Then, contigs coverage was computed in each sample with jgi_summarize_bam_contig_depths. MAGs were generated with MetaBAT 2 and MAGs quality was assessed with checkM. MAGs with completeness 5% or N50 < 8Kb were discarded.
Non-redundant gene catalog
Genes were predicted on all contigs from the data Source 2 with Prodigal (parameters : -m -p meta ). Likewise, genes were predicted on all genomes from the data Source 1 (MGBC) with Prodigal (parameters : -m -p single ).
Genes from the two data sources were pooled and those shorter than 90 bp or incomplete were discarded. Finally, genes were clustered with cd-hit-est (parameters -c 0.95 -aS 0.90 -G 0 -d 0 -M 0 -T 0 ) by choosing those from the longest contigs as representatives.
MSPs recovery
Samples from 19 cohorts (see below) were aligned against the non-redundant gene catalog with the Meteor software suite to produce a raw gene abundance table (5M genes quantified in 1374 samples).
Then, co-abundant genes were binned in 1,252 Metagenomic Species Pan-genomes (MSPs, i.e. clusters of > 500 co-abundant genes that likely belong to the same microbial species) using MSPminer.
The 19 cohorts used to recover the MSPs are:
PRJNA783624
CNP0000619
PRJEB15095
PRJEB22007
PRJEB22710
PRJEB31298
PRJEB32790
PRJEB32890
PRJEB3374
PRJEB36943
PRJEB44286
PRJEB7759
PRJNA293255
PRJNA390686
PRJNA397886
PRJNA515074
PRJNA540893
PRJNA549182
PRJEB40719
MSPs taxonomic annotation
Representative genomes of the MMGC collection were annotated with GTDB-Tk based on GTDB r202. Then, taxonomic annotation of MMGC genomes was propagated to the corresponding MSPs.
For the MSPs without any corresponding MAG, taxonomic annotation was performed by alignment of all core and accessory genes against representative genomes of the GTDB database (release r202) using blastn (version 2.7.1, task = megablast, word_size = 16). A species-level assignment was given if > 50% of the genes matched the representative genome of a given species, with a mean nucleotide identity ≥ 95% and mean gene length coverage ≥ 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom), if more than 50% of their genes had the same annotation.
Construction of the phylogenetic tree
39 universal phylogenetic markers genes were extracted from the 1,252 MSPs (or the corresponding MAGs if available) with fetchMGs. Then, the markers were separately aligned with MUSCLE. The 40 alignments were merged and trimmed with trimAl (parameters: -automated1). Finally, the phylogenetic tree was computed with FastTreeMP (parameters: -gamma -pseudo -spr -mlacc 3 -slownni).