SARS-CoV-2 GISAID isolates (2020 - 05 - 24) genotyping VCF by mutation

VCF file containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The file is tab delimited, with 17345 rows including the names, and 19665 columns.The file was generated to test the hypothesis whether top of the most common mutations in the SARS-CoV-2 genome, 14408 C > T and 23403 A > G, significantly affect the mutation density of the virus over time and whether these affect the synonymous and nonsynonymous mutation densities differently. We discovered that the mutation densities between nonsynonymous and synonymous mutations show significant differences over early and late periods between WT (wildtype for both nucleotides of interest) and MT (mutant for both nucleotides of interest) samples, with nonsynonymous mutations especially showing higher increase in density in late period in MT samples. These results were obtained by identifying the earliest co-occurrence of the mutations in the two countries with the highest number of mutations, separating the isolates from these countries that were sequenced after the earliest co-occurrence date into two time groups, early and late, as well as two selecting those that fit two phenotypes into two categorical variables, WT and MT, and all known mutations into synonymous and non-synonymous mutation categorial variables. The relationships between these categories, along with the density of synonymous and nonsynonymous SNVs both across the genome and per gene locus, as well as the RdRp coding region, were analysed across time.

THIS DATASET IS ARCHIVED AT DANS/EASY, BUT NOT ACCESSIBLE HERE. TO VIEW A LIST OF FILES AND ACCESS THE FILES IN THIS DATASET CLICK ON THE DOI-LINK ABOVE

Identifier
DOI https://doi.org/10.17632/jv87xwj7fv.1
PID https://nbn-resolving.org/urn:nbn:nl:ui:13-4h-ild4
Metadata Access https://easy.dans.knaw.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:easy.dans.knaw.nl:easy-dataset:264411
Provenance
Creator Eskier, D
Publisher Data Archiving and Networked Services (DANS)
Contributor Doğa Eskier
Publication Year 2020
Rights info:eu-repo/semantics/openAccess; License: http://creativecommons.org/licenses/by/4.0; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Representation
Resource Type Dataset
Discipline Other