Unifying Antimicrobial Peptide Datasets for Robust Deep Learning-Based Classification

DOI

Leguminous crops are vital to sustainable agriculture due to their ability to fix atmospheric nitrogen, improving soil fertility and reducing the need for synthetic fertilizers. Additionally, they are an excellent source of protein for both human consumption and animal feed. AntiMicrobial Peptides (AMPs), found in various leguminous seeds, exhibit broad-spectrum antimicrobial activity through diverse mechanisms, including interaction with microbial cell membranes and interference with cellular processes, making them valuable for enhancing crop resilience and food safety. In the field of plant sciences, computational biology methods have been instrumental in the discovery and optimization of AMPs. These methods enable rapid exploration of sequence space and the prediction of AMPs using deep learning technologies. Optimizing AMP annotations through computational design offers a strategic approach to enhance efficacy and minimize potential side effects, providing a viable alternative to conventional antimicrobial agents. However, the presence of overlapping sequences across multiple databases poses a challenge for creating a reliable dataset for AMP prediction. To address this, we conducted a comprehensive analysis of sequence redundancy across various AMP databases. These databases encompass a wide range of AMPs from different sources and with specific functions, including both naturally occurring and artificially synthesized AMPs. Our analysis revealed significant overlap, underscoring the need for a non-redundant AMP sequence database. We present the development of a new database that consolidates unique AMP sequences derived from leguminous seeds, aiming to create a more refined dataset for the binary classification and prediction of plant-derived AMPs. This database will support the advancement of sustainable agricultural practices by enhancing the use of plant-based AMPs in agroecology, contributing to improved crop protection and food security.

R, 4.1.0

Python, 3.9

Dover analyzer (https://doi.org/10.1093/bioinformatics/btv180) was used to remove redundant AMPs.

Non-AMP dataset underwent hierarchical redundancy removal using CD-HIT (https://www.bioinformatics.org/cd-hit/)

All special characters, except the 20 proteinogenic amino acids (from the the standard genetic code; https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:83813), in AMP and non-AMP small peptides were removed through preprocessing.

Finally, the peptides shared between AMPs and non-AMPs were deleted only in the non-AMP dataset.

The source of AMP data contains the description of entry source, target, availability, update date, and total number of entries.

Peptaibol is the Specific database (DOI: 10.1093/nar/gkh077), collecting Peptaibol family antimicrobial peptides, with in total 316 entries. Last update since 2004 and is Absent for now.

Defensins is the Specific database (DOI: 10.1093/nar/gkl866), collecting Defensin family of AMPs antimicrobial peptides, with in total 536 entries. Last update since 2007 and is Absent for now.

AVPdb is the Specific database (DOI: 10.1093/nar/gkt1191), collecting Antiviral peptides antimicrobial peptides, with in total 2,059 entries. Last update since 2013 and is Present for now.

MilkAMP is the Specific database (DOI: 10.1007/s13594-013-0153-2), collecting AMPs of dairy origin antimicrobial peptides, with in total 385 entries. Last update since 2013 and is Present for now.

AMSDb is the Specific database (DOI: 10.2174/1381612023395475), collecting Eukaryotic AMPs antimicrobial peptides, with in total 893 entries. Last update since 01/11/2004 and is Absent for now.

AMPer is the Specific database (DOI: 10.1093/bioinformatics/btm068), collecting Eukaryotic AMPs antimicrobial peptides, with in total 988 entries. Last update since 01/02/2007 and is Present for now.

PenBase is the Specific database (DOI: 10.1016/j.dci.2005.04.003), collecting Penaeidin family of AMPs antimicrobial peptides, with in total 28 entries. Last update since 01/07/2008 and is Absent for now.

RAPD is the Specific database (DOI: 10.1111/j.1574-6968.2008.01357.x), collecting Recombinantly produced AMPs antimicrobial peptides, with in total 179 entries. Last update since 01/03/2010 and is Absent for now.

DAMPD is the General database (DOI: 10.1093/nar/gkr1063), collecting General AMPs antimicrobial peptides, with in total 1,232 entries. Last update since 01/09/2011 and is Absent for now.

PhytAMP is the Specific database (DOI: 10.1093/nar/gkn655), collecting AMPs from plants antimicrobial peptides, with in total 273 entries. Last update since 01/01/2012 and is Present for now.

DADP is the Specific database (DOI: 10.1093/bioinformatics/bts141), collecting Anuran defense peptides antimicrobial peptides, with in total 2,571 entries. Last update since 01/03/2012 and is Absent for now.

Bagel I is the Specific database (DOI: 10.1093/nar/gkq365), collecting Bacteriocins antimicrobial peptides, with in total 158 entries. Last update since 01/01/2013 and is Absent for now.

Bagel II is the Specific database (DOI: 10.1093/nar/gkq365), collecting Bacteriocins antimicrobial peptides, with in total 228 entries. Last update since 01/01/2013 and is Absent for now.

Bagel III is the Specific database (DOI: 10.1093/nar/gkq365), collecting Bacteriocins antimicrobial peptides, with in total 93 entries. Last update since 01/01/2013 and is Absent for now.

LAMP Experimental is the General database (DOI: 10.1371/journal.pone.0066557), collecting General AMPs antimicrobial peptides, with in total 3,191 entries. Last update since 01/03/2013 and is Present for now.

LAMP Patent is the Specific database (DOI: 10.1371/journal.pone.0066557), collecting Patented AMPs antimicrobial peptides, with in total 1,491 entries. Last update since 01/03/2013 and is Present for now.

YADAMP is the General database (DOI: 10.1016/j.ijantimicag.2011.12.003), collecting General AMPs antimicrobial peptides, with in total 2,525 entries. Last update since 01/03/2013 and is Absent for now.

Bactibase is the Specific database (DOI: 10.1186/1471-2180-10-22), collecting Bacteriocins antimicrobial peptides, with in total 227 entries. Last update since 01/10/2014 and is Present for now.

BaAMPs is the Specific database (DOI: 10.1186/1471-2180-10-22), collecting Against microbial biofilms antimicrobial peptides, with in total 221 entries. Last update since 13/09/2019 and is Absent for now.

dbAMP is the General database (DOI: 10.1093/nar/gkab1080), collecting General antimicrobial peptides, with in total 18,345 entries. Last update since 01/06/2021 and is Present for now.

APD3 is the General database (DOI: 10.1093/nar/gkv1278), collecting General AMPs antimicrobial peptides, with in total 3,273 entries. Last update since 01/08/2021 and is Present for now.

CAMP Patent is the Specific database (DOI: 10.1093/nar/gkv1051), collecting Patented AMPs antimicrobial peptides, with in total 1,716 entries. Last update since 01/11/2013 and is Absent for now.

CAMP Structure is the General database (DOI: 10.1093/nar/gkv1051), collecting 3D structures of AMPs antimicrobial peptides, with in total 682 entries. Last update since 01/11/2013 and is Absent for now.

CAMP Validated is the General database (DOI: 10.1093/nar/gkv1051), collecting General AMPs antimicrobial peptides, with in total 2,602 entries. Last update since 01/11/2013 and is Absent for now.

DRAMP General is the General database (DOI: 10.1093/nar/gkab651), collecting General antimicrobial peptides, with in total 6,034 entries. Last update since 04/07/2023 and is Present for now.

DRAMP Patent is the Specific database (DOI: 10.1093/nar/gkab651), collecting Patented AMPs antimicrobial peptides, with in total 16,11 entries. Last update since 04/07/2023 and is Present for now.

DRAMP Clinical is the Specific database (DOI: 10.1093/nar/gkab651), collecting Clinical AMPs antimicrobial peptides, with in total 40 entries. Last update since 04/07/2023 and is Present for now.

DRAMP Specific is the Specific database (DOI: 10.1093/nar/gkab651), collecting Specific AMPs antimicrobial peptides, with in total 6,097 entries. Last update since 04/07/2023 and is Present for now.

All AMP sequences, contains 44,099 entries resulting from the concatenation and filtering of 28 AMP databases

AMP sequences as test set, contains 8,876 entries from 11 validated databases selected, which are DRAMP_Clinical, PhytAMP, Defensins, CAMP_Structure, DAMPD, LAMP_Patent, CAMP_Patent, YADAMP, DADP, LAMP_Experimental, APD databases.

AMP sequences as training set, contains 35,229 entries from 17 validated databases selected, which are PenBase, Bagel_III, Bagel_I, RAPD, BaAMPs, Bactibase, Bagel_II, Peptaibol, MilkAMP, AMSDb, AMPer, AVPdb, DRAMP_General, DRAMP_Specific, DRAMP_Patent, CAMP_Validated, dbAMP databases.

All non-AMP sequences of Fabaceae data set, contains 59,606 entries

The non-AMP sequences of Fabaceae data set as test set by equidistant sampling, contains 9,355 entries

namp sequences of faba data set as training set by equidistant sampling, contains 37,420 entries

all namp sequences of viri data set, contains 98,357 entriess

namp sequences of viri data set as test set by equidistant sampling, contains 19,672 entries

namp sequences of viri data set as training set by equidistant sampling, contains 78,685 entries

@misc {shuang_peng_2024, author = { {shuang peng} }, title = { amp_dataset_viri (Revision 62d8d6a) }, year = 2024, url = { https://huggingface.co/datasets/ps29/amp_dataset_viri }, doi = { 10.57967/hf/2249 }, publisher = { Hugging Face } }

@misc {shuang_peng_2024, author = { {shuang peng} }, title = { amp_dataset_faba (Revision 5821c0d) }, year = 2024, url = { https://huggingface.co/datasets/ps29/amp_dataset_faba }, doi = { 10.57967/hf/2250 }, publisher = { Hugging Face } }

Identifier
DOI https://doi.org/10.57745/NZ0IRX
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/NZ0IRX
Provenance
Creator Peng, Shuang ORCID logo; Rajjou, Loïc ORCID logo
Publisher Recherche Data Gouv
Contributor Rajjou, Loïc; Peng, Shuang; Entrepôt-Catalogue Recherche Data Gouv
Publication Year 2024
Funding Reference Agence nationale de la recherche ANR-17-EURE-0007
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Rajjou, Loïc (INRAE)
Representation
Resource Type Dataset
Format text/plain
Size 1653611; 456177; 1187451; 3020985; 621552; 2492983; 7789247; 1597242; 6388719
Version 1.0
Discipline Agriculture, Forestry, Horticulture; Computer Science; Life Sciences; Agricultural Sciences; Agriculture, Forestry, Horticulture, Aquaculture; Agriculture, Forestry, Horticulture, Aquaculture and Veterinary Medicine; Biology; Medicine
Spatial Coverage Versailles