Données de réplication pour : Kinetic solubility: experimental and machine-learning modeling perspectives

DOI

Kinetic aqueous or buffer solubility is important parameter measuring suitability of compounds for high throughput assays in early drug discovery while thermodynamic solubility is reserved for later stages of drug discovery and development. Kinetic solubility is also considered to have low inter-laboratory reproducibility because of its sensitivity to protocol parameters. Presumably, this is why little efforts have been put to build QSPR models for kinetic in comparison to thermodynamic aqueous solubility.

Here, we investigate the reproducibility and modelability of kinetic solubility assays. We first analyzed the relationship between kinetic and thermodynamic solubility data, and then examined the consistency of data from different kinetic assays. In this contribution, we report differences between kinetic and thermodynamic solubility data that are consistent with those reported by others and good agreement between data from different kinetic solubility campaigns in contrast to general expectations. The latter is confirmed by achieving high performing QSPR models trained on merged kinetic solubility datasets. This encourages for building predictive models for kinetic solubility. The kinetic solubility QSPR model developed in this study is freely accessible through the Predictor web service of the Laboratory of Chemoinformatics (https://chematlas.chimie.unistra.fr/cgi-bin/predictor2.cgi).

---------------

PICT The dataset was provided by Plateforme Intégrée de Criblage de Toulouse (PICT) screening platform. It consists of kinetic solubility measurements for 939 fragments (small organic molecules). The measurements were performed in PBS buffer solution (pH 7.2) (with 1% DMSO from stock solution) using NMR technique for detection. Adding uncertainties in sample preparation and detection, experts recommend to interpret a fragment of this dataset as “Insoluble” if the reported concentration is 880 μM. In-between the solubility label is undecided. Other curation steps included removal of data points reporting a concentration greater than the nominal sample concentration (1 mM) or greater than the concentration in the stock solution, indicative of an error. After the curation and removal of 46 confirmed outliers and suspicious data points, the total number of compounds in the dataset was 606 (513 “Soluble” and 93 “Insoluble”).

Prestwick This dataset originates from the former Prestwick Chemicals company. Kinetic solubility was measured for 1049 fragments in a buffer solution (pH 7.4) using static light scattering (SLS). Compounds are categorized as “Soluble” or “Insoluble” at 1 mM PBS (with 1% DMSO from stock solution). Data curation involved removal of identical duplicate measurements, as well as the molecules found soluble at higher concentrations, 5 mM and/or 10 mM, but not at 1 mM concentration, implying an error. The curated dataset consists of 989 compounds (900 “Soluble” and 89 “Insoluble”).

Life Chemicals Life Chemicals company provided kinetic solubility data for one of its fragment libraries (https://lifechemicals.com/fragment-libraries/soluble-fragment-library). Solubility of 11457 fragments was visually determined based on scattering observed in solutions at 1 mM concentration in PBS (pH 7.4) with 0.5% DMSO. After removal of data points with no kinetic solubility, the curated dataset consists of 9276 “Soluble” molecules.

MLSMR The Molecular Libraries Small Molecule Repository (MLSMR - https://pubchem.ncbi.nlm.nih.gov/bioassay/1996) is a collection of small molecules compiled under the initiative of National Institutes of Health (NIH) and screened by Sanford-Burnham Center for Chemical Genomics (SBCCG). To our knowledge, MLSMR is the largest kinetic solubility dataset available in PubChem and it is composed of 57824 data points measured in PBS (pH 7.4) using quantitative chemiluminescent nitrogen detection (CLND). Although, 0.2 mM was reported as the nominal concentration of a sample, a large fraction of the reported concentration (about 31% of the dataset) is in the range of (0.15; 0.151]. Based on this observation, we assumed 0.15 mM as the actual sample nominal concentration and removed data points which reported concentration greater than or equal to 0.15 mM (13262 data points). Additionally, data curation included removal of duplicate molecules while taking median of their solubility values. The resulting curated dataset contained 44510 nitrogen containing compounds which are insoluble at 0.15 mM, and therefore labeled “Insoluble” at 1 mM.

Boehringer Boehringer Ingelheim Pharma GmbH & Co. shared a dataset of 789 kinetic solubility measurements (dot: 10.1002/cmdc.200900205) performed in PBS (pH 7.4) using nephelometry method. Data points with reported precipitate formation in DMSO stock solution and those for which solubility value was only bounded (relation denoted as “>”) were removed. The curated dataset contained 605 compounds that are all “Insoluble” at 1 mM. This dataset was used for QSPR modelling. The full dataset (789 data points) was used to discuss the alignment of solubility values between different kinetic solubility assays.

CNE1/CNE2 Chimiothèque Nationale Essentielle (CNE) is a representative collection of physical samples of pure compounds from a larger chemical library of biologically relevant substances and natural extracts called Chimiothèque Nationale (https://chembiofrance.cn.cnrs.fr/fr/composante/chimiotheque). CNE1 is referring to the first generation of this representative collection of 640 compounds, most of which has been depleted. CNE2 is a currently available new representative collection of 1040 compounds. Aqueous solubility of both of these collections have been measured by the “Plateforme de Chimie Biologique Intégrative de Strasbourg” (PCBIS) screening platform. PCBIS has measured thermodynamic solubility for CNE1 collection, whereas CNE2 collection was screened for kinetic solubility. Thermodynamic solubility was measured using shake-flask method, whereas kinetic solubility was measured using HPLC-UV method, at 200 μM nominal concentration. Data curation process was identical to Oprisiu (https://www.theses.fr/2012STRAF059). Insoluble compounds which solubility was lower than the limit of detection have been ignored for the discussion. In addition, for CNE2, the following data points were removed:

entries with reported concentration > 210 μM, implying an experimental error; measurements with signs of impurity (multiple peaks in chromatogram); compounds with observed precipitation in stock solutions.

The CNE1 contains 282 compounds and the curation step yielded 525 compounds in CNE2, all of which are insoluble based on 1 mM threshold. CNE1 and CNE2 datasets were used to analyze differences between thermodynamic and kinetic solubility assay types, whereas the latter was also used for QSPR model training.

--------------- All datasets are provided as MDL SDF V2000 molecular structure electronic format (Specifications available here). This format is standard that can be interpreted by most chemistry software such as ChemAxon or DataWarrior.

Each file contains the chemical structures and one or more of the following fields (see the description of each file).

ID Identifier of the molecule (integer)

Relation If only a bound for a numeric value has been reported (symbol - ">", "<")

Source Which file the compound originates from (nominal - "CN1", "CN2", "Life Chemicals", "Prestwick Chemicals", "MLSMR", "Boehringer", "PICT")

Reason A note explaining why a data point has been discarded (text)

Kinetic solubility (uM) Kinetic solubility in μM in PBS buffer (numeric - max value=1000)

Kinetic solubility in PBS (uM) Kinetic solubility in μM in PBS buffer (numeric - max value=1000)

Kinetic solubility in PBS (ug/mL) Kinetic solubility in μg/mL in PBS buffer (numeric)

Solubility at 1 mM in PBS Kinetic solubility at 1mM nominal concentration in PBS buffer (nominal - soluble: "YES"; insoluble: "NO")

Solubility at 5 mM in PBS Kinetic solubility at 5mM nominal concentration in PBS buffer (nominal - soluble: "YES"; insoluble: "NO")

Solubility at 10 mM in PBS Kinetic solubility at 10mM nominal concentration in PBS buffer (nominal - soluble: "YES"; insoluble: "NO")

Solubility at 50 mM in PBS Kinetic solubility at 50mM nominal concentration in PBS buffer (nominal - soluble: "YES"; insoluble: "NO")

Solubility at 100 mM in PBS Kinetic solubility at 100mM nominal concentration in PBS buffer (nominal - soluble: "YES"; insoluble: "NO")

Kinetic solubility class (1 mM threshold) Insoluble if the compound is detected at a concentration less than 1 mM. (nominal - insoluble: "0"; soluble: "1")

Kinetic solubility class (830 uM threshold) Soluble if the compound is detected at a concentration larger than 830 μM. (nominal - insoluble: "0"; soluble: "1")

Thermodynamic solubility (uM) Themodynamic aqueous solubility in μM (numeric)

Thermodynamic solubility (logS (M)) log10 of the thermodynamic solubility in M units (numeric)

Solubility in DMSO (uM) Solubility in the DMSO, for stock solution, in μM (numeric)

Solubility and nominal conc absolute difference Quantifies an anomaly on the detected concentration and nominal concentration of the solute (numeric)

Precipitate in DMSO stock Indication of an anomaly in the stock solution (unary - anomaly: "Y")

Identifier
DOI https://doi.org/10.57745/ZWS0WC
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/ZWS0WC
Provenance
Creator Baybekov, Shamkhal ORCID logo; Llompart , Pierre ORCID logo; Marcou, Gilles ORCID logo; Gizzi, Patrick ORCID logo; Galzi, Jean-Luc ORCID logo; Ramos, Pascal; Saurel, Olivier ORCID logo; Bourban, Claire ORCID logo; Minoletti, Claire ORCID logo; Varnek, Alexandre (ORCID: 0000-0003-1886-925X)
Publisher Recherche Data Gouv
Contributor Marcou, Gilles; Université de Strasbourg; Centre national de la recherche scientifique; Entrepôt-Catalogue Recherche Data Gouv
Publication Year 2023
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Marcou, Gilles (UMR7140 CNRS, University of Strasbourg)
Representation
Resource Type Dataset
Format application/octet-stream; text/plain
Size 1716883; 519947; 578879; 709579; 1214732; 1332562; 15680472; 4166966; 118707981; 597827; 16226572; 106190591; 100130621; 30127191; 914009; 610066; 1604433; 99457; 9560
Version 1.1
Discipline Chemistry; Natural Sciences