This dataset transforms the many generic molecules in the ChEBI ontology—those whose structures contain undefined R‑groups—into fully specified molecular instances.
Its purpose is to let cheminformaticians, enzymologists and AI/ML developers treat R‑group–bearing ChEBI entries as ordinary molecules, so they can be indexed, searched and used to augment training sets for tasks such as reaction prediction, bio‑isosteric replacement and retro‑biosynthetic pathway design.
In nature, the resource is a gzip‑compressed CSV file produced by a three‑stage RDKit‑based pipeline that:
1. Extracts every ChEBI SMILES that contains at least one R‑group from the Rhea reaction database (release 134).
2. Finds real PubChem compounds whose heavy‑atom core matches the ChEBI scaffold, allowing only the R‑group position to vary.
3. Filters matches so that the final list comprises molecules differing from the template only at the R‑group site, and records their PubChem CIDs for traceability.
Each record therefore links a generic ChEBI structure to the enumerated set of concrete PubChem structures that realise it, along with molecular weight, heavy‑atom count and bookkeeping fields that distinguish “exact core” versus “core + extra substituent” matches.
The dataset’s scope encompasses all R‑group–containing entries in Rhea/ChEBI that survive atomic filters (≥ 6 heavy atoms and atoms found in living organisms), yielding 12,709 rows and eight columns that summarise: the canonical SMILES, the list of ChEBI IDs sharing that SMILES, computed properties, matched PubChem SMILES/CIDs with and without extra substituents, and provenance metadata. By expanding more than a thousand otherwise unusable generic templates into over ten thousand explicit molecules, the dataset bridges a long‑standing gap between curated biochemical ontologies and large‑scale public compound repositories, enabling systematic benchmarking, data augmentation and method development wherever R‑groups once forced researchers to discard valuable reaction data.
chebirgroup, 1.0.0