Fiehn Lab - Substructure Mining

The analysis of substructures from a given set of molecules is of general importance for metabolomics, cheminformatics, QSAR/QSPR and drug research. Substructure analysis can be used for diversity analysis of natural products or drug libraries for either classification or later use in SAR analysis. Several groups tried to analyze molecular databases at a greater level of abstraction (natural compounds, general organic compounds) but such approaches while possible are in general not very feasible , because depending on the level of abstraction and the aim of the work they may result in different outcomes. Furthermore if the databases or the tools used are not freely or commercial available they have no greater impact at all.

Problems during the analysis of large molecule datasets are memory problems and insufficient use of multi processor systems which are now broadly available. Many newer tools aim to allow the processing of larger datasets (like PubChem with 20 million compounds).

Approaches

Maximum common substructure (MCS) analysis
SMARTS or substructure matching
Chemical graph mining
Fancy mining

Tools

Substructure Mining: MOSS (Christian Borgelt) and MOFA (Uni Konstanz)
MCS analysis: LIBMCS (ChemAxon); MCSS (OpenEye); CncMCS (ChemNavigator)
SMARTS analysis: CDK Descriptor GUI
Frequent Subgraph Mining: ParMol (Uni Erlangen)
Graph and chemical mining: PAFI/AFGen (Karypis Lab UMN)

Examples