Personal tools
You are here: Home / Members / Dr. Tobias Kind / Metabolomics / Structure Elucidation

Structure Elucidation of Small Molecules

Edited by Tobias Kind (June 2014)

If you are interested in structure elucidation of small compounds (excluding peptides, less than 2000 Da) this listmaybe helpful to you. Itwill point you to some places where scientists usually lurk around during their worktime. I will focus on mass spectrometry techniques, because NMR can solve pretty much every structure. This is not the case for MS; hence it is more challenging :-) and of course we are a mass spectrometry (MS) and cheminformatics and metabolomics focused lab.


Massive structure elucidation with MS - which means you can assign a structure to 80% of all your peaks in each sample is far from becoming reality. It takes little less than a day to find out for a student that most  textbook examples are just a mockup. Of course every analytical and chemometrical technique should be used like UV, IR, NMR, MS, crystal structures, mp, bp in a comprehensive approach; but in reality people are very limited in a technical and time-wise manner. Massive structure elucidation also requires a radical acceptance of new mass spectrometry technical advances and even more important new concepts for software development and data handling.

If you feel that there are links missing please contact me. This is a sorted non-comprehensive list of software and databases.

 



Mass Spectrometry in general


MS Vendor software is getting better each year because otherwise the benefits of the new mass spectrometry hardware are totally lost. Additionally users are not satisfied with simple quantification and library search anymore but request complex software solutions including peak picking, data alignment, automatic identification, sample handling and more. Protein and peptide based mass spectrometry solutions brought fresh air into the market of MS software. However these solutions can not just transformed into software for small molecule research. Additionally the huge proteomics community developed a broad base of free and open source mass spectrometry tools. Especially the many incompatible vendor formats required a better solution which gave the netCDF and mzXML file formats a huge boost. New computer hardware  developments and faster multicore CPUs let software developers recognize that the free lunch is over and more and more mass spectrometry software is now multi-core and multi-processor capable.


Programs


Databases

  • SetupX - LIMS system storing meta-data and annotated GC-MS results from Fiehn Laboratory
  • Metlin DB - from Scripps
  • MassBank DB - from the Japanese metabolomics initiative


Literature and Concepts




GC-MS

For GC/MS data evaluation the free AMDIS software together with the free NIST MS Search program and the commercial and curated NIST05 database is a good starter. The NIST Retention Index DB which is included in the NIST05 is a valuable source of retention indices. For advanced studies the NIST mass spectral interpreter can be recommended for structure elucidation and the NIST substructure identification routine in the NIST MS Search software. For integration and quantification commercial vendor software is recommended.

Programs

  • Collection from amdis.net - general selection of tools for GC-MS
  • Mass Spectrum Interpreter - a free tool from NIST for MS fragment annotation and the NIST MS Search
  • AMDIS Deconvolution - GC-MS peak picking and spectrum investigation
  • MOLGEN-MS - structure elucidation of electron impact mass spectra
  • LECO ChromaTOF - software and you have to purchase the GCTOF
  • AnalyzerPro -from SpectralWorks is a multi vendor deconvolution software
  • Ion Signature Technology - Multi vendor deconvolution software
  • Thermo Finnigan - GC/MS software updates direct service link
  • Alignment of GC-MS data - used for biomarker identification see separate section
  • MassFinder - targeted GC-MS software with mass spectra and retention index search
  • MET-IDEA - metabolomics ion-based data extraction algorithm using AMDIS
  • WSEARCH - multipurpose GC-MS tool including integration, RI, library search (free and pro version)
  • Tagfinder - For alignment of large GC-MS-based metabolite profiling experiments (free version)
  • GCImage - for GCxGC visualization and analysis (ZOEX Corp)
  • OpenChrom - Open source tool for GC-MS and LC-MS chromatogram handling, peak integration
  • ARISTO - Automatic Reduction of Ion Spectra to Ontology - classification of EI MS spectra


Databases

  • NIST05, Wiley, Palisade electron impact mass spectral databases + retention indices (NIST05)
  • GOLM DB  and GMD - spectral collection of TMS spectra
  • Pherobase - collection of Kovats retention indices
  • SetupX - Open Repository of GC-MS experiments from Fiehnlab, annotated and assigned metabolomic profiling raw data from 1600 public GC-MS samples can be downloaded [content-1] [species]
  • MassBase - Download for 2500 GC-MS samples and standards from KAZUSA
  • MeltDB - an online repository and alignment program for LC-MS and GC-MS data
  • FiehnLib - GC-MS database from primary metabolites for metabolic profiling experiments

Literature and Concepts




LC-MS

For simple LC-MS data handling the vendor specific software is usually sufficient. For structure elucidation and mass spectral identification HighChem Mass Frontier and ACD/MS Manager are required. Both software packets offer unique solutions like peak detection, peak picking and mass spectral deconvolution plus mass spectral tree search, mass spectral fragmentation prediction and adduct detection no other mass spectrometry software currently can offer. For LC-MS data handling it is important that the software not only performs a peak detection on the chromatographic data but on truly mass spectral deconvoluted and cleaned peaks similar to ChromaTOF LC. These cleaned peaks (which have an annotated peak purity and signal noise ratio) are later submitted to database search or further interpretations.


Programs

  • MassFrontier - from HighChem a standard tool for LC/MS handling
  • ACD/MS Manager - from ACDLabs a standard tool for LC/MS data handling
  • MassWorks - from Cerno Bioscience a software for increasing mass accuracy and
    peak picking and alignment for single quads and triple-quads
  • LIMSA, SECD, for lipid data and emass and qmass for isotopic distributions
  • Seven Golden Rules for elemental compositions and molecular formula determinations
  • Alignment of LC-MS data - (compare multiple LC-MS runs) separate section
  • Adduction ion detection - (detect [M+H], [M+Na] ions) see separate section
  • Mass++ - Multiple format (ABI, Waters, Thermo) plug-in style software for mass spectrometers
  • Sirius2 (Starburst) - MS/MS interpretation and molecular formula calculation (Uni Jena)
  • MetFrag - In silico fragmentation for computer assisted identification of molecules [MetWare]
  • MetFusion - An online service for rapid compound identification
  • mMass  - Open Source Mass Spectrometry Tool (reads mzData and mzXML)
  • mzMatch - XCMS alignment for peakML data and metabolomics data processing
  • OpenMS - An open-source framework for mass spectrometry (peak picking, database connections)
  • NIST MS Search - publicly available MS/MS search and LIB2NIST and MSPepsearch (4 commandline)
  • MAVEN -chromatographic aligner, peak-feature detector, isotope and adduct calculator
  • MOLFIND - multi-threaded pipeline for compound annotation and identifcation 
  • IDEOM - An Excel interface for analysis of LC-MS based metabolomics data 
  • MetAssign - probabilistic annotation of liquid chromatography data
  • CFM-ID - webserver for spectral prediction, peak assignments and compound identification
  • MAGMA - a online tool for compound identification and metabolite annotation

Databases

  • NIST05 -  MS/MS database (new NIST10 available)
  • Wiley DB - mostly EI but also ESI and APCI spectra
  • MassFrontier - spectral tree MS/MS + CID collection (included in MassFrontier)
  • ChemicalSoft drug spectra - MS/MS and CID for ABI mass spectrometers [list]
  • HMDB - Human Metabolome DB with multiple search interfaces (adducts, accurate masses)
  • MZedDB - Tools for the annotation of High Resolution MS metabolomics data (adducts, formulas, DBs)
  • MassBank - Large curated repository of MS1 and MS/MS spectra (ESI, MALDI, APCI)
  • LipidBlast - the largest freely available in-silico MS/MS spectral database for lipid identification

Literature and Concepts

 



NMR

Nuclear Magnetic Resonance (NMR) is the most important technique for structure elucidation. Except from some rare cases it is not possible to interpret complex molecular structures with GC-MS, LC-MS, FTIR (Fourier Transform Infrared) or UV alone. All these techniques can give promising hints, but still NMR is the magic gatekeeper. NMR for metabolomic profiling has the advantage of being very reproducible. To achieve the same in LC-MS or GC-MS extreme calibration and quality checking efforts have to be performed.  That said, for example 25% of all  GC-MS profiling runs in our lab are quality check mixtures to monitor injectors, columns and detectors. On the other hand NMR used for mixture analysis has a poor resolving power for analyzing thousand of compounds in one single sample. The most severe problem are sensitivity issues compared to mass spectrometry. Depending on the sample technique mass spectrometry can be 1000-fold to million-fold more sensitive than NMR. Considering the different strengths and weaknesses of both techniques it must be concluded that they have to be used as complementary techniques for metabolomics.



Programs

Databases

  • NMRShiftDB - largest open NMR data spectral collection (25k experimental spectra)
  • CSearchlite - free database of CNMR shift predictions (23 million unique structures from PubChem)
  • MMCD -  Madison Metabolomics Consortium Database (MMCD) - largest collection of experimental 1H, 13C and 2D NMR raw spectra [ftp]

Literature and Concepts

  • 2D-NMR - A chemists guide,ISBN 0471187070
  • Progress in NMR - Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation by M.E. Elyashberg, A.J. Williams and G.E. Martin

 



Cheminformatics and Chemometrics tools

It is pretty clear that successful structure elucidation of small molecules can only be performed with a rich set of cheminformatics and statistics tools. This is due to the fact that molecule information and molecule spectra and properties are equally important. Statistical tools are needed to develop new methods for structure elucidation using mass spectral data and molecule properties. Additionally molecular properties are used to apply more stringent filters during the structure elucidation process.


Programs

  • JChem Suite - a cheminformatics suite from ChemAxon and important tool for small molecule handling
  • CDK - Chemistry Development Kit a JAVA bases tool for small molecule handling
  • OEChem - cheminformatics suite for several platforms
  • Statistica Dataminer - the Swiss Army Knife for desktop statistical computing
  • MEV - Statistical tool for multivariate and univariate analysis (TIGR Institute/JG Venter)
  • R - the mathematical and statistical language - lack of a good GUI is major obstacle
  • YALE - Datamining and Statistics
  • Dragon - for chemical descriptor calculation (see also CDK and Joelib)
  • MOLGEN-QSPR - for chemical structure and descriptor generation (see also Dragon and CDK)
  • CC Program List - from the book:  Introduction to Computational Chemistry

Databases

  • EPI - Suite -  containing thousands of boiling points, pka, logP values and more (largest free collection)
  • ChemBioGrid - a collection of chemistry and biology related databases

Literature and Concepts

 



Structure and Property Databases

Small molecule structure databases and databases collecting physico-chemical properties are the backbone of modern structure elucidation. Data from modern sciences like genetics and proteomics is usually collected in open data repositories like GenBank. In chemistry molecules and their properties are usually published in papers first and later specialized database services (like CAS and Beilstein) collect this publisher copyrighted information and sell it back as service to the scientist.

The positive side is that data from these services is curated and of high-quality. Having access to these subscription databases frees chemists from publishing any metadata regarding molecules, molecule reactions or molecular properties or spectra in a electronic accessible format and allows them to focus on their own work instead of curating databases. However this very comfortable heritage (since 1881) increasingly develops into a very serious handicap, because complete access to all information can not be granted. That limits many synergistic approaches which require the investigation of the whole set of known small molecules, their properties and spectra at once. This is a sad fact in the electronic age.

To overcome this severe problem open-access databases like PubChem were established. In a future step all molecule data and molecule property data must be submitted electronically to open access data repositories together with every publication to allow commercial and non-commercial exploitation and exploration of this valuable data (a very common fact in genomics and proteomics).

Free databases (academic use)

  • PubChem - the most important small molecule database (compounds, substances, assays)
  • Chemical Lookup Service - the largest collection (30 million) of small molecules
  • KEGG database - the metabolite and pathway database
  • Pathguide - pathway database collection
  • Chemspider - search service for small molecules and property data (with programmers API)
  • Flavanoid DB - Arita lab via metabolome.jp
  • eMolecules - small molecule and supplier database
  • KNApSAcK - Species-Metabolite Relationship Database from NAIST (Kanaya Lab) and RIKEN
  • Metab2Mesh - links MeSH headings with metabolites including PubMed articles [example]
  • CTD - The Comparative Toxicogenomics Database (disease - metabolite relationships)



Subscription databases

  • CAS Scifinder - the largest curated collection of chemical compounds and chemical information
  • Beilstein - the organic compound database and organic compound property DB
  • Jubilant - interaction of proteins and small molecules
  • MDL Discoverygate - compilation of several small molecule databases
  • DNP - Dictionary of Natural Products
  • ChemNavigator - database of 52 million (commercially) available compounds (iResearch Library)


Literature and Concepts




Computer Hardware

If you want to perform advanced studies in structure elucidation the selection of proper hardware is a serious issue, because LC-MS and GC-MS and GCxGC systems generate plenty of data. Do not trust people who want to sell you an office computer. For MS data handling you need the fastest computer hardware your money (minimum $1000-$5000) can buy. Also computers delivered with MS hardware are usually not recommended.

Operating System

The platform selection is very easy; its Windows 7 or 8 (64-bit). The reason is that 95% of the mass spectrometry software is windows dependent). For LINUX applications use a virtual operating system like VMWARE or Microsoft Virtual PC. Use any LINUX platform like UBUNTU.

For Windows 32-bit the maximum available RAM is 2,88 GByte (not 4GByte - even if you stick 8 GByte in).
For Windows 64-bit version the maximum is currently 196 GByte. The number of CPUs is usually two, that means if you use dual-core CPU or a quad-core CPUs you can have up to 8 CPU cores running.

CPU
The GHz numbers on CPUs are only a general hint. The minimum number of cores or CPUs is two for MS applications. A number of 4 core is an optimum because most MS software is currently only single-threaded (the computational routines). For data processing Intel Core-i5 and Core-i7 (quad core, speed >2.4 Ghz). The PassMark score (see below) should be higher than 8,000.

Memory
32-bit systems are not recommended anymore, unless vendor required. For 64-bit systems minimum 8 to 32 Gbyte are the  preferred configuration. Surplus memory is needed for using a RAMDISK. Most MS software is only programmed for 32-bit (except JAVA applications). 

DISK
The selection of hard disks is a three-tiered approach. The disk selection is a general underestimated issue. However on-the-fly LC-MS and GC-MS deconvolution need the fastest disk performance possible. Do not trust people who want to install one single hard drive into your computer. RAID 10 and RAID 6 are the magic keywords.

The first part is to provide enough backup space and a way to backup the backup-data. LC-MS and GC-MS files can easily reach hundreds of GBytes or Tbytes. A small external NAS (minimum  4 GByte to 8 TByte) or a connection to a computer center is preferred. For Backup the ACRONIS software suite or similar software performs very well. A incremental backup of 200 GByte takes usually 10-20 minutes. Usually 5-10 TByte is recommended storage place with either a 1Gbit network or better 10 Gbps network connection.

The second part is the internal hard disk array. A professional system usually uses SSD in RAID 10 or RAID 6 configuration for data processing and a storage RAID system for large file storage. If price is a problem, use  500 Gbyte SSD for OS and program installation and 2-4 TByte for data file storage. 

A RAM disk is recommended for extreme performance when RAID performance is not sufficient. RAM disks perform only memory based and a 64-bit system can provide several GByte space. The advantage is that the access time are extremely low so thousands of files can be read without delay and the transfer rates can reach 10,000 MByte/second. As a pure software solution the RAMDISK Enterprise or the free SoftPerfect RAMDISK or OSFMount Ramdisk can be used.

Example Laptop for Metabolomics and ChemInformatics (2013)

For heavy and fast data processing the Laptop CPU should reach

around 8000 Passmark points, everything above 8000 is OK.
http://www.cpubenchmark.net/high_end_cpus.html

* Minimum is > 2.4 Ghz for Intel Core CPUs (Example Intel Core i7-4700MQ @ 2.40GHz)
* Minimum is Core i5 better core i7 (Virtualization + quad core)
* RAM minimum is 8 GByte better 16 GByte
* HDD is 1TByte but not recommended, better 500-1000 GB SSD
* WIN7 or WIN8 OK, if MAC use BootCamp or Virtual Machine
* Important is screen resolution: minimum 1600x900 and 15"
if you buy lower vertical resolution (downwards) you will not see the many software
and chromatograms in full, 
hence 1920x1080 resolution is better, everything above will cost much more.

* Lifetime is estimated 4 years for heavy data processing
* Price for such a Laptop in 2013 is minimum $800 (without SSD) or $1200 (with SSD)

Example Workstation (2013)

Low budget workstation (Price <$2500): Single CPU 
For heavy and fast data processing the Workstation CPU should reach at least 10,000 in Passmark points, everything above 8000 is OK. http://www.cpubenchmark.net/high_end_cpus.html

High budget workstation (Price <$10,000): Dual CPU setup
For fastest multi-core processing the Workstation should at least reach 16,000 Passmark points, better 20,000. http://www.cpubenchmark.net/multi_cpu.html

The deal for workstations is to find a good mix of high single turbo speed, as well as high multi-core speed. Hence no AMD system is recommended, because single threaded speed is below standard (multi CPU is OK).  The most crucial setup is to use SSDs for processing and only SSDs. Best in RAID1 (mirror) RAID10 (4 SSDs, 2 can fail) RAID6 (4SSDs, 2 can fail). If disks are used they will slow down the whole system and performance will be sub-par.

* Minimum CPU is Core-i7 or Xeon E5 (>2.6 GHz, better >3.0 GHz) Example Intel Core i7-4930K or Dual Xeon E5 2687W
* SSD 2x 512 GB or 4x1TB and a disk based 4x2TByte RAID
* RAM minimum is 32 GByte, better 128 Gbyte
* Win7 or WIN8 (64-bit) OK
* Screens Dual 24" and 1900x1200 resolution minimum
* GPU for CUDA and OPENCL enabled acceleration ad libitum

* Lifetime is estimated 4 years single CPU and 6 years for Dual CPU setup and heavy data processing 
* Price for such a workstation in 2013 is minimum $2400 (SSD) or $4400 (with 4 SSDs)