Fiehn Lab - Structure Elucidation

If you are interested in structure elucidation of small compounds (excluding peptides, less than 2000 Da) this listmaybe helpful to you. Itwill point you to some places where scientists usually lurk around during their worktime. I will focus on mass spectrometry techniques, because NMR can solve pretty much every structure. This is not the case for MS; hence it is more challenging :-) and of course we are a mass spectrometry (MS) and cheminformatics and metabolomics focused lab.

Massive structure elucidation with MS - which means you can assign a structure to 80% of all your peaks in each sample is far from becoming reality. It takes little less than a day to find out for a student that most textbook examples are just a mockup. Of course every analytical and chemometrical technique should be used like UV, IR, NMR, MS, crystal structures, mp, bp in a comprehensive approach; but in reality people are very limited in a technical and time-wise manner. Massive structure elucidation also requires a radical acceptance of new mass spectrometry technical advances and even more important new concepts for software development and data handling.

If you feel that there are links missing please contact me. This is a sorted non-comprehensive list of software and databases.

Mass Spectrometry in general

MS Vendor software is getting better each year because otherwise the benefits of the new mass spectrometry hardware are totally lost. Additionally users are not satisfied with simple quantification and library search anymore but request complex software solutions including peak picking, data alignment, automatic identification, sample handling and more. Protein and peptide based mass spectrometry solutions brought fresh air into the market of MS software. However these solutions can not just transformed into software for small molecule research. Additionally the huge proteomics community developed a broad base of free and open source mass spectrometry tools. Especially the many incompatible vendor formats required a better solution which gave the netCDF and mzXML file formats a huge boost. New computer hardware developments and faster multicore CPUs let software developers recognize that the free lunch is over and more and more mass spectrometry software is now multi-core and multi-processor capable.

Programs

PNNL tools - a source of free tools for MS and MS/MS of peptides and small molecules
Sashimi - if you need converters for mzXML from Bruker, Thermo, ABI, JEOL, Waters
MSQuant - if you need MGF and DTA files from MS/MS data
MS-utils - collection of free mass spectrometry tools
SPC Proteomics Tools - collection of tools around the Trans-Proteomic Pipeline (TPP)
NHLBI Spectral Tools - file converters, interpretation software for peptides - [ISO CD]
Spectral Alignment and Peak Alignment - these tools are used for comparison on multiple datasets
Agilent Software Download - MS tools, MassHunter, GeneSpring updates and MS libraries
AB/Sciex - Biomarker, Food, Lipid analysis software utilizing LC-MS/MS
Thermo Finnigan Downloads - Updates for XCalibur and other tools (see also Thermo MS Software)
Waters MS Software - MassLynx and Inspector updates and information
Bruker MS Software - Compass and Metabolic Profiler (MetaboliteTools) and ProfileAnalysis
Applied Biosystems - Analyst and Analyst QS and Markerview
Varian - MS Workstation Software
JEOL - AccuTOF software
PerkinElmer - Turbomass software for GC/MS line
Shimadzu - Class software and LCMSSolution and GCMSSolution and Compound Composer
Hitachi - BA Software (Information Based Acquisition) for Linear Trap-TOF-MS

Databases

Metlin DB - from Scripps
MassBank DB - from the Japanese metabolomics initiative

Literature and Concepts

Base Peak Magazin - from Wiley with tools and literature
Pittcon 2006 review - technology review from David Sparkman
Mass Spectrometry A Textbook - covering basic principles of mass spectrometry
Lipid Analysis - short primer covering most databases and useful tools
Data processing tools - for mass spectrometry-based metabolomics (Katajamaa, Oresic)

GC-MS

For GC/MS data evaluation the free AMDIS software together with the free NIST MS Search program and the commercial and curated NIST05 database is a good starter. The NIST Retention Index DB which is included in the NIST05 is a valuable source of retention indices. For advanced studies the NIST mass spectral interpreter can be recommended for structure elucidation and the NIST substructure identification routine in the NIST MS Search software. For integration and quantification commercial vendor software is recommended.

Programs

Collection from amdis.net - general selection of tools for GC-MS
Mass Spectrum Interpreter - a free tool from NIST for MS fragment annotation and the NIST MS Search
AMDIS Deconvolution - GC-MS peak picking and spectrum investigation
MOLGEN-MS - structure elucidation of electron impact mass spectra
LECO ChromaTOF - software and you have to purchase the GCTOF
AnalyzerPro -from SpectralWorks is a multi vendor deconvolution software
Ion Signature Technology - Multi vendor deconvolution software
Thermo Finnigan - GC/MS software updates direct service link
Alignment of GC-MS data - used for biomarker identification see separate section
MassFinder - targeted GC-MS software with mass spectra and retention index search
MET-IDEA - metabolomics ion-based data extraction algorithm using AMDIS
WSEARCH - multipurpose GC-MS tool including integration, RI, library search (free and pro version)
Tagfinder - For alignment of large GC-MS-based metabolite profiling experiments (free version)
GCImage - for GCxGC visualization and analysis (ZOEX Corp)
OpenChrom - Open source tool for GC-MS and LC-MS chromatogram handling, peak integration
ARISTO - Automatic Reduction of Ion Spectra to Ontology - classification of EI MS spectra

Databases

NIST05, Wiley, Palisade electron impact mass spectral databases + retention indices (NIST05)
GOLM DB and GMD - spectral collection of TMS spectra
Pherobase - collection of Kovats retention indices
MassBase - Download for 2500 GC-MS samples and standards from KAZUSA
MeltDB - an online repository and alignment program for LC-MS and GC-MS data
FiehnLib - GC-MS database from primary metabolites for metabolic profiling experiments

Literature and Concepts

Mass Spectrometry A Textbook - a compact source of MS knowledge
Interpretation of Mass Spectra - a standard handbook of mass spectrometry by McLafferty / Turecek

LC-MS

For simple LC-MS data handling the vendor specific software is usually sufficient. For structure elucidation and mass spectral identification HighChem Mass Frontier and ACD/MS Manager are required. Both software packets offer unique solutions like peak detection, peak picking and mass spectral deconvolution plus mass spectral tree search, mass spectral fragmentation prediction and adduct detection no other mass spectrometry software currently can offer. For LC-MS data handling it is important that the software not only performs a peak detection on the chromatographic data but on truly mass spectral deconvoluted and cleaned peaks similar to ChromaTOF LC. These cleaned peaks (which have an annotated peak purity and signal noise ratio) are later submitted to database search or further interpretations.

Programs

MassFrontier - from HighChem a standard tool for LC/MS handling
ACD/MS Manager - from ACDLabs a standard tool for LC/MS data handling
MassWorks - from Cerno Bioscience a software for increasing mass accuracy and
peak picking and alignment for single quads and triple-quads
LIMSA, SECD, for lipid data and emass and qmass for isotopic distributions
Seven Golden Rules for elemental compositions and molecular formula determinations
Alignment of LC-MS data - (compare multiple LC-MS runs) separate section
Adduction ion detection - (detect [M+H], [M+Na] ions) see separate section
Mass++ - Multiple format (ABI, Waters, Thermo) plug-in style software for mass spectrometers
Sirius2 (Starburst) - MS/MS interpretation and molecular formula calculation (Uni Jena)
MetFrag - In silico fragmentation for computer assisted identification of molecules [MetWare]
MetFusion - An online service for rapid compound identification
mMass - Open Source Mass Spectrometry Tool (reads mzData and mzXML)
mzMatch - XCMS alignment for peakML data and metabolomics data processing
OpenMS - An open-source framework for mass spectrometry (peak picking, database connections)
NIST MS Search - publicly available MS/MS search and LIB2NIST and MSPepsearch (4 commandline)
MAVEN -chromatographic aligner, peak-feature detector, isotope and adduct calculator
MOLFIND - multi-threaded pipeline for compound annotation and identifcation
IDEOM - An Excel interface for analysis of LC-MS based metabolomics data
MetAssign - probabilistic annotation of liquid chromatography data
CFM-ID - webserver for spectral prediction, peak assignments and compound identification
MAGMA - a online tool for compound identification and metabolite annotation

Databases

NIST05 - MS/MS database (new NIST10 available)
Wiley DB - mostly EI but also ESI and APCI spectra
MassFrontier - spectral tree MS/MS + CID collection (included in MassFrontier)
ChemicalSoft drug spectra - MS/MS and CID for ABI mass spectrometers [list]
HMDB - Human Metabolome DB with multiple search interfaces (adducts, accurate masses)
MZedDB - Tools for the annotation of High Resolution MS metabolomics data (adducts, formulas, DBs)
MassBank - Large curated repository of MS1 and MS/MS spectra (ESI, MALDI, APCI)
LipidBlast - the largest freely available in-silico MS/MS spectral database for lipid identification

Literature and Concepts

Basics of LC-MS - A free primer from Agilent (5988-2045EN)
Counterfeit drugs - A story using different mass spectrometry approaches to identify unknowns

NMR

Nuclear Magnetic Resonance (NMR) is the most important technique for structure elucidation. Except from some rare cases it is not possible to interpret complex molecular structures with GC-MS, LC-MS, FTIR (Fourier Transform Infrared) or UV alone. All these techniques can give promising hints, but still NMR is the magic gatekeeper. NMR for metabolomic profiling has the advantage of being very reproducible. To achieve the same in LC-MS or GC-MS extreme calibration and quality checking efforts have to be performed. That said, for example 25% of all GC-MS profiling runs in our lab are quality check mixtures to monitor injectors, columns and detectors. On the other hand NMR used for mixture analysis has a poor resolving power for analyzing thousand of compounds in one single sample. The most severe problem are sensitivity issues compared to mass spectrometry. Depending on the sample technique mass spectrometry can be 1000-fold to million-fold more sensitive than NMR. Considering the different strengths and weaknesses of both techniques it must be concluded that they have to be used as complementary techniques for metabolomics.

Programs

free NMR shift prediction from NMRShiftDB
NMR prediction - NMRPredict software from Modgraph
SENECA - package for Computer Assisted Structure Elucidation (CASE) ported into the CDK
StrucEluc - from ACDLabs currently the best performer, checkout the challenge
NMR tools - from ScienceSoft (Varian package)
Assemble - from Upstream Solutions (CH)
NMR alignment - see collection of peak alignment programs
NMR Information Server Software - collection of tools from SpinCore
MestreLab - Mestrec MestRe Nova, NMRPredict and MSpin
Chenomx - Chenomx NMR Suite 5.0 for NMR metabolomics with NMR library for identification
Automics - An open source software for NMR alignment and statistics
LSD - free software for automated structure elucidation from 2D NMR data

Databases

NMRShiftDB - largest open NMR data spectral collection (25k experimental spectra)
CSearchlite - free database of CNMR shift predictions (23 million unique structures from PubChem)
MMCD - Madison Metabolomics Consortium Database (MMCD) - largest collection of experimental 1H, 13C and 2D NMR raw spectra [ftp]

Literature and Concepts

2D-NMR - A chemists guide,ISBN 0471187070
Progress in NMR - Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation by M.E. Elyashberg, A.J. Williams and G.E. Martin

Cheminformatics and Chemometrics tools

It is pretty clear that successful structure elucidation of small molecules can only be performed with a rich set of cheminformatics and statistics tools. This is due to the fact that molecule information and molecule spectra and properties are equally important. Statistical tools are needed to develop new methods for structure elucidation using mass spectral data and molecule properties. Additionally molecular properties are used to apply more stringent filters during the structure elucidation process.

Programs

JChem Suite - a cheminformatics suite from ChemAxon and important tool for small molecule handling
CDK - Chemistry Development Kit a JAVA bases tool for small molecule handling
OEChem - cheminformatics suite for several platforms
Statistica Dataminer - the Swiss Army Knife for desktop statistical computing
MEV - Statistical tool for multivariate and univariate analysis (TIGR Institute/JG Venter)
R - the mathematical and statistical language - lack of a good GUI is major obstacle
YALE - Datamining and Statistics
Dragon - for chemical descriptor calculation (see also CDK and Joelib)
MOLGEN-QSPR - for chemical structure and descriptor generation (see also Dragon and CDK)
CC Program List - from the book: Introduction to Computational Chemistry

Databases

EPI - Suite - containing thousands of boiling points, pka, logP values and more (largest free collection)
ChemBioGrid - a collection of chemistry and biology related databases

Literature and Concepts

Cheminformatics Textbook - A must have reference ISBN3-527-30681-1 - checkout the 4 volume set
Encyclopedia of Computational Chemistry - must have reference for CC - ISBN 0-471-96588-X
Molecular descriptors - collection of books and software for QSAR and QSPR
CCL net - a very active computational chemistry discussion group

Structure and Property Databases

Small molecule structure databases and databases collecting physico-chemical properties are the backbone of modern structure elucidation. Data from modern sciences like genetics and proteomics is usually collected in open data repositories like GenBank. In chemistry molecules and their properties are usually published in papers first and later specialized database services (like CAS and Beilstein) collect this publisher copyrighted information and sell it back as service to the scientist.

The positive side is that data from these services is curated and of high-quality. Having access to these subscription databases frees chemists from publishing any metadata regarding molecules, molecule reactions or molecular properties or spectra in a electronic accessible format and allows them to focus on their own work instead of curating databases. However this very comfortable heritage (since 1881) increasingly develops into a very serious handicap, because complete access to all information can not be granted. That limits many synergistic approaches which require the investigation of the whole set of known small molecules, their properties and spectra at once. This is a sad fact in the electronic age.

To overcome this severe problem open-access databases like PubChem were established. In a future step all molecule data and molecule property data must be submitted electronically to open access data repositories together with every publication to allow commercial and non-commercial exploitation and exploration of this valuable data (a very common fact in genomics and proteomics).

Free databases (academic use)

PubChem - the most important small molecule database (compounds, substances, assays)
Chemical Lookup Service - the largest collection (30 million) of small molecules
KEGG database - the metabolite and pathway database
Pathguide - pathway database collection
Chemspider - search service for small molecules and property data (with programmers API)
Flavanoid DB - Arita lab via metabolome.jp
eMolecules - small molecule and supplier database
KNApSAcK - Species-Metabolite Relationship Database from NAIST (Kanaya Lab) and RIKEN
Metab2Mesh - links MeSH headings with metabolites including PubMed articles [example]
CTD - The Comparative Toxicogenomics Database (disease - metabolite relationships)

Subscription databases

CAS Scifinder - the largest curated collection of chemical compounds and chemical information
Beilstein - the organic compound database and organic compound property DB
Jubilant - interaction of proteins and small molecules
MDL Discoverygate - compilation of several small molecule databases
DNP - Dictionary of Natural Products
ChemNavigator - database of 52 million (commercially) available compounds (iResearch Library)

Literature and Concepts

Thirty-Two Free Chemistry Databases - a selection of small molecule databases

Computer Hardware

If you want to perform advanced studies in structure elucidation the selection of proper hardware is a serious issue, because LC-MS and GC-MS and GCxGC systems generate plenty of data. Do not trust people who want to sell you an office computer. For MS data handling you need the fastest computer hardware your money (minimum $1000-$5000) can buy. Also computers delivered with MS hardware are usually not recommended.

Operating System

The platform selection is very easy; its Windows 7 or 8 (64-bit). The reason is that 95% of the mass spectrometry software is windows dependent). For LINUX applications use a virtual operating system like VMWARE or Microsoft Virtual PC. Use any LINUX platform like UBUNTU.For Windows 32-bit the maximum available RAM is 2,88 GByte (not 4GByte - even if you stick 8 GByte in).
For Windows 64-bit version the maximum is currently 196 GByte. The number of CPUs is usually two, that means if you use dual-core CPU or a quad-core CPUs you can have up to 8 CPU cores running. CPU
The GHz numbers on CPUs are only a general hint. The minimum number of cores or CPUs is two for MS applications. A number of 4 core is an optimum because most MS software is currently only single-threaded (the computational routines). For data processing Intel Core-i5 and Core-i7 (quad core, speed >2.4 Ghz). The PassMark score (see below) should be higher than 8,000.Memory
32-bit systems are not recommended anymore, unless vendor required. For 64-bit systems minimum 8 to 32 Gbyte are the preferred configuration. Surplus memory is needed for using a RAMDISK. Most MS software is only programmed for 32-bit (except JAVA applications).

DISK

The selection of hard disks is a three-tiered approach. The disk selection is a general underestimated issue. However on-the-fly LC-MS and GC-MS deconvolution need the fastest disk performance possible. Do not trust people who want to install one single hard drive into your computer. RAID 10 and RAID 6 are the magic keywords.The first part is to provide enough backup space and a way to backup the backup-data. LC-MS and GC-MS files can easily reach hundreds of GBytes or Tbytes. A small external NAS (minimum 4 GByte to 8 TByte) or a connection to a computer center is preferred. For Backup the ACRONIS software suite or similar software performs very well. A incremental backup of 200 GByte takes usually 10-20 minutes. Usually 5-10 TByte is recommended storage place with either a 1Gbit network or better 10 Gbps network connection.

The second part is the internal hard disk array. A professional system usually uses SSD in RAID 10 or RAID 6 configuration for data processing and a storage RAID system for large file storage. If price is a problem, use 500 Gbyte SSD for OS and program installation and 2-4 TByte for data file storage.

A RAM disk is recommended for extreme performance when RAID performance is not sufficient. RAM disks perform only memory based and a 64-bit system can provide several GByte space. The advantage is that the access time are extremely low so thousands of files can be read without delay and the transfer rates can reach 10,000 MByte/second. As a pure software solution the RAMDISK Enterprise or the free SoftPerfect RAMDISK or OSFMount Ramdisk can be used.

Example Laptop for Metabolomics and ChemInformatics (2013)

For heavy and fast data processing the Laptop CPU should reach around 8000 Passmark points, everything above 8000 is OK.
http://www.cpubenchmark.net/high_end_cpus.html

* Minimum is > 2.4 Ghz for Intel Core CPUs (Example Intel Core i7-4700MQ @ 2.40GHz)
* Minimum is Core i5 better core i7 (Virtualization + quad core)
* RAM minimum is 8 GByte better 16 GByte
* HDD is 1TByte but not recommended, better 500-1000 GB SSD
* WIN7 or WIN8 OK, if MAC use BootCamp or Virtual Machine
* Important is screen resolution: minimum 1600x900 and 15"
if you buy lower vertical resolution (downwards) you will not see the many software
and chromatograms in full, hence 1920x1080 resolution is better, everything above will cost much more.
* Lifetime is estimated 4 years for heavy data processing
* Price for such a Laptop in 2013 is minimum $800 (without SSD) or $1200 (with SSD)

Example Workstation (2013)

Low budget workstation (Price <$2500): Single CPU
For heavy and fast data processing the Workstation CPU should reach at least 10,000 in Passmark points, everything above 8000 is OK. http://www.cpubenchmark.net/high_end_cpus.html

High budget workstation (Price <$10,000): Dual CPU setup
For fastest multi-core processing the Workstation should at least reach 16,000 Passmark points, better 20,000. http://www.cpubenchmark.net/multi_cpu.html

The deal for workstations is to find a good mix of high single turbo speed, as well as high multi-core speed. Hence no AMD system is recommended, because single threaded speed is below standard (multi CPU is OK). The most crucial setup is to use SSDs for processing and only SSDs. Best in RAID1 (mirror) RAID10 (4 SSDs, 2 can fail) RAID6 (4SSDs, 2 can fail). If disks are used they will slow down the whole system and performance will be sub-par.

* Minimum CPU is Core-i7 or Xeon E5 (>2.6 GHz, better >3.0 GHz) Example Intel Core i7-4930K or Dual Xeon E5 2687W
* SSD 2x 512 GB or 4x1TB and a disk based 4x2TByte RAID
* RAM minimum is 32 GByte, better 128 Gbyte
* Win7 or WIN8 (64-bit) OK
* Screens Dual 24" and 1900x1200 resolution minimum
* GPU for CUDA and OPENCL enabled acceleration ad libitum
* Lifetime is estimated 4 years single CPU and 6 years for Dual CPU setup and heavy data processing
* Price for such a workstation in 2013 is minimum $2400 (SSD) or $4400 (with 4 SSDs)

Structure Elucidation of Small Molecules

Mass Spectrometry in general

GC-MS

LC-MS

NMR

Cheminformatics and Chemometrics tools

Computer Hardware