Structure Elucidation of Small Molecules
If you are interested in structure elucidation of small compounds (excluding peptides, less than 2000 Da) this list maybe helpful to you. It will point you to some places where scientists usually lurk around during their worktime. I will focus on mass spectrometry techniques, because NMR can solve pretty much every structure. This is not the case for MS; hence it is more challenging :-) and of course we are a mass spectrometry (MS) and cheminformatics and metabolomics focused lab.
Massive structure elucidation with MS - which means you can assign a structure to 80% of all your peaks in each sample is far from becoming reality. It takes little less than a day to find out for a student that most textbook examples are just a mockup. Of course every analytical and chemometrical technique should be used like UV, IR, NMR, MS, crystal structures, mp, bp in a comprehensive approach; but in reality people are very limited in a technical and time-wise manner. Massive structure elucidation also requires a radical acceptance of new mass spectrometry technical advances and even more important new concepts for software development and data handling.
If you feel that there are links missing please contact me. This is a sorted non-comprehensive list of software and databases.
Mass Spectrometry in general
MS Vendor software is getting better each year because otherwise the benefits of the new mass spectrometry hardware are totally lost. Additionally users are not satisfied with simple quantification and library search anymore but request complex software solutions including peak picking, data alignment, automatic identification, sample handling and more. Protein and peptide based mass spectrometry solutions brought fresh air into the market of MS software. However these solutions can not just transformed into software for small molecule research. Additionally the huge proteomics community developed a broad base of free and open source mass spectrometry tools. Especially the many incompatible vendor formats required a better solution which gave the netCDF and mzXML file formats a huge boost. New computer hardware developments and faster multicore CPUs let software developers recognize that the free lunch is over and more and more mass spectrometry software is now multi-core and multi-processor capable.
Programs
- PNNL tools - a source of free tools for MS and MS/MS of peptides and small molecules
- Sashimi - if you need converters for mzXML from Bruker, Thermo, ABI, JEOL, Waters
- MSQuant - if you need MGF and DTA files from MS/MS data
- MS-utils - collection of free mass spectrometry tools
- SPC Proteomics Tools - collection of tools around the Trans-Proteomic Pipeline (TPP)
- NHLBI Spectral Tools - file converters, interpretation software for peptides - [ISO CD]
- Spectral Alignment and Peak Alignment - these tools are used for comparison on multiple datasets
- Agilent Software Download - MS tools, MassHunter, GeneSpring updates and MS libraries
- Thermo Finnigan Downloads - Updates for XCalibur and other tools (see also Thermo MS Software)
- Waters MS Software - MassLynx and Inspector updates and information
- Bruker MS Software - Compass and Metabolic Profiler (MetaboliteTools) and ProfileAnalysis
- Applied Biosystems - Analyst and Analyst QS and Markerview
- Varian - MS Workstation Software
- JEOL - AccuTOF software
- PerkinElmer - Turbomass software for GC/MS line
- Shimadzu - Class software and LCMSSolution and GCMSSolution and Compound Composer
- Hitachi - BA Software (Information Based Acquisition) for Linear Trap-TOF-MS
- SetupX - LIMS system storing meta-data and annotated GC-MS results from Fiehn Laboratory
- Metlin DB - from Scripps
- MassBank DB - from the Japanese metabolomics initiative
- Base Peak Magazin - from Wiley with tools and literature
- Pittcon 2006 review - technology review from David Sparkman
- Mass Spectrometry A Textbook - covering basic principles of mass spectrometry
- Lipid Analysis - short primer covering most databases and useful tools
- Data processing tools - for mass spectrometry-based metabolomics (Katajamaa, Oresic)
GC-MS
For GC/MS data evaluation the free AMDIS software together with the free NIST MS Search program and the commercial and curated NIST05 database is a good starter. The NIST Retention Index DB which is included in the NIST05 is a valuable source of retention indices. For advanced studies the NIST mass spectral interpreter can be recommended for structure elucidation and the NIST substructure identification routine in the NIST MS Search software. For integration and quantification commercial vendor software is recommended.
- Collection from amdis.net - general selection of tools for GC-MS
- Mass Spectrum Interpreter - a free tool from NIST for MS fragment annotation and the NIST MS Search
- AMDIS Deconvolution - GC-MS peak picking and spectrum investigation
- MOLGEN-MS - structure elucidation of electron impact mass spectra
- LECO ChromaTOF - software and you have to purchase the GCTOF
- Ion Signature Technology - Multi vendor deconvolution software
- Thermo Finnigan - GC/MS software updates direct service link
- Alignment of GC-MS data - used for biomarker identification see separate section
- MassFinder - targeted GC-MS software with mass spectra and retention index search
- MET-IDEA - metabolomics ion-based data extraction algorithm using AMDIS
- WSEARCH - multipurpose GC-MS tool including integration, RI, library search (free and pro version)
Databases
- NIST05, Wiley, Palisade electron impact mass spectral databases + retention indices (NIST05)
- GOLM DB spectral collection of TMS spectra
- Pherobase - collection of Kovats retention indices
- SetupX - Open Repository of GC-MS experiments from Fiehnlab, annotated and assigned metabolomic profiling raw data from 1600 public GC-MS samples can be downloaded [content-1] [species]
- MassBase - Download for 2500 GC-MS samples and standards from KAZUSA
- Mass Spectrometry A Textbook - a compact source of MS knowledge
- Interpretation of Mass Spectra - a standard handbook of mass spectrometry by McLafferty / Turecek
LC-MS
Programs
- MassFrontier - from HighChem a standard tool for LC/MS handling
- ACD/MS Manager - from ACDLabs a standard tool for LC/MS data handling
- MassWorks - from Cerno Bioscience a software for increasing mass accuracy and
peak picking and alignment for single quads and triple-quads - LIMSA, SECD, for lipid data and emass and qmass for isotopic distributions
- Seven Golden Rules for elemental compositions and molecular formula determinations
- Alignment of LC-MS data - (compare multiple LC-MS runs) separate section
- Adduction ion detection - (detect [M+H], [M+Na] ions) see separate section
- NIST05 - MS/MS database
- Wiley DB - mostly EI but also ESI and APCI spectra
- MassFrontier - spectral tree MS/MS + CID collection (included in MassFrontier)
- ChemicalSoft drug spectra - MS/MS and CID for ABI mass spectrometers [list]
- Basics of LC-MS - A free primer from Agilent (5988-2045EN)
- Counterfeit drugs - A story using different mass spectrometry approaches to identify unknowns
NMR
Nuclear Magnetic Resonance (NMR) is the most important technique for structure elucidation. Except from some rare cases it is not possible to interpret complex molecular structures with GC-MS, LC-MS, FTIR (Fourier Transform Infrared) or UV alone. All these techniques can give promising hints, but still NMR is the magic gatekeeper. NMR for metabolomic profiling has the advantage of being very reproducible. To achieve the same in LC-MS or GC-MS extreme calibration and quality checking efforts have to be performed. That said, for example 25% of all GC-MS profiling runs in our lab are quality check mixtures to monitor injectors, columns and detectors. On the other hand NMR used for mixture analysis has a poor resolving power for analyzing thousand of compounds in one single sample. The most severe problem are sensitivity issues compared to mass spectrometry. Depending on the sample technique mass spectrometry can be 1000-fold to million-fold more sensitive than NMR. Considering the different strengths and weaknesses of both techniques it must be concluded that they have to be used as complementary techniques for metabolomics.
Programs
- free NMR shift prediction from NMRShiftDB
- NMR prediction - NMRPredict software from Modgraph
- SENECA - package for Computer Assisted Structure Elucidation (CASE) ported into the CDK
- StrucEluc - from ACDLabs currently the best performer, checkout the challenge
- NMR tools - from ScienceSoft (Varian package)
- Assemble - from Upstream Solutions (CH)
- NMR alignment - see collection of peak alignment programs
- NMR Information Server Software - collection of tools from SpinCore
- MestreLab - Mestrec MestRe Nova, NMRPredict and MSpin
- Chenomx - Chenomx NMR Suite 5.0
- LSD - free software for automated structure elucidation from 2D NMR data
- NMRShiftDB - largest open NMR data spectral collection (25k experimental spectra)
- CSearchlite - free database of CNMR shift predictions (23 million unique structures from PubChem)
-
MMCD - Madison Metabolomics Consortium Database (MMCD) - largest collection of experimental 1H, 13C and 2D NMR raw spectra [ftp]
- 2D-NMR - A chemists guide, ISBN 0471187070
- Progress in NMR - Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation by M.E. Elyashberg, A.J. Williams and G.E. Martin
Cheminformatics and Chemometrics tools
It is pretty clear that successful structure elucidation of small molecules can only be performed with a rich set of cheminformatics and statistics tools. This is due to the fact that molecule information and molecule spectra and properties are equally important. Statistical tools are needed to develop new methods for structure elucidation using mass spectral data and molecule properties. Additionally molecular properties are used to apply more stringent filters during the structure elucidation process.
Programs
- JChem Suite - a cheminformatics suite from ChemAxon and important tool for small molecule handling
- CDK - Chemistry Development Kit a JAVA bases tool for small molecule handling
- OEChem - cheminformatics suite for several platforms
- Statistica Dataminer - the Swiss Army Knife for desktop statistical computing
- MEV - Statistical tool for multivariate and univariate analysis (TIGR Institute/JG Venter)
- R - the mathematical and statistical language - lack of a good GUI is major obstacle
- YALE - Datamining and Statistics
- Dragon - for chemical descriptor calculation (see also CDK and Joelib)
- MOLGEN-QSPR - for chemical structure and descriptor generation (see also Dragon and CDK)
- CC Program List - from the book: Introduction to Computational Chemistry
- EPI - Suite - containing thousands of boiling points, pka, logP values and more (largest free collection)
- ChemBioGrid - a collection of chemistry and biology related databases
- Cheminformatics Textbook - A must have reference ISBN 3-527-30681-1 - checkout the 4 volume set
- Encyclopedia of Computational Chemistry - must have reference for CC - ISBN 0-471-96588-X
- Molecular descriptors - collection of books and software for QSAR and QSPR
- CCL net - a very active computational chemistry discussion group
Structure and Property Databases
Small molecule structure databases and databases collecting physico-chemical properties are the backbone of modern structure elucidation. Data from modern sciences like genetics and proteomics is usually collected in open data repositories like GenBank. In chemistry molecules and their properties are usually published in papers first and later specialized database services (like CAS and Beilstein) collect this publisher copyrighted information and sell it back as service to the scientist.
The positive side is that data from these services is curated and of high-quality. Having access to these subscription databases frees chemists from publishing any metadata regarding molecules, molecule reactions or molecular properties or spectra in a electronic accessible format and allows them to focus on their own work instead of curating databases. However this very comfortable heritage (since 1881) increasingly develops into a very serious handicap, because complete access to all information can not be granted. That limits many synergistic approaches which require the investigation of the whole set of known small molecules, their properties and spectra at once. This is a sad fact in the electronic age.
To overcome this severe problem open-access databases like PubChem were established. In a future step all molecule data and molecule property data must be submitted electronically to open access data repositories together with every publication to allow commercial and non-commercial exploitation and exploration of this valuable data (a very common fact in genomics and proteomics).
- PubChem - the most important small molecule database (compounds, substances, assays)
- Chemical Lookup Service - the largest collection (30 million) of small molecules
- KEGG database - the metabolite and pathway database
- Pathguide - pathway database collection
- Chemspider - search service for small molecules and property data (with programmers API)
- Flavanoid DB - Arita lab via metabolome.jp
- eMolecules - small molecule and supplier database
- KNApSAcK - Species-Metabolite Relationship Database from NAIST and RIKEN
Subscription databases
- CAS Scifinder - the largest curated collection of chemical compounds and chemical information
- Beilstein - the organic compound database and organic compound property DB
- Jubilant - interaction of proteins and small molecules
- MDL Discoverygate - compilation of several small molecule databases
- DNP - Dictionary of Natural Products
- ChemNavigator - database of 52 million (commercially) available compounds (iResearch Library)
Literature and Concepts
- Thirty-Two Free Chemistry Databases - a selection of small molecule databases
Computer Hardware
For Windows 32-bit the maximum available RAM is 2,88 GByte (not 4GByte - even if you stick 8 GByte in).
For Windows 64-bit version the maximum is currently 128 GByte. The number of CPUs is usually two, that means if you use dual-core CPU or a quad-core CPUs you can have up to 8 CPU cores running.
CPU
The GHz numbers on CPUs are only a general hint (avoid old Intel Netburst technology). The minimum number of cores or CPUs is two for MS applications. A number of 4 core is an optimum because most MS software is currently only single-threaded (the computational routines).
AMD: The CPU should be minimum a Dual Opteron 2.6 Ghz or Athlon 64 X2 5000 (both are one year old).
INTEL: Minimum Intel Core 2 with minimum 2.0 GHz or Intel Quad core.
Memory
On a 32-bit system you need 4 GByte for maximum memory performance, however one process can only use 2 GByte (2.88 Gbyte with PAE). On a 64-bit memory system stick in whatever your money allows. 8 to 32 Gbyte are the preferred configuration. Surplus memory is needed for using a RAMDISK. Most MS software is only programmed for 32-bit (except JAVA applications) so it will take a while to step to 64-bit. However most MS companies recognize now the importance of good performing software so there will be changes in the near future.
DISK
The selection of hard disks is a three-tiered approach. The disk selection is a general underestimated issue. However on-the-fly LC-MS and GC-MS deconvolution need the fastest disk performance possible. Do not trust people who want to install one single hard drive into your computer. RAID 5 and RAID 6 are the magic keywords.
The first part is to provide enough backup space and a way to backup the backup-data. LC-MS and GC-MS files can easily reach hundreds of GBytes. A small external NAS (minimum 1000 GByte or 1 TByte) or a connection to a computer center is preferred. For Backup the ACRONIS software suite or similar software performs very well. A incremental backup of 200 GByte takes usually 10-20 minutes. Usually 5-10 TByte is recommended storage place with either a 1Gbit network or better 10 Gbps network connection.
The second part is the internal hard disk array. A professional system usually uses minimum 10,000 rpm hard disks like the WD Raptor SATA series. Such a hard disk has a internal speed of around 50-70 MByte/second. Normal hardrives usually can read/write with 30-60 MByte performance. For efficient working a SATA RAID 5 or RAID 6 array is recommended. The new RAID 5 and RAID 6 arrays can read/write with a performance of 200 to 400 MByte/second or up to 800 MByte/second. With RAID 5 one disk can fail and will be repaired automatically and with RAID 6 two drives can fail and the data is still secure (the RAID 0 and RAID 1 setups are only a slightly faster and not safe). The new ARECA RAID cards cost around 500 Dollar and 4-8 modern SATA hard drives are recommended for use. AMUG recently showed benchmarks with the 8-bay EnhanceBox E8-ML Multilane Infiniband SATA enclosure and the ARECA RAID-6 card (dual Mini-SAS).
A RAM disk is recommended for extreme performance when RAID performance is not sufficient. RAM disks perform only memory based and a 64-bit system can provide several GByte space. The advantage is that the access time are extremely low so thousands of files can be read without delay and the transfer rates can reach 1000 MByte/second. As a pure software solution the RAMDISK Enterprise can be used.
Last modified 2008-07-22 07:58 PM