Project:
How large is the metabolome? A Critical Analysis of Data Exchange Practices in Chemistry Project Partner:
Tobias Kind, Martin Scholz, Oliver Fiehn

Picture (public domain): US long grain rice (USDA/Weller)


Results:
Kind T, Scholz M, Fiehn O (2009) How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry. PLoS ONE 4(5): e5440.; dx.doi.org/10.1371/journal.pone.0005440; Download article here [DOI] [PDF]Short Introduction:

Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites.As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonicaoryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases.We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons.

Pathway and small molecule database creation is performed from A) Publications and published papers B) in-silico data or computational models (ortholog mapping approaches) B) molecules directly obtained from experimental repositories or databases. For each molecule and pathway taxonomy data must be associated, or for each species a single database is created. Associated compartment data (organ, tissue) should be included as well as endogenous or exogenous origin of data.

Molecular structure and spectral data is usually obtained from bitmap data using optical character and optical structure recognition. The associated data loss (hamburger-to-cow algorithm) can be avoided by direct submission of machine readable structures and machine readable spectra to large institutional repositories. OCR algorithms are valuable and needed for already published data, but a paradigm change must be initiated for new chemistry, biochemistry and metabolomics publications.

Machine readable semantic data (living cow) should not be converted into a hamburger, hence bitmap structures and bitmap spectra that are not machine readable anymore. Later algorithms are used to convert this hamburger back to a cow, which is error-prone and not needed, because original vectorized structures and high resolution spectra are available "as is" from publication authors and researchers. (Our critique is not aimed at text OCR algorithms and cloning cows from well done hamburgers may be possible in the future.) [Large Picture]


Provided project software:

Due to multiple license restrictions the data here is presented for academic research and peer-review only. By downloading, you agree to use the data either for academic research or peer-review. Download the supplement data and project software here EXCEL 2000 [XLS] or [ZIP]. The file can be opened with Microsoft EXCEL or OpenOffice.

Picture service (for your convenience):

Parts of the software supplement of the publication are published under the Creative Commons (by) license. This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. Download PPT and TIFF pictures as [ZIP].Call for participation and open discussion:
Please comment on this article or discuss barriers, problems, obstacles or missing projects from the article. The focus is not on how many databases exist, but how such databases can be enriched with experimental, machine readable data from electronic structure and spectral data submissions directly from publications. The PLOS comment section requires a valid (non-anonymous) login, the comment rider is on the top see below graphics [LINK to comments].


Please comment on the article by creating a PLOS login and write about your ideas regarding this article.

Links to external software used in the project:

Programs:

  1. TEXTPAD ($$) www.textpad.com
  2. MS EXCEL ($$$) + Visual Basic www.microsoft.com
  3. ChemAxon molconvert (free), cxcalc (academic license), JCHEM full (academic license)
  4. ChemAxon Instant-JChem (free academic version)
  5. EPA EPISuite (free)
  6. Beilstein Crossfire for searching the Beilstein database of organic compounds and properties
  7. Scifinder Scholar for searching the CAS database
  8. InChI and InChIkey software (free)

Databases and Services (updated):

  • The PubChem database (free) - download the whole PubChem DB here: PubChem FTP
  • The Dictionary of Natural Products ($$$$) Web version
  • The KEGG database (free)
  • The peptide DB and metabolome DB (free)
  • The MDL Beilstein database ($$$$$)
  • The CAS database ($$$$$ academic or $$$$$$ commercial)
  • The ChemSpider DB (free) - largest information enhanced DB with mass spectrometry API
  • The RiceCyc DB - Rice Metabolic Pathways: RiceCyc Home
  • The Reactome DB
  • The SetupX - biological experiment database
  • The KNApSAcK DB - Species-Metabolite Relationship Database
  • The SureChem patent database
  • The IBM Patent chemical search
  • The MetaCrop DB - a detailed database of crop plant metabolism
  • The LipidMaps DB - LIPID Metabolites And Pathways Strategy
  • The Dr. Duke's Phytochemical and Ethnobotanical Database
  • The NCBI Taxonomy DB
  • The Oryzabase - integrated rice sciences database
  • The IBM Chemical Patent search (Simple) beta
  • The BatchEntrez service to retrieve compounds from PubChem compound IDs
  • The InChiKey resolver from RSC and ChemSpider
  • Compound annotations from text (Name; PubChem CID; InChIKey):
    2-acetyl-1-pyrroline; CID 522834; DQBQWWSFRPLIAX-UHFFFAOYAG
    Vitamin-A; CID 445354; FPIPGXGPPPQFEQ-OVSJKPMPBW
    Beta-carotene; CID 5280489; OENHQHLEOONYIE-JLTXGRSLBT
    Bisbynin; CID NA; ICHJNTDKHBXTFN-CMZGOGIXBZ [CML] [MOL] [ChemSpider]
    Trans-luteine; CID 5368396; KBPHJBAIARWVSC-DKLMTRRABK
    Cholesterol; CID 5997; HVYWMOMLDIMFJA-DPAQBDIFBB
    Malathion; CID 4004; JXSJBGJIGXNWCI-UHFFFAOYAK
    Chlorpyrifos; CID 2730; SBPBAQFWLVIOKP-UHFFFAOYAG
    Ribosylnicotinamide; CID 439924; JLEBZPBDRKPWTD-ARWKKGFBBE
    Omeprazol; CID 4594; SUBDBMMJDZJVOS-UHFFFAOYAZ
    Rhodopinal; CID 20055178; GOJQFVQXKNNAAY-XQHLYSSHBM
    Tegafur; CID 5386; WFWLQNSHRPWKFK-UHFFFAOYAE
    Arginine; CID 232; ODKSFYDXXFIFQN-UHFFFAOYAT


    Optical Character Recognition and Chemical Structure Recognition:

    1. OSRA - Optical Structure Recognition (NIH) (free, open source)
    2. Kekule - OCR-optical chemical (structure) recognition (NCI)
    3. Clide & Clide Pro - Chemical literature data extraction tool (Univ. Leeds/ SimBioSys/Keymodule)
    4. ChemoCR - Tool for Chemical Compound Reconstruction
    5. ChemReader - Automated extraction of chemical structure information

    Text based semantic annotation tools and projects:

    1. Oscar3 - Open Source Chemistry Analysis Routines (open source)
    2. Chem-MANTIS - Nomenclature Transformation Integrated System
    3. Project Prospect - IUPAC, Ontology, CML, InChI enhanced chemical publications
    4. Chemicalize.org - web based annotation service via ChemAxon proxy (name to structure)

    Name to chemical structure converters (vice versa):

    1. Autonom - Beilstein Institute
    2. IBM Chemical Annotator - IBM Almaden
    3. Lexichem - OpenEye
    4. Struct <=> Name - CambridgeSoft
    5. Marvin IUPAC Name - ChemAxon
    6. ACDName - Structure to Name and Name to Structure ACDLabs
    7. NameExpert and Nomenclator - Cheminnovation
    8. IUPAC NameIt - BioRad
    9. OPSIN - name to structure converter open source project (OSCAR3)