Fiehn Lab - Databases

Small molecule databases like CAS, Beilstein, PubChem, KEGG and many others are discussed here.

Title	Large-Scale Annotation of Small-Molecule Libraries Using Public Databases Yingyao Zhou,* Bin Zhou, Kaisheng Chen, S. Frank Yan, Frederick J. King, Shumei Jiang, and Elizabeth A. Winzeler
Source	J. Chem. Inf. Model.; 2007, ASAP
DOI	http://dx.doi.org/10.1021/ci700092v
Short Review	Targeted at biomedical and chemical (drug) research this article calculates the overlap of different closed-source (CAS,WDI,IDDB3) and open chemical databases (PubChem, KEGG, MeSH). A pipeline for annotating and merging compound data from more than 15 databases is shown. Some astonishing results are obtained: "We assumed that the CAS database was the "golden standard" for representing all of the current knowledge of small molecules largely because of its extensive comprehensiveness compared to other sources. Unfortunately, as many as 36% of the class-C structures found in the PubChem database currently are not present in the CAS database, indicating that CAS has not taken advantage of this public resource." This is a serious issue and the CAS Advisory Committee needs to do something about, otherwise the innovation-spin at CAS is lost after 100 years. And it is very obvious and clear what do: "It would be preferable if all 26 million structures in the CAS system were readily accessible via PubChem or other informatic services, in order to prescan in-house compound collections quickly and identify those that have patent, reaction, and/or other literature data in the CAS database." CAS has developed an important database, but forgot to innovate during the last years and the digital revolution of the 21st century. Among these things which are required for 21 century databases are web-services and programming APIs and new program interfaces or rich clients. That is also shown in the article: "However, the lack of a programming interface capable of searching thousands of HTS hit candidates against the CAS catalog limits the ability to use this database for hit-to-lead analyses." This explains some comments why the CAS database is an important tool, but not a platform. CAS should take the PubChem data (which is freely available) and merge it into the CAS database and should also provide CAS data to PubChem (like many other companies do) to allow links back from PubChem to the CAS database. "Since the structures in PubChem were not collected with any obviously biased filters, we hypothesize that the chemical space covered by PubChem and the 26 million structures collected by the CAS database to be similar. Our analyses also demonstrate that commercial parties such as CAS could benefit from the incorporation of additional open-access chemical and biological data found in PubChem." This is important because the Chemical Structure Lookup Service (CSLS) has already 39 million compounds (29 million unique) and IBM is working on a chemical patent search engine for (Markush) structures from patent data. A link to the developed search service of the Genomics Insitute of the Novartis Research Foundation is provided here: Batch Compound Annotation Service

Title
Source
DOI	doi:10.1016/j.jmr.2004.11.028
Short Review

Title
Source
DOI	doi:10.1002/mrc.1517
Short Review

Title
Source
DOI
Short Review

Title
Source
DOI
Short Review

Title
Source
DOI
Short Review

Today is a cool and nice day. Why? Ask yourself!

(c) 2016 Fiehn Lab