Chemical database
A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.
Types of chemical databases
Bioactivity database
Bioactivity databases correlate structures or other chemical information to bioactivity results taken from bioassays in literature, patents, and screening programs.| Name | Developer | Initial release |
| ScrubChem | Jason Bret Harris | 2016 |
| ChEMBL | EMBL-EBI | 2009 |
| Reaxys DB | Elsevier | 2017 |
| PubChem-BioAssay | NIH | 2004 |
Chemical structures
s are traditionally represented using lines indicating chemical bonds between atoms and drawn on paper. While these are ideal visual representations for the chemist, they are unsuitable for computational use and especially for search and storage. Small molecules, are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks. Radioactive isotopes are also represented, which is an important attribute for some applications. Large chemical databases for structures are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory.Literature database
Chemical literature databases correlate structures or other chemical information to relevant references such as academic papers or patents. This type of database includes STN, Scifinder, and Reaxys. Links to literature are also included in many databases that focus on chemical characterization.Crystallographic database
store X-ray crystal structure data. Common examples include Protein Data Bank and Cambridge Structural Database.NMR spectra database
s correlate chemical structure with NMR data. These databases often include other characterization data such as FTIR and mass spectrometry.Reactions database
Most chemical databases store information on stable molecules but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and reaction mechanisms.A popular example that lists chemical reaction data, among others, would be the Beilstein database,
Thermophysical database
Thermophysical data are information about- phase equilibria including vapor–liquid equilibrium, solubility of gases in liquids, liquids in solids, heats of mixing, vaporization, and fusion.
- caloric data like heat capacity, heat of formation and combustion,
- transport properties like viscosity and thermal conductivity
Chemical structure representation
- As connection tables / adjacency matrices / lists with additional information on bond and atom attributes, such as:
- :MDL Molfile, PDB, CML
- As a linear string notation based on depth first or breadth first traversal, such as:
- :SMILES/SMARTS, SLN, WLN, InChI
Search
Substructure
Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are different from other general purpose databases in their support for substructure search, a method to retrieve chemicals matching a pattern of atoms and bonds which a user specifies. This kind of search is achieved by looking for subgraph isomorphism and is a widely studied application of graph theory.Query structures may contain bonding patterns such as "single/aromatic" or "any" to provide flexibility. Similarly, the vertices which in an actual compound would be a specific atom may be replaced with an atom list in the query. Cis–''trans'' isomerism at double bonds is catered for by giving a choice of retrieving only the E form, the Z form, or both.
Conformation
Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that is particularly of use in drug design. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.Examples
Large databases, such as PubChem and ChemSpider, have graphical interfaces for search. The Chemical Abstracts Service provides tools to search the chemical literature and Reaxys supplied by Elsevier covers both chemicals and reaction information, including that originally held in the Beilstein database. PATENTSCOPE makes chemical patents accessible by substructure and Wikipedia's articles describing individual chemicals can also be searched that way.Suppliers of chemicals as synthesis intermediates or for high-throughput screening routinely provide search interfaces. Currently, the largest database that can be freely searched by the public is the ZINC database, which is claimed to contain over 37 billion commercially available molecules.
Descriptors
All properties of molecules beyond their structure can be split up into either physico-chemical or pharmacological attributes also called descriptors. On top of that, there exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and synonyms. The IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable and unique string although it becomes unwieldy for larger molecules. Trivial names on the other hand abound with homonyms and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, charge, solubility, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.Similarity
There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an inverse of a measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measures and non-Euclidean measures depending on whether the triangle inequality holds. Maximum Common Subgraph based substructure search is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph.Chemicals in the databases may be clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick algorithm.
In pharmacologically oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.
Registration systems
Databases systems for maintaining unique records on chemical compounds are termed as Registration systems. These are often used for chemical indexing, patent systems and industrial databases.Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'canonical' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective.
A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific mixture, or racemic. Each of these would be considered a different record in a chemical registry system.
Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen ions in chemicals.
An example is the Chemical Abstracts Service registration system. See also CAS registry number.
List of chemical cartridges
- Accord
- Direct
- J Chem
- CambridgeSoft
- Bingo
- Pinpoint
List of chemical registration systems
- ChemReg
- Register
- RegMol
- Compound-Registration
- Ensemble
Web-based
| Name | Developer | Initial release |
| CDD Vault | Collaborative Drug Discovery | 2018 |
| Adroit Repository | Adroit DI | 2023 |
| Elsevier | 1989 |