Download InChI keys as standard global identifiers in chemistry web services

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Chemical reaction wikipedia , lookup

Cocrystal wikipedia , lookup

Ceramic engineering wikipedia , lookup

Transition state theory wikipedia , lookup

Crystallographic database wikipedia , lookup

Al-Shifa pharmaceutical factory wikipedia , lookup

Chemical potential wikipedia , lookup

History of chemistry wikipedia , lookup

Chemical weapon proliferation wikipedia , lookup

Drug discovery wikipedia , lookup

Safety data sheet wikipedia , lookup

Chemical weapon wikipedia , lookup

Atomic theory wikipedia , lookup

Chemical plant wikipedia , lookup

Chemical Corps wikipedia , lookup

Chemical industry wikipedia , lookup

Organic chemistry wikipedia , lookup

History of molecular biology wikipedia , lookup

Physical organic chemistry wikipedia , lookup

IUPAC nomenclature of inorganic chemistry 2005 wikipedia , lookup

Resonance (chemistry) wikipedia , lookup

History of molecular theory wikipedia , lookup

Chemical thermodynamics wikipedia , lookup

Transcript
InChI keys
as standard global identifiers
in chemistry web services
Russ Hillard
ACS, Salt Lake City
March 2009
Context of this talk
•  We have created a web service
•  That aggregates sources built independently
- Dozens individual databases
- Containing Molecules and reactions
- Created using non-standardized business rules (wrt chemical representation)
•  Covers large record sets
- 30+ million unique molecules from combined sources
- 5+ million unique reactions from combined sources
•  Requires integration across all sources
-  Based on shared chemical entities
-  Where “entity” means chemical compound(s)
-  And “chemical compound” has a unique identifiers
-  Chemical structure
elucidated by scientists
-  Systematic chemical name
derived from structure
-  Graphic representation of structure assigned at registration
-  Trivial chemical name
assigned to structure
-  Registry number
assigned to structure
-  Key or string
computed from structure
The basic problem . . .
ChemInform
(FIZ Chemie)
BRN3936786
Beilstein
(Elsevier)
BRN3936786
Curr. Chem Reactions
(Thomson)
5693-99-2
71403-94-6
121651-02-3
126720-47-6
stereochem unspecified
relative stereochem
absolute stereochem (2R,3S)
absolute stereochem (2S,3R)
trans-3-phenyloxirane-carboxaldehyde
(2R*,3R*)-2,3-epoxycinnamaldehyde
trans-cinnamaldehyde epoxide
Epoxyzimtaldehyd
(2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom)
•  Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . .
•  Relationship of Structure:RegNumbers if often 1:many
One solution
•  Define our own set of registration rules
•  Register all structures to one big database
- Normalize structures according to our rules
•  Assign a unique record identifier (URI) to the normalized
structures
•  Correlate our URIs to the native sources
•  Use our URIs to correlate records across different databases
•  We have done this but have not exposed the URIs
- Even with modern computers this is resource intensive
- Problem is compounded when data is from different providers
- Does the world really need another “Global Registry Number”?
As currently implemented this gives:
ChemInform
(FIZ Chemie)
Great for internal correlations:
Reactions
Commercial Availability
Toxicity
Bioactivity
. . . etc
Molecules
Synthetic preparations of
Organic reactions of
Toxicity
. . . Etc
But what about external correlations?
Anything we don’t/can’t index
Commercial data
Proprietary data
Alternative solution
•  Assume structures as registered are correct
- Accept that we cannot always normalize according to our rules
•  Use a derived (calculated) compound identifier
•  Is this possible?
- IUPAC Name
- Wiswesser Line Notation (WLN)
- Molfile and its derivatives
- SEMA Key
- MDL Line Notation
- SMILES
- Chemical Markup Language (CML)
- InChI Name
- InChI Key
-  NEMA key
Will focus on these two options
IUPAC - International Chemical Identifier
The objective of the IUPAC Chemical Identifier Project is to
establish a unique label, the IUPAC Chemical Identifier, which
would be a non-proprietary identifier for chemical substances that
could be used in printed and electronic data sources thus
enabling easier linking of diverse data compilations.
The initial work focused on the development of algorithms for
converting an input organic chemical structure to a unique
(canonical) form. This, in effect, involves the unique numbering of
each atom, with equivalent atoms being assigned identical
numbers. "Serializing" the result to create a string is the final,
straightforward, step in creating an identifier.
From: http://www.iupac.org/web/ins/2000-025-1-800
For this presentation all InchI Keys are generated using:
final standard InChI/InChIKey v. 1.02 so7ware
The Morgan Algorithm
Invented by H. L. Morgan, J. Chem. Doc., 5, 107 (1965)
- Underpins many of the systems in use today
- The basis of CAS Online
Identifies atoms based on an extended connectivity value
and the atom with the highest value becomes the first atom
in the name, and its neighbors are then listed in descending
order – ties are resolved based on additional parameters, for
example bond order, and atomic number
Does not handle stereochemistry
SEMA developed to handle stereoisomers
- W. T. Wipke and T. M. Dyott, J. Amer. Chem. Soc., 96,
4825, (1974).
NEMA
NEMA produces a unique name and key for a wider range of
structures than SEMA. It extends perception to non-tetrahedral
stereogenic centers, it supports both 2D and 3D stereochemistry
perception, and it does not have an atom limit. It is a proprietary to
Symyx, but it is exposed in our products, for example Symyx
Draw and Symyx Direct generate NEMA keys.
The work of Wipke et al identified the value of a constitutional key
and a stereo key. This approach has been incorporated into
NEMA.
W. T. Wipke, S. Krishnan, and G. I. Ouchi, J. Chem. Inf Comput. Sci., 18, 32, 1978
Tautomers (mobile H atoms)
Different structures
Different systematic names
Presumably exist in equilibrium
InchI Keys are identical
NEMA Keys are different
Both structures are registered to our collection
57531-38-1 assigned to both structures
4(5)-chloro-5(4)-nitroimidazole
5(4)-chloro-4(5)-nitroimidazole
4-chloro-5-nitroimidazole
5-chloro-4-nitroimidazole
4-chloro-5-nitro-1(3)H-imidazole
Tautomers (“mobile hydrogen atoms”)
Different NEMA Keys
Same InchI Key
Mesomers
Mesomers ideally would have the same identifier
Different NEMA Keys
Same InchI Key
Both structures are registered to our collection
Methylene blue
61-73-4
Mesomers?
Same InChi Key
Different NEMA Keys
Same InchI Key
Same NEMA Keys
Stereoisomers
No stereo
Enantiomeric pair
Pure enantiomer
InchI does not distinguish pure enantiomer from raceme
Relative versus absolute stereochemistry
Indistinguishable based on InchI Key
Absolute Stereochemistry
InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N
1S
1s
2R
3 unique NEMA Keys
InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N
Concern with stereochem goes back to…..
ChemInform
(FIZ Chemie)
BRN3936786
Beilstein
(Elsevier)
BRN3936786
Curr. Chem Reactions
(Thomson)
5693-99-2
71403-94-6
121651-02-3
126720-47-6
stereochem unspecified
relative stereochem
absolute stereochem (2R,3S)
absolute stereochem (2S,3R)
trans-3-phenyloxirane-carboxaldehyde
(2R*,3R*)-2,3-epoxycinnamaldehyde
trans-cinnamaldehyde epoxide
Epoxyzimtaldehyd
(2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom)
•  Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . .
•  Relationship of Structure:RegNumbers if often 1:many
Typically problematic structures
Definitely the same compound
Same InchI Key
Different NEMA Keys
Typically problematic compounds
Just the tip of the iceberg
Organometallics
Inorganics
Layered structure of InchI Keys
AAAAAAAAAAAAAA-BBBBBBBBCD
AAAAAAAAAAAAAA
BBBBBBBB
= skeleton
= structural features
mobile hydrogens, isotopes, metal bonds ...
C
= flag, InchI version . . .
D
= check character
Ability to reconstruct InChi Keys into classes of related
structures sets them apart
InChI key resolution using ChemSpider
Full InChI key search
Partial InChI key search
There is still plenty to do……
Biologics
Average pipeline contains 22% biologics
Some companies are near 50%
Peptides & modified peptides
Nucleic acid sequences
Generics
Markush structures
Polymers
Repeating monomers
Block copolymers
Cross-linked polymers
So what should go into our web service?
•  Unique chemical structures registered to Compound Index
•  Unique reaction structures registered to Reaction Index
•  Assigned global identifiers as available
- Registry numbers (BRN, CASRN, MFCD#s, PubChemIDs. . .)
•  Computed global identifiers for all compounds
-  InChI strings
-  InChI Keys
-  NEMA Keys
•  Register InChi Keys to ACD and other Symyx databases
•  Let the consumer decide which to use