Download InChI keys as standard global identifiers in chemistry web services

InChI keys as standard global identifiers in chemistry web services Russ Hillard ACS, Salt Lake City March 2009 Context of this talk •  We have created a web service •  That aggregates sources built independently - Dozens individual databases - Containing Molecules and reactions - Created using non-standardized business rules (wrt chemical representation) •  Covers large record sets - 30+ million unique molecules from combined sources - 5+ million unique reactions from combined sources •  Requires integration across all sources -  Based on shared chemical entities -  Where “entity” means chemical compound(s) -  And “chemical compound” has a unique identifiers -  Chemical structure elucidated by scientists -  Systematic chemical name derived from structure -  Graphic representation of structure assigned at registration -  Trivial chemical name assigned to structure -  Registry number assigned to structure -  Key or string computed from structure The basic problem . . . ChemInform (FIZ Chemie) BRN3936786 Beilstein (Elsevier) BRN3936786 Curr. Chem Reactions (Thomson) 5693-99-2 71403-94-6 121651-02-3 126720-47-6 stereochem unspecified relative stereochem absolute stereochem (2R,3S) absolute stereochem (2S,3R) trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd (2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom) •  Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . . •  Relationship of Structure:RegNumbers if often 1:many One solution •  Define our own set of registration rules •  Register all structures to one big database - Normalize structures according to our rules •  Assign a unique record identifier (URI) to the normalized structures •  Correlate our URIs to the native sources •  Use our URIs to correlate records across different databases •  We have done this but have not exposed the URIs - Even with modern computers this is resource intensive - Problem is compounded when data is from different providers - Does the world really need another “Global Registry Number”? As currently implemented this gives: ChemInform (FIZ Chemie) Great for internal correlations: Reactions Commercial Availability Toxicity Bioactivity . . . etc Molecules Synthetic preparations of Organic reactions of Toxicity . . . Etc But what about external correlations? Anything we don’t/can’t index Commercial data Proprietary data Alternative solution •  Assume structures as registered are correct - Accept that we cannot always normalize according to our rules •  Use a derived (calculated) compound identifier •  Is this possible? - IUPAC Name - Wiswesser Line Notation (WLN) - Molfile and its derivatives - SEMA Key - MDL Line Notation - SMILES - Chemical Markup Language (CML) - InChI Name - InChI Key -  NEMA key Will focus on these two options IUPAC - International Chemical Identifier The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations. The initial work focused on the development of algorithms for converting an input organic chemical structure to a unique (canonical) form. This, in effect, involves the unique numbering of each atom, with equivalent atoms being assigned identical numbers. "Serializing" the result to create a string is the final, straightforward, step in creating an identifier. From: http://www.iupac.org/web/ins/2000-025-1-800 For this presentation all InchI Keys are generated using: ﬁnal standard InChI/InChIKey v. 1.02 so7ware The Morgan Algorithm Invented by H. L. Morgan, J. Chem. Doc., 5, 107 (1965) - Underpins many of the systems in use today - The basis of CAS Online Identifies atoms based on an extended connectivity value and the atom with the highest value becomes the first atom in the name, and its neighbors are then listed in descending order – ties are resolved based on additional parameters, for example bond order, and atomic number Does not handle stereochemistry SEMA developed to handle stereoisomers - W. T. Wipke and T. M. Dyott, J. Amer. Chem. Soc., 96, 4825, (1974). NEMA NEMA produces a unique name and key for a wider range of structures than SEMA. It extends perception to non-tetrahedral stereogenic centers, it supports both 2D and 3D stereochemistry perception, and it does not have an atom limit. It is a proprietary to Symyx, but it is exposed in our products, for example Symyx Draw and Symyx Direct generate NEMA keys. The work of Wipke et al identified the value of a constitutional key and a stereo key. This approach has been incorporated into NEMA. W. T. Wipke, S. Krishnan, and G. I. Ouchi, J. Chem. Inf Comput. Sci., 18, 32, 1978 Tautomers (mobile H atoms) Different structures Different systematic names Presumably exist in equilibrium InchI Keys are identical NEMA Keys are different Both structures are registered to our collection 57531-38-1 assigned to both structures 4(5)-chloro-5(4)-nitroimidazole 5(4)-chloro-4(5)-nitroimidazole 4-chloro-5-nitroimidazole 5-chloro-4-nitroimidazole 4-chloro-5-nitro-1(3)H-imidazole Tautomers (“mobile hydrogen atoms”) Different NEMA Keys Same InchI Key Mesomers Mesomers ideally would have the same identifier Different NEMA Keys Same InchI Key Both structures are registered to our collection Methylene blue 61-73-4 Mesomers? Same InChi Key Different NEMA Keys Same InchI Key Same NEMA Keys Stereoisomers No stereo Enantiomeric pair Pure enantiomer InchI does not distinguish pure enantiomer from raceme Relative versus absolute stereochemistry Indistinguishable based on InchI Key Absolute Stereochemistry InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N 1S 1s 2R 3 unique NEMA Keys InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N Concern with stereochem goes back to….. ChemInform (FIZ Chemie) BRN3936786 Beilstein (Elsevier) BRN3936786 Curr. Chem Reactions (Thomson) 5693-99-2 71403-94-6 121651-02-3 126720-47-6 stereochem unspecified relative stereochem absolute stereochem (2R,3S) absolute stereochem (2S,3R) trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd (2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom) •  Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . . •  Relationship of Structure:RegNumbers if often 1:many Typically problematic structures Definitely the same compound Same InchI Key Different NEMA Keys Typically problematic compounds Just the tip of the iceberg Organometallics Inorganics Layered structure of InchI Keys AAAAAAAAAAAAAA-BBBBBBBBCD AAAAAAAAAAAAAA BBBBBBBB = skeleton = structural features mobile hydrogens, isotopes, metal bonds ... C = flag, InchI version . . . D = check character Ability to reconstruct InChi Keys into classes of related structures sets them apart InChI key resolution using ChemSpider Full InChI key search Partial InChI key search There is still plenty to do…… Biologics Average pipeline contains 22% biologics Some companies are near 50% Peptides & modified peptides Nucleic acid sequences Generics Markush structures Polymers Repeating monomers Block copolymers Cross-linked polymers So what should go into our web service? •  Unique chemical structures registered to Compound Index •  Unique reaction structures registered to Reaction Index •  Assigned global identifiers as available - Registry numbers (BRN, CASRN, MFCD#s, PubChemIDs. . .) •  Computed global identifiers for all compounds -  InChI strings -  InChI Keys -  NEMA Keys •  Register InChi Keys to ACD and other Symyx databases •  Let the consumer decide which to use

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download InChI keys as standard global identifiers in chemistry web services