Download Gene Normalization - Computational Bioscience Program

Research in the Verspoor Lab Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine [email protected] http://compbio.ucdenver.edu/Hunter_lab/Verspoor Linguistics, Lexicons, and Biomedical Verbs • I could go on and on and on and on • But I probably won’t… Biological Knowledge Discovery GENE NORMALIZATION Gene Normalization • Mapping a gene or protein name to an • • identifier (e.g. in GenBank) Very important task for using extracted information (more useful than just a name) Ambiguity – with English words (“to” “dunce” “wingless”) – in naming (1168 genes in Entrez named “p60”) – in species (949 species have a gene named “p53”) Normalization methods • Heuristic approach can be effective – Edit distance is too coarse (some characters matter more than others) • Some heuristics that appear to help – Ignore hyphens, commas, some other interrupting punctuation (but not, e.g., ' ) – Ignore parenthetical elements – Consider translations among arabic/roman numerals, and latin/greek letters – Special words for compound noun phrases: receptor, precursor, mRNA, gene, protein, greek letter names, etc. Gene Normalization: a species-based approach • Based on species detection (NCBI Taxonomy terms) – Global cues: • First (first species mention) • Abstract (most frequent species in abstract) • Majority (most frequent species in doc) – Local cues, close to gene reference: • Recency • Window (most frequent in window) – “mixed” strategy setting confidence • “First” >> “Recency” >> “Window” >> “Majority Putting it all together: BioCreative II.5 System Architecture Document Concept Recognition Tokenization and Sentence splitting Dictionary-based protein and species recognition protein candidate sets and species annotations Gene Normalization Gene normalization INT Coordination analysis OpenDMAP relation extraction Interaction pair construction normalized interaction pairs ﬁltered protein sets IPT OpenDMAP UniProt Dictionary Match • Trie-based data structure • Protein names and synonyms normalized upon insertion – reduces number of variants – same form we search for in the text Gene candidate selection • Normalized string match against SwissProt names and synonyms – lowercase – eliminating punctuation (apostrophes, hyphens, and parentheses) – converting Greek letters and Roman numerals to a standard form – removing spaces • Left and right token boundary constraints (right constraint relaxed for plurals) Protein Match example • Sentence: Affixin/β-parvin is an integrin-linked kinase (ILK)-binding focal adhesion protein highly expressed in skeletal muscle and heart. • Normalized Sentence: affixinbparvinisanintegrinlinkedkinaseilkbindingfocaladhesio n proteinhighlyexpressedinskeletalmuscleandheart • Match Affixin to affixin (ID: Q9HBI1) • Match β-parvin to bparvin (ID: Q9HBI1) Species detection • Dictionary lookup using UIMA Concept • Mapper loaded with NCBI Taxonomy Match species and sub-species; traverse is-a hierarchy for sub-species BC II.5 results RAW TP FP FN 105 1592 147 P R F AUC micro 0.06187 0.41667 0.10775 0.05316 macro 0.06817 0.44374 0.11296 0.17806 Homonym/Ortholog TP FP FN 127 454 125 P R F AUC micro 0.21859 0.50397 0.30492 0.21285 macro 0.28334 0.55928 0.32453 0.39295 KNoGM and KaBOB • KNoGM: Knowledge-based Normalization of Gene Mentions • Strategy based on WSD methods from Agirre and Soroa, based on knowledge graphs • Taking advantage of biological knowledge resources • KaBOB: Knowledge Base Of Biology – Integrated resource across biological databases Knowledge-based methods in Word Sense Disambiguation • Disambiguate words based on relations • • represented in a semantic graph Take advantage of connections among word senses and prefer word senses that are semantically connected Intuition: Spreading Activation – Can perform static analysis of the graph to determine most likely disambiguations based only on the state of connections in the graph – More effective: dynamic, consider words in context UKB: Agirre & Soroa knowledge-based WSD • Knowledge-based word sense disambiguation method – knowledge = WordNet graph – algorithm = (personalized) page rank PageRank: ranks vertices in a graph according to their relative structural importance Personalized PageRank: bias certain vertices; “activation” from a vertex increases Knowledge-based methods in Gene Normalization • Knowledge typically brought to bear based on textual matching of concepts known to be associated with genes – Gene ontology concepts – Chromosome locations – Species names • KNoGM takes advantage of such knowledge in a broader relational context KaBOB: Knowledge Base of Biology • Goal: construction of an integrated, broad-coverage semantic resource of biological knowledge – information artifacts – abstracted biological knowledge – RDF representation using ontological relations • KaBOB v.0 – iRefWeb protein interaction data – GO annotations – Homologene – NCBI Taxonomy From knowledge-based WSD to KNoGM • knowledge: KaBOB • dictionary: gene name → gene identifiers • context: mentions of gene names, GO terms, NCBI Taxonomy terms KNoGM Training Set 1, BCIII True False False Precision Positives Positives Negatives Recall F-score Default human (baseline) 73 465 534 0.1357 0.1203 0.1275 UKB-5 (iRefWeb) 73 322 534 0.1848 0.1203 0.1457 UKB-50 (iRefWeb) 64 310 543 0.1711 0.1054 0.1305 UKB-5 (KaBOB v.0) 104 468 503 0.1818 0.1713 0.1764 UKB-25 (KaBOB v.0) 115 504 492 0.1858 0.1895 0.1876 UKB-100 (KaBOB v.0) 151 580 456 0.2066 0.2488 0.2257 Biological Knowledge Discovery PROTEIN ACTIVE SITES Automated validation of high-throughput predictions • Collaboration with Mike Wall @ LANL • Combine structure-based predictions of active sites on proteins with literature-based validation – Given a PDB protein structure, and a prediction for residues in that structure that are active (ligand binding sites, catalytic sites, etc.) – Search the literature for evidence supporting the prediction Protein Fold vs. Function • Many amino acids in a protein are responsible • • for defining the overall fold However, only a small fraction of the residues in a protein are directly responsible for its behavior The evolutionary pressures on these residues are different from other residues, and can cause mutations to be correlated with function (Lichtarge) Functional Residues Are Often Remote in Sequence • Difficult to identify as motifs >1AQM:A|PDBID|CHAIN|SEQUENCE TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVSPPNEHITGSQWWTRYQPVSYELQSRGGNRAQFIDMVNRCSAAGVD IYVDTLINHMAAGSGTGTAGNSFGNKSFPIYSPQDFHESCTINNSDYGNDRYRVQNCELVGLADLDTASNYVQNTIAAYI NDLQAIGVKGFRFDASKHVAASDIQSLMAKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAW LSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPV HNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQ YCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDAMAIHKNAKLNTSSAS -amylase from Alteromonas haloplanctis Asp174, Glu200, Asp264 The Same Residues are Often Nearby in 3D Structure Glu200 Asp264 Asp174 1AQM Functional Sites • Types of Functional Sites – Catalytic sites – Allosteric Sites – Ligand-binding sites – Protein-protein interaction sites • Used to define motifs – Geometric hashing and other methods (TESS, Thornton lab) • Targets for Drug Design DPA Prediction of Functional Sites Glu200 Asp264 Asp174 Catalytic Triad Predicted Residues NLP Validation of Protein Active Site Predictions • Combine structure-based predictions of active sites on proteins with literature-based validation – Given a PDB protein structure, and a prediction for residues in that structure that are active (ligand binding sites, catalytic sites, etc.) – Search the literature for evidence supporting the prediction NLP validation: approach Protein Data Bank Protein ID protein name(s) protein structure Pubmed query Dynamic Perturbation Analysis predicted active residues residue validation or re-ranking validated active residues relevant documents Analysis Pipeline extracted residues NLP validation: NLP analysis Catalytic Site Atlas Amino Acid residue pattern development Corpus of documents Binding MOAD database Analysis pipeline Tokenization and Sentence splitting Amino Acid residue recognition list of residues Compare to known active residues for the document P/R/F score Analyze FP/FNs • • • • • Residue mention detection, examples This missense mutation converts a highly conserved glycine (Gly17 of neurophysin) to a valine residue. Killer of prune (Kpn) is a mutation in the awd gene which substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene. Residues in both the N-terminal (Arg-66 and Glu-70) and Cterminal (Arg-200, Asp-254, Asp-255, and Asp-276) thirds of the protein are implicated in binding to cells. … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine. Other outliers of possible functional relevance include D18, R23, R59, R390 and A391. Patterns must handle 3-letter and 1-letter abbrevations; various connectors, mutations, linguistic constructs such as coordination, and other variations in surface forms. Some regular expressions for AA mentions AA_long= "(alanine|asparagine|aspartic|cysteine|glutamic|glutamic acid|glutamine|glycine|histidine|allo\ |leucine|lysine|methionine|penylalanine|proline|serine|threonine|tryptophane|\ tyrosine|valinealanine|arginine|alanyl|arginyl|asparaginyl|aspartyl|cysteinyl|glutaminyl\ |glycyl|glutamyl|histidyl|isoleucyl|leucyl|lysyl|methionyl\ |phenylalanyl|prolyl|seryl|threonyl|tryptophanyl|tyrosyl|isoleucine|valyl)" AA_short = "(arg|asn|asp|cys|gln|gly|glu|his|ile|leu|lys|met|phe|pro|ser|thr|trp|tyr|val|asx|glx|xle|xa a|ala|ctt)" AA_initial = "(A|C|D|E|F|G|H|I|K|L|M|N|P|Q|R|S|T|V|W|Y)” AA_unbounded = AA_long + "|" AA_short AA_bounded = "\b" + AA_unbounded + "\b" AA_position_variant1 = "(\d+)([ \-]+)" + AA_bounded #AA plus the position tyr85 with optional parenthesis around the position tyr(85) AA_position_variant2 = AA_unbounded + "[ \-]*\(?\d+\)*?" # (tyr85 to ser85, Tyr 85 Ser 85, trp27-gly360) connection = "[ \-]?(\-|to|\s|\\)[ \-]?" grammatical_expressions = "([ \-]?(to|substitution of|at position|acid)[ \-]?)” pattern3 = AA_unbounded + ".?\d+" + connection + AA_unbounded + ".?\d+" Current pattern performance residues Corpus 1 Corpus 2 Corpus 3 Average 3723 767 303 Prec 0.725933 0.741873 0.735436 0.734415 Recall 0.993 1.0 1.0 0.998 F1 0.84 0.85 0.85 0.85 Corpus 1: 61 full-text journal publications derived from Protein Data Bank (PDB) records that have known functional sites Corpus 2: 7 full-text journal publications; 5 abstracts. Derived from PDB records that are known drug targets. Corpus 3: 100 journal abstracts; obtained from Nagel et al (2009). NLP analysis, refined Amino Acid residue pattern development Catalytic Site Atlas Corpus of documents protein-residue association pattern development Binding MOAD database Analysis pipeline Protein recognition Amino Acid residue recognition Protein-Residue association (OpenDMAP patterns) list of pairs (protein, residue) Compare to known active residues for a protein linked to the document P/R/F score Analyze FP/FNs Some initial results of integration • For 32,195 PDB entries: – 26,829 entries map to a PubMed ID – 14,851 unique PubMed abstracts processed – 23,477 residues identified • 69% match surface residues on the relevant protein – 50% of these match predicted active sites • 79% of PDB entries have at least one residue identified Complicating factors • AA numbering in sequences may not be consistent – Different “reference” sequences for the protein – Mutant or other variant sequences • Explicit mentions of mutations • Namespace ambiguity, possibly BioNLP TECHNICAL AND REPRESENTATIONAL ISSUES NLP validation: infrastructure • Requires scaling our architecture to process full text publications on a large scale – UIMA-AS (Asynchronous Scaleout) – Cloud/cluster computing • Take software engineering seriously – Robust, scaleable, modular architectures – Consider the kinds of knowledge structures we need to be able to represent and manipulate • hierarchical controlled vocabularies • patterns of expression Annotation Representation “biological regulation” “transcription” rdfs:label rdfs:label GO:0006350 GO:0065007 kiao:denotesResource kiao:denotesResource kiao:denotesResource a1 a2 a3 has_location has_location has_location t1 t2 t3 EG:23939 rdfs:label “M. musculus Mapk7” …regulation of transcription of mouse Mapk7… t4 “regulation of transcription” a4 rdfs:Resource rdfs:label GO:0045449 p kiao:ResourceAnnotatio n kiao:StatementSetAnnotatio n rdf:Property (s p o) In a nutshell • Ontologies and Semantic graph analysis • Vocabularies and Linguistic knowledge for the • • biomedical domain Text Mining Information Extraction • Addressing the needs of the biological user • Biological data analysis integrating multiple data sources Acknowledgements • • • • • • • Larry Hunter (Lab director) Eneko Agirre and Aitor Soroa at EHU (UKB) Kevin Livingston (KaBOB) Kevin Cohen (NLP) Helen Johnson (Linguist) (Software engineers) • Other Lab members: • Mike Wall and Judith Cohn at LANL – Bill Baumgartner – Chris Roeder – Tom Christiansen • NIH grants – R01 LM 010120-01 – R01 LM 009254 – R01 LM 008111 – R01 GM 083649 – G08 LM 009639 – T15 LM 009451 Guillaume Achaz for the gnome image – Mike Bada, Hannah Tipney, Yuriy Malenkiy, Lynne Fox

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene Normalization - Computational Bioscience Program