Download Slide 1

BioNLP Tutorial K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman PSB 2006 Wailea, Maui, HI The Biological Data Cycle Experimental Data Ontologies Databases Genbank SwissProt Literature Collections MEDLINE Expert Curation Bottleneck: getting knowledge from literature to databases Solution: text mining 1 Model Organism Curation Pipeline 3. Curate genes from paper 2. List genes for curation 1. Select papers MEDLINE 1 Double exponential growth in the literature New entries in Medline with publication date in Jan-Aug 2005: 431,478 (avg. 1775/ day) 1 Examples of BioNLP in action 1 Examples of BioNLP in action 1 Examples of BioNLP in action 1 Application types Information retrieval: find documents in response to an “information need” p53 Resistance to apoptosis, increased growth potential, and altered gene expression in cells that survived genotoxic hexavalent chromium exposure. PMID: 16283527 2 Application types Question-answering: question as input, answer as output What is BRCA1? A gene located on the seventeenth chromosome associated with a risk of breast and ovarian cancer (Yu and Sable 2005) 2 Application types •Summarization – Input: one or more texts – Output: single (shorter) text Information extraction: Information extraction systems find statements about some specified type of relationship in text. Entity identification is a necessary prerequisite to information extraction. Information retrieval: Information retrieval is Ling as etthe al. location (multiple documents) classically defined of documents that are relevant to some information need. PubMed is a premier Lu et al. (single document) example of a sophisticated biomedical information retrieval system. Summarization systems benefit from high-performance entity identification and normalization. Other approaches involve information extraction. 2 Application types Information extraction: relationships between things BINDING_EVENT Binder: Bound: 2 Application types Met28 binds to DNA. Lussier (gene/phenotype) BINDING_EVENT Binder: Met28 Bound: DNA Maguitman (protein/family) Chun (gene/disease) Höglund (protein/location) Stoica (protein/function) 2 Application types HSP60 Hsp-60 heat shock protein 60 Cerberus wingless Ken and Barbie the Entity identification 3 Application types Entity normalization: find concepts in text and map them to unique identifiers A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3 Application types • Perfect entity identification finds 5 mentions; they correspond to just 2 genes: – FBgn0000592 (esterase 6) – FBgn0026412 (leucine aminopeptidase) A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3 Application types • Partial list of synonyms for FBgn0000592: – Esterase 6 – Carboxyl ester hydrolase – CG6917 Chun (gene/disease) – Est6 Johnson (ontology alignment) – Est-D Stoica (gene/function) – Est-5 Vlachos (FlyBase mapping) 3 Biological Nomenclature: “V-SNARE” V-SNARE Vesicle SNARE SNAP Receptor Soluble NSF Attachment Protein N-Ethylmaleimide-Sensitive Fusion Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (A. Morgan) 4 The Biological Data Cycle Experimental Data Ontologies Databases Genbank SwissProt Literature Collections MEDLINE Expert Curation What’s the organizing principle for all of this? 4 Organizing principles Clinical repositories Genetic knowledge bases SNOMED Other subdomains OMIM … MeSH UMLS Biomedical literature NCBI Taxonomy Model organisms GO UWDA Anatomy Genome annotations 4 Organizing principles 4 Ontologies as text mining resources Neurofibromatosis type 2 (NF2) is often not recognised as a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutation in the gene named merlin, located on chromosome 22. (Uppal, S., and A. P. Coatesworth. “Neurofibromatosis Type 2.” Int J Clin Pract, 57, no. 8, 2003, pp. 698-703.) 4 Ontologies as text mining resources Neurofibromatosis type 2 (NF2) is often not recognised as a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutation in the gene named merlin, located on chromosome 22. vestibular schwannoma manifestation of neurofibromatosis 2 • Tumor manifestation of Disease neurofibromatosis 2Tumor associated with mutation of merlin • Disease associated with mutation of Gene Disease Gene Chromosome • merlin locatedononChromosome chromosome 22 Gene located 4 What’s the state of the art? Source Craven '99 Rindflesch '99 Proux '00 Friedman '01 Pustejovsky '02 Bunescu '05 Relation location binding interact pathway inhibit interact Entity protein UMLS gene many gene protein DB Prec Recall Yeast 92% 21% MEDLINE 79% 72% Flybase 81% 44% Articles 96% 63% MEDLINE 90% 57% MEDLINE ~37% ~50% Precision ≈ Specificity • Tasks differ greatly: finding human protein interactions (Bunescu ‘05) may be harder finding Recall ≈than Sensitivity “inhibition” relations (Pustejovsky ‘02) • Need a CASP-style competitive evaluation 4 What’s the state of the art? • • • • KDD Cup (2002) TREC Genomics (2003, 2004, 2005) BioCreAtIvE (2004) BioNLP (2004) What’s the state of the art? BioCreAtIvE information extraction task: PDB → Gene Ontology 3. Curate genes from paper 2. List genes for curation 1. Select papers MEDLINE BioCreAtIvE entity identification and entity normalization tasks KDD 2002, TREC Genomics 2004 5 What’s the state of the art? 1 • Yeast results good: High: 0.93 F Smallest vocab Short names Little ambiguity • Fly: 0.82 F High ambiguity • Mouse: 0.79 F Large vocabulary Long names Precision 0.8 0.6 0.4 0.2 FLY MOUSE YEAST 0.8 F-measure 0.9 F-measure 0 0 0.2 0.4 0.6 0.8 1 Recall **F-measure is balanced precision and recall: 2*P*R/(P+R) Recall: # correctly identified/# possible correct Precision: # correctly identified/# identified 3 What’s the state of the art? user run evaluated results "perfect" predictions correct protein, "general" GO user4 1 1048 268 (25.57%) 74 (7.06%) user5 1 1053 166 (15.76%) 77 (7.31%) 2 1050 166 (15.81%) 90 (8.57%) 3 1050 154 (14.67%) 86 (8.19%) 1 1057 272 (25.73%) 154 (14.57%) 2 1864 43 (2.31%) 40 (2.15%) 3 1703 66 (3.88%) 40 (2.35%) 1 251 125 (49.80%) 13 (5.18%) 2 70 33 (47.14%) 5 (7.14%) 3 89 41 (46.07%) 7 (7.87%) user10 1 45 36 (80.00%) 3 (6.67%) 2 59 45 (76.27%) 2 (3.39%) 3 64 50 (78.12%) 4 (6.25%) user14 1 1050 303 (28.86%) 69 (6.57%) user15 1 524 59 (11.26%) 28 (5.34%) 2 998 125 (12.53%) 69 (6.91%) user17 1 413 83 (20.10%) 19 (4.60%) 2 458 7 (1.53%) user20 1 1048 301 (28.72%) 57 (5.44%) 2 1048 280 (26.72%) 60 (5.73%) 3 1050 239 (22.76%) 59 (5.62%) user7 user9 (0.00%) Blaschke et al. 5 What’s the state of the art? Cellular Component: 34.61% (561/1621) Molecular Function: 33.00% (933/2827) Biological Process: 23.02% (1011/4391) Cellular component is easier because task is relation between “entities” located_in (protein,cell_component) Biological process is hardest because it is the most abstract Blaschke et al. 5 2.5 types of solutions Johnson Chun (IE,(information (ontology multiple gene alignment, -> UMLS GO disease) → other OBO) Höglund extraction, gene → localiz.) Lu Ling (summarization, (summarization, Entrez FlyBase) Gene → GeneRIFs) Maguitman (info. extract., SWISSPROT → Pfam) • Rule-based Lussier extraction, GOAgene -> phenotype) Vlachos (info. (entity normalization, → FlyBase) – Patterns Vlachos (coreference, FlyBase & Sequence Ont.) Stoica (gene → GO code) – Grammars • Statistical/machine learning – Labelled training data – Noisy training data • Hybrid statistical/rule-based 5 Common tools/techniques • “Stop word” removal: eliminate features that are rarely helpful the, a, and… • (Porter) stemming: convert inflected words to their roots promot, mitochondri, cytochrom • POS: “part of speech”— ≈80 categories 5 Why text mining is difficult • Variability • Pervasive ambiguity at every level of analysis 5 Why text mining is difficult Met28 binds to DNA …binding of Met28 to DNA… …Met28 and DNA bind… …binding between Met28 and DNA… …Met28 is sufficient to bind DNA… …DNA bound by Met28… 2(6) Why text mining is difficult …binding of Met28 to DNA… …binding under unspecified conditions of Met28 to DNA… …binding of this translational variant of Met28 to DNA… …binding of Met28 to upstream regions of DNA… 2(6) Why text mining is difficult …binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA… 3(6) Why text mining is difficult • • • • • Document segmentation Sentence segmentation Tokenization Part of speech tagging Parsing 5 Why text mining is difficult Here, we show that Bifocal F-measure (Bif), a putative cytoskeletal regulator, is a MaxEnt_1 of the Msn pathway.40 component for regulating R cell growth targeting. bif MaxEnt_2strong genetic interaction .67 with displays msn. KeX LingPipe .95 (Ruan et al. 2002) .96 6 (Baumgartner, in prep.) Why text mining is difficult lead • 69 tokens in GENIA – – – – – “bare stem” verb: 34 3rd person singular present tense verb: 29 Noun: 3 Past tense verb: 2 Past participle: 1 6 Why text mining is difficult HUNK • Human natural killer (cell type) • HUN kinase (gene/protein) • Radiological/orthopedic classification scheme • Piece of something 6 Why text mining is difficult NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002) NACT: neoadjuvant chemotherapy (PMID 8898170) N-acetyltransferase (PMID 10725313) Na+-coupled citrate transporter (PMID 12177002 ) 6 Why text mining is difficult NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002) •(liver), (testis) and (brain in rat) •liver, (testis and brain in rat) •(liver, testis and brain in rat) 6 Why text mining is difficult NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002) •shows preference for (citrate over dicarboxylates) •shows preference (for citrate) (over dicarboxylates) 7 Why text mining is difficult regulation of cell migration and proliferation (PMID …) serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428) ! proliferation and regulation of cell migration ! regulation of proliferation and cell migration regulation of cell migration and regulation of cell proliferation 7 Why text mining is difficult regulation of cell migration and proliferation (PMID …) serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428) !degradation of IRS-1, translocation, and serine phosphorylation !serine phosphorylation, serine translocation, and serine degradation (of IRS-1) 7 Most biomedical text mining to date: “ungrounded” • Drosophila OBP76a is necessary for fruit flies to respond to the aggregation pheromone 11-cis vaccenyl acetate (PMID 15664166) • lush is completely devoid of evoked activity to the pheromone 11-cis vaccenyl acetate (VA), revealing that this binding protein is absolutely required for activation of pheromone-sensitive chemosensory neurons (PMID 15664171) Entrez Gene ID:40136 7 The next step • Text mining can be key tool for linking biological knowledge from the literature to structured data in biological databases… • …and databases to each other. 7 Papers in the text mining session • 5 papers on linkage to ontologies • Höglund et al.: generating cellular localization annotations • Lussier et al.: PhenoGO for capture of phenome data • Stoica and Hearst: functional annotation of proteins • Johnson et al.: ontology alignments • Vlachos et al.: ontology for name extraction, anaphora • 2 papers linking other sets of resources • Maguitman et al. on “bibliome” to reproduce Pfam classes • Chun et al. on linking genes and diseases • 2 papers on summarization, using linked resources • Lu et al.: automated GeneRIF extraction • Ling et al.: automated gene summary generation 7 Acknowledgements • Alex Morgan for several slides • Christian Blaschke for data and slides • Bill Baumgartner for sentence segmenter performance data • Helen Johnson for data on POS ambiguity in GENIA • Lu Zhiyong for syntactic ambiguity examples • Larry Hunter for current PubMed graph 7 How big is a humuhumunukunukuapua’a? How big is a humuhumunukunukuapua’a?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1