Download - Information Extraction and Text Mining Group

Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics Department of Computer Sciences University of Wisconsin U.S.A. [email protected] www.biostat.wisc.edu/~craven The Information Extraction Task Analysis of Yeast PRP20 Mutations and Functional Complementation by the Human Homologue RCC1, a Protein Involved in the Control of Chromosome Condensation Fleischmann M, Clark M, Forrester W, Wickens M, Nishimoto T, Aebi M Mutations in the PRP20 gene of yeast show a pleitropic phenotype, in which both mRNA metabolishm and nuclear structure are affected. SRM1 mutants, defective in the same gene, influence the signal transduction pathway for the pheromone response . . . By immunofluorescence microscopy the PRP20 protein was localized in the nucleus. Expression of the RCC1 protein can complement the temperature-sensitive phenotype of PRP20 mutants, demonstrating the functional similarity of the yeast and mammalian proteins protein(PRP20) subcellular-localization(PRP20, nucleus) Motivation • assisting in the construction and updating of databases • providing structured summaries for queries What is known about protein X (subcellular & tissue localization, associations with diseases, interactions with drugs, …)? • assisting scientific discovery by detecting previously unknown relationships, annotating experimental data Three Themes in Our IE Research 1. Using “weakly” labeled training data 2. Representing sentence structure in learned models 3. Combining evidence when making predictions 1. Using “Weakly” Labeled Data • why use machine learning methods in building information-extraction systems? – hand-coding IE systems is expensive, timeconsuming – there is a lot of data that can be leveraged • where do we get a training set? – by having someone hand-label data (expensive) – by coupling tuples in an existing database with relevant documents (cheap) “Weakly” Labeled Training Data • to get positive examples, match DB tuples to passages of text referencing constants in tuples YPD database P1, L1 MEDLINE abstracts …P1…L1… P2, L2 P3, L3 …P2…L2… …P1…L1… …L3…P3… Weakly Labeled Training Data • the labeling is weak in that many sentences with co-occurrences wouldn’t be considered positive examples if we were hand-labeling them • consider the sentences associated with the relation subcellular-localization(VAC8p, vacuole) after weak labeling VAC8p is a 64-kD protein found on the vacuole membrane, a site consistent with its role in vacuole inheritance. In analogy, VAC8p may link the vacuole to actin during vacuole partitioning. In addition to its role in early vacuole inheritance, VAC8p is required to target aminopeptidase I from the cytoplasm to the vacuole. Learning Context Patterns for Recognizing Protein Names • We use AutoSlog [Riloff ’96] to find “triggers” that commonly occur before and after tagged proteins in a training corpus selections from the training corpus …gene encoding gamma-glutamyl kinase was… …recognized genes encoding vimentin, heat… …found that E2F binds specifically… …IleRS binds to the acceptor… …of CPB II binds 1 mol of… …purified C/EBP binds at the same position… …which interacts with CD4: both… …14-3-3tau interacts with protein kinase C mu, a subtype… encoding [X] 2/4 [X] binds 4/5 interacts with [X] 2/6 “Weak” Labeling Example SwissProt dictionary ... D-AKAP-2 D-amino acid oxidase D-aspartate oxidase D-dopachrome tautomerase … DAG kinase zeta DAMOX DASOX DAT DB83 protein … PubMed abstract Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: D-amino acid oxidase and D-aspartate oxidase. The enzymes differ in their electrophoretic properties, tissue distribution, binding with flavine adenine denucleotide, heat stability, molecular size and possibly in subunit structure. Neither enzyme exhibits genetic polymorphism in European populations, but a rare electrophoretic variant phenotype (DASOX 2-1) was identified which suggests that the DASOX locus is autosomal and independent of the DAMOX locus. Protein Name Extraction Approach select noun phrases that match Autoslog patterns Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: D-amino acid oxidase and classify noun phrases using a naïve Bayes model encoding [X] [X] binds interacts with [X] extract positive classifications D-amino acid oxidase Experimental Evaluation • hypothesis: we get more accurate models by using weakly labeled data in addition to manually labeled data • models use Autoslog-induced context patterns + naïve Bayes on morphological/syntax features of candidate names • compare predictive accuracy resulting from – fixed amount of hand-labeled data – varying amounts of weakly labeled data + handlabeled data Extraction Accuracy: Yapex Data Set 1 Precision 0.8 TP TP  FP 0.6 0.4 NB model only NB + Autoslog: 0 weak abstracts NB + Autoslog: 90 weak abstracts NB + Autoslog: 2,000 weak abstracts NB + Autoslog: 25,100 weak abstracts 0.2 0 0 0.2 0.4 0.6 0.8 Recall TP TP  FN 1 Extraction Accuracy: Texas Data Set 1 Precision 0.8 TP TP  FP 0.6 0.4 NB model only NB + Autoslog: 0 weak abstracts NB + Autoslog: 1800 weak abstracts NB + Autoslog: 2,000 weak abstracts NB + Autoslog: 25,100 weak abstracts 0.2 0 0 0.2 0.4 0.6 Recall 0.8 TP TP  FN 1 2. Representing Sentence Structure in Learned Models • hidden Markov models (HMMs) have proven to be perhaps the best family of methods for learning IE models • typically these HMMs have a “flat” structure, and are able to represent relatively little about grammatical structure • how can we provide HMMs with more information about sentence structure? Hidden Markov Models: Example .1 the .001 protein .00005 .00005 ... start 1 q4 .2 .4 .1 .1 .1/.8 .1/.8 .3 q1 the .007 .007 protein .02 ... the .00001 protein .00002 ... bed1 bed1 .001 .001 q2 .4 .3/.6 q5 Pr(“... the Bed1 protein ...” | ... q1,q4,q2 ...) .8 q3 .2 .2 .1 the .0001 protein .03 ...... the .0001 protein .0003 ... end Hidden Markov Models for Information Extraction • there are efficient algorithms for doing the following with HMMs: – determining the likelihood of a sentence given a model – determining the most likely path through a model for a sentence – setting the parameters of the model to maximize the likelihood of a set of sentences Representing Sentences • we first process sentences by analyzing them with a shallow parser (Sundance, [Riloff et al., 98]) sentence clause noun phrase clause verb phrase c_m noun phrase verb phrase prep phrase noun phrase adjective noun Our results verb c_m suggest that noun unk cop verb protein Bed1 is found prep art unk in the ER Hierarchical HMMs for IE (Part 1) • • • • [Ray & Craven, IJCAI 01; Skounakis et al, IJCAI 03] states have types, emit phrases some states have labels (PROTEIN, LOCATION) our models have  25 states at this level NP-SEGMENT PROTEIN NP-SEGMENT END START PREP LOCATION NP-SEGMENT Hierarchical HMMs for IE (Part 2) positive model NP-SEGMENT PROTEIN NP-SEGMENT END START PREP null model LOCATION NP-SEGMENT NP-SEGMENT START END PREP Hierarchical HMMs for IE (Part 3) PREP PROTEIN NP-SEGMENT END START NP-SEGMENT START ALL END Pr(the) = 0.0003 Pr(and) = 0.0002 … Pr(cell) = 0.0001 LOCATION NP-SEGMENT START BEFORE LOCATION BETWEEN AFTER END Hierarchical HMMs in PP-SEGMENT consider emitting: END START “. . . is found in the ER” VP-SEGMENT the START ALL is found END PROTEIN NP-SEGMENT START BEFORE LOCATION NP-SEGMENT ER LOCATION BETWEEN AFTER END Extraction with our HMMs • extract a relation instance if – sentence is more probable under positive model – Viterbi (most probable) path goes through special extraction states NP-SEGMENT PROTEIN NP-SEGMENT END START PP-SEGMENT LOCATION NP-SEGMENT NP-SEGMENT START END PP-SEGMENT Representing More Local Context • we can have the word-level states represent more about the local context of each emission • partition sentence into overlapping trigrams “... the/ART Bed1/UNK protein/N is/COP located/V ...” w1 , p1 w0 , p0 w1 , p1 w1 , p1 w0 , p0 w1 , p1 w1 , p1 w0 , p0 w1 , p1 Representing More Local Context • states emit trigrams t  w1 , w0 , w1 , p1 , p0 , p1 with probability: Pr(t )  Pr( w1 ) Pr( w0 ) Pr( w1 ) Pr( p1 ) Pr( p0 ) Pr( p1 ) • note the independence assumption above: we compensate for this naïve assumption by using a discriminative training method [Krogh ’94] to learn parameters Experimental Evaluation • hypothesis: we get more accurate models by using a richer representation of sentence structure in HMMs • compare predictive accuracy of various types of representations – hierarchical w/context features more grammatical – hierarchical information – phrases – tokens w/part of speech – tokens • 5-fold cross validation on 3 data sets Weakly Labeled Data Sets for Learning to Extract Relations • subcellular_localization(PROTEIN, LOCATION) – YPD database – 769 positive, 6193 negative sentences – 939 tuples (402 distinct) • disorder_association(GENE, DISEASE) – OMIM database – 829 positive, 11685 negative sentences – 852 tuples (143 distinct) • protein_protein_interaction(PROTEIN, PROTEIN) – MIPS database – 5446 positive, 41377 negative – 8088 (819 distinct) Extraction Accuracy (YPD) 1 Context HHMMs HHMMs Phrase HMMs POS HMMs Token HMMs Precision 0.8 TP TP  FP 0.6 0.4 0.2 0 0 0.2 0.4 0.6 Recall 0.8 TP TP  FN 1 Extraction Accuracy (MIPS) 1 Context HHMMs HHMMs Phrase HMMs POS HMMs Token HMMs Precision 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 Recall 0.8 1 Extraction Accuracy (OMIM) 1 Context HHMMs HHMMs Phrase HMMs POS HMMs Token HMMs Precision 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 Recall 0.8 1 3. Combining Evidence when Making Predictions • in processing a large corpus, we are likely to see the same entities, relations in multiple places • in making extractions, we should combine evidence across different occurrences/contexts in we see some entity/relation Combining Evidence: Organizing Predictions into Bags occurrence predicted actual CAT is a 64-kD protein… …the cat activated the mouse... CAT was established to be… …were removed from cat brains. let nb be the number of instances in bag b pb be the number of positive predictions ab be the number of actual positives Combining Evidence when Making Predictions • given a bag of predictions, estimate the probability that the bag contains at least one actual positive example: Pr(ab  0 | nb , pb ) nb   Pr( p j 1 nb b  Pr( p i 0 b | ab  j , nb ) Pr( ab  j | nb ) | ab  i, nb ) Pr( ab  i | nb ) Combining Evidence: Estimating Relevant Probabilities nb   Pr( p j 1 nb b  Pr( p i 0 b | ab  j , nb ) Pr( ab  j | nb ) | ab  i, nb ) Pr( ab  i | nb ) can model with two binomial distributions based on estimated TP-rate, FP-rate of model can do something simple here (e.g. assume uniform priors) or can make estimate this from data w/ a few assumptions Evidence Combination: Protein-Protein Interactions 1 BECIP Soft-Count Noisy-OR WM Soft-OR NC Precision 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 Recall 0.8 1 Evidence Combination: Protein Names 1 Precision 0.8 0.6 0.4 BECIP Soft-Count Noisy-OR WM Soft-OR NC 0.6 0.8 0.2 0 0 0.2 0.4 Recall 1 Conclusions • machine learning methods provide a means for learning/refining models for information extraction • learning is inexpensive when unlabeled/weakly labeled sources can be exploited – learning context patterns for protein names – learning HMMs for relation extraction • we can learn more accurate models by giving HMMs more information about syntactic structure of sentences – hierarchical HMMs • we can improve the precision of our predictions by carefully combining evidence across extractions Acknowledgments my graduate students Soumya Ray Burr Settles Marios Skounakis NIH/NLM grant 1R01 LM07050-01 NSF CAREER grant IIS-0093016

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download - Information Extraction and Text Mining Group