* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Some Natural Language Processing (NLP) Tasks
Survey
Document related concepts
Transcript
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. [email protected] www.biostat.wisc.edu/~craven The Task Given: a bacterial genome Do: use computational methods to predict a “parts list” of regulatory elements Outline 1. 2. 3. 4. 5. background on bacterial gene regulation background on probabilistic language models predicting transcription units using probabilistic language models augmenting training with “weakly” labeled examples refining the structure of a stochastic context free grammar The Central Dogma of Molecular Biology Transcription in Bacteria Operons in Bacteria promoter gene gene gene terminator mRNA • operon: sequence of one or more genes transcribed as a unit under some conditions • promoter: “signal” in DNA indicating where to start transcription • terminator: “signal” indicating where to stop transcription The Task Revisited Given: – DNA sequence of E. coli genome – coordinates of known/predicted genes – known instances of operons, promoters, terminators Do: – learn models from known instances – predict complete catalog of operons, promoters, terminators for the genome Our Approach: Probabilistic Language Models 1. 2. 3. write down a “grammar” for elements of interest (operons, promoters, terminators, etc.) and relations among them learn probability parameters from known instances of these elements predict new elements by “parsing” uncharacterized DNA sequence Transformational Grammars • a transformational grammar characterizes a set of legal strings • the grammar consists of – a set of abstract nonterminal symbols S, C1 , C2 , C3 , C4 – a set of terminal symbols (those that actually appear in strings) a, c, g, t – a set of productions C1 t C2 C2 a C3 C2 g C4 C3 a C3 g C4 a A Grammar for Stop Codons S C1 C1 t C2 C2 a C3 C3 a C2 g C4 C3 g C4 a • this grammar can generate the 3 stop codons: taa, tag, tga • with a grammar we can ask questions like – what strings are derivable from the grammar? – can a particular string be derived from the grammar? The Parse Tree for tag S C1 C1 t C2 C2 a C3 C3 a C2 g C4 C3 g C4 a S C1 C2 t a C3 g A Probabilistic Version of the Grammar 1.0 1.0 0.7 0.2 0.3 0.8 S C1 C1 t C2 C2 a C3 C3 a C2 g C4 C3 g 1.0 C4 a • each production has an associated probability • the probabilities for productions with the same left-hand side sum to 1 • this grammar has a corresponding Markov chain model A Probabilistic Context Free Grammar for Terminators START PREFIX PREFIX BBBBBBBBB STEM_BOT1 tl STEM_BOT2 tr STEM_BOT2 tl* STEM_MID tr* | tl* STEM_TOP2 tr* STEM_MID tl* STEM_MID tr* | tl* STEM_TOP2 tr* STEM_TOP2 tl* STEM_TOP1 tr* STEM_TOP1 tl LOOP tr LOOP B B LOOP_MID B B LOOP_MID SUFFIX B t = {a,c,g,u}, t* = {a,c,g,u, } STEM_BOT1 SUFFIX a c B LOOP_MID | g BBBBBBBBB c a|c|g|u c a g c-u-c-a-a-a-g-g- c u c u c prefix g c stem g loop g suffix u c g -u-u-u-u-u-u-u-u Inference with Probabilistic Grammars • for a given string there may be many parses, but some are more probable than others • we can do prediction by finding relatively high probability parses • there are dynamic programming algorithms for finding the most probable parse efficiently Learning with Probabilistic Grammars • in this work, we write down the productions by hand, but learn the probability parameters • to learn the probability parameters, we align sequences of a given classs (e.g. terminators) with the relevant part of the grammar • when there is hidden state (i.e. the correct parse is not known), we use Expectation Maximization (EM) algorithms Outline 1. 2. 3. 4. 5. background on bacterial gene regulation background on probabilistic language models predicting transcription units using probabilistic language models [Bockhorst et al., ISMB/Bioinformatics ‘03] augmenting training with “weakly” labeled examples refining the structure of a stochastic context free grammar A Model for Transcription Units stem loop spacer -35 prom intern start spacer -10 post prom TSS start ORF intra ORF RIT prefix UTR last ORF pre term end spacer RDT prefix stem loop SCFG ORF position specific Markov model semi-Markov model RIT suffix RDT suffix untranscribed region transcribed region end The Components of the Model • position-specific Markov models represent fixed-length sequence motifs j Pr( xi: j | ) Pr( xi | 1 ) Pr( xl | xl 1 , l i 1 ) l i 1 • semi-Markov models represent variable-length sequences Pr( xi: j | ) Pr( L | L ) j Pr( xi | 0 ) Pr( xl | xl 1 , 0 ) l i 1 • stochastic context free grammars (SCFGs) represent variable-length sequences with long-range dependencies Gene Expression Data • in addition to DNA sequence data, we also use expression data to make our parses • microarrays enable the simultaneous measurement of the transcription levels of thousands of genes genes/ sequence positions experimental conditions Incorporating Expression Data • our models parse two sequences simultaneously – the DNA sequence of the genome – a sequence of expression measurements associated with particular sequence positions ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT • the expression data is useful because it provides information about which subsequences look like they are transcribed together Predictive Accuracy for Operons 100 80 sensitivity 60 specificity 40 precision 20 0 sequence only sensitivit y TP TP FN expression only specificit y sequence+expression TN TN FP precision TP TP FP Predictive Accuracy for Promoters 100 90 sensitivity 80 specificity 70 precision 60 50 sequence only sensitivit y TP TP FN expression only specificit y sequence+expression TN TN FP precision TP TP FP Predictive Accuracy for Terminators 100 90 sensitivity 80 specificity 70 precision 60 50 sequence only sensitivit y TP TP FN expression only specificit y sequence+expression TN TN FP precision TP TP FP Accuracy of Promoter & Terminator Localization Terminator Predictive Accuracy 100% TP TP FN True Positive Rate 80% 60% 40% 20% 0% 20% SCFG SCFG, no training Complementarity Matrix Interpolated Markov Model 40% 60% 80% False Positive Rate FP FP TN 100% Outline 1. 2. 3. 4. 5. background on bacterial gene regulation background on probabilistic language models predicting transcription units using probabilistic language models augmenting training data with “weakly” labeled examples [Bockhorst & Craven, ICML ’02] refining the structure of a stochastic context free grammar Key Idea: Weakly Labeled Examples • regulatory elements are inter-related – promoters precede operons – terminators follow operons – etc. • relationships such as these can be exploited to augment training sets with “weakly labeled” examples Inferring “Weakly” Labeled Examples g1 g2 g3 g4 ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA g5 • if we know that an operon ends at g4, then there must be a terminator shortly downstream • if we know that an operon begins at g2, then there must be a promoter shortly upstream • we can exploit relations such as this to augment our training sets Strongly vs. Weakly Labeled Terminator Examples strongly labeled terminator: sub-class: rho-independent end of stem-loop gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg extent of terminator weakly labeled terminator: gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg Training the Terminator Models: Strongly Labeled Examples negative examples negative model rho-independent examples rho-independent terminator model rho-dependent examples rho-dependent terminator model Training the Terminator Models: Weakly Labeled Examples weakly labeled examples negative examples negative model rho-independent terminator model rho-dependent terminator model combined terminator model Do Weakly Labeled Terminator Examples Help? • task: classification of terminators (both sub-classes) in E. coli K-12 • train SCFG terminator model using: – S strongly labeled examples and – W weakly labeled examples • evaluate using area under ROC curves Area under ROC curve Learning Curves using Weakly Labeled Terminators 1 0.9 0.8 0.7 250 weak examples 0.6 25 weak examples 0 weak examples 0.5 0 20 40 60 80 100 120 Number of strong positive examples 140 Are Weakly Labeled Examples Better than Unlabeled Examples? • train SCFG terminator model using: – S strongly labeled examples and – U unlabeled examples • vary S and U to obtain learning curves Training the Terminator Models: Unlabeled Examples unlabeled examples negative model rho-independent terminator model combined model rho-dependent terminator model Learning Curves: Weak vs. Unlabeled Weakly Labeled Unlabeled Area under ROC curve 1 0.8 250 weak examples 25 weak examples 0 weak examples 0.6 0 40 80 120 250 unlabeled examples 25 unlabeled examples 0 unlabeled examples 0 40 Number of strong positive examples 80 120 Are Weakly Labeled Terminators from Predicted Operons Useful? • train operon model with S labeled operons • predict operons • generate W weakly labeled terminators from W most confident predictions • vary S and W Learning Curves using Weakly Labeled Terminators Area under ROC curve 1 0.9 0.8 0.7 200 weak examples 100 weak examples 0.6 25 weak examples 0 weak examples 0.5 0 20 40 60 80 100 120 Number of strong positive examples 140 160 Outline 1. 2. 3. 4. 5. background on bacterial gene regulation background on probabilistic language models predicting transcription units using probabilistic language models augmenting training with “weakly” labeled examples refining the structure of a stochastic context free grammar [Bockhorst & Craven, IJCAI ’01] Learning SCFGs • given the productions of a grammar, can learn the probabilities using the Inside-Outside algorithm • we have developed an algorithm that can add new nonterminals & productions to a grammar during learning • basic idea: – identify nonterminals that seem to be “overloaded” – split these nonterminals into two; allow each to specialize Refining the Grammar in a SCFG • there are various “contexts” in which each grammar nonterminal may be used • consider two contexts for the nonterminal w2 w1 w1 A w2 U C w2 G w2 Aw3 U | 0.4 Cw3G | 0.1 w2 Aw3 U | 0.1 Cw3G | 0.4 Gw3C | 0.1 Uw3 A 0.4 Gw3C | 0.4 Uw3 A 0.1 • if the probabilities for w2 look very different, depending on its context, we add a new nonterminal and specialize w2 Refining the Grammar in a SCFG • we can compare two probability distributions P and Q using Kullback-Leibler divergence P( xi ) H ( P || Q) P( xi ) Q( xi ) i P Q w2 Aw3 U | 0.4 Cw3G | 0.1 w2 Aw3 U | 0.1 Cw3G | 0.4 Gw3C | 0.1 Uw3 A 0.4 Gw3C | 0.4 Uw3 A 0.1 Learning Terminator SCFGs • extracted grammar from the literature (~ 120 productions) • data set consists of 142 known E. coli terminators, 125 sequences that do not contain terminators • learn parameters using Inside-Outside algorithm (an EM algorithm) • consider adding nonterminals guided by three heuristics – KL divergence – chi-squared – random SCFG Accuracy After Adding 25 New Nonterminals 100% True Positive Rate 80% 60% 40% 20% 0% 20% 40% Chi square KL Divergence Random Original Grammar 60% 80% False Positive Rate 100% SCFG Accuracy vs. Nonterminals Added Area under ROC curve 1 0.9 0.8 0.7 0.6 0.5 0 5 10 Chi square KL Divergence Random 15 20 Additional nonterminals 25 Conclusions • summary – we have developed an approach to predicting transcription units in bacterial genomes – we have predicted a complete set of transcription units for the E. coli genome • advantages of the probabilistic grammar approach – can readily incorporate background knowledge – can simultaneously get a coherent set of predictions for a set of related elements – can be easily extended to incorporate other genomic elements • current directions – expanding the vocabulary of elements modeled (genes, transcription factor binding sites, etc.) – handling overlapping elements – making predictions for multiple related genomes Acknowledgements • Craven Lab: Joe Bockhorst, Keith Noto • David Page, Jude Shavlik • Blattner Lab: Fred Blattner, Jeremy Glasner, Mingzhu Liu, Yu Qiu • funding from National Science Foundation, National Institutes of Health