Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pattern Analysis in Biology Timothy L. Bailey Institute for Molecular Bioscience University of Queensland Overview Pattern Analysis: Converting data into knowledge Objectives of Biological Pattern Analysis Elements of Pattern Discovery – Sequence pattern example Pattern Analysis: Converting data into knowledge The purpose of pattern analysis is to convert data into knowledge. Pattern: … order or form discernible in things, actions, ideas, situations, etc. (Oxford English Dictionary) We analyze patterns to give form and structure to knowledge and to make predictions from data. Discovery and Search Pattern discovery involves constructing a model of a biological signal, process or interaction from data. Pattern search involves looking for data that fits a given model in order make predictions. The Components of Pattern Analysis Pattern analysis starts with three major questions: The data: – What kind of data am I looking for patterns in? The pattern language: – How will I describe or model the patterns? The learning algorithm: – What algorithms exist for searching for patterns in this language. Data Pattern analysis is applied to many types of biological data: Sequence (DNA, RNA, protein) Structural (protein 2-D and 3-D) Expression (mRNA levels) Literature (text) Pattern Language Different types of pattern languages are used to represent patterns of different types: Sequence models: regular expressions, hidden Markov models, stochastic context-free grammars Structural models: 3-D coordinates Phylogeny models: trees and cladograms Network models: boolean networks General models: linear and non-linear equations, artificial neural networks (ANN), support vector machines (SVM) Learning algorithms The process of finding model that best fits given data or that optimizes some objective is often referred to as “learning”. Optimization algorithms: simulated annealing, genetic algorithms, backpropogation in neural networks Clustering algorithms: k-means clustering, self-organizing maps Statistical learning algorithms: expectation maximization (EM), forward-backward, Gibbs sampling Heuristic search: branch-and-bound, suffix trees, Tabu search, nearest neighbor Categories of learning algorithms Learning algorithms fall into two broad categories: Supervised learning: the “training” data is “labeled” with the features that the model will be used to predict. – Classification – Regression Unsupervised learning: unlabeled training data is used and clusters or “surprising” patterns are sought – Clustering – Pattern discovery Objectives of Pattern Analysis Patterns in biological data can be used to describe and predict, among other things, the properties and relationships of genes, proteins and species. These include: – – – – – Evolution Structure Function Regulation Interaction networks Evolution How are current species (or genes or proteins) related evolutionarily? What kind of reorganizations have chromosomes undergone over evolutionary time? What speciation and gene duplication events have occurred? Structure Buffalo Center of Bioinformatics What is the 3-D structure of a particular protein? Function What protein-protein and protein-DNA interactions does a protein or DNA molecule engage in? Which amino acid residues or DNA bases are involved in the interactions? Regulation What transcription factors (proteins etc.) and DNA binding sites are involved in the regulation of the transcription of a particular gene? How are the signals arranged along the chromosome? Wasserman and Sandelin Interaction networks What enzymes and substrates are involved in a particular metabolic pathway? What is the network of interactions of genes involved in development? Elements of Pattern Discovery Pattern discovery requires: A pattern language – This defines what kind of patterns you can find. (Models are described in the pattern language.) An objective function – This defines what makes a pattern “interesting”. An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. Pattern search is generally much simpler--computing the objective function. Pattern discovery example: sequence patterns We will illustrate pattern discovery using sequence pattern examples. Sequence patterns in protein, DNA and RNA Sequence pattern languages Objective functions for sequence patterns Learning algorithms for sequence patterns Protein sequence patterns: the “leucine zipper” Pattern: L-X(6)-L-X(6)-L-X(6)-L The leucine side chains extending from one alpha-helix interact with those from a similar alpha helix of a second polypeptide, facilitating dimerization. DNA sequence patterns: a protein-coding gene Patterns in RNA sequences Human RNAsplice junctions sequence matrix http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html Higher-order sequence patterns Cis-regulatory modules often involve clusters of binding sites for one or more transcription factors (e.g., drosophila EVE gene). Clusters of 13 or more pattern matches in a window of 700 bp. Elements of Pattern Discovery Pattern discovery requires: A pattern language – This defines what kind of patterns you can find. An objective function – This defines what makes a pattern “interesting”. An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. Sequence pattern description languages Regular expressions Profiles Hidden Markov Models (HMMs) Motif-based HMMs Regular expressions define sets of sequences that they match Sp1 binds to DNA via 3 zinc-finger binding domains: C-X(2,4)-C-X(3)[LIVMFYWC]-X(8)-HX(3,5)-H These particular domains recognize Sp1 binding sites: GRGGCRGGW Transcription factor Sp1 binding to DNA Profiles are more powerful than regular expressions Regular expressions do not capture the statistics of the variation in sequence patterns—they just tell you what letters are permissible at each position in the pattern. Profiles capture the frequency of each letter at each position in the pattern so you can tell how well a potential site matches the pattern (the site’s probability.) Profiles are built from multiple alignments of instances of a pattern Example: nuclear hormone receptor transcription factor binding site profile derived from experimentally determined sites. Observed counts can be converted to frequencies by dividing by the number of observed instances. So profiles are probabilistic models of sequence patterns. Counts of number of times each letter is observed at each position in pattern. Hidden Markov Models HMMs are statistical models (like profiles) that assign a probability (score) to any (sub-)sequence they are presented with. HMMs can model whole sequences or domains (e.g., PFAM models of protein domains). HMMs can also be built from one or more profiles to model groups of interacting patterns (e.g., cis-regulatory modules). A “motif” HMM Each box is a “state” and recognizes letters with the probabilities in the vertical rectangles. This simple HMM is equivalent to a profile. 1 2 3 A .7 C .1 G .1 T .1 A .1 C .0 G .0 T .9 A .1 C .1 G .8 T .0 4 A C T T .0 .9 .0 .1 5 A .1 C .0 G .0 T .9 A motif-based HMM for recognizing cis-regulatory modules Motif states Complemented motifs Non-emitting states Emitting gap states Free transitions This HMM can recognize (sub-)sequences consisting of one or more motifs separated by “gaps” (sequence of unknown function). within-cluster gap +1 -1 +2 -2 between-cluster gap Elements of Pattern Discovery Pattern discovery requires: A pattern language – This defines what kind of patterns you can find. An objective function – This defines what makes a pattern “interesting”. An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. Objective functions for Regular Expression Patterns Possible objective functions are: – Perfect matches only (no mismatches) – Allow a given number of mismatches – Allow a given density of mismatches (or wildcards). To be interesting, the pattern must occur a certain minimum number of times in the data. Objective functions for profiles and HMMs Profile- and HMM-based patterns are usually ranked by statistical or informationtheoretic measures: – Likelihood ratio – Information content – Maximum a posteriori probability Example for motif HMMs: the likelihood ratio Use the HMM to compute the likelihood of the data: Pr(data | motif) Use a “background” model to compute the likelihood of the data under the background model: Pr(data | background) The likelihood is: Pr(data | motif) / Pr(data | bakground) Elements of Pattern Discovery Pattern discovery requires: A pattern language – This defines what kind of patterns you can find. An objective function – This defines what makes a pattern “interesting”. An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. Motif discovery algorithms The goal is to find a set of sites (or a motif model) that maximizes the objective function. Motif discovery algorithms for finding sequence motifs mostly use either EM (Expectation Maximization) or Gibbs sampling. Gibbs sampling is a bit easier to visualize, so the following slides illustrate it via the AlignACE algorithm (by G. M. Church.) AlignACE Example Input Data Set 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. AlignACE Example The Target Motif 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = 20.37 (maximum) AlignACE Example Initial Seeding 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** MAP score = -10.0 AlignACE Example Sampling Add? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this site as opposed to without? TCTCTCTCCA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** AlignACE Example Continued Sampling Add? Remove. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this site as opposed to without? ATGAAAAAAT TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** AlignACE Example Continued Sampling Add? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this site as opposed to without? TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** AlignACE Example Column Sampling 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this new column structure? GACATCGAAAC GCACTTCGGCG GAGTCATTACA GTAAATTGTCA CCACAGTCCGC TGTGAAGCACA ********* * AlignACE Example The Best Motif 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = 20.37 Conclusion Pattern analysis is an important tool for making sense of the large amounts of data being generated in biology laboratories. As pattern-description languages and machine learning algorithms improve, pattern analysis will become increasingly useful. Learning methods in today’s talks Supervised methods: – Using support vector machines for classification and regression Unsupervised methods: – Discovering and using sequence patterns – Using artificial neural networks and genetic algorithms to discover patterns – The generalized Gibbs sampler