* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Motifs and motif prediction methods I - BIDD
Survey
Document related concepts
Cre-Lox recombination wikipedia , lookup
Human genome wikipedia , lookup
Primary transcript wikipedia , lookup
Genetic code wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome editing wikipedia , lookup
Point mutation wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Nucleic acid tertiary structure wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Transcript
CZ5226: Advanced Bioinformatics Lecture 4: Motifs and methods for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore What is a motif? • A motif is a sequence pattern that occurs repeatedly in a group of related DNA or RNA or protein or peptide sequences. 2 Types of motifs and what they mean • Motifs in protein sequences – Structure, function, evolution • Motifs in DNA and RNA sequences – Promoters, transcription factor binding sites, splicing signals • Motifs in MHC-binding peptides – Anchor residue positions, TCR recognition residues 3 Motifs in Protein Sequences • The leucine zipper may explain how some eukaryotic gene regulatory proteins work. • L-x(6)-L-x(6)-L-x(6)-L • The leucine side chains extending from one alpha-helix interact with those from a similar alpha helix of a second polypeptide, facilitating dimerization 4 Motifs in DNA Sequences 5 Motifs in DNA Sequences • Promoter regions, e.g. TATA box • Transcription factor binding sites, e.g. Eve in Drosophila: G-G-T-C-C-T-G-G • Cis-Regulatory regions 6 Motifs in RNA sequences Human RNAsplice junctions sequence matrix http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html 7 Motifs in Protein Structures • Protein structure patterns can encode information about protein function. • Structure motifs can be used to improve multiple alignments of protein sequences. 8 Active site recognition EXAMPLE: CATHEPSIN A PEPTIDASE FAMILY S10 EC # 3.4.16.5 3-D representation 3D profile (PROCAT) 9 1ac5 438LTFVSVYNASHMVPFDKS455 1ivy 10 419IAFLTIKGAGHMVPTDKP436 Motifs in MHCBinding Peptide 11 Motifs in MHC Binding Peptides 12 Motifs in MHC Binding Peptides 13 What is the goal and method of motif detection? • Perform local multiple sequence alignment to find consensus sequences and common sequence patterns (motifs) 14 Macromolecular motif recognition 1-D representation: Primary amino acid sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPE NSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTE VAQSNFEALQDFFRLFPEYKNNKL... Computational sequence analysis Query secondary databases over the Internet http://www.ebi.ac.uk/interpro/ 15 The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language – This defines what kind of patterns you can find. • An objective function – This defines what makes a pattern “interesting”. • An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. 16 The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language – This defines what kind of patterns you can find. • An objective function – This defines what makes a pattern “interesting”. • An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. 17 Pattern Description Languages • Regular expressions • Profiles • Hidden Markov Models (HMMs) – Motif HMMs – Motif-based HMMs 18 Macromolecular motif recognition single motif exact regular expression (PROSITE) full domain alignment profile (PROSITE) residue frequency multiple motifs Hidden Markov Model (Pfam, PROSITE) matrices (PRINTS) 19 Regular Expressions • Regular expressions can be used to describe sequence motifs. • They use a simple syntax to describe patterns. • An example protein pattern: [DENG]-x-[DEN]-x(0,2)-[DENQK]-[LIVFY] 20 Regular expressions contd. Basic rules for regular expressions • Each position is separated by a hyphen “-” • A symbol X is a regular expression matching itself • x means ‘any residue’ • [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols • A string [R]* matches any number of strings that match • { } surround forbidden residues • ( ) surround repeat counts Model formation •Restricted to key conserved features in order to reduce the “noise” level •Built by hand in a stepwise fashion from multiple alignments 21 Motif modelling methods Prosite: Regular expressions CARBOXYPEPT_SER_HIS [LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x[IVAQ]-P-x(3)-[PSA] Regular expressions represent features by logical combinations of characters. A regular expression defines a sequence pattern to be matched. 22 Regular expressions contd. Regular expressions, such as PROSITE patterns, are matched to primary amino acid sequences using finite state automata. “all-or-none” 23 Profiles • Profiles give weights for each letter. • Example from TRANSFAC: NF-kappab1 G G G G A T Y C C C A 0 0 0 2 16 0 0 0 0 0 C 0 0 0 0 1 0 7 16 18 17 G 18 18 18 16 0 3 1 0 0 1 T 0 0 0 0 1 15 10 2 0 0 24 Profiles • Profiles are usually created by aligning multiple instances of the motif. • Example: nuclear hormone receptor transcription factor binding site. 25 Motif modelling methods Prints: Residue frequency matrices Motif 1 NPESWTNFANMLW NPYSWVNLTNVLW REYSWHQNHHMIY NEGSWISKGDLLF NPYSWTNLTNVVY NEYSWNKMASVVY NDFGWDQESNLIY NENSWNNYANMIY NEYGWDQVSNLLY NPYAWSKVSTMIY NPYSWNGNASIIY NEYAWNKFANVLF NPYSWNRVSNILY NPYSWNLIANVLY NEYRWNKVANVLF LDQPFGTGYSQ VDNPVGAGFSY VDQPVGTGFSL VDQPGGTGFSS IDNPVGTGFSF IDQPTGTGFSV VDQPLGTGYSY IDQPAGTGFSP LESPIGVGFSY LDQPVGSGFSY LDQPVGSGFSY LDQPINTGFSN LDQPIGAGFSY LDAPAGVGFSY LDQPVGAGFSY Motif Motif 2 3 FFQHFPEYQTNDFHIAGESYAGHYIP Motif 4 FFNKFPEYQNRPFYITGESYGGIYVP WVERFPEYKGRDFYIVGESYAGNGLM FLSKFPEYKGRDFWITGESYAGVYIP WFQLYPEFLSNPFYIAGESYAGVYVP FFEAFPHLRSNDFHIAGESYAGHYIP FFRLFPEYKDNKLFLTGESYAGIYIP FLTRFPQFIGRETYLAGESYGGVYVP FFNEFPQYKGNDFYVTGESYGGIYVP WMSRFPQYQYRDFYIVGESYAGHYVP FFRLFPEYKNNKLFLTGESYAGIYIP FFRLFPEYKNNKLFLTGESYAGIYIP WLERFPEYKGREFYITGESYAGHYVP WMSRFPQYRYRDFYIVGESYAGHYVP WFEKFPEHKGNEFYIAGESYAGIYVP LAFTLSNSVGHMAP LQFWWILRAGHMVA LMWAETFQSGHMQP LTYVRVYNSSHMVP LQEVLIRNAGHMVP LTFVSVYNASHMVP LTFARIVEASHMVP LTFSSVYLSGHEIP IDVVTVKGSGHFVP MTFATIKGSGHTAE MTFATIKGGGHTAE FGYLRLYEAGHMVP MTFATVKGSGHTAE ITLISIKGGGHFPA MTFATVKGSGHTAE •a collection of protein “fingerprints” that exploit groups of motifs to build characteristic family signatures •motifs are encoded in ungapped ”raw” sequence format •different scoring methods may be superimposed onto the data, e. .g. BLAST •improved diagnostic reliability •mutual context provided by motif neighbours 26 Motif modelling methods Prosite: Profiles Feature is represented as a matrix with a score for every possible character. Matrix is derived from a sequence alignment, e.g.: F F Y F F L K K P P K E L A I V V F L F V V L I S G G K A S H Q Q E A E C T E A V C L M L I I I L F L L A I V Q G K D Q 27 Profiles contd. Derived matrix: A C D E F G H I K L M N P Q R S T V W Y -18 -22 -35 -27 60 -30 -13 3 -26 14 3 -22 -30 -32 -18 -22 -10 0 9 34 -10 -33 0 15 -30 -20 -12 -27 25 -28 -15 -6 24 5 9 -8 -10 -25 -25 -18 -1 -18 -32 -25 12 -28 -25 21 -25 19 10 -24 -26 -25 -22 -16 -6 22 -18 -1 -8 -18 -33 -26 14 -32 -25 25 -27 27 14 -27 -28 -26 -22 -21 -7 25 -19 1 8 -22 -7 -9 -26 28 -16 -29 -6 -27 -17 1 -14 -9 -10 11 -5 -19 -25 -23 -3 -26 6 23 -29 -14 14 -23 4 -20 -10 8 -10 24 0 2 -8 -26 -27 -12 Alignment positions 3 22 -17 -9 -15 -23 -22 -8 -15 -9 -9 -15 -22 -16 -18 -1 2 6 -34 -19 -10 -24 -34 -24 4 -33 -22 33 -27 33 25 -24 -24 -17 -23 -24 -10 19 -20 0 -2 -19 -31 -23 12 -27 -23 19 -26 26 12 -24 -26 -23 -22 -19 -7 16 -17 0 -8 -7 0 -1 -29 -5 -10 -23 0 -21 -11 -4 -18 7 -4 -4 -11 -16 -28 -18 28 Profiles contd. •Inclusion of all possible information to maximise overall signal of protein/domain i. e., a full representation of features in the aligned sequences •Able to detect distant relationships with only few well conserved residues • •Position-dependent weights/penalties for all 20 amino acids, gaps, insertions •Dynamic programming algorithms for scoring hits 29 Hidden Markov Models (HMM) • HMMs generalize the idea of a profile. • They can model insertions and deletions in the sequence as well as the letters at conserved positions. • Profiles can be seen as simple HMMs. 30 Macromolecular motif recognition Pfam and Prosite: Hidden Markov Models (HMMs) •Feature is represented by a probabilistic model of interconnecting match, delete or insert states •contains statistical information on observed and expected positional variation - “platonic ideal of protein family” Di Ii B Mi E 31 HMM example A possible HMM for the sequence “ACCY” which is represented as a sequence of probabilities. The probability of ACCY is shown as a highlighted path through the model. P that an amino acid occurs in a particular state P of transition state 32 Motif-Based HMMs Motif-based HMMs are sequence models made by combining one or more motif models. Motif HMM: Motifs are modeled as profile HMMs without delete or insert states. Motif HMM M1 M2 M3 M4 M5 33 A Simple Motif-Based HMM • Adding emitting states with self-loops, plus start and end states, turns a motif HMM into a sequence model. • The HMM below models sequences with one occurrence of the motif. Sequence HMM Start Left Flank M1 M2 M3 M4 M5 Right Flank End 34 Motif-Based HMM for Modeling Cis-regulatory Regions With two or more motif models we can make more complicated motif-based HMMs. This sequence model captures motifs on the + and – strand of DNA. It does not capture the order of the motifs. 35 The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language – This defines what kind of patterns you can find. • An objective function – This defines what makes a pattern “interesting”. • An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. 36 Objective functions for Regular Expression Patterns • Possible objective functions are: – Perfect matches only (no mismatches) – Allow a given number of mismatches – Allow a given density of mismatches (or wildcards). • To be interesting, the motif must occur a certain minimum number of times in the data. 37 Objective functions for profiles and HMMs • Profile- and HMM-based motifs are usually ranked by statistical or informationtheoretical measures: – Likelihood ratio (eg, forward-backward) – Information content (relative entropy) – Maximum a posteriori probability 38 Example for profiles: the likelihood ratio • Use the profile to compute the likelihood of the data: Pr(data | profile) • Use the background model to compute the likelihood of the data under the background model: Pr(data | bkgrnd) • The likelihood is: Pr(data | profile) / Pr(data | bkgrnd) 39 Objective functions for protein structure patterns • Structure motifs are usually evaluated based on the RMS distance – between the pattern and each instance, or, – among all the instances of the pattern. 40 The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language – This defines what kind of patterns you can find. • An objective function – This defines what makes a pattern “interesting”. • An algorithm – This defines how to search among the possible patterns to find the “interesting” ones. 41 Algorithms for discovering sequence motifs • Regular expression searches enumerate or use seeds. • Profile/HMM algorithms use Gibbs sampling or Expectation Maximization (EM). Forward-Backward is a form of EM. 42 Regular Expression Discovery: a simple algorithm • Look for DNA 16-mers where (up to) one wild card is allowed in the pattern: – E.g., “T-A-C-X-G-T-A-G-G-C-C-T-A-G-T-T” 16 11 5 1 . 5 10 • There are possible patterns—a big number. • Idea: Instead of enumerating the possible patterns and counting, just update the counts of appropriate patterns for each 16-mer that actually occurs in the data. 43 Regular Expression Discovery: a simple algorithm (cont’d) • Run a window of width 16 along the data and, for each 16-mer in the data, e.g. “AGGGTAAAAGCCCCCT”, update the counts of the exact match pattern and each pattern with one wildcard: A-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, X-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, A-X-G-G-T-A-A-A-A-G-C-C-C-C-C-T, etc. 44 Profile discovery algorithms • Profile discovery algorithms for finding sequence motifs mostly use either EM (Expectation Maximization) or Gibbs sampling. 45 What is Gibbs sampling? • Stochastic optimization method • Works well with local multiple alignment without gaps (motif searching) • Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space 46 What is the program going to do? 1. Ask user for : a) file containing multiple DNA or protein sequences b) motif width c) how many motifs wanted 2. Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409] 47 What is the program going to do? 3. Generate random start positions for the motif in each sequence. Example: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width 48 What is the program going to do? 4. Construct position specific score matrix from all sequences except one. Motif Position A C G T 0 1 2 3 4 5 6 0.6 0 0.7 0 0.5 0.1 0.1 0 0.9 0.2 0.2 0 0.3 0 0.3 0 0 0.7 0.1 0 0.6 0 0 0 0 0.3 0.5 0.2 49 What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix: score width k 0 i{A,C,G,T } Pki Pi where Pki probability of seeing base i at position k and Pi probability of seeing base i as background 50 What is the program going to do? Example: Use the position specific matrix and background from before: [A: 0.34951456310679613, C: 0.17799352750809061, G: 0.21035598705501618, T: 0.23300970873786409] Motif Position A C G T 0 1 2 3 4 5 6 0.6 0 0.7 0.2 0.5 0.1 0.1 0 0.9 0.2 0.1 0.3 0.3 0 0.3 0 0 0.7 0.1 0 0.6 0 0 0 0 0.1 0.5 0.2 GATTACA: 0.3 0 0 0 0.5 0.3 0.1 4.81 0.21 0.35 0.23 0.23 0.35 0.18 0.35 51 What is the program going to do? 6. Randomly generate another start position of the motif for that left-out sequence. 7. Score that sequence with its new start position. 8. Compare this new score with its original score. 9. If newscore >= oldscore, then jump to that new start position, else jump to that new start position with probability = newscore oldscore 52 What is the program going to do? 10. Start all over again with this updated start position with another sequence left out Do this many many times! ~ 1000 iterations Gibbs will converge to a stationary distribution of the start positions => a probable alignment of the multiple sequences 53 Gibbs Sampling Algorithm 54 Gibbs Sampling Algorithm (cont’d) 55 Gibbs Sampling Algorithm (cont’d) 56 Gibbs Sampling Example • The following slides illustrate Gibbs sampling to discover a motif in yeast DNA sequences. • This example uses a sequence model that allows multiple sites per sequence. • Columns are sampled as well as sites. 57 The Input Data Set 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. 58 The Target Motif 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = 20.37 (maximum) 59 Initial Seeding 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** MAP score = -10.0 60 Sampling Add? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this site as opposed to without? TCTCTCTCCA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** 61 Continued Sampling Add? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this site as opposed to without? TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** Source: G.M. Church 62 Column Sampling 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********** How much better is the alignment with this new column structure? GACATCGAAAC GCACTTCGGCG GAGTCATTACA GTAAATTGTCA CCACAGTCCGC TGTGAAGCACA ********* * Source: G.M. Church 63 The Best Motif 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = 20.37 64 What is the program missing? • Doesn’t do re-initializations in the middle to get out of local maxima • Doesn’t optimize the width (you have to specify width explicitly) • Doesn’t do the Bayesian approach – just frequentist (easier for me and for you to understand!) • Doesn’t read in fasta files • Doesn’t do error checking! • And other things that don’t know they are missing yet! 65 Alternative program? • Gibbs Motif Sampler – http://bayesweb.wadsworth.org/gibbs/gib bs.html • AlignAce – http://atlas.med.harvard.edu/cgibin/alignace.pl 66 Protein structure pattern discovery algorithms • To discover structure patterns, some algorithms search the space of structural motifs. • We illustrate another approach (called SPRATT) that searches the space of sequence motifs to find structure motifs. 67 Structure patterns: SPRATT • Describe local spatial neighborhoods as strings. • Use sequence pattern discovery to find patterns. • Check if string similarity reflects structural similarity. 68 SPRATT Example 69 SPRATT Example (cont’d) 70 SPRATT Example: conclusion • Now, use the PRATT sequence pattern discovery algorithm to find any patterns in the Cstrings and N-strings, separately. • For each pattern found, compute the maximum RMSD of the overlaid structures of the pattern instances. • The final objective function is the ratio of the information content of the sequence pattern and the maximum RMSD. 71 Some real-life algorithms • Sequence pattern discovery – – – – – – RSA tools (DNA) MEME (protein, DNA, RNA) Alignace (DNA, RNA) Weeder PRATT (protein, DNA, RNA, other alphabets) Clover • Cis-regulatory regions – MCAST – Comet, Cister, Cluster-Buster – Meta-MEME • Protein structure pattern discovery – SPRATT – CE 72