Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Summary: Sequence Analysis Lectures Protein and DNA Sequence Analysis Part II Fritz Roth BCMP 201 Spring 2008 Outline n Sequence Analysis I Sequence Analysis II Case Study Searching sequence databases Aligning a pair of sequences Scoring aligned sequences Aligning multiple sequences Representing and finding sequence patterns Searching Sequence Databases Searching sequence databases An O(nm) database search algorithm (SmithWaterman) sounds pretty good. - BLAST - BLAST statistics n Aligning multiple sequences n Representing and finding sequence patterns But searching a 300 a.a. query against SwissProt could take an hour! Enter… BLAST! BLAST algorithm BLAST: The family Step 0 - Preprocess the Sequence Database Make a lookup table with locations of ‘words’ in all database sequences For each query sequence: Step 1 - Define Query Words Step 2 - Locate Query Words in the Database Step 3 - Ungapped Extension Step 4 - Gapped Extension From NCBI website 1 BLAST Step 1: Define query words BLAST Step 1: Define query words For every word in query, make neighborhood word list (Word Size = 2, Threshold 8) Example: Adipokinetic hormone II - Migratory locust Query Words Expanded List Q L N F S A G W Q L L N N F F S S A A G G W n QL,QM,HL,ZL LN,LB NF,AF,NY,DF,QF,EF,GF,HF,KF,SF,TF,BF,ZF FS,FA,FN,FD,FG,FP,FT,FB,YS None score≥ 8 (including SA) AG GW,AW,RW,NW,DW,QW,EW,HW,IW,KW,MW,PW,SW, TW,VW,BW,ZW,XW Default word size in BLAST is 11 for DNA and 3 for proteins. From http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html From NCBI website BLAST Step 2: Word lookup n In step 0 we preprocessed the sequence database: - Storing locations of all neighborhood words that could match a query word with score above threshold. n In step 2 we use a the lookup table (a hash table) to do O(1) (constant time) lookup. BLAST Step 3 variant: Ungapped Extension Ungapped Extension from Two-Word Hits: n Extend when two words are on same diagonal within distance A. n Same sensitivity achieved with lower word threshold BLAST Step 3: Ungapped Extension For each single word ‘hit’ exceeding threshold T n n Extend an ungapped alignment until the alignment score starts decreasing Ungapped aligned sequences = High Scoring Segment Pair (HSP). Illustration of the Two-Word Variant + . T ≥ 13 T ≥ 11 For each hit… n Extend ungapped alignment until alignment score decreases below best score S. n If S is above a threshold, go to STEP 4, otherwise discard From Altschul et al, NAR, 1997 2 Decrease-Limited Dynamic Programming BLAST Step 4: Gapped extension STEP 4 (BLAST2 variant) Gapped Extension Using Dynamic Programming n If ungapped alignment (HSP) had good enough score, then… Stop if score falls by X from highest score so far For each hit… n Extend alignment in both directions from center of HSP using dynamic programming. n Decrease Threshold=2 For improved efficiency, neglect alignment paths that score below max score observed thus far by more than X. ∆ A G C C T A ∆ 0 0 0 0 0 0 0 A 0 3 1 - T 0 1 2 0 G 0 - 4 2 - C 0 2 7 5 - C 0 - 5 10 8 - A 0 - 8 9 11 T 0 - 11 9 G 0 9 10 BLAST: A final alignment Decrease-Limited Dynamic Programming From Altschul et al, NAR, 1997 From Altschul et al, NAR, 1997 BLAST: E-values With a random sequence database, expected number of hits, E, is: Where… BLAST: P-values n m is the size of our query n n is the size of our database n K and λ are scale parameters - They depend on gapped or ungapped alignment (λ and K for gapped is lower) - λ also depends on substitution matrix (BLOSUM62 uses 2log2; some versions of PAM use 10log10) P-value = Prob( N > 0) = Prob of a hit as good or better by chance E ≅ Kmn ⋅ e −λ S n - n Probability of N=0 events given E expected events (Poisson): Prob( N = 0) = e− E n So P-value is n And… Prob( N > 0) = 1 − Prob( N = 0) = 1 − e − E ≅ E (for small E ) E values and P values are about the same below .01 From Altschul et al 1994 3 BLAST: E- and P-values BLAST: Problematic ‘hits’ n Truly homologous but uninteresting - ‘Self-hits ’ - Repetitive elements - Vector sequence One lesson: Don’t search a bigger database than you need to. E ≅ Kmn ⋅ e −λ S n Non-homologous but similar sequence - Low complexity sequence - Coiled coil regions - Membrane-spanning (hydrophobic) sequences Where… n m is the size of our query n n is the size of our database YKIL Composition-based statistics | | FKVL n n Odds score of Y and F is: Odds = p (Y ⇔ F ) YF q(Y ) ⋅ q ( F ) But if your query is mostly “F”, then we should adjust the score In terms of Odds score: Odds’YF = OddsYF ⋅ q(Y) / q’(Y) n In terms of Log-Odds score: S’YF = SYF + log [q(Y) / q’(Y)] n BLAST: Filtering n n n Filtering (aka Masking): Hiding regions that often give spurious high scores: BLAST: Miscellaneous n Standard filters for low-complexity: - SEG (Protein) - DUST (DNA) Some BLAST interfaces have a coiled-coil region filter, some have a repeat filter. n Matches that are more than 50% identical in a 20-40 amino acid region occur frequently by chance. Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons. 4 Historical Footnote: FASTA n FASTA (aka Pearson-Lipman) came after SmithWaterman but before BLAST n First to use path-restricted dynamic programming n First to use rule of two hits on the same diagonal n Outline n Searching sequence databases n Aligning multiple sequences - Global alignment - Local alignment n Now remembered mostly because of FASTA format sequence files: Representing and finding sequence patterns >bovin ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFPASTSLSPFYLR PPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIEV HGKHEERQDEHGFISREFHRKYRIPADVDPLAITSSLSSDGVLTVNGPRK QASGPERTIPITREEKPAVTAAPKK >chick ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). MDITIHNPLVRRPLFSWLTPSRIFDQIFGEHLQESELLPTSPSLSPFLMR SPFFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMIEIH GKHEERQDEHGFIAREFSRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ SDVPERSIPITREEKPAIAGSQRK ClustalW, a Tree-Based Method for Global Alignment Multiple Sequence Alignment Can we use dynamic programming? n n We can extend Smith-Waterman to align N sequences of length L in O(LN). 1. Align and score all pairs of sequences 2. Build a tree by successively merging sequence pairs (or sequence cluster pairs) This becomes prohibitive above 3-4 sequences of length 100. Approximate method examples n Global alignment—ClustalW n Local alignment—Gibbs Motif Sampling - (e.g., AlignACE) NOTE: similarity between clusters is the average over all sequence pairs between two clusters. Adapted from (Sternberg, 1996) ClustalW (continued) ClustalW: Notes n 3. Build multiple sequence alignment successively, starting with most similar pair n n Dependence on initial pairwise alignment - (gaps come but they don’t go) Appropriate only if sequences are similar overall, rather than just in local regions For aligning sequences with <30% identity try slower, more accurate “T-Coffee” Adapted from (Sternberg, 1996) 5 Outline Gibbs Motif Sampling:: Input Data Set n Searching sequence databases n Aligning multiple sequences 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 - Global alignment - Local alignment n 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 Representing and finding sequence patterns 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 Seven amino acid biosynthesis genes 300-600 bp of upstream sequence per gene (Saccharomyces cerevisiae) Gibbs Motif Sampling:: Weighting Gibbs Motif Sampling:: Initial Seeding Add? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC TGAAAAATTC GACATCGAAA GACATCGAAA GCACTTCGGC GCACTTCGGC GAGTCATTAC GAGTCATTAC GTAAATTGTC GTAAATTGTC CCACAGTCCG CCACAGTCCG TGTGAAGCAC TGTGAAGCAC ********** MAP score = -10.0 ********** Remove. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 ATGAAAAAAT TGAAAAATTC GACATCGAAA TGAAAAATTC GCACTTCGGC GACATCGAAA GCACTTCGGC GAGTCATTAC GAGTCATTAC GTAAATTGTC GTAAATTGTC CCACAGTCCG CCACAGTCCG TGTGAAGCAC TGTGAAGCAC ********** ********** Use this as a weight Add or remove each sequence in a weighted random fashion. Gibbs Motif Sampling:: Convergence on GCN4 Gibbs Motif Sampling:: More Sampling Add? What are the odds that this sequence came from the current alignment as opposed to the random model? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = 20.37 6 Gibbs Motif Sampling: Protein example Locations of conserved repeats in bacterial outer membrane proteins (porins) Outline n Searching sequence databases n Aligning multiple sequences n Representing and finding sequence patterns - Consensus sequences Regular expressions Weight matrices Profile Hidden Markov models From Neuwald, Liu, Lawrence 1995 Representing and finding sequence patterns Pattern Model I: The Consensus Sequence An Example Multiple Sequence Alignment n Consensus Sequences n Regular Expressions n Weight Matrices n Hidden Markov Models AGCATT CGCACC ATCATT AGCACT ACAAAT CCCAAA GCCAGG Table of Counts A C G T 4 2 1 0 0 3 3 1 1 6 0 0 7 0 0 0 2 2 1 2 1 1 1 4 Consensus is the sequence of (possibly degenerate) bases which best represents the aligned bases Consensus Sequence: Degenerate Bases Consensus Sequence: What is It? An Example Multiple Sequence Alignment Degenerate DNA Codes: IUPAC Code Meaning Mnemonic W (A/T) S (G/C) M (A/C) K (G/T) R (A/G) Y (C/T) V (A/C/G) H (A/C/T) D (A/G/T) B (C/G/T) N (A/C/G/T) Weak Strong aMino Keto puRine pYrimidine Not T or U, V Not G, H Not C, D Not A, B aNy base AGCATT CGCACC ATCATT AGCACT ACAAAT CCCAAA GCCAGG Table of Counts A C G T 4 2 1 0 0 3 3 1 1 6 0 0 7 0 0 0 2 2 1 2 1 1 1 4 A S C A N T? M S C A N T? M S C A H T? ISSUES: n >9 different methods that give different consensus for some choice of input! (Day and McMorris, 1992) n Most papers don’t tell you how they reached consensus. 7 Consensus Sequence: Proteins Consensus Sequence: Scoring Scoring your consensus pattern against query sequences n Standard: perfect match > imperfect match Protein Degeneracy Codes: IUPAC Code Meaning B (Asx) D (Asp) or N (Asg) Z (Glx) E (Glu) or Q (Gln) X Any amino acid n Pros n Easy to calculate (no rules!) Concise n Easy to score against a query sequence n Example protein consensus sequence for PKG kinase: Cons (R/K)X(S/T)X Pattern Model II: The Regular Expression Alternative: Perfect match > 1 mismatch > 2 mismatch > … n Rules are ambiguous n Information is lost PHI-BLAST: Pattern-Hit Initiated BLAST Regular expressions n Like consensus , but allow for more complicated rules. EXAMPLE: CG(AA|TT)GC or CGGC EXAMPLE: Regular expression that finds most protein kinases [LIV]-G-{P}-G-{P}-[FYWMGSTNH]-[SGA]-{PW}-[LIVCAT]{PD}-x-[GSTACLIVMFY]-x(5,18)-[LIVMFYWCSTAR]-[AIVP][LIVMFAGCKR]-K [ ] means any of these amino acids; {} means anything but these Pros Allows more complicated patterns than consensus From http://bioweb.pasteur.fr/seqanal/blast/ Same disadvantages as consensus sequences Pattern Model III: The weight matrix n aka Position-Specific Scoring Matrix (PSSM) Alignment ACAA TCAA ACAG AGCT Probability Table Position Count Table Position A 1 2 3 4 3 0 3 2 C 0 3 1 0 G 0 1 0 1 T 1 0 0 1 Odds ratio of query sequence ACAC? p ( A) p2 (C ) p3 ( A) p4 (C ) Odds = 1 ⋅ ⋅ ⋅ From http://bioweb.pasteur.fr/seqanal/blast/ q1 ( A) q2 (C ) q3 ( A) q4 (C ) .75 .75 .75 0 = ⋅ ⋅ ⋅ =0 .3 .2 .3 .2 Base PSI-BLAST: Position-Specific Iterated BLAST Base n 1 2 A .75 0 3 4 .75 .50 C 0 .75 .25 0 G 0 .25 0 .25 T .25 0 0 .25 Random Probability Table Position Base n Cons 1 2 3 4 A .3 .3 .3 .3 C .2 .2 .2 .2 G .2 .2 .2 .2 T .3 .3 .3 .3 8 Weight matrix: using pseudocounts Position 1 2 3 4 1 2 3 4 A 3 0 3 2 A 3.3 0.3 3.3 2.3 C 0 3 1 0 C 0.2 3.2 1.2 0.2 G 0 1 0 1 G 0.2 1.2 0.2 1.2 T 1 0 0 1 T 1.3 0.3 0.3 1.3 Probability Table Position 1 Base Base Base Alignment ACAA TCAA ACAG AGCT Weight matrices vs consensus sequences Revised Count Table Position Count Table Odds ratio of query sequence ACAC? .66 .64 .66 .04 = ⋅ ⋅ ⋅ = 3.1 .3 .2 .3 .2 3 4 .66 .06 .66 .46 C .04 .64 .24 .04 G .04 .24 .04 .24 T .26 .06 .06 .26 Do they match the pattern in the alignment? n Using consensus ACAA, only find ACAA exact match n Using consensus ACAD… ACAA and ACAG are tied n Using regular expression ACAN, All score equally! n Using weight matrix, odds ratios are 36, 19, and 3, respectively Random Probability Table Position Base p ( A) p2 (C ) p3 ( A) p4 (C ) Odds = 1 ⋅ ⋅ ⋅ q1 ( A) q2 (C ) q3 ( A) q4 (C ) 2 A Alignment ACAA Consensus might be: ACAA, ACAD, or ACAN TCAA ACAG AGCT Example: Three test sequences ACAA, ACAG, ACAC. 1 2 3 4 A .3 .3 .3 .3 C .2 .2 .2 .2 G .2 .2 .2 .2 T .3 .3 .3 .3 Sequence Logos Sequence Logos n Randomness can be described in terms of Shannon entropy, or uncertainty. H = −∑ pi ⋅ log 2 ( pi ) i EXAMPLE: Amount of computer memory it takes to store the result of a coin toss. Hcoin toss = − ∑ 0.5 ⋅ log2 (0.5) = 1 bit i={heads,tails} Hmax for DNA pos = − ∑ i={A,C,G,T ) Sequence Logos H = −∑ pi ⋅ log 2 ( pi ) i n If frequencies of A, C, G, T are 88%, 1%, 1%, 10%, then… Hpos5 = −(.88⋅ log2(.88) + .01⋅ log2 (.01) + .01⋅ log 2(.01) + .10 ⋅ log 2(.10) = 0.6 bits n n n 0.25⋅ log2 (0.25) = 2 bits Outline n Searching sequence databases n Aligning multiple sequences n Representing and finding sequence patterns - Consensus sequences Regular expressions Weight matrices Profile Hidden Markov models Information = Reduction in uncertainty = 2 bits - 0.6 bits = 1.4 bits Total stack height shows information Relative letter height show relative frequency 9 Pattern Model IV: Hidden Markov Models Markov Models Example: The Boss n Also switches randomly between states A machine that is always in one of several possible states n Also emits a symbol from each state Switches randomly between states n n Emits a symbol from each state n n Different states can emit the same letter! ACCGATAGCTA… Hidden Markov Models Given recent “symbols” emitted, what state is the boss in? State + + + + + + + + + + + + + + + Random# 0.82 0.23 0.95 0.55 0.95 0.46 0.55 0.90 0.11 0.17 0.22 0.45 0.14 0.39 0.94 0.91 0.67 0.17 0.30 0.76 0.75 0.38 0.94 0.70 0.75 0.52 0.60 0.09 0.92 Emission Great! Great! Great! Great! Great! Great! Great! Great! No! No! No! No! No! No! Great! Great! No! No! No! No! No! No! Great! No! Great! Great! Great! No! Great! Random# 0.95 0.43 0.81 0.40 0.19 0.56 0.53 1.00 0.10 0.41 0.14 0.52 0.56 0.06 0.06 0.83 0.81 0.68 0.20 0.62 0.64 0.73 0.75 0.06 0.73 0.24 0.19 0.95 0.76 Pattern Model: The Profile HMM Equivalent to a weight matrix… but with insertions and deletions: Using HMMs n Find most probable path (as in boss’s mood example) n Odds of ‘emitting’ a given sequence (used like a weight matrix) Hidden Markov Models: DNA and Protein Examples n Gene Prediction n Transmembrane segments Random Alignment ACA---ATG TCAACTATC ACAC--AGC AGC---ATC ACCG--ATC Sequence Pattern Resources DNA n n n TRANSFAC, weight matrices for TF binding sites JASPAR, weight matrices for TF binding sites ESEfinder, weight matrices for splicing enhancer sequences Protein n PROSITE: regular expressions n ProDom: weight matrices (based on PSI-BLAST) n PFAM: HMMs n SMART: weight matrices + HMMs signalling domain-focused n n InterPro: all of the above Conserved Domain Database (CDD): SMART+PFAM+COGs 10 Conserved Domain Database Remember MJ0577? Conserved Domain Database: Example Summary: Sequence Analysis Lectures Sequence Analysis I Sequence Analysis II Case Study Searching sequence databases Aligning a pair of sequences Scoring aligned sequences Aligning multiple sequences Representing and finding sequence patterns 11