* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Algorithm for Discovering Patterns in Sequences
Survey
Document related concepts
Transcript
Algorithms for Discovering Patterns in Sequences Raj Bhatnagar University of Cincinnati Outline • Applications of Sequence Mining – Genomic Sequences – Engineering/Scientific Data Analysis – Market Basket Analysis • Algorithm Goals – – – – Temporal Patterns Interesting Frequent Subsequences Interesting Frequent Substrings (with mutations) Detection of Outlier sequences Why Mine a Dataset? Discover Patterns: – Associations • A subsequence pattern follows another • Help make a prediction – Clusters • Typical modes of operation – Temporal Dependencies • Events related in time – Spatial Dependencies • Events related in space Example Problems From various domains – – – – Engineering Scientific Genomic Business Energy Consumption Gene Expression over Time Multivariate Time Series • Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals • Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics Spatio-Temporal Patterns Goal: Find patterns in space-time dimensions for phenomena Issues: Simultaneous handling of space and time dimensions Problem: Phenomena in space-time How to handle large complexity of algorithms Discover Interesting Subsequences AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAGacgt aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATACAGgatcagt atgacttAAAAAAAAGGGGGGGtgctctcccgattttcCTGATTCCAATACAGc aggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAGta ggaggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAG Blue pattern sequence may have upto k substitutions Main Sequence Mining Tasks • • • • • • • Finding frequent patterns Doing similarity search Clustering of time series Periodicity detection Finding temporal associations Summarization Prediction Finding Frequent Substrings Two Main Approaches: • Generalized Suffix Trees (Linear Time) • Find Least Common Substrings (Linear Time) abacebc . abcdobd . abaaebd . eoaoobd . abceoad . 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3738 39 GST generated considering the substrings starting at location #0 of each string is: (0) < root > - (8) ab - (10) c – (11)dobd - (15) 8 - (35) e0ad – (39) 32 - (18) a – (19) aebd – (23) 16 - (3) cebc – (7) 0 - (24) e0a00bd - (31) 24 Substring Length Position s size(s) pos Set set P Cardinality |P| aba 3 0 {0, 2} 2 abc 2 0 {1,4} 3 ab 2 0 {0,1,2,4} 4 Finding Frequent Substrings • Recursively generate the GST with each string truncated by removing the first character Original strings: abacebc . abcdobd . abaaebd . eoaoobd . abceoad . Truncated strings: bacebc . bcdobd . baaebd . oaoobd .bceoad . (0) < root > - (8) b - (10) dobd - (14) 8 - (31) e0ad – (35) 29 - (16) a – (17) aebd – (21) 15 - (3) cebd – (7) 1 - (22) 0a00bd - (28) 24 s |s| pos set P |P| b a 2 1 {0, 2} 2 b c 2 1 {1,4} 2 b 1 1 {0, 1, 2,4} 4 Results after Phase-I • Substrings generated after Phase-I are: Phase2: Subsequence Hypotheses Result of Phase 1: string ABa---AB------a-------*bd -----bd size(s) 3 2 1 3 2 profiles(G) {0,2} {0,1,2} {0,2,3} {1,3} {1,2,3} size(G) 2 3 3 2 3 Result after subsequence AB---bd --a--bd Phase 2: size(T) 4 3 profiles(R) {1,2} {2,3} size(R) 2 2 Merging Profile Sets • Two sets of profiles are similar and will be merged together if: Size(Intersection(P1, P2)) ≥ threshold Size(Union(P1, P2 )) where P1 and P2 any two sets and threshold is user-defined Merged Substrings Generalizing Substring Patterns • Core subsequences can be written in the form of regular expression or regex • Each symbol is considered replaceable by preceding or following alphabet E.g. for the substring, aba – eb - regex will be [ab]{1}[abc]{1}[ab]{1}.{1}[def]{1}[abc]{1}.{2} Generalization of Sequence Patterns • Hypothesis T specifies a core subsequence shared by the set of profiles R • Core subsequences and set R, affected by various factors such as loss of information • T is considered as seed of hypothesis Cluster #1 from Utility Dataset Cluster #2 from Utility Dataset What is Achieved • Discover: – Partial-sequence temporal hypotheses – identities of profiles for each temporal hypothesis Histone Cluster Histone Cluster ORF Process Function Peak Phase Order Cluster Order YBL002W chromatin structure histone H2B S 424 789 YBL003C chromatin structure histone H2A S 441 790 YBR009C chromatin structure histone H4 S 417 791 YBR010W chromatin structure histone H3 S 420 795 YDR224C chromatin structure histone H2B S 432 794 YDR225W chromatin structure histone H2A S 437 796 YNL030W chromatin structure histone H4 S 418 792 YNL031C chromatin structure histone H3 S 430 793 YPL127C chromatin structure histone H1 S 448 788 ------------------------------------------------------------------------------------------- Fourier Results YBL002W YBL003C YBR009C YBR010W YDR224C YDR225W YNL030W YNL031C YPL127C CC CC xC BC BC CC CC CC Da ab ab *x aA aa *b bb ab bb dccaCBBBabc*BCCCA* cdcbBCBBbacaBCCC** ddcaCCBBaab*BCCB*a ccc*CBBAbbc*BCCBAa ccc*CCB*abbaBBCA*b cdcaCCC*abbbBCCBAa dccACCBAabb*CBCBAa cccABCB*abb*BCCAAa dcaBBCBAabb*BBBBAa xcCBCbbcccACACCBBbaabbb* cbCCCAbcccaCCCCBA*bcbbaA xcCCBbccccbCBCBBa*bbxaAB bbCCB*ccdbbCBCCCAAxbcaAB cbCCB*cccbbCCCCCAAcccaAB cbCCC*bccbaCCCBCBAbccbaB cbCCBacccbbCCCCB*AbbcaAB bbCCB*bccbbBBCBC*Abbbb*B cbBCABbcdbaBABBCBBbcbaAB dcABCaaabbBCxBBBa bcABBb*ab*AxABBAa baB*AaA*bxdBaBBBB bbA*Ab*bbAAxaBBB* caBAAaabbAAxaBBBA cbAA*a*bbAAxaBBBB b*AaAbabbBxBaA*BA bAAAAbabbBAxaA*B* ccBBBB*bcbABBxAAb String-Based Results cccbABCBBBBBab cb*baBBCBBBB*b cccbABCCCBABAb bccbaBBBBB*Bab cccb*BBCBBBBAb bccbaBCBBBBB*b cccb*BCCBBBB*b bcbb*BBCBAA**b bccbABBCBBBA*a Regular Expression ^.{4}[cdx]{2}.{3}[BCx]{1}.{3}[abx]{1}.{3}[BCDx].{7}[BCx]{1}.{3}[bcdx]{2}.{5}[BCx]{1}.{33}[BCx] {1}[ABCx]{1}.{5}$ Generalization YBL002W YBL003C YDR225W YNL031C CC-b--c---------BCC-----C---bcc----C-------b----A-----b-----------b-B--B----b CCabdccaCBBBabc*BCCCA*xcCBCbbcccACACCBBbaabbb*dcABCaaabbBCxBBBacccbABCBBBBBab CCabcdcbBCBBbacaBCCC**cbCCCAbcccaCCCCBA*bcbbaAbcABBb*ab*AxABBAacb*baBBCBBBB*b CC*bcdcaCCC*abbbBCCBAacbCCC*bccbaCCCBCBAbccbaBcbAA*a*bbAAxaBBBBbccbaBCBBBBB*b CCabcccABCB*abb*BCCAAabbCCB*bccbbBBCBC*Abbbb*BbAAAAbabbBAxaA*B*bcbb*BBCBAA**b ++x----x+++x---x+++xxx--+++x----x+x+++xx-x--xxxxx+x-x--x++x+x+x--x-x++++++xx5,6,10,14,18,26,30,31,37,71,72 Localized Focus of Generalization = more genes included Example 2 with Yeast Cell Data Original Hypothesis ORF Process YBR009C YBR010W YDR224C YDR225W YLR300W YNL030W YNL031C chromatin chromatin chromatin chromatin cell wall chromatin chromatin structure structure structure structure biogenesis structure structure Function Peak histone H4 S histone H3 S histone H2B S histone H2A S "exo-beta-1,3-glucanase" histone H4 S histone H3 S Phase Order Cluster Order 417 420 432 437 G1 791 795 794 796 381 418 430 756 792 793 New Genes Included After Localized Generalization Not classified as cell cycle related: YBR106W, YBR118W, YBR189W, YBR206W, YCLX11W, YDL014W, YDL213C, YDR037W, YDR134C, YGL148W, YKL009W, YLR449W, YNL110C, YPR163C New Regular Expression ^.{46}[bcx]{1}.{1}[ABx]{1}[aA*x]{1}[AB*]{1}[abx]{1}[aA*x]{1}.{1}[abcx]{1}[ABx]{1}.{1} [ABCx]{1}[ab*x]{1}[ABx]{1}.{1}[ABCx]{1}.{15}$ Find Longest Common Substrings Algorithms for LCS • Substring : continuous sequence of characters in a string • Subsequence : obtained by deleting zero or more symbols in a given string abcdefghia Substrings : cdefg, efgh, abcd Subsequences: ade , cefhi, abc, aia Longest Common Subsequence • LCS is common subsequence of maximal length between two strings String 1 : abcdabcefghijk String 2: xbcaghaehijk LCS = bcaehijk, Length of LCS = 8 Finding LCS • Brute Force has exponential time complexity in the length of string • Dynamic Programming can find LCS in O(mn) time and space complexity • Length of LCS can be found in O(min(m,n)) space complexity and O(mn) time complexity Main Sequence Mining Tasks • • • • • • • Finding frequent patterns Doing similarity search Clustering of time series Periodicity detection Finding temporal associations Summarization Prediction Finding the LCS Length Recursive Formulation: LCS[i, j] = 0, LCS[i-1, j-1] + 1, max(LCS[i, j-1], LCS[i-1, j]), if i = 0 or j = 0 if i, j > 0 and ai = bj if i, j > 0 and ai ≠ bj Finding the LCS Length Recursive Formulation: LCS[i, j] = 0, LCS[i-1, j-1] + 1, max(LCS[i, j-1], LCS[i-1, j]), if i = 0 or j = 0 if i, j > 0 and ai = bj if i, j > 0 and ai ≠ bj Iterative solution is more efficient than recursive Finding the LCS Length: Algorithm lcs_length(A, B) { // A is a string with length m // b is another string with length n , m>=n // L is an array to keep intermediate values in Dynamic Programming for (i = m; i >= 0; i--) for (j = n; j >= 0; j--) { if (A[i] = '$' || B[j] = '$') L[i,j] = 0; //end of strings else if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1]; else L[i,j] = max(L[i+1, j], L[i, j+1]); } return L[0,0]; } Sequential Clustering • Clustering is partitioning the data in equivalence classes • Data is input one or few times • Unique classification based on input order • Simple and fast for large data Basic Sequential Clustering Sequential Clustering with Buckets Main Sequence Mining Tasks • • • • • Finding frequent patterns Doing similarity search Clustering of time series Periodicity detection Biological Sequence Problems Multivariate Time Series • Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals • Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics Time Series Analysis Tasks • • • • • • Finding frequent patterns Doing similarity search Clustering of time series Periodicity detection Finding temporal associations Prediction Why temporal association rules? • More information about correlations between frequent patterns • Contains richer information than knowledge frequent patterns • Helps to build diagnostic and prediction tools Finding temporal associations: recent work • Mannilla - Discovery of frequent episodes in event sequences [2], 1997 • Das - Rule Discovery from Time Series [1], 1998 • Kam - Discovering temporal patterns for interval-based events [3], 2000 • Roddick - Discovering Richer Temporal Association Rules from Interval-based Data [4], 2004 • Mörchen - Discovering Temporal Knowledge in Multivariate Time Series [5], 2004 Research Issues • Finding richer set of temporal relationships { contains, follows, overlaps, meets, equals …} than sequence mining does {follows} • Robustness of rules - room for noise in patterns • Understanding temporal relationships at different levels of abstraction • Efficient algorithms to find patterns with noise and temporal associations. Temporal Association Rules Frameworks • Kam [1] uses Allens temporal relations to find rules • Roddick [4] uses state sequence framework similar to Höppner [6] • Mörchen [5] is based on Unification-based Temporal Grammar. Allen’s relationships Above figure is presented from [1] What are A1 rules? ... A1 rel1 A2 rel 2 A3 rel 3 A4 ....rel k 1 Ak Problem Given a multivariate time series, minimum support for number of occurrences, minimum pattern length find all the temporal association rules (similar to A1) less than size k dimensionality reduction, discretization and symbolic representation aaaeffdaaaaaaaaacccaaaaaaaaaaedefggcbabaacfgfc… dccbbccdccdeedcdcdeecccdddeeecdeccccddegedbcdc… dddcbbcfffeeegffcbcdeeefffffecbbaaaaaacffecbbb… multivariate time series Frequent pattern enumeration sequences clustering {aaaaaaaa, bbbaaa, bbaaa, eeef, aaac, eaaa ...} {cdee, dccde, deed, ddeeec, eeec, ccccc,edddd …} {fff,fffe,aaaaa,fecbb,ddcb,bbbbc,bbbbc,cbbb…} {aaaaaaaa} {bbba,bba,..} {eeef…} {aaac,eaaa …}… {cdee}{dccde…}{deed,ddeeec,eeec…}{ccccc…}{edddd…}… {fff,fffe…}{fecbb…}{ddcb…}{bbbbc,bbbbc,cbbb…} … frequent patterns Temporal association rule discovery clusters Summarization and visualization 1.{aaaa,aaac,aaae,abaa…}followed by{cbb,ccbb,cccbb…} 2.{bbbbb,cbbbbbc,…}followed by{baaaa,aaaaaaa,aaaaaa …} overlaps {aaaa,aaac,aaae…} 3.{dccccdd,ccccdd} contains {aaaaa,aaaa } 4. … 5. … …. Temporal rules Summarized rule Mining in 3 steps 1. Find all frequent patterns in each dimension along with all the occurrence. 2. Cluster the similar Patterns to form equivalence classes 3. Find temporal associations between these equivalence classes using iterative algorithm Step 1: Finding Frequent Patterns 1. The Data in each dimension is quantized to form a string ( equal frequency, equal interval, SAX, persist etc…) [8]. 2. Enhanced Suffix Tree is constructed for this string using O(n) Algorithm [7]. 3. All the Frequent patterns along with the locations are enumerated in Linear Time by complete traversal of Tree. Step 1: Enhanced Suffix Tree Enhanced Suffix Tree Construction Algorithm Step 2: Clustering frequent patterns • The Frequent patterns are clustered together using string similarity measures. • Any reasonable string similarity measure and algorithm can be chosen. • LCS based similarity measure along with sequential clustering algorithm is chosen because of robustness and efficiency. Step 2: string similarity measure • Similarity measure Sim( s1, s2) = 2 * LCS(s1, s2) (| s1 | | s2 |) • Distance measure Dist(s1, s2) = 1 - Sim( s1,s2) • LCS (s1, s2) is the length of longest common subsequence • |s| is the length of string s Step 2. Clustering • Sequential clustering Algorithm • Frequent patterns are divided in to overlapping buckets based upon their length • Clustering is done for patterns in each bucket to reduce the clustering complexity • Finally, each pattern is assigned to closest cluster if it is a member of two clusters Step 2. Clustering An Example cluster with cluster center ccccddccccccc ccccdcccccccc cccccddbccccc ccccdcccccccd cccccdcddcccc cccccdcdccccc ccccddccccccc cccccddcccccc dccccdccccccc ccccccddccccc cccccdccccccc ccccccdccdccc ccccccdcccccc ccccccdcdcccc 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Step 2. Clustering Algorithm Step 3. Temporal Relations • Only three relations are explored here • Overlap also means meets, starts with, overlaps and overlapped by Temporal Relations • Temporal relationship concept extended to interval sequence by including support Temporal Relations Temporal relationship concept extended to interval sequence by including support Temporal Relations Examples Example of terminology Step 3. Algorithm Step 3. Example Step 3. Example Summarization • Inference of higher level or more general knowledge from lower level temporal dependencies. • Summarization generalizes the temporal association rules by identifying the time windows in which the rule is applicable. • Measures for summarization are coverage, average length of coverage and maximum coverage length. Results: Utility Dataset Results: EEG Dataset Scalable up to 1 million rows Results: Great Lakes dataset Results: Power Consumption dataset daily patterns Weekly patterns after summarization Spatio-Temporal Clusters Temporal Patterns Observed Value Time axis Goal: Cluster together profiles with similar behavior Problem: Example Domains for Temporal Profiles: - Gene Expression - Each gene’s expression level (1000s genes) - Utility Consumption - Each day/week’s utility consumption (daily profile for multiple years) - Social Data - Number of crime incidents/month How to define similar behavior? Similar for? - Complete period - One interval - Multiple Intervals Spatial Patterns x Goal: . .. . . .. . . . . .. . . . . .. . .. . . . ... . . . .. . . . .. . Cluster together profiles close to each other y Example Domains for Spatial Profiles: - Social Data - Number of crime incidents/area - Scientific Data - Pollution level in land Spatial Closeness: - Euclidean distance - Chain connectivity Spatio-Temporal Patterns Goal: Find patterns in space-time dimensions for phenomena Issues: Simultaneous handling of space and time dimensions Problem: Phenomena in space-time How to handle large complexity of algorithms Examples: Time t Energy Consumption x y Avg. Production Avg. Temperature Crime data: - Each layer represents month i.e. t-axis - x, y axis represents space coordinates Brewery data: - t-axis represent weeks - x, y axis represents the Average Production and Average Temperature Clustering Algorithms Temporal Clustering Spatial Clustering Spatio-Temporal Clustering • Discrete Fourier Transform [Agarwal] • PAM • R-Trees - Pros: Works well for small datasets - Cons: Expensive - Often used for indexing spatiotemporal data - Overlapping causes backtracking in search - Pros: Index time sequences for similarity searching - Cons: Not able to represent subsequence temporal concepts • Rule discovery from Time Series [Das1998] • CLARA - Pros: Works well for large datasets - Cons: Expensive • CLARANS - Pros: Windows clustered - Motivated by PAM and according to their similarity CLARA to determine temporal rules. - Cons: Unable to cluster • BIRCH & CURE profiles according to - Suitable for large subsequences of similarity datasets, clusters found with one scan • MR-Tree & HR- Tree - use overlapping of R-Tree to represent successive states of database - if the number of moving objects from one time instant to another is large, the approach degenerates to independent tree structures and thus no paths are common Generalizing to multiple dimensions • Two different approaches – Phenomenon moving in the same direction i.e. parallel to time axis – Phenomenon moving in different directions i.e. N, NW, NE, S, SW, SE, W, E • Task: classify the phenomena in time, then track them in space – STEP I: Discovery of temporal clusters – STEP II: Finding spatial clusters Phenomenon moving in different directions Goal: To discover subsequences of the complete temporal profile that are shared by a large number of profiles and are neighbors spatially Modified Sequences Generated a b b c a c c c c d d c b c d b There are 4 layers, so the final sequences formed are of length 7 i.e. 2n – 1, where n is number of layers Complexity • Number of strings generated moving from Layer 1 to Layer 2 are: • For L layers, number of strings generated when phenomenon move from layer 1 to L are: Results Time - Test data with 20 profiles - Each profile observed at 30 time points Profile Results (Brewery Data 1) Energy Consumption -Energy Consumption data of size 51X14 - 51 profiles, each observed over a time period of 14 days Time Results (Brewery Data 2) - Production data of size 51X14 - 51 profiles, each observed over a time period of 14 days Results - Sequences generated from a test data with 7 layers, each layer of size 10X10 Time Observed Values - 4480 strings generated Profile Index Sequences and Bioinformatics Importance of Sequences • DNA: sequence of letters A,C,G,T • Proteins: sequence of 20 amino acids • Central premise of bioinformatics – Sequence structure function • Ultimate goal: predict function based on sequence Strings and Sequences Exact String Matching Solutions to Exact Matching • O(n+m) upper bound for worst case • Preprocess the pattern P • Preprocess the text T – Suffix tree method: Weiner, Ukkonen and others • Building the tree: O(m) • Searching: O(n) Suffix Trees • T = cabca • Suffixes are – cabca – abca – bca – ca –a • P = ca c b $ a c a b a $ 4 c b $ a 5 c $ 3 1 a $ 2 Approximate String Matching • Error in the experiment or observation • Redundancy in biological system – Insertion/deletions – Mutations The Task: Best Alignment • Given Strings – HEAGAWGHE – PAWHE • Many Alignments possible HEAGAWGHE P_A _ _W_HE HEAGAWGHE _ _P_AW _HE HEAGAWGHE _PA_ _W _HE Scoring Function • a score for match HEAG… XXAX … • a score for mis-match HEAG… XXPX … • a score for alignment against gap HEAG… XX_X … Algorithms • Needleman-Wunsch (NW): Global Alignment • Smith-Waterman (SW): Local Alignment • Both Based on idea of dynamic programming All Possible Alignments • Represented using a table – First string, a1a2, is aligned along top – Second string, b1b2, is aligned along the left side \ a1b1a2b2 | b1a1b2a2 \ a1 a2 \_ 0 _ a1a2b1b2 \ | b1 b2 1 1 1 1 3 5 5 | _ _ |_ _ |_ _ |_ _ | |_ |_ | | | \ |_ \_ |_ _ 13 _ | |_ | \ | b1a1a2b2 b1b2a1a2\ a1b1b2a2 Main idea of DP a1 a2 a3 b1 b2 5 12 b3 -3 4 • Choices: – – – Align a3 to b3, s(a3,b3)= -5 Align a3 to a gap, s(a3,-)= -8 Align b3 to a gap, s(-,b3)= -8 score: 5 -5 =0 score: -3 -8=-11 score: 12-8=4 Main idea of DP • Score is sum of independent piecewise scores F(i-1,j• Global alignment (Needleman-Wunsch): F(i-1,j) 1) F(i,j-1) – F(0,0) = 0; F(k,0) = F(0,k) = - k d; – F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } • Local alignment (Smith-Waterman): – F(0,0) = 0; F(k,0) = F(0,k) = 0; – F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } (i,j) Overview of DP • • • • Start from the upper left corner Assign scores for the leading gaps Iteratively fill in the scores in the table Find the maximum score, bottom-right for global alignment • Back-track to the upper-left to find the best overall alignment Pairwise score from Blosum50 H E A G A W G H E P -2 -1 -1 -2 -1 -4 -2 -2 -1 A -2 -1 5 0 5 -3 0 -2 -1 W -3 -3 -3 -3 -3 15 -3 -3 -3 H 10 0 -2 -2 -2 -3 -2 10 0 E 0 6 -1 -3 -1 -3 -3 0 6 DP Table for global alignment H E A G A W G H E 0 <-8 <-16 <-24 <-32 <-40 <-48 <-56 <-64 <-72 P ^-8 *-2 *-9 *-17 <-25 *-33 <-41 <-49 <-57 *-65 A ^-16 ^-10 *-3 *-4 <-12 *-20 <-28 <-36 <-44 <-52 W ^-24 ^-18 ^-11 *-6 *-7 *-15 *-5 <-13 <-21 <-29 H ^-32 *-14 *-18 *-13 *-8 *-9 ^-13 *-7 *-3 <-11 E ^-40 ^-22 *-8 <-16 ^-16 *-9 *-12 ^-15 *-7 *3 HEAGAWGHE --P-AW-HE DP Table for local alignment H E A G A W G H E 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 A 0 0 0 *5 0 *5 0 0 0 0 W 0 0 0 0 *2 0 *20 <12 <4 0 H 0 *10 <2 0 0 0 ^12 *18 *22 <14 E 0 ^2 *16 <8 0 0 ^4 ^10 *18 *28 AWGHE AW-HE Multiple Alignment • Complexity of DP: O(n2) • For k strings: O(nk) • Explore other options – Hidden Markov Model Markov Chain Models • Similar to finite automata • Emits sequences with certain probability A T ATGT… C G P(ATGT) = P(A|TGT) P(T|GT) P(G|T) P(T) P(ATGT) = P(A|T) P(T|G) P(G|T) P(T) Hidden Markov Models • Generalization of Markov Models • Hidden states that emit sequences A* T* A T C* G* C G Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands. HMMs for Gene Prediction HMMs & Supervised Learning • Input: a training set of aligned sequences • Find optimal transition and emission probabilities • Criteria: maximize probability of observing the training sequences • Algorithms – Baum-Welch (Expectation Maximization) – Viterbi training algorithm Recognition Phase • We have optimized probablities • Predict likelihood of a sequence belonging to a family – Whats the probablity that a sequence is generated by the HMM? A Simple HMM Model Each blue square represents a match state that “emits” each letter with certain probability ej(a) which is defined by frequency of a at position j: Beg … … Mj Example End 1 2 3 4 5 6 7 AGAAACT AGGAATT TGAATCT A 2/3 0 2/3 1 2/3 0 0 T 1/3 0 0 0 1/3 1/3 1 C 0 0 0 0 0 2/3 0 P(AGAAACT)=16/81 P(TGGATTT)=1/81 G 0 1 1/3 0 0 0 0 Insertions… Ij Beg Mj End Insert states emit symbols just like the match states, however, the emission probabilities are typically assumed to follow the background distribution and thus do not contribute to log-odds scores. Transitions Ij -> Ij are allowed and account for an arbitrary number of inserted residues that are effectively unaligned (their order within an inserted region is arbitrary). … and Deletions Dj Beg Mj End Deletions are represented by silent states which do not emit any letters. A sequence of deletions (with D -> D transitions) may be used to connect any two match states, accounting for segments of the multiple alignment that are not aligned to any symbol in a query sequence (string). The total cost of a deletion is the sum of the costs of individual transitions (M->D, D->D, D->M) that define this deletion. As in case of insertions, both linear and affine gap penalties can be easily incorporated in this scheme. HMMs for Multiple Alignment Dj Ij Beg Mj Example AG---C A-AG-C AG-AA--AAAC AG---C ** * End Emision and Transition Counts C0 C1 C2 1 Beg 4 3 4 C0 C1 C2 C3 A - 4 0 0 C - 0 0 G - 0 T - 0 AG...C A-AG.C AGAA.--AAAC AG...C Dj 1 1 C3 2 1 1 Ij 1 2 Mj 2 4 End C0 C1 C2 C3 A 0 0 6 0 4 C 0 0 0 0 3 0 G 0 0 1 0 0 0 T 0 0 0 0 Match emissions Insert emissions Results Training File 2 Validation File 2 Bibliography 1. G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyth, "Rule Discovery from Time Series," in Knowledge Discovery and Data Mining, 1998, pp. 16-22 2. H. Mannila, "Discovery of Frequent Episodes in Event Sequences," Data Mining and Knowledge Discovery, vol. 1, pp. 259, 1997. 3. P. Kam and A.W. Fu, "Discovering Temporal Patterns for Interval-Based Events," in Second International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2000); Lecture Notes in Computer Science, 2000, pp. 317-326. 4. J. Roddick F. and E. Winarko, "Discovering Richer Temporal Association Rules from Interval-based Data : Extended Report," School of Informatics and Engineering,Flinders University., Adelaide, Australia, Tech. Rep. SIE-05-003, March 2005, 2005. 5. F. Mörchen and A. Ultsch, "Discovering Temporal Knowledge in Multivariate Time Series," in GfKl Dortmund, 2004, 6. H¨oppner, F.: Learning temporal rules from state sequence. In: Proceedings of IJCAI Workshop on Learning from Temporal and Spatial Data,Seattle, USA (2001) 25-31 7. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, vol. 14, pp. 249260, 1995. 8. C.S. Daw, C.E.A. Finney and E.R. Tracy, "A review of symbolic analysis of experimental data," Rev.Sci.Instrum., vol. 74, pp. 915-930, feb. 2003. 9. T. Sergios and K. Konstantinos Eds., Pattern Recognition, San Diego: Elsevier Academic Press, 2003