Download Efficient and Accurate Discovery of Patterns in Sequence Datasets

Efficient and Accurate Discovery of Patterns in Sequence Datasets Avrilia Floratou, SandeepTata, and Jignesh M. Patel ICDE 2010 Outline  Motivation  A powerful new model  FLAME(FLexible and Accurate Motif Detector) algorithm  Experimental Results  Conclusion 2 2010/5/27 Motivation  Existing sequence mining algorithms mostly focus on mining for subsequences(non-contiguous).  For instance, assume that a sequence is “a, b, a, c, b, a, c”, the sequence “a, b, b, c” is a subsequence constructed by choosing the 1st, 2nd, 5th, and 7th.  “Are there any frequently recurring patterns in this time series dataset?”  The recurring subsequences are similar, but not identical.  It allow for some noise.  Approximate subsequence mining problem.(contiguous) 3 2010/5/27 (Cont.)  Computational biology  To detect short sequences, usually of length 6-15, that occur frequently in a given set of DNA or protein sequences.  We call these frequently occurring patterns as motifs.  Goal  To present a new model for approximate motif mining in many different domains.  To present FLAME algorithm to efficiently find motifs that satisfy these model. 4 2010/5/27 The model  (L, M, s, k) model  L is the length of the motif.  M is a distance matrix that is used to compute the similarity between two strings.  s is the maximum distance threshold within which two strings are considered similar.  k is the min_sup.  If  at least k strings T1,…, Tk in the DB  each of them is length L, and d(S,Ti)  s, where d ( A, B)   M (a , b ) is a distance function. Then, a string S is an (L, M, s, k) motif. n i 1 5 i i 2010/5/27 (Cont.)  Protein motif mining  Some amino acids are very similar to each other, while some are very different.  Because Alanine and Valine are both hydrophobic, the matrix can be used to award a small penalty for M(X,Y) when X and Y are similar.  When Alanine is matched to Glycine, the large penalty is awarded. Because Glycine is a hydrophilic amino acid.  (L, d, k) model  (L, f, d, k) model 6 2010/5/27 (L, d, k) model  (L, d, k) model is a mismatch based model for finding DNA motifs.  The distance measure between two strings is the Hamming distance, M(X,Y)=1 if X  Y and M(X,Y)=0 if X=Y  The signature is usually a short string of DNA 6-15 bases long.  These signatures are seldom identical and differ in a few positions.  For instance  {ABCD, ACCD, ABCA} if ABCD is the model sequence, the other two sequences are within one mismatch of the motif, so these sequences would constitute a (4,1,3) motif. 7 2010/5/27 (L, f, d, k) model  This model builds on themotif (L, d, k) model to include positional (8,1,0,5) motif or (8,1,5) constraints onk)the mismatches. Use the (L, d, model to retrieve the pattern, we will end up  Tomany specify the number of fixed-position mismatches. with extraneous hits that might not be meningful.  {ABCD, ACCD, ADCD}, this set forms a (4,1,3) motif, but the mismatches, whenever they occur, are always in position two (AcCD, AdCD).  The advantage of this model  Consider a DNA DB consisting of 5 sequences, each of length 500.  Assume that each sequence has in it the motif GTGAACAC, and each instance of the motif has a mismatch at the fifth position. 8 2010/5/27 FLAME algorithm  It first construct a data suffix tree and model suffix tree.  To traversing the nodes of the model space in depth first order.  Using two subroutines  Evaluate_support: to compute the list of matches and the support.  Expand_Matches: to ensure that the number of mismatches to the model string does not exceed d. 9 2010/5/27 (Cont.)  The data suffix tree:  The model suffix tree:  On the set of all possible model srtings.  To help guide the exploration of the model space in a way that avoids redundant work. 10 2010/5/27 (Cont.)the strategy of pruning the model suffix tree  Assume that the dataset consists of sequences over the alphabet {A, B, C, D, E}  All the strings of length L starting with the symbol A form a subset of the model space.  … A AA AB AC AD AE … 11 2010/5/27 (Cont.)  No mismatches are allowed.  AA AB … k A AC AD AE … 12 2010/5/27 (Cont.)  As mismatches are allowed.  A AA AB … B AC AD AE … 13 2010/5/27 Experimental results 14 2010/5/27 (Cont.) 15 2010/5/27 Conclusion  This paper presented a powerful new model: (L, M, s, k)  It also presented FLAME to find (L, M, s, k) motifs 16 2010/5/27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Efficient and Accurate Discovery of Patterns in Sequence Datasets