Download Efficient and Accurate Discovery of Patterns in Sequence Datasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multi-state modeling of biomolecules wikipedia , lookup

Transcript
Efficient and Accurate Discovery of
Patterns in Sequence Datasets
Avrilia Floratou, SandeepTata, and Jignesh M. Patel
ICDE 2010
Outline
 Motivation
 A powerful new model
 FLAME(FLexible and Accurate Motif Detector) algorithm
 Experimental Results
 Conclusion
2
2010/5/27
Motivation
 Existing sequence mining algorithms mostly focus on mining
for subsequences(non-contiguous).
 For instance, assume that a sequence is “a, b, a, c, b, a, c”, the
sequence “a, b, b, c” is a subsequence constructed by choosing
the 1st, 2nd, 5th, and 7th.
 “Are there any frequently recurring patterns in this time
series dataset?”
 The recurring subsequences are similar, but not identical.
 It allow for some noise.
 Approximate subsequence mining problem.(contiguous)
3
2010/5/27
(Cont.)
 Computational biology
 To detect short sequences, usually of length 6-15, that occur
frequently in a given set of DNA or protein sequences.
 We call these frequently occurring patterns as motifs.
 Goal
 To present a new model for approximate motif mining in many
different domains.
 To present FLAME algorithm to efficiently find motifs that
satisfy these model.
4
2010/5/27
The model
 (L, M, s, k) model
 L is the length of the motif.
 M is a distance matrix that is used to compute the similarity
between two strings.
 s is the maximum distance threshold within which two strings
are considered similar.
 k is the min_sup.
 If  at least k strings T1,…, Tk in the DB  each of them is
length L, and d(S,Ti)  s, where d ( A, B)   M (a , b ) is a
distance function. Then, a string S is an (L, M, s, k) motif.
n
i 1
5
i
i
2010/5/27
(Cont.)
 Protein motif mining
 Some amino acids are very similar to each other, while some are
very different.
 Because Alanine and Valine are both hydrophobic, the matrix can be used
to award a small penalty for M(X,Y) when X and Y are similar.
 When Alanine is matched to Glycine, the large penalty is awarded.
Because Glycine is a hydrophilic amino acid.
 (L, d, k) model
 (L, f, d, k) model
6
2010/5/27
(L, d, k) model
 (L, d, k) model is a mismatch based model for finding DNA
motifs.
 The distance measure between two strings is the Hamming
distance, M(X,Y)=1 if X  Y and M(X,Y)=0 if X=Y
 The signature is usually a short string of DNA 6-15 bases long.
 These signatures are seldom identical and differ in a few
positions.
 For instance
 {ABCD, ACCD, ABCA} if ABCD is the model sequence, the other two
sequences are within one mismatch of the motif, so these sequences
would constitute a (4,1,3) motif.
7
2010/5/27
(L, f, d, k) model

This model
builds
on themotif
(L, d, k) model to include positional
(8,1,0,5)
motif
or (8,1,5)
constraints
onk)the
mismatches.
Use
the (L, d,
model
to retrieve the pattern, we will end up

Tomany
specify
the number
of fixed-position
mismatches.
with
extraneous
hits that
might not be meningful.
 {ABCD, ACCD, ADCD}, this set forms a (4,1,3) motif, but the
mismatches, whenever they occur, are always in position two
(AcCD, AdCD).
 The advantage of this model
 Consider a DNA DB consisting of 5 sequences, each of length
500.
 Assume that each sequence has in it the motif GTGAACAC,
and each instance of the motif has a mismatch at the fifth
position.
8
2010/5/27
FLAME algorithm
 It first construct a data suffix tree and model suffix tree.
 To traversing the nodes of the model space in depth first
order.
 Using two subroutines
 Evaluate_support: to compute the list of matches and the
support.
 Expand_Matches: to ensure that the number of mismatches to
the model string does not exceed d.
9
2010/5/27
(Cont.)
 The data suffix tree:
 The model suffix tree:
 On the set of all possible model srtings.
 To help guide the exploration of the model space in a way that
avoids redundant work.
10
2010/5/27
(Cont.)the strategy of pruning the
model suffix tree
 Assume that the dataset consists of sequences over the
alphabet {A, B, C, D, E}
 All the strings of length L starting with the symbol A form a
subset of the model space.

…
A
AA
AB
AC
AD
AE
…
11
2010/5/27
(Cont.)
 No mismatches are allowed.

AA
AB
…
k
A
AC
AD
AE
…
12
2010/5/27
(Cont.)
 As mismatches are allowed.

A
AA
AB
…
B
AC
AD
AE
…
13
2010/5/27
Experimental results
14
2010/5/27
(Cont.)
15
2010/5/27
Conclusion
 This paper presented a powerful new model: (L, M, s, k)
 It also presented FLAME to find (L, M, s, k) motifs
16
2010/5/27