Download patterns - GdR BIM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Minimized compact automaton for
clumps over degenerate patterns
Evgeniia Furletova*, Jan Holub, Mireille Regnier
Institute of mathematical problems in biology, Russia
November 27, 2015
Collaborators
Mireille Regnier
Ecole Polytechnique, INRIA, France
Jan Holub
Czech Technical University in Prague, Czech Republic
Challenge
Functional fragments recognition in biological sequences can be
reduced to finding of overrepresented occurrences of a pattern.
A measure of overrepresentation is P-value of pattern occurrences
Problem. Creating of an efficient algorithm of pattern
occurrences P-value computation.
P-value of pattern occurrences
P-value is the probability to find at least one occurrence of words
from a pattern H in a random sequence of length n generated
according to a given probability model.
For a Bernoulli model P-value can be approximated by the formula* :
1
n
P  value  1 


,
'
 (1  C (  ))
• C(z) – generating function of clumps;
• ρ – closest to 1 root of 1 – z+C(z) = 0
Regnier M., Fang B, Iakovishina D. Clump Combinatorics, Automata, and Word Asymptotics// Proceedings of
the Eleventh Workshop on Analytic Algorithmics and Combinatorics (ANALCO). 2014
Clumps
k-clump for a pattern H = {h1,…,hr} is a string s such that:
•
s consists of k overlapping occurrences of H
•
any two consecutive letters of s belong to an occurrence of H
Examples of clumps for pattern ACATTACA
• ACATTACA
1-clump
• ACATTACATTACACATTACA
3-clump
ACATTACA
ACATTACA
ACATTACA
Clumps generating function
C ( z )  p0  p1 z  ...  pn z n ,
pk – sum of probabilities of all k-clumps.
Our goal is to create an efficient method for computation of
probabilities of k-clumps
Degenerate (intermediate) patterns
Degenerate alphabet Σ’ – alphabet letters of which are subsets of alphabet Σ.
Degenerate pattern is a string in Σ’
Example: IUPAC alphabet
A = [A]
C = [C]
G = [G]
T = [T]
R = [AG]
Y = [CT]
S = [CG]
…
N = [ACGT]
Examples: IUPAC consensuses
ТАТА-box ТAТA[AТ]A[AТ] – 4 words of length 7
Consensus of transcription factor binding site Antp (Drosophila)
ANNNNCATTA – 256 words of length 10
Pattern matching (Aho-Corasick) automaton
for degenerate pattern H = A[CT]A
0
A
1
C
2
T
3
A
A
4
5
Pattern matching (Aho-Corasick) automaton
for degenerate pattern H = A[CT]A
0
A
1
C
2
T
3
A
A
4
Clumps: ACA, ATA, ACACA, ACATA,….
5
Overlap walking automaton
Pattern matching automaton
Overlap walking automaton*
for H = A[CT]A
for H = A[CT]A
0
0
A
ACA
ATA
1
C
2
CA
T
3
TA
A
4
5
Clumps: ACA, ATA, ACACA, ACATA,….
* Regnier M., 2014
5
4
ACA
A
TA
CA
ATA
We propose a minimization of overlap walking automaton
for degenerate patterns
Pattern matching automata minimization
degenerate pattern H = [AT][CG][AC]
Minimal pattern matching automaton
degenerate pattern H = [AT][CG][AC]
0
[AT]
1
[CG]
2
[A]
3
[C]
4
This automaton can be constructed in linear time of its states
R-equivalence
Nodes x and y are R-equivalent (x R~ y) iff x = y or
1. |x|=|y|;
2. suffix_link(x) R~ suffix_link(y).
For degenerate patterns, the nodes of the same length have the
same paths below
Two words are R-equivalent iff they are Nerode-equivalent
Minimal pattern matching automaton
Minimal overlap walking automaton
for H = [AT][CG][AC]
for H = [AT][CG][AC]
0
0
[AT]
[AT][CG]A
1
[CG]A
[CG]
3
3
4
[CG]C
2
A
[AT][CG]C
C
4
Clumps: [AT][CG]A, [AT][CG]C, [AT][CG]A[CG]A, [AT][CG]A[CG]C,….
Efficiency demonstrating examples
• H = LXDXLXD[DLE]
(amino acid alphabet)
PatAut: 40841 states and 81681 edges
R-minimal PatAut: 25 states and 59 edges
Minimal OWA: 6 states and 45 edges
• H = AXXXXCATTA
(DNA alphabet )
PatAut: 1622 states and 3243 edges
R-minimal PatAut: 64 states and 140 edges
Minimal OWA: 2 states and 3 edges
Merci