Download Lynx - SAT 2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

Biosynthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Genetic code wikipedia , lookup

RNA interference wikipedia , lookup

Transcriptional regulation wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Polyadenylation wikipedia , lookup

Homology modeling wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Structural alignment wikipedia , lookup

Protein structure prediction wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Gene expression wikipedia , lookup

RNA wikipedia , lookup

RNA-Seq wikipedia , lookup

RNA silencing wikipedia , lookup

Epitranscriptome wikipedia , lookup

Transcript
Lynx: A Programmatic SAT Solver
for the RNA-folding Problem
Vijay Ganesh, Charles W. O’Donnell,
Mate Soos, Srinivas Devadas, Martin C. Rinard, and Armando Solar-Lezama
SAT Conference, Trento, Italy 2012
SAT Solvers “are a black-box” Problem
Users want more Control
• The story so far …
• SAT solvers have been amazingly successful in many fields
• AI, formal methods, testing, program analysis,…
• New applications everyday (e.g., biology)
• However
• Diminishing returns of “baked-in” solver heuristics
• For most users solvers are a magic black-box difficult to control
• “How can I integrate my heuristic into the solver with minimal effort?”
2 / XX
RNA-folding Problem
A programmatic SAT-based Solution
•
•
Why does a SAT solver-based approach for Bio make sense?
•
Computational biology is an ideal domain for declarative languages like SAT
•
Because problems are often modeled as mathematical constraints
•
Biologists prefer to prototype models quickly, minimize coding
Is a simple translation to SAT sufficient?
•
Unfortunately No!
•
Naïve SAT representation of problem instances blows up
•
Users need to have greater control over the solver heuristics
3 / XX
An Effective Solution to the Black-box Problem
Programmatic Solvers
Key Idea:
 Expose solver internals to user through a callback programmatic API
 User writes code for the API to influence solver behavior
 Users gain control
SAT SOLVER
Input
Formula
Result
USER CODE
4 / XX
Central Dogma of Biology
From DNA through RNA to Function
DNA
DNA transcribed
into messenger RNA
RNA
Proteins
RNA encodes amino
acid chains that fold
into proteins
Function
Folded proteins interact
to control function
5 / XX
RNA-folding: 3-D Structure of non-coding RNA
From Structure to Function
RNA-Protein
DNA
RNA
RNA-DNA
Protein-RNA-DNA
RNA-RNA-DNA
Protein-RNAProtein-DNA
RNA-RNA
Guttman, Rinn 2012
6 / XX
RNA
What is its Structure?
phosphate + sugar
O
-- PO
O
P4 O
P
+ base
RiboNucleic Acid
O
sugar
4 bases
C = Cytosine
G = Guanine
A = Adenine
U = Uracil
7 / XX
RNA-folding Problem
From Structure to Function
Question: How does RNA sequence determine its folded 3D structure, and thus its
function?
•
•
•
Easy to determine RNA’s primary structure through bio experiments
Very expensive to determine 3D structure through bio experiments alone
Hence, we need computational prediction tools for RNA optimal 3D structure
“Primary” structure
(sequence)
3D structure
8 / XX
RNA-folding problem
Optimal Secondary Structure Prediction Problem
Unfortunately, modeling every atom/electron too computationally demanding
• Solution? implement reduced model
• The reduced model is called the secondary structure
• The secondary structure is an approximate planar representation of 3D structure
“Primary” structure
(sequence)
“Tertiary” structure (3D)
“Secondary” structure
9 / XX
RNA-folding problem
Optimal Secondary Structure Prediction Problem
Computational thermodynamics-based solution:
• Define energetic “cost” function for all possible structures
• Must be optimal
Score(
) = 839 kcal/mol
Score(
) = 992 kcal/mol
UIUC
Score(
UNC LCCC
) = 1029 kcal/mol
Score(
) = 2267 kcal/mol
Rothamstad Res
Search all structures to find “best”
(minimum energy funnel)
Dill/Chan
10 / XX
RNA-folding problem
Quick Recap
Given the primary RNA sequence and thermodynamic cost function, can we
predict the optimal secondary structure?
“Primary” structure
(sequence)
“Tertiary” structure (3D)
“Secondary” structure
11 / XX
Obtaining RNA structures
Lynx RNA model for secondary structure:
Given a string (RNA sequence) - any nucleic acid at position i can pair with
another at j, subject to four general constraints (more later)
Lynx decision problem (energy function constraint):
Assign independent scores to all potential (i,j) pairs, find a valid assignment of
(i,j) pairs whose scores sum to be greater than some threshold t
Score(
) =
12 / XX
Lynx RNA structure constraints
Based on this published energy model that assumes score independence, valid
structures can be “knot-free” or contain “crossing pseudo-knots”
“pseudo-knot”
13 / XX
Lynx RNA structure constraints
Bit-vectors X and Y (length n2), indicates two independent configurations of “knotfree” (i,j) pairings
Crossing-pseudoknots allowed by simultaneous assignment of X and Y
Constraint 1:
Every position (nucleotide) can only pair with at most one other position
X
a a
c
X
i
Y
Y
X
X
c
Y
i
ij
i
ij
i
Y
X
ij
j
jk
jk
j
k
b
ij
j
j
k
jk
X
k
k
jk
ij
ij
d
X
Y
X
Y
bi
di
X
i
Y
j
j
ij
ij
X
i
Y
mn
ij
mn
j
jk
jk
jk
jk
j
ij
k
e
k
k
k
f
g
X
Y
fi
X
g
X
Y
e
Xij
i
Y
X
i 14 / XXi
Lynx RNA structure constraints
Bit-vectors X and Y (length n2), indicates two independent configurations of “knotfree” (i,j) pairings
Crossing-pseudoknots allowed by simultaneous assignment of X and Y
X
k
e
dX
jk
X
i ij
Y
f
j
ij
i
k
f
j
Y
mn
ij
X
Constraint
3: i
j
e
Y knot-free on their own
X and Y are
Constraint
j
k
b 2: i
Y
X and Y cannot
assign
the jksame pair
ij
k
j
ij
i
kl
ij
X
m
l
k
kl
j
g
n
kl
ij
X
i
k
j
i
k
j
l
Y
X
Y
ij
kl
l
l
Y
ij
mn
X
15 / XX
Lynx RNA structure constraints
ij
X
i
j
i
Bit-vectors
Xk and Y (length n2e), indicates
twoj independent configurations of “knotY
ij
free”
(i,j)jkpairings
jk
i
Crossing-pseudoknots
allowed
by
simultaneous klassignment of X and Y
j
k
ij
X
i
k
j
l
ij
f
Y
X
mn
Constraint 4:
k
j
l
g ij i jk
kl
l
n
X
X
X
Y will well
Only permit pseudoknots
characterized
biophysical
energetics
ij
i
j
k
i
j
k
e
b
a
Y
Y
Y
(exclusion of constraint
would require construction
of
novel
energy
function)
jk
ij
mn
l
n
c
X
Y
i
j
ij
X
hi
ji
Y
k
ijjk
k
i
Y
j
ij
k
f
ij
m
l
n
g
mn
ij
k
ij
j
i
k
j
i
k
j
ij
X
Y
X
Y
ij
kl
X
i
l j
i
Y kl
mn
k
i
jk
X
ij
X
Y
d
j
ij
j
m
kl
l
n
h
X
i
Y
j
k
ij
16 / XX
SAT Solvers and RNA representations
A Case for a Programmatic SAT Solver
• A SAT-based solution would be ideal given the constraint representation
given above
• However, constraint-size is n^6 where n is the length of the RNA primary
structure
• The naïve representation is too large
• We want to use SAT but avoid naïve representation and cost
• We want to give user to experiment with different secondary structure
models and heuristics
17 / XX
An Effective Solution to the Black-box Problem
Programmatic Solvers
Key Idea:
 Expose solver internals to user through a callback programmatic API
 User writes code for the API to influence solver behavior
 Users gain control
SAT SOLVER
Input
Formula
Result
USER CODE
18 / XX
How does the Programmatic Solver Work?
Energetic Constraints are Input, Structural ones are Code
• Structural constraints can grow to O(N^6) where N is length of RNA
• Few solvers can deal with such large sizes when N is 100 or more
• Incrementally adding constraints in inner-loop gives fine-grained control of search
SAT SOLVER
Energetic
Constraints
Result
Structural
Constraints
(N^6)
19 / XX
An Effective Solution to the Black-box Problem
Programmatic Solvers
•
User code examines the trail in the solver at regular intervals
•
If the assignment violates a structural constraint in the user code then add clause to
block the bad assignment using blocking clause
•
Detect early, block bad assignment quickly. Far more efficient than outer-loop
incrementality
SAT SOLVER
Energetic
Constraints
Result
Structural
Constraints
(N^6)
20 / XX
RNA prediction results
21 / XX
An Effective Solution to the Black-box Problem
Advantages of Programmatic Solvers
•
Memory savings if the simple SAT representation of the problem is large (n^6 for
RNA)
•
Time savings since bad assignments are detected in the inner loop of the SAT solver
•
Domain-specific heuristics and user control
SAT SOLVER
Energetic
Constraints
Result
Structural
Constraints
(N^6)
22 / XX
Related Work
Incrementality and DPLL(T)
•
•
•
Incrementality
•
Stuckey et al. (2007)
•
Extensible solvers
•
Abstraction-refinement in model-checking and SMT
DPLL(T)
•
Closest related work (Tinelli, Neiuwenheus, Oliveras 06)
•
Programmatic solvers are for the lay users
•
Rich theory sub-solvers (more powerful, but more work)
Dynamic Programming approach
•
Zuker (1981), PKNOTS, HOTKNOTS, Vienna RNA
•
Locks you into a set of modeling assumptions, unlike SAT
•
Have to make simplifying assumptions, otherwise NP-complete
23 / XX
Conclusions
The Power of Programmatic Solvers
• Benefits of Programmatic API
• Flexible: easy for lay users
• Adaptive: domain-specific sub-solvers
• Performance: Improve memory usage and time
• Possible other programmatic API choices
• Heuristics: branching heuristics
• Adaptive Strategies: search and restart
• User-controlled Portfolios: Parallel SAT with diff. heuristics
24 / XX