* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lynx - SAT 2012
Survey
Document related concepts
Non-coding DNA wikipedia , lookup
Biosynthesis wikipedia , lookup
Biochemistry wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Genetic code wikipedia , lookup
RNA interference wikipedia , lookup
Transcriptional regulation wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Polyadenylation wikipedia , lookup
Homology modeling wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Structural alignment wikipedia , lookup
Protein structure prediction wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Gene expression wikipedia , lookup
Transcript
Lynx: A Programmatic SAT Solver for the RNA-folding Problem Vijay Ganesh, Charles W. O’Donnell, Mate Soos, Srinivas Devadas, Martin C. Rinard, and Armando Solar-Lezama SAT Conference, Trento, Italy 2012 SAT Solvers “are a black-box” Problem Users want more Control • The story so far … • SAT solvers have been amazingly successful in many fields • AI, formal methods, testing, program analysis,… • New applications everyday (e.g., biology) • However • Diminishing returns of “baked-in” solver heuristics • For most users solvers are a magic black-box difficult to control • “How can I integrate my heuristic into the solver with minimal effort?” 2 / XX RNA-folding Problem A programmatic SAT-based Solution • • Why does a SAT solver-based approach for Bio make sense? • Computational biology is an ideal domain for declarative languages like SAT • Because problems are often modeled as mathematical constraints • Biologists prefer to prototype models quickly, minimize coding Is a simple translation to SAT sufficient? • Unfortunately No! • Naïve SAT representation of problem instances blows up • Users need to have greater control over the solver heuristics 3 / XX An Effective Solution to the Black-box Problem Programmatic Solvers Key Idea: Expose solver internals to user through a callback programmatic API User writes code for the API to influence solver behavior Users gain control SAT SOLVER Input Formula Result USER CODE 4 / XX Central Dogma of Biology From DNA through RNA to Function DNA DNA transcribed into messenger RNA RNA Proteins RNA encodes amino acid chains that fold into proteins Function Folded proteins interact to control function 5 / XX RNA-folding: 3-D Structure of non-coding RNA From Structure to Function RNA-Protein DNA RNA RNA-DNA Protein-RNA-DNA RNA-RNA-DNA Protein-RNAProtein-DNA RNA-RNA Guttman, Rinn 2012 6 / XX RNA What is its Structure? phosphate + sugar O -- PO O P4 O P + base RiboNucleic Acid O sugar 4 bases C = Cytosine G = Guanine A = Adenine U = Uracil 7 / XX RNA-folding Problem From Structure to Function Question: How does RNA sequence determine its folded 3D structure, and thus its function? • • • Easy to determine RNA’s primary structure through bio experiments Very expensive to determine 3D structure through bio experiments alone Hence, we need computational prediction tools for RNA optimal 3D structure “Primary” structure (sequence) 3D structure 8 / XX RNA-folding problem Optimal Secondary Structure Prediction Problem Unfortunately, modeling every atom/electron too computationally demanding • Solution? implement reduced model • The reduced model is called the secondary structure • The secondary structure is an approximate planar representation of 3D structure “Primary” structure (sequence) “Tertiary” structure (3D) “Secondary” structure 9 / XX RNA-folding problem Optimal Secondary Structure Prediction Problem Computational thermodynamics-based solution: • Define energetic “cost” function for all possible structures • Must be optimal Score( ) = 839 kcal/mol Score( ) = 992 kcal/mol UIUC Score( UNC LCCC ) = 1029 kcal/mol Score( ) = 2267 kcal/mol Rothamstad Res Search all structures to find “best” (minimum energy funnel) Dill/Chan 10 / XX RNA-folding problem Quick Recap Given the primary RNA sequence and thermodynamic cost function, can we predict the optimal secondary structure? “Primary” structure (sequence) “Tertiary” structure (3D) “Secondary” structure 11 / XX Obtaining RNA structures Lynx RNA model for secondary structure: Given a string (RNA sequence) - any nucleic acid at position i can pair with another at j, subject to four general constraints (more later) Lynx decision problem (energy function constraint): Assign independent scores to all potential (i,j) pairs, find a valid assignment of (i,j) pairs whose scores sum to be greater than some threshold t Score( ) = 12 / XX Lynx RNA structure constraints Based on this published energy model that assumes score independence, valid structures can be “knot-free” or contain “crossing pseudo-knots” “pseudo-knot” 13 / XX Lynx RNA structure constraints Bit-vectors X and Y (length n2), indicates two independent configurations of “knotfree” (i,j) pairings Crossing-pseudoknots allowed by simultaneous assignment of X and Y Constraint 1: Every position (nucleotide) can only pair with at most one other position X a a c X i Y Y X X c Y i ij i ij i Y X ij j jk jk j k b ij j j k jk X k k jk ij ij d X Y X Y bi di X i Y j j ij ij X i Y mn ij mn j jk jk jk jk j ij k e k k k f g X Y fi X g X Y e Xij i Y X i 14 / XXi Lynx RNA structure constraints Bit-vectors X and Y (length n2), indicates two independent configurations of “knotfree” (i,j) pairings Crossing-pseudoknots allowed by simultaneous assignment of X and Y X k e dX jk X i ij Y f j ij i k f j Y mn ij X Constraint 3: i j e Y knot-free on their own X and Y are Constraint j k b 2: i Y X and Y cannot assign the jksame pair ij k j ij i kl ij X m l k kl j g n kl ij X i k j i k j l Y X Y ij kl l l Y ij mn X 15 / XX Lynx RNA structure constraints ij X i j i Bit-vectors Xk and Y (length n2e), indicates twoj independent configurations of “knotY ij free” (i,j)jkpairings jk i Crossing-pseudoknots allowed by simultaneous klassignment of X and Y j k ij X i k j l ij f Y X mn Constraint 4: k j l g ij i jk kl l n X X X Y will well Only permit pseudoknots characterized biophysical energetics ij i j k i j k e b a Y Y Y (exclusion of constraint would require construction of novel energy function) jk ij mn l n c X Y i j ij X hi ji Y k ijjk k i Y j ij k f ij m l n g mn ij k ij j i k j i k j ij X Y X Y ij kl X i l j i Y kl mn k i jk X ij X Y d j ij j m kl l n h X i Y j k ij 16 / XX SAT Solvers and RNA representations A Case for a Programmatic SAT Solver • A SAT-based solution would be ideal given the constraint representation given above • However, constraint-size is n^6 where n is the length of the RNA primary structure • The naïve representation is too large • We want to use SAT but avoid naïve representation and cost • We want to give user to experiment with different secondary structure models and heuristics 17 / XX An Effective Solution to the Black-box Problem Programmatic Solvers Key Idea: Expose solver internals to user through a callback programmatic API User writes code for the API to influence solver behavior Users gain control SAT SOLVER Input Formula Result USER CODE 18 / XX How does the Programmatic Solver Work? Energetic Constraints are Input, Structural ones are Code • Structural constraints can grow to O(N^6) where N is length of RNA • Few solvers can deal with such large sizes when N is 100 or more • Incrementally adding constraints in inner-loop gives fine-grained control of search SAT SOLVER Energetic Constraints Result Structural Constraints (N^6) 19 / XX An Effective Solution to the Black-box Problem Programmatic Solvers • User code examines the trail in the solver at regular intervals • If the assignment violates a structural constraint in the user code then add clause to block the bad assignment using blocking clause • Detect early, block bad assignment quickly. Far more efficient than outer-loop incrementality SAT SOLVER Energetic Constraints Result Structural Constraints (N^6) 20 / XX RNA prediction results 21 / XX An Effective Solution to the Black-box Problem Advantages of Programmatic Solvers • Memory savings if the simple SAT representation of the problem is large (n^6 for RNA) • Time savings since bad assignments are detected in the inner loop of the SAT solver • Domain-specific heuristics and user control SAT SOLVER Energetic Constraints Result Structural Constraints (N^6) 22 / XX Related Work Incrementality and DPLL(T) • • • Incrementality • Stuckey et al. (2007) • Extensible solvers • Abstraction-refinement in model-checking and SMT DPLL(T) • Closest related work (Tinelli, Neiuwenheus, Oliveras 06) • Programmatic solvers are for the lay users • Rich theory sub-solvers (more powerful, but more work) Dynamic Programming approach • Zuker (1981), PKNOTS, HOTKNOTS, Vienna RNA • Locks you into a set of modeling assumptions, unlike SAT • Have to make simplifying assumptions, otherwise NP-complete 23 / XX Conclusions The Power of Programmatic Solvers • Benefits of Programmatic API • Flexible: easy for lay users • Adaptive: domain-specific sub-solvers • Performance: Improve memory usage and time • Possible other programmatic API choices • Heuristics: branching heuristics • Adaptive Strategies: search and restart • User-controlled Portfolios: Parallel SAT with diff. heuristics 24 / XX