Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extensible Storage Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Relational algebra wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Symbolic Query Processing Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter Hass (IBM Almaden Research Center) © ETH Zürich Symbolic Query Processing Treat all data as symbols (think of variables) E.g., a1 represents any value under the domain of attribute a Table R and S are called symbolic relations ETH Zurich 2 Background – Symbolic Execution 1/3 Borrow the concept from symbolic execution A well known program verification technique Represent values of program variables with symbolic values instead of concrete data Manipulate expressions based on those symbolic values ETH Zurich 3 Background – Symbolic Execution 2/3 Find a test case for path 1236 Symbolic execution – start: 1. minsalary = read_input(); 1. minsalary = ben 2. bensalary = minsalary + 2000; 2. bensalary = ben + 2000; 3. if (bensalary < 80000) 3. bensalary = ben + 2000 4. output “no kidding!”; Symbolic execution – end 5. else 6. and !(bensalary < 80000);- ( ) output “that’s right”; Instantiate (): ben = 90000 expected input “that’s right” expected output ETH Zurich 4 Background – Symbolic Execution 3/3 Has been research for > 20 years Still have many limitations E.g., cannot handle highly complex software However, many large software vendors still put hope on this technique for program verifications E.g., Microsoft Research No progress on database applications involve an external database and SQL ETH Zurich 5 SQP Applications Extend program verification and symbolic execution techniques to support database applications For DBMS testing focus of today ETH Zurich 6 Symbolic Query Processing Query manipulates data according to different needs R b=c b1 S Want the join results to have one tuple? set c1=b1 Want the join results to have: four tuples Zipf distribution (t1 joins more, t2 joins less)? ETH Zurich 7 DBMS Testing To test a DBMS, we generate a lot of test databases and execute a lot of test queries DBMS vendors are looking for a way to control the intermediate results of a test query such that we can test an individual component of a DBMS under a particular test case ETH Zurich 8 DBMS Testing Example Test the accuracy of a cardinality estimation component of a query optimizer under a multi-way hash join query a two-way join query with aggregation If we can make sure executing the test query on the test database expected answer gives ETH Zurich 9 DBMS Testing The test query is given Physical join ordering can be fixed (by testers) Evaluation algorithm (e.g., using hash-join) can be fixed too However, the size of the intermediate results cannot be fixed easily ETH Zurich 10 DBMS Testing Problem Guarantee that executing a test query on a test database can obtain the desired intermediate query results (e.g.,. output cardinality, data distribution) ETH Zurich 11 DBMS Testing Problem A test case T is: a parametric query Qp with a set of constraints C on each intermediate result A good test database D means Qp (D) satisfies C - if the set of parameters p is properly instantiated Test case T D covers test case T ETH Zurich 12 Trial-and-error Generate Database 3, 2, and 1 Using traditional database generators such as IBM Test DB generator, MSR DB generator, etc Search for parameters T2 is never covered The database generation process does not care about the test queries ETH Zurich 13 Latest approach – Finding query parameters MSR realized this problem [TKDE06] Given the test database + the test query Qp, search parameter values for p such that Qp(D) (almost) fit the cardinality requirements defined on the test case It is a NP-hard problem Same as the previous approach, T2 is never covered ETH Zurich 14 QAGen – Query Aware test database Generator Based on symbolic query processing We can control the output size of each intermediate query result (and even more) ETH Zurich 15 QAGen – Generate a query-aware test database for each test case ETH Zurich 16 QAGen overview ETH Zurich 17 QAGen overview – Query Analyzer Analyzer the query and assign the knob to an operator A knob is a parameter of an operator to control the output (e.g., output cardinality, distribution) A knob for an operator is not always available for tuning ETH Zurich 18 QAGen overview – Query Analyzer A knob for an operator is not always available for tuning join distribution? Yes ETH Zurich join distribution? No 19 QAGen overview – Query Analyzer The available knob(s) for an operator depends on its input characteristics Definition: pre-grouping data Definition: non pre-grouping data ETH Zurich 20 QAGen overview – Query Analyzer ETH Zurich 21 Symbolic Query Engine and Symbolic Database ETH Zurich 22 Symbolic Query Engine and Symbolic Database (SDB) σa>p An SQL operator: Add predicates to a symbol Replace a symbol with another other >p <=p symbol (e.g., joining) E.g., SELECT a FROM R WHERE a > p; 1 output ETH Zurich 23 Symbolic Query Engine and Symbolic Database (SDB) How to physically store the symbolic data? >p <=p Options: Implement a native symbolic database Use relational database s - How to represent “a1 > p”? - Stores all predicates that are associated with a symbol s in a separate relation called Pred. a1 a1>p a2 a2<=p PTable PTable ETH Zurich 24 Data Instantiator ETH Zurich 25 Data Instantiation • Data instantiator uses a constraint solver: • Input: a (propositional) constraint (e.g., A + B > 50) • Output: any concrete values for the constraint (e.g., A=99, B=12) ETH Zurich 26 Symbolic Query Engine ETH Zurich 27 Symbolic Query Engine Iterator-based open(), getNext(), close() No naughty user ETH Zurich Contradicting knob values 28 SQP – Table operator Fill up the table with symbols ETH Zurich 29 SQP – σ operator ETH Zurich 30 SQP – operator (with FK constraint) Action: join key replacement ETH Zurich 31 SQP – operator (with FK constraint) Action: join key replacement ETH Zurich 32 SQP – operator (with FK constraint) When the input of the join is pre-grouped, the world has changed It sometimes happen, e.g., 2-way join Base tables A, B and C with foreign key relationships A B, B C ETH Zurich 33 SQP – operator (with FK constraint) Do not support join distribution (the knob is disabled by the analyzer) Controlling the output cardinality is a subset-sum problem (weakly NP-hard) Subset-sum has a pseudo-polynomial time exact solution using dynamic programming ETH Zurich 34 SQP – operator (with FK constraint) Blocking During open() Materialize Table S in a temporary relation SELECT COUNT(k) From S GROUP BY k Solve the subset-sum ETH Zurich 35 SQP – χ operator Action 1: Aggregation attribute replacement • o_date3 o_date1 • o_date4 o_date2 1st output group (o_date1) 2nd output group (o_date2) ETH Zurich 36 SQP – χ operator Action 2 (base case version): - Adding aggregation constraints to PTable, base case: <l_price1, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7> <l_price2, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7> <l_price3, aggsum1 = ‘’> <l_price4, aggsum1=‘’> <l_price7, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7> <l_price5, aggsum2 = l_price5+ l_price6 + l_price8> <l_price6, aggsum2 = l_price5+ l_price6 + l_price8> <l_price8, aggsum2 = ‘’> ETH Zurich 37 SQP – χ operator Action 2 (optimized version): - A constraint solver call is exponential to the size of predicates - Adding 2 aggregation constraints to PTable: <l_price1, aggsum1 = l_price1 x 5> <l_price5, aggsum2 = l_price5 x 3> and do l_price replacement ETH Zurich 38 Data Instantiation ETH Zurich 39 Data Instantiation Use a constraint solver to instantiate the symbolic database for each symbolic relation r for each tuple t for each symbol s load the related predicates P instantiate P cache P ETH Zurich 40 Experiment 1 – Operator Performance Study the performance (and scalability) of Individual operator during SQP The data instantiation phase Use TPC-dbgen to generate 3 TPCH-DB 10M, 100M, 1G Q8(TPCH-DB) to collect the intermediate results R for each operator QAGen(Q8, R) Q8 query aware database ETH Zurich 41 Experiments – TPC-H Query 8 ETH Zurich 42 Experiment 1 – TPC-H Query 8 ETH Zurich 43 Experiment 2 – Effects of knob values Use TPCH Q8 6 sets of knob values TPCH-Uniform, TPCH-Zipf Min-Uniform, Min-Zipf Max-Uniform, Max-Zipf ETH Zurich 44 Experiment 2 – Effects of knob values ETH Zurich 45 Experiment 3 – System Scalability ETH Zurich 46 Related Work, Future Work, Conclusions Reverse Query Processing (ICDE07) Given the result R, the query Q, reversely process Q to generate D for function testing database applications, view maintenance, debugging SQL Multiple SQL statements (to ACM TSE journal) ETH Zurich 47 ETH Zurich 48 Current approach 2 – Stochastically generate many test queries Based on a given test database, RAGS/QGen generates many valid SQL queries to test the system No guarantee that T1 can be covered Same as the previous approach, T2 is never covered ETH Zurich 49 QAGen overview – Query Analyzer Each knob combination (e.g., output cardinality + join distribution) for an operator may have different ways to implement it The output is an knobannotated execution plan ETH Zurich 50