Download Implementation of Two Semantic Query Optimization Techniques in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining for Query Optimization
Jarek Gryz
Outline
•
•
•
•
Semantic Query Optimization
Soft Constraints
Query Optimization via Soft Constraints
Selectivity Estimation via Soft Constraints
2
Semantic Query Optimization
Use integrity constraints associated with a database to rewrite
a query into a form that may be evaluated more efficiently
Some Techniques:
•Join Elimination
•Predicate Elimination
•Join Introduction
•Predicate Introduction
•Detecting an Empty Answer Set
3
Commercial implementations of SQO
Few (if any!)
Early Experiences:
•Could not spend too much time on optimization
•Few integrity constraints are ever defined
•Association with deductive databases
4
Join elimination: example
select
from
where
p_name, p_retailprice, s_name, s_address
tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier
p_partkey = ps_partkey and s_suppkey = ps_suppkey and
ps_partkey = l_partkey and ps_suppkey = l_suppkey;
RI constraints:
select
from
where
part-partsupp (on partkey)
supplier-partsupp (on partkey)
partsupp-lineitem (on partkey and suppkey)
p_name, p_retailprice, s_name, s_address
tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier
p_partkey = l_partkey and s_suppkey = l_suppkey;
5
Algorithm for join elimination
1. Derive column transitivity classes from the
join predicates in the query
2. Divide the relations in the query that are
related through RI constraints into
removable and non-removable
3. Eliminate all removable relations from the
query
4. Add is not null predicate to foreign key
columns of all tables whose RI parents were
removed
6
Algorithm for join elimination: example
S.S
PS.S
S.S
PS.S
PS.S
C.C
O.C
C.C
O.C
O.C
7
Performance results for join
elimination
600
500
400
Original
Optimized
300
200
100
0
J1
J2
J3
J4
J5
J6
J7
J8
J9
J10
9
Predicate Introduction: Example
select
from
where
sum(l_extendedprice * l_discount) as revenue
tpcd.lineitem
shipdate >date('1994-01-01');
Check constraint: receiptdate >= shipdate
Clustered Index on receiptdate
select
from
where
sum(l_extendedprice * l_discount) as revenue
tpcd.lineitem
shipdate >date('1994-01-01') and receiptdate >= date('1994-01-01');
10
Algorithm for Predicate Introduction
N - set of predicates derivable from the query and
check constraints
•If N is inconsistent, stop.
•Else, for each predicate A op B in N, add it to the
query if:
•A or B is a join column
•B is a major column of an index
•no other index on B’s table can be used in the plan
for the original query
11
Queries
select
from
where
100.00 * sum
(case
when p_type like 'PROMO%'
then l_extendedprice * (1 - l_discount)
else 0
end)
/ sum(l_extendedprice * (1 - l_discount)) as promo_revenue
tpcd.lineitem, tpcd.part
l_partkey = p_partkey and
l_shipdate >= date('1998-09-01') and
l_shipdate < date('1998-09-01') + 1 month;
Given the check constraint l_receiptdate >= l_shipdate we may add
a new predicate to the query:
l_receiptdate >= date(‘1998-09-01’)
13
Performance Results for Index
Introduction
100
90
80
70
60
50
40
30
20
10
0
Original
Optimized
P1
P2
P3
P4
P5
14
The Culprit
New query plan uses an index, but the original table
scan is still better!
DATA READS
Original Query
“Optimized Query”
Physical
21607
10680
Logical
22439
286516
INDEX READS
Physical
12
2687
Logical
26
288326
CPU COST (S)
ESTIMATED # OF
QUALIFYING TUPLES
21.9
55.9
20839
12618
Why did this happen:
•incorrect estimate of the filter factor
•underestimation of the CPU cost of locking index pages
15
Soft Constraints
16
Soft Constraints
Traditional (“hard”) integrity constraints are
defined to prevent incorrect updates. A soft
constraint is a statement that is true about
the current state of the database, but does
not verify updates. In fact, a soft constraint
can be invalidated by an update.
17
Soft Constraints (cont.)
• Absolute soft constraints – no violation in
the current state of the database
Absolute soft constraints can be used for optimization in exactly
the same way traditional constraints are.
• Statistical soft constraints – can have some
(small) degree of violation
Statistical soft constraints can be used for improved selectivity
estimation
18
Implementation of Soft Constraints
In Oracle the standard integrity constraints are
marked with a rely option, so that they are
not verified on updates.
In DB2 soft constraints are called
informational constraints.
19
Informational Check Constraint
Example 1: Create an employee table where a
minimum salary of $25,000 is guaranteed by the
application
CREATE TABLE emp(empno INTEGER NOT NULL
PRIMARY KEY,
name VARCHAR(20),
firstname VARCHAR(20),
salary INTEGER CONSTRAINT minsalary
CHECK (salary >= 25000)
NOT ENFORCED
ENABLE QUERY
OPTIMIZATION);
20
Enforcing Validation
Example 2: Alter the employee table to start
enforcing the minimum wage of $25,000
using DB2. DB2 will also verify existing
data right away.
ALTER TABLE emp ALTER
CONSTRAINT minsalary ENFORCED
21
Informational RI Constraint
Example 3: Create a department table where the
application ensures the existence of departments to
which the employees belong.
CREATE TABLE dept(deptno INTEGER NOT NULL PRIMARY
KEY,
deptName VARCHAR(20),
budget INTEGER);
ALTER TABLE emp ADD COLUMN dept INTEGER NOT NULL
CONSTRAINT dept_exist
REFERENCES dept
NOT ENFORCED
ENABLE QUERY OPTIMIZATION);
22
Query Optimization via Empty
Joins
23
Example
select
from
where
Model
Tickets T, Registration R
T.RegNum = R.RegNum and T.date > “1990-01-01”
and R.Model LIKE “BMW Z3%”
First BMW Z3 series cars were made in 1997.
select
from
where
Model
Tickets T, Registration R
T.RegNum = R.RegNum and T.date > “1997-01-01”
and R.Model LIKE “BMW Z3%”
24
Matrix representation of empty joins
 (R
S)
A,B
A
3
1
3
B
6
7
8
1
6 0
7 1
8 0
2
0
0
0
3
1
0
1
25
Staircase data structure
1
( xr ,yr )
1
1
Y
1
0
1
0
1
( x1 ,y1)
1
( x, y )
1
X
0
0
0
1
0
0
0
0
0
0
26
Properties of the algorithm
•Time Complexity O(nm)
requires a single scan of the sorted data
•Space Complexity O(min(n,m))
only two rows of the matrix need be kept in memory
•Scalable with respect to:
•number of tuples in the join result
•number of discovered empty rectangles
•size of the domain of one of the attributes
27
How many empty rectangles are there?
Tests done on 4 pairs of attributes with numerical domain present in
typical joins in a real-world workload of a health insurance company.
45000
40000
35000
30000
25000
20000
Number of
discovered empty
rectangles
Number of tuples
in the join
15000
10000
5000
0
Test 1 Test 2 Test 3 Test 4
28
How big are the rectangles?
The sizes of the 5 largest rectangles as
% of the size of the matrix
100
50
0
Test 1 Test 2 Test 3 Test 4
5th
4th
3rd
2nd
1st largest
29
Query rewrite: simple case
select …
from R, S,...
where R.C=S.C and
60<R.A<80 and
20<S.B<80 and...
select …
from R, S,...
where R.C=S.C and
60<R.A<80 and
20<S.B<60 and...
30
Query rewrite: complex case
select
from
where
…
R, S,...
R.C=S.C and
60<R.A<80 and
20<S.B<80 and...
select
from
where
…
R, S,...
R.C=S.C and
(… and …) or
(… and …) or
(… and …) or
...
31
Experiment I: Size of the Overlap
100
90
80
70
60
50
40
30
20
10
0
Reduction of the
Size of the Table
(%)
Reduction of
Execution Time (%)
32
Experiment 2: Type of Overlap
100
80
60
40
20
Reduction of
Execution
Time (%)
0
-20
33
Experiment 3: Number of Empty
Joins Used in Rewrite
100
80
60
40
Reduction of
Execution
Time (%)
20
0
34
How much do the rectangles
overlap with queries?
25
20
15
% of rectangles
overlapping with
queries
10
5
0
Q1
Q2
Q3
Q4
Q5
35
Query optimization experiments
• real-world workload of 26 queries
• 5 of the queries “qualified” for the rewrite
• only simple rewrites were considered
• all rewrites led to improved performance
80
70
60
50
40
% improvement
in execution time
30
20
10
0
Q1
Q2
Q3
Q4
Q5
36
Query Cardinality Estimate via
Empty Joins
37
Query Cardinality Estimate via
Empty Joins (SIEQE)
• Cardinality estimates crucial for designing
good query evaluation plans
• Uniform data distribution (UDA): standard
assumption in database systems
• Histograms effective in single dimensions:
too expensive to build and maintain
otherwise
38
The Strategy
•With UDA, the “density”: 1 tuple/sq unit
•Empty joins cover 20% of the area
•Adjusted density: 1.25 tuples/sq unit
Q1
Cardinality UDA SIEQE
Q2
Q1
100
62
Q2
100
125
39
Experiments
Number of queries for which the error is less than a given limit
40
Discovery of Check Constraints
and Their Application in DB2
We discover two types of (rules) check constraints:
•correlations between attributes over ordered domains
•partitioning of attributes
42
Correlations between attributes
over ordered domains
Rules have the form:
Y = bX + a + [emin, emax]
Algorithm
for all tables in the database
for all comparable variable pairs (X and Y) in the table
apply OLS estimation to get the function of the
form: Y = a + bX
calculate the max and min error (or residual)
emax and emin
endfor
endfor
43
Partitioning
Rules have the form: If X = a, then Y
[emin, emax]
Algorithm
for all tables in the database
for any qualifying variable pair (X and Y) in the table
calculate partitions by using GROUP BY X statements
find the max and min value of Y for each partition
endfor
endfor
44
Experiments in TPC-H
TPC-H contains the following check constraint:
L_RECEIPTDATE > L_SHIPDATE
Our algorithm discovered the following rule:
L_RECEIPTDATE = L_SHIPDATE + (1, 30), m = 0.0114.
Rules discovered through partitioning:
If L_LINESTATUS=F, then L_SHIPDATE=(01/04/1992, 06/17/1995), m = 0.50
If L_LINESTATUS=O, then L_SHIPDATE=(06/19/1995, 12/25/1998), m = 0.50
45
Applications
• DBA Wizard
• Semantic Query Optimization
• Improved Filter Factor Estimates
46
Example
Consider a query issued against a hotel database, that requests the number of guests
staying in the hotel on a given date.
ARRIVAL DATE <= ‘1999-06-15’ AND DEPARTURE_DATE >= ‘1999-06-15’
The filter factor estimate for the query would be:
ff = ff1 * ff2
If ‘1999-06-15’ was approximately midway in the date ranges, we would estimate a
quarter of all the guests that came in over the number of years would be in the answer
of the query!
47
Example (cont.)
Assume that the following check constraint was discovered:
DEPARTURE_DATE >= ARRIVAL_DATE + (1 DAY, 5 DAYS)
The original condition in the query predicate can then be changed to:
ARRIVAL_DATE <= ‘1999-06-15’ AND ARRIVAL_DATE >= ‘1999-06-18’
or
ARRIVAL_DATE BETWEEN ‘1999-06-15’ AND ‘1999-06-18’
The filter factor is now estimated to:
ff = (ff1 + ff2 –1)
48
Other Research on the Use of Soft
Constraints in Query Optimization
49
Query-driven Approach
• Built multidimensional histograms based on
query results (Microsoft)
• Improve cardinality estimates by looking at
the intermediate query results (IBM)
Both techniques generate statistical soft constraints
50
Data-driven Approach
• Lots of methods using Bayesian networks to
infer statistical soft constraint
• Lots of methods to discover functional
dependencies in data (absolute soft
constraints)
• Most recently, BHUNT and CORDS use
sampling to discover soft constraints (IBM)
51
References
• Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu, X. Qian, B.
Schiefer: Implementation of Two Semantic Query Optimization
Techniques in DB2 Universal Database. VLDB 1999.
• J. Edmonds, J. Gryz, D. Liang, R. Miller: Mining for Empty Rectangles
in Large Data Sets. ICDT 2001.
• J. Gryz, B. Schiefer, J. Zheng, C. Zuzarte: Discovery and Application
of Check Constraints in DB2. ICDE 2001.
• P. Godfrey, J. Gryz, C. Zuzarte: Exploiting Constraint-Like Data
Characterizations in Query Optimization. SIGMOD 2001.
• J. Gryz, D. Liang: Query Optimization via Empty Joins. DEXA 2002.
• J. Gryz, D. Liang: Query Cardinality Estimation via Data Mining. IIS
2004.
52