Download Optimization_Lecture_7

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Relational algebra wikipedia , lookup

Transcript
Query Optimization
(introduction to query
processing)
Advanced Databases
By
Dr. Akhtar Ali
1
Advanced Databases: Lecture 2 Query Optimization (I)
What is Optimization
• Best use of resources.
– Good time management
– Effective allocations of lecturers, labs to course units
• Efficient solution to a problem.
– Quick response to a user query
• Less costly.
– Solar Energy Vs. Nuclear Vs. hydro-electric power
– Minimum I/O, CPU cycles, Memory Space
2
Advanced Databases: Lecture 2 Query Optimization (I)
Query Optimization
• A classical component of a DBMS.
• Choosing best composition of algebraic operators to answer a query.
– A query (e.g. in SQL) may have several alternative representations in
algebra.
– The optimizer selects a best possible algebraic representation.
• Choosing an efficient and less costly plan to answer a query.
– One that takes less time to compute.
– One with least cost (in terms of I/Os).
• Why Query Optimization?
– To make query evaluation faster.
– To reduce the response time of the query processor.
– To allow the user write queries without being aware of the physical
access mechanisms and without asking her/him to explicitly dictate the
system how the queries should be evaluated.
3
Advanced Databases: Lecture 2 Query Optimization (I)
Recommended Text
• Database Management Systems
By R. Ramakrishnan, Chapters 12, 13 (copy provided)
• Fundamental of Database Systems – 3rd Edition
By R. Elmasri and S. B. Navati, Chapter 18
• An Introduction to Database Systems – 7th Edition
By C. J. Date, Chapter 17
4
Advanced Databases: Lecture 2 Query Optimization (I)
Query Processing – the context
user/
application
SQL query
scanning,
parsing,
validating
parse tree
meta data
result of
the query
Translator
Relational
Algebra query tree
Catalog
Database
Logical Optimizer
uses
tranformations
database
statistics
optimized Relational
Algebra query tree
data
Runtime
Database Engine
Advanced Databases: Lecture 2 Query Optimization (I)
code to execute
the query
Physical Optimizer
uses a cost model
5
Example database schema
• We will use the following schema throughout this lecture:
Sailors(sid:integer, sname:string, rating:integer, age:real)
Reserves(sid:integer, bid:integer, day:date, rname:string)
• Consider the following statistics about the relations.
–
–
–
–
–
–
Each tuple of Reserves is 40 bytes long,
A data page can hold 100 Reserves tuples,
The size of Reserves relation is 1000 pages,
Each tuple of Sailors is 50 bytes long,
A data page can hold 80 Sailors tuples, and
The size of Sailors relation is 500 pages.
6
Advanced Databases: Lecture 2 Query Optimization (I)
Translating SQL into Relational Algebra
• After the SQL query is parsed and it is syntactically correct,
then it is mapped onto Relational Algebra (RA) expression.
Usually shown as a query tree (bottom up).
• Consider the SQL query:
π sname
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
AND R.bid = 100 AND S.rating > 5
σ bid = 100 and rating > 5
sid=sid
Reserves
The same query in RA:
sname (bid=100 and rating > 5(Reserves ⋈sid=sid Sailors))
Sailors
7
Advanced Databases: Lecture 2 Query Optimization (I)
Implementation of Relational Operators
• We will discuss how to implement:
– Selection () Selects a subset of rows from a relation.
– Projection () Picks only required attributes and removes
unwanted attributes from a relation.
– Join (⋈) Combines two relations.
8
Advanced Databases: Lecture 2 Query Optimization (I)
Access Paths
• There is usually more than one way to retrieve tuples from a
relation, if indexes are available and if the query contains
selection conditions.
• The selection condition comes from a select or a join.
• The alternative ways to retrieve tuples from a relation are
called access paths.
• An access path is either:
– A file scan (when there is no selection condition or no index can
be used).
– An index plus a matching selection condition. For example, attr
op value, where op is an operator (<, >, =), and there is an index
available on attr.
9
Advanced Databases: Lecture 2 Query Optimization (I)
Implementing Selection operator
• Depends on the available file organizations, that is whether
we have:
– No index available and the physical file for a given relation is
unsorted. Too much expensive.
– No index but the file is sorted on some attribute.
– A B+ tree index is available.
– A Hash index is available.
• For each of the above, the selection operator costs
differently and that is the main thing to know.
10
Advanced Databases: Lecture 2 Query Optimization (I)
Selection Operator – an Example Query
• Consider the following query:
SELECT *
FROM Reserves
WHERE rname = ‘Joe’
• Consider that there are 100 tuples that qualify for the result
of the above query. That is 100 tuples have rname = ‘Joe’.
11
Advanced Databases: Lecture 2 Query Optimization (I)
Selection using no index & no sorting
• For a general selection query: R.attr op value (R), we have to
scan the entire file to get the qualifying tuples. Note that op
can be <, >, =, <>, etc.
• For each tuple, it is tested to see if the given condition
(R.attr op value) holds. If the conditions holds then the
tuple is added to the result.
• The cost of this approach is M I/Os, where M is the number
of pages in R.
• For the example query, the cost is 1000 I/Os because there
are 1000 pages in Reserves relation.
12
Advanced Databases: Lecture 2 Query Optimization (I)
Selection using sorting but no index
• For a general selection query: R.attr op value (R), if R is physically sorted on
R.attr, we use a binary search to locate the first qualifying tuple.
• We keep on testing the condition on the tuples in every page that is scanned
and add them to the result until the condition fails to hold.
• The cost of this approach is equal to the cost of binary search plus the
number of pages that have been read.
– The cost of binary search = log2 M I/Os
– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve
the qualifying tuples.
• For the example query, the cost is computed as follows:
– The binary search cost = log2 1000 = log 1000/ log 2 = 9.96  10
– Since the number of qualifying tuples are 100, 1 page will hold these tuples and
scanning that page will cost 1 I/O.
– So the total cost is 10 + 1 = 11 I/Os.
13
Advanced Databases: Lecture 2 Query Optimization (I)
B+ tree Index
Root
10
6
3*
•
•
•
•
•
4*
6*
20
12
9*
10* 10*
23
12* 13*
20* 22*
35
23* 31*
35* 36*
B+ tree index is a balanced tree in which the internal nodes (the top two levels) direct the search and
the leaf nodes contain data entries.
Searching for a record requires just a traversal from the root to the appropriate leaf node.
The length of the path from the root to a leaf is called height of the tree (usually 2 or 3).
To search for entry 9*, we follow the left most child pointer from the root (as 9 < 10). Then at level
two we follow the right child pointer (as 9 > 6). Once at the leaf node, data entries can be found
sequentially.
Leaf nodes are inter-connected which makes it suitable for range queries.
14
Advanced Databases: Lecture 2 Query Optimization (I)
Selection using B+ tree index
• For a general selection query: R.attr op value (R), B+ tree is best if
R.attr is not equality (e.g. <, >). It is also good for = operator.
• We search the B+ tree to find the first page that contains a
qualifying tuple. Assume that the tree index is clustered.
• We then read all those pages that contain the qualifying tuples.
• The cost of this approach is equal to the sum of the following:
– The cost of identifying the starting page = 2 or 3 I/Os. We assume 2 I/Os
throughout.
– The cost of retrieving tuples = T I/Os where T is the number of pages scanned
to retrieve the qualifying tuples.
• For the example query, the cost is computed as follows:
– Since the number of qualifying tuples are 100, 1 page will hold these tuples and
scanning that page will cost 1 I/O.
– So the total cost is 2 + 1 = 3 I/Os.
15
Advanced Databases: Lecture 2 Query Optimization (I)
Hash Index
•
•
•
•
Local Dept
A function called hash
function is applied to the
hash field value (key field)
to get the address of the
disk page in which the
record is stored.
Global Dept
4*
The directory is an array
of size n (4 in the figure),
each element is a pointer
to a bucket.
•
•
32*
5*
21*
16*
Bucket A
2
1*
Bucket B
01
2
10
10*
Bucket C
11
To search for a data entry:
•
12*
2
00
A bucket is a set of
records.
2
2
Directory
15*
7*
19*
Bucket D
Data Pages
the hash function is applied to the search field and the last bits of its binary form is used to get a
number between 0 and 3.
this number gives the array position to get the pointer to the desired bucket.
to locate a record with key field 5 (binary 101), we look at directory element 01 and follow the
pointer to the data page (Bucket B).
16
Advanced Databases: Lecture 2 Query Optimization (I)
Selection using Hash Index
• For a general selection query: R.attr op value (R), hash index is best if R.attr
is equality (=). It is not good for not equality (e.g. <, >, <>).
• We retrieve the index page that contain the rids (record identifiers) of the
qualifying tuples.
• Then the pages that contain these tuples are scanned.
• The cost of this approach is equal to the sum of the following:
– The cost to retrieve the index page = 1 I/O
– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve
the qualifying tuples.
– For none-equality operators, T = the number of qualifying tuples.
• For the example query, the cost is computed as follows:
– Since the number of qualifying tuples are 100, 1 page will hold these tuples and
scanning that page will cost 1 I/O.
– So the total cost is 1 + 1 = 2 I/Os.
17
Advanced Databases: Lecture 2 Query Optimization (I)
Summary of Lecture 7
• Query Optimization
– What and why
• Query Processing
– The various stages through which a query goes
• Translation of SQL into Relational Algebra
– Internal representation of the query
• Access Paths
– Different paths and ways to get the same data
• Implementation of the Selection Operator
– Different ways of evaluating selection using different access paths
18
Advanced Databases: Lecture 2 Query Optimization (I)