Download Query Optimization – Seminar 1

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia, lookup

SQL wikipedia, lookup

Clusterpoint wikipedia, lookup

Open Database Connectivity wikipedia, lookup

Relational model wikipedia, lookup

Database model wikipedia, lookup

Microsoft Jet Database Engine wikipedia, lookup

Extensible Storage Engine wikipedia, lookup

Database wikipedia, lookup

Relational algebra wikipedia, lookup

Query Optimization – Seminar 2
1. Introduction
Query optimization plays a vital role in query processing. Query processing consists of the
following stages:
Parsing a user query (e.g. in SQL)
Translating the parse tree (representing the query) into relational algebra expression.
Optimizing the initial algebraic expression.
Choosing an evaluation algorithm for each relational algebra operator that would
constitute least cost for answering the query.
Stages 3-4 are the two parts of Query Optimization. Query optimization is an important and
classical component of a database system. Queries, in a high level and declarative language e.g.
SQL, that require several algebraic operations could have several alternative compositions and
ordering. Finding a “good” composition is the job of the optimizer. The optimizer generates
alternative evaluation plan for answering a query and chooses the plan with least estimated cost.
To estimate the cost of a plan (in terms of I/O, CPU time, memory usage, etc but not in pounds or
dollars) the optimizer uses statistical information available in the database system catalogue.
2. Objectives
Generally speaking, the purpose of this seminar is to present and discuss how relational algebra
(RA) operators are evaluated. In other words, how RA operators are implemented using some
algorithms. In particular, this seminar tries to compare different evaluation algorithms for
selection (known as Restrict) operation. In addition, the exercises will explore the influence of
different physical access methods available to a RBMS to make a choice of how to evaluate
3. Reading
Please read section 12.1 to 12.2.4 of the 12th Chapter of “Database Management Systems” book
(copies provided). This will give you an introduction of query processing and the different ways
of implementing the selection operation of relational algebra.
4. Reading Summary
Evaluation of selection operation: This operation can be implemented using several access
methods, namely:
No Index, unsorted data: The cost of this approach is M I/Os where M is the number of
pages. If you are selecting all or fewer tuples from the table, the cost is generally the same.
For the given example (figure 12.1) this cost is 1000 I/Os.
No Index, sorted data: The cost is the sum of the binary search cost (log2 M) and the cost
of scanning the required tuples sequentially (this depends on the number of such tuples).
For the example, the total cost is (log2 1000) + 1 (for 1 page if there are 100 qualifying
tuples with rname = ‘Joe’ i.e. these tuples are stored in the same page). That is the cost is
9.9657 + 1 = 10.9657  11. Note that log2 1000 is computed as follows:
Let x = log2 1000  2x = 1000  log 2x = log 1000  x log 2 = log 1000
 x = log 1000 / log 2  x = 3 / 0.30102 = 9.9657
B+ Tree index: This is quite good for none equality selection conditions. The index can be
clustered or unclustered. Clustered means that the records in the data file are ordered
similarly to the ordering of entries in the index. Assuming that we have a clustered B+
index, and that there are 100, the cost would be 2 (for identifying the initial page) plus 1
(for scanning the 100 tuples) i.e. 3 I/Os in total.
Hash Index: This is ideal for equality conditions. The cost depends on how many tuples
are likely to be selected and whether the index is clustered or not. The cost of evaluating
the example would be 1 (for retrieving the index page for retrieving the rids – record
identifiers of 100 tuples) plus 1 (for scanning the 100 tuples assuming that they are in 1
page i.e. clustered) i.e. 2 I/Os in total.
5. Exercise
Consider a relation R(a, b, c, d, e) containing 5,000,000 records (tuples), where each data page
of the relation holds 10 records. R is organized as a sorted file with dense secondary indexes.
Assume that R.a is a candidate key for R, with values lying in the range of 0 to 4,999,999, and
that R is sorted in R.a order. For each of the following relational algebra expressions (i.e.
queries), state which of the following three approaches (evaluation strategies) is most likely to be
the cheapest.
a) Access the sorted file for R directly.
b) Use a B+ tree index (clustered) on attribute R.a.
c) Use a hashed index (clustered) on attribute R.a.
1. a<50,000(R)
2. a=50,000(R)
3. a50,000(R)
where  denotes the selection operation of relational algebra (e.g. a<50,000(R)).
Hints: Try to calculate how many pages are there in R. Calculate the cost of each approach using
the formulas given in the reading material and select the one that gives least cost.