Query Optimization – Seminar 2 1. Introduction Query optimization plays a vital role in query processing. Query processing consists of the following stages: 1. 2. 3. 4. Parsing a user query (e.g. in SQL) Translating the parse tree (representing the query) into relational algebra expression. Optimizing the initial algebraic expression. Choosing an evaluation algorithm for each relational algebra operator that would constitute least cost for answering the query. Stages 3-4 are the two parts of Query Optimization. Query optimization is an important and classical component of a database system. Queries, in a high level and declarative language e.g. SQL, that require several algebraic operations could have several alternative compositions and ordering. Finding a “good” composition is the job of the optimizer. The optimizer generates alternative evaluation plan for answering a query and chooses the plan with least estimated cost. To estimate the cost of a plan (in terms of I/O, CPU time, memory usage, etc but not in pounds or dollars) the optimizer uses statistical information available in the database system catalogue. 2. Objectives Generally speaking, the purpose of this seminar is to present and discuss how relational algebra (RA) operators are evaluated. In other words, how RA operators are implemented using some algorithms. In particular, this seminar tries to compare different evaluation algorithms for selection (known as Restrict) operation. In addition, the exercises will explore the influence of different physical access methods available to a RBMS to make a choice of how to evaluate queries. 3. Reading Please read section 12.1 to 12.2.4 of the 12th Chapter of “Database Management Systems” book (copies provided). This will give you an introduction of query processing and the different ways of implementing the selection operation of relational algebra. 4. Reading Summary Evaluation of selection operation: This operation can be implemented using several access methods, namely: No Index, unsorted data: The cost of this approach is M I/Os where M is the number of pages. If you are selecting all or fewer tuples from the table, the cost is generally the same. For the given example (figure 12.1) this cost is 1000 I/Os. No Index, sorted data: The cost is the sum of the binary search cost (log2 M) and the cost of scanning the required tuples sequentially (this depends on the number of such tuples). For the example, the total cost is (log2 1000) + 1 (for 1 page if there are 100 qualifying tuples with rname = ‘Joe’ i.e. these tuples are stored in the same page). That is the cost is 9.9657 + 1 = 10.9657 11. Note that log2 1000 is computed as follows: Let x = log2 1000 2x = 1000 log 2x = log 1000 x log 2 = log 1000 x = log 1000 / log 2 x = 3 / 0.30102 = 9.9657 B+ Tree index: This is quite good for none equality selection conditions. The index can be clustered or unclustered. Clustered means that the records in the data file are ordered similarly to the ordering of entries in the index. Assuming that we have a clustered B+ index, and that there are 100, the cost would be 2 (for identifying the initial page) plus 1 (for scanning the 100 tuples) i.e. 3 I/Os in total. Hash Index: This is ideal for equality conditions. The cost depends on how many tuples are likely to be selected and whether the index is clustered or not. The cost of evaluating the example would be 1 (for retrieving the index page for retrieving the rids – record identifiers of 100 tuples) plus 1 (for scanning the 100 tuples assuming that they are in 1 page i.e. clustered) i.e. 2 I/Os in total. 5. Exercise Consider a relation R(a, b, c, d, e) containing 5,000,000 records (tuples), where each data page of the relation holds 10 records. R is organized as a sorted file with dense secondary indexes. Assume that R.a is a candidate key for R, with values lying in the range of 0 to 4,999,999, and that R is sorted in R.a order. For each of the following relational algebra expressions (i.e. queries), state which of the following three approaches (evaluation strategies) is most likely to be the cheapest. a) Access the sorted file for R directly. b) Use a B+ tree index (clustered) on attribute R.a. c) Use a hashed index (clustered) on attribute R.a. 1. a<50,000(R) 2. a=50,000(R) 3. a50,000(R) where denotes the selection operation of relational algebra (e.g. a<50,000(R)). Hints: Try to calculate how many pages are there in R. Calculate the cost of each approach using the formulas given in the reading material and select the one that gives least cost.