Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Database model wikipedia , lookup
Query Optimization (introduction to query processing) Advanced Database Technologies By Dr. Akhtar Ali 1 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) What is Optimization • Best use of resources. – Good time management – Effective allocations of lecturers, labs to course units • Efficient solution to a problem. – Quick response to a user query • Less costly. – Solar Energy Vs. Nuclear Vs. hydro-electric power – Minimum I/O, CPU cycles, Memory Space 2 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Query Optimization • A classical component of a DBMS. • Choosing best composition of algebraic operators to answer a query. – A query (e.g. in SQL) may have several alternative representations in algebra. – The optimizer selects a best possible algebraic representation. • Choosing an efficient and less costly plan to answer a query. – One that takes less time to compute. – One with least cost (in terms of I/Os). • Why Query Optimization? – To make query evaluation faster. – To reduce the response time of the query processor. – To allow the user write queries without being aware of the physical access mechanisms and without asking her/him to explicitly dictate the system how the queries should be evaluated. 3 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Recommended Text • Database Management Systems By R. Ramakrishnan, Chapters 12, 13 (copy provided) • Fundamental of Database Systems – 3rd Edition By R. Elmasri and S. B. Navati, Chapter 18 • An Introduction to Database Systems – 7th Edition By C. J. Date, Chapter 17 4 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Query Processing – the black box view user/ application SQL query scanning, parsing, validating parse tree meta data result of the query Translator Relational Algebra query tree DBMS Query Processor Catalog Database Logical Optimizer uses tranformations database statistics optimized Relational Algebra query tree data Runtime Database Engine code to execute the query Physical Optimizer uses a cost model 5 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Query Processing – the clear view user/ application SQL query scanning, parsing, validating parse tree meta data result of the query Translator Relational Algebra query tree Catalog Database Logical Optimizer uses tranformations database statistics optimized Relational Algebra query tree data Runtime Database Engine Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) code to execute the query Physical Optimizer uses a cost model 6 Example database schema • We will use the following schema throughout this lecture: Sailors(sid:integer, sname:string, rating:integer, age:real) Reserves(sid:integer, bid:integer, day:date, rname:string) • Consider the following statistics about the relations. – – – – – – Each tuple of Reserves is 40 bytes long, A data page can hold 100 Reserves tuples, The size of Reserves relation is 1000 pages, Each tuple of Sailors is 50 bytes long, A data page can hold 80 Sailors tuples, and The size of Sailors relation is 500 pages. 7 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Translating SQL into Relational Algebra • After the SQL query is parsed and it is syntactically correct, then it is mapped onto Relational Algebra (RA) expression. Usually shown as a query tree (bottom up). • Consider the SQL query: π sname SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.rating > 5 σ bid = 100 and rating > 5 sid=sid Reserves The same query in RA: sname (bid=100 and rating > 5(Reserves ⋈sid=sid Sailors)) Sailors 8 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Implementation of Relational Operators • We will discuss how to implement: – Selection () Selects a subset of rows from a relation. – Projection () Picks only required attributes and removes unwanted attributes from a relation. – Join (⋈) Combines two relations. 9 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Access Paths • There is usually more than one way to retrieve tuples from a relation, if indexes are available and if the query contains selection conditions. • The selection condition comes from a select or a join. • The alternative ways to retrieve tuples from a relation are called access paths. • An access path is either: – A file scan (when there is no selection condition or no index can be used). – An index plus a matching selection condition. For example, attr op value, where op is an operator (<, >, =), and there is an index available on attr. 10 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Implementing Selection operator • Depends on the available file organizations, that is whether we have: – No index available and the physical file for a given relation is unsorted. Too much expensive. – No index but the file is sorted on some attribute. – A B+ tree index is available. – A Hash index is available. • For each of the above, the selection operator costs differently and that is the main thing to know. 11 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Selection Operator – an Example Query • Consider the following query: SELECT * FROM Reserves WHERE rname = ‘Joe’ • Consider that there are 100 tuples that qualify for the result of the above query. That is 100 tuples have rname = ‘Joe’. 12 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Selection using no index & no sorting • For a general selection query: R.attr op value (R), we have to scan the entire file to get the qualifying tuples. Note that op can be <, >, =, <>, etc. • For each tuple, it is tested to see if the given condition (R.attr op value) holds. If the conditions holds then the tuple is added to the result. • The cost of this approach is M I/Os, where M is the number of pages in R. • For the example query, the cost is 1000 I/Os because there are 1000 pages in Reserves relation. 13 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Selection using sorting but no index • For a general selection query: R.attr op value (R), if R is physically sorted on R.attr, we use a binary search to locate the first qualifying tuple. • We keep on testing the condition on the tuples in every page that is scanned and add them to the result until the condition fails to hold. • The cost of this approach is equal to the cost of binary search plus the number of pages that have been read. – The cost of binary search = log2 M I/Os – The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples. – Total cost = log2 M + T I/Os • For the example query, the cost is computed as follows: – The binary search cost = log2 1000 = log 1000/ log 2 = 9.96 10 – Since the number of qualifying tuples are 100, 1 page will hold these tuples and scanning that page will cost 1 I/O. – So the total cost is 10 + 1 = 11 I/Os. 14 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) B+ tree Index Root 10 6 3* • • • • • 4* 6* 20 12 9* 10* 10* 23 12* 13* 20* 22* 35 23* 31* 35* 36* B+ tree index is a balanced tree in which the internal nodes (the top two levels) direct the search and the leaf nodes contain data entries. Searching for a record requires just a traversal from the root to the appropriate leaf node. The length of the path from the root to a leaf is called height of the tree (usually 2 or 3). To search for entry 9*, we follow the left most child pointer from the root (as 9 < 10). Then at level two we follow the right child pointer (as 9 > 6). Once at the leaf node, data entries can be found sequentially. Leaf nodes are inter-connected which makes it suitable for range queries. 15 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Selection using B+ tree index • For a general selection query: R.attr op value (R), B+ tree is best if R.attr is not equality (e.g. <, >). It is also good for = operator. • We search the B+ tree to find the first page that contains a qualifying tuple. Assume that the tree index is clustered. • We then read all those pages that contain the qualifying tuples. • The cost of this approach is equal to the sum of the following: – The cost of identifying the starting page = 2 or 3 I/Os. We assume 2 I/Os throughout. – The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples. – Total cost = 2 + T • For the example query, the cost is computed as follows: – Since the number of qualifying tuples are 100, 1 page will hold these tuples and scanning that page will cost 1 I/O. – So the total cost is 2 + 1 = 3 I/Os. 16 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Hash Index • • • • Local Dept A function called hash function is applied to the hash field value (key field) to get the address of the disk page in which the record is stored. Global Dept 4* The directory is an array of size n (4 in the figure), each element is a pointer to a bucket. • • 32* 5* 21* 16* Bucket A 2 1* Bucket B 01 2 10 10* Bucket C 11 2 Directory To search for a data entry: • 12* 2 00 A bucket is a set of records. 2 15* 7* 19* Bucket D Data Pages the hash function is applied to the search field and the last bits of its binary form is used to get a number between 0 and 3. this number gives the array position to get the pointer to the desired bucket. to locate a record with key field 5 (binary 101), we look at directory element 01 and follow the pointer to the data page (Bucket B). 17 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Selection using Hash Index • For a general selection query: R.attr op value (R), hash index is best if R.attr is equality (=). It is not good for not equality (e.g. <, >, <>). • We retrieve the index page that contain the rids (record identifiers) of the qualifying tuples. • Then the pages that contain these tuples are scanned. • The cost of this approach is equal to the sum of the following: – The cost to retrieve the index page = 1 I/O – The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples. • For the example query, the cost is computed as follows: – Since the number of qualifying tuples are 100, 1 page will hold these tuples and scanning that page will cost 1 I/O. – So the total cost is 1 + 1 = 2 I/Os. 18 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Implementation of Selection (summary) Assuming R.attr op value (R) • No Index is available on attr and R is not sorted on attr – Cost = M I/Os, where M is the number of pages in R • No Index is available on attr and R is sorted on attr – Cost = log2M + T I/Os, where T is the number of pages read for retrieving the qualifying tuples • B+ Tree Index (clustered) is available on attr – Cost = B + T I/Os, where B is the height of the index (i.e. 2). • Hash Index (clustered) is available on attr – If attr is not a primary key: • Cost = H + T I/Os, where H (i.e. 1) is the I/O required to obtain the rids of the qualifying tuples. – If attr is a primary key: • Cost = (H + 1) * TP I/Os, where TP is the number of the qualifying tuples. In this case if the query is searching only one record then hash index performs better than B+ tree. If the query is a range query then B+ tree is a better alternative. 19 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I) Summary of Lecture 7 • Query Optimization – What and why • Query Processing – The various stages through which a query goes • Translation of SQL into Relational Algebra – Internal representation of the query • Access Paths – Different paths and ways to get the same data • Implementation of the Selection Operator – Different ways of evaluating selection using different access paths 20 Advanced DB Tech (CG096): Lecture # 7 Query Optimization (I)