Download Chapter 4 Implementation of Relational Operators

Chapter 4 Implementation of Relational Operators, Query Optimization, and Physical Database Design IAM6133 Advanced Database Technology 1 Contents • Relational Operator • Query Optimization • Physical Database Design 2 Relational Operator (Slide 4 – 16) 3 Relational Query Languages • Query languages: Allow manipulation and retrieval of data from a database. • Relational model supports simple, powerful QLs: – – Strong formal foundation based on logic. Allows for much optimization. • Query Languages != programming languages! – – – QLs not expected to be “Turing complete”. QLs not intended to be used for complex calculations. QLs support easy, efficient access to large data sets. 4 Formal Relational Query Languages • Two mathematical Query Languages form the basis for “real” languages (e.g. SQL), and for implementation: – Relational Algebra: More operational(procedural), very useful for representing execution plans. – Relational Calculus: Lets users describe what they want, rather than how to compute it. (Non-operational, declarative.) 5 Preliminaries • A query is applied to relation instances, and the result of a query is also a relation instance. – – Schemas of input relations for a query are fixed (but query will run regardless of instance!) The schema for the result of a given query is also fixed! Determined by definition of query language constructs. • Positional vs. named-field notation: – – Positional notation easier for formal definitions, named-field notation more readable. Both used in SQL 6 R1 Example Instances • “Sailors” and “Reserves” relations for our examples. “bid”= boats. “sid”: sailors • We’ll use positional or named field notation, assume that names of fields in query results are `inherited’ from names of fields in query input relations. sid bid day 22 101 10/10/96 58 103 11/12/96 S1 sid 22 31 58 sname rating age dustin 7 45.0 lubber 8 55.5 rusty 10 35.0 S2 sid 28 31 44 58 sname rating age yuppy 9 35.0 lubber 8 55.5 guppy 5 35.0 rusty 10 35.0 Relational Algebra • Basic operations: – – – – – Selection (  ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from relation. Cross-product ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union (  ) Tuples in reln. 1 and in reln. 2.   • Additional operations: – Intersection, join, division, renaming: Not essential, but (very!) useful. • Since each operation returns a relation, operations can be composed! (Algebra is “closed”.) 8 sname rating lubber guppy rusty 9 8 5 10 Projectionyuppy • Deletes attributes that are not in projection list. • Schema of result contains exactly the fields in the projection list, with the same names that they had in the (only) input relation. • Projection operator has to eliminate duplicates! (Why??, what are the consequences?) – Note: real systems typically don’t do duplicate elimination unless the user explicitly asks for it. (Why not?)  sname,rating(S2) age 35.0 55.5  age(S2) Selection • Selects rows that satisfy selection condition. • Schema of result identical to schema of (only) input relation. • Result relation can be the input for another relational algebra operation! (Operator composition.) sid sname rating age 28 yuppy 9 35.0 58 rusty 10 35.0  rating 8(S2) sname rating yuppy 9 rusty 10  sname,rating( rating 8(S2)) Union, Intersection, Set-Difference sid sname rating age • All of these operations take two input relations, which must be union-compatible: – – Same number of fields. `Corresponding’ fields have the same type. • What is the schema of result? sid sname 22 dustin rating age 7 45.0 S1 S2 22 31 58 44 28 dustin lubber rusty guppy yuppy 7 8 10 5 9 45.0 55.5 35.0 35.0 35.0 S1 S2 sid sname rating age 31 lubber 8 55.5 58 rusty 10 35.0 S1 S2 Cross-Product • Each row of S1 is paired with each row of R1. • Result schema has one field per field of S1 and R1, with field names `inherited’ if possible. – Conflict: Both S1 and R1 have a field called sid. (sid) sname rating age (sid) bid day 22 dustin 7 45.0 22 101 10/10/96 22 dustin 7 45.0 58 103 11/12/96 31 lubber 8 55.5 22 101 10/10/96 31 lubber 8 55.5 58 103 11/12/96 58 rusty 10 35.0 22 101 10/10/96 58 rusty 10 35.0 58 103 11/12/96  Renaming operator:  (C(1 sid1, 5  sid 2), S1 R1) 12 Joins • Condition Join: R  c S   c ( R  S) (sid) sname rating age 22 dustin 7 45.0 31 lubber 8 55.5 (sid) bid 58 103 58 103 day 11/12/96 11/12/96 S1 same as that of cross-product. R1 • Result schema S1. sid  R1. sid • Fewer tuples than cross-product. Filters tuples not satisfying the join condition. • Sometimes called a theta-join. 13 Joins • Equi-Join: A special case of condition join where the condition c contains only equalities. sid 22 58 sname dustin rusty rating age 7 45.0 10 35.0 bid 101 103 day 10/10/96 11/12/96 sid ,.., age,bid ,..(S1 sid R1) • Result schema similar to cross-product, but only one copy of fields for which equality is specified. • Natural Join: Equijoin on all common fields. 14 Division • Not supported as a primitive operator, but useful for expressing queries like: Find sailors who have reserved all boats. • Precondition: in A/B, the attributes in B must be included in the schema for A. Also, the result has attributes A-B. – SALES(supId, prodId); – PRODUCTS(prodId); – Relations SALES and PRODUCTS must be built using projections. – SALES/PRODUCTS: the ids of the suppliers supplying ALL products. 15 Examples of Division A/B sno s1 s1 s1 s1 s2 s2 s3 s4 s4 pno p1 p2 p3 p4 p1 p2 p2 p2 p4 A pno p2 B1 pno p2 p4 B2 pno p1 p2 p4 B3 sno s1 s2 s3 s4 sno s1 s4 sno s1 A/B1 A/B2 A/B3 16 Query Optimization (Slide 18-28) 17 Introduction • Query optimization is a function of many relational database management systems. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans. • Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. However, some database engines allow guiding the query optimizer with hints. 18 Introduction • A query is a request for information from a database. It can be as simple as "finding the address of a person with SS# 123-45-6789," or more complex like "finding the average salary of all the employed married men in California between the ages 30 to 39, that earn less than their wives." Queries results are generated by accessing relevant database data and manipulating it in a way that yields the requested information. Since database structures are complex, in most cases, and especially for not-very-simple queries, the needed data for a query can be collected from a database by accessing it in different ways, through different datastructures, and in different orders. Each different way typically requires different processing time. Processing times of the same query may have large variance, from a fraction of a second to hours, depending on the way selected. The purpose of query optimization, which is an automated process, is to find the way to process a given query in minimum time. The large possible variance in time justifies performing query optimization, though finding the exact optimal way to execute a query, among all possibilities, is typically very complex, time consuming by itself, may be too costly, and often practically impossible. Thus query optimization typically tries to approximate the optimum by comparing several common-sense alternatives to provide in a reasonable time a "good enough" plan which typically does not deviate much from the best possible result. 19 General considerations • There is a trade-on between the amount of time spent figuring out the best query plan and the quality of the choice; the optimizer may not choose the best answer on its own. Different qualities of database management systems have different ways of balancing these two. Cost-based query optimizers evaluate the resource footprint of various query plans and use this as the basis for plan selection. These assign an estimated "cost" to each possible query plan, and choose the plan with the smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, CPU path length, amount of disk buffer space, disk storage service time, and interconnect usage between units of parallelism, and other factors determined from the data dictionary. The set of query plans examined is formed by examining the possible access paths (e.g., primary index access, secondary index access, full file scan) and various relational table join techniques (e.g., merge join, hash join, product join). The search space can become quite large depending on the complexity of the SQL query. There are two types of optimization. These consist of logical optimization—which generates a sequence of relational algebra to solve the query—and physical optimization—which is used to determine the means of carrying out each operation. 20 • Implementation[edit] • Most query optimizers represent query plans as a tree of "plan nodes". A plan node encapsulates a single operation that is required to execute the query. The nodes are arranged as a tree, in which intermediate results flow from the bottom of the tree to the top. Each node has zero or more child nodes—those are nodes whose output is fed as input to the parent node. For example, a join node will have two child nodes, which represent the two join operands, whereas a sort node would have a single child node (the input to be sorted). The leaves of the tree are nodes which produce results by scanning the disk, for example by performing an index scan or a sequential scan. 21 Join ordering • • • • • The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first. Most query optimizers determine join order via adynamic programming algorithm pioneered by IBM's System Rdatabase project[citation needed]. This algorithm works in two stages: First, all ways to access each relation in the query are computed. Every relation in the query can be accessed via a sequential scan. If there is an index on a relation that can be used to answer a predicate in the query, an index scan can also be used. For each relation, the optimizer records the cheapest way to scan the relation, as well as the cheapest way to scan the relation that produces records in a particular sorted order. The optimizer then considers combining each pair of relations for which a join condition exists. For each pair, the optimizer will consider the available join algorithms implemented by the DBMS. It will preserve the cheapest way to join each pair of relations, in addition to the cheapest way to join each pair of relations that produces its output according to a particular sort order. Then all three-relation query plans are computed, by joining each two-relation plan produced by the previous phase with the remaining relations in the query. Sort order can avoid a redundant sort operation later on in processing the query. Second, a particular sort order can speed up a subsequent join because it clusters the data in a particular way. 22 Query planning for nested SQL queries • A SQL query to a modern relational DBMS does more than just selections and joins. In particular, SQL queries often nest several layers of SPJ blocks (Select-ProjectJoin), by means of group by,exists, and not exists operators. In some cases such nested SQL queries can be flattened into a select-project-join query, but not always. Query plans for nested SQL queries can also be chosen using the same dynamic programming algorithm as used for join ordering, but this can lead to an enormous escalation in query optimization time. So some database management systems use an alternative rule-based approach that uses a query graph model. 23 • • Cost estimation[edit] One of the hardest problems in query optimization is to accurately estimate the costs of alternative query plans. Optimizers cost query plans using a mathematical model of query execution costs that relies heavily on estimates of the cardinality, or number of tuples, flowing through each edge in a query plan. Cardinality estimation in turn depends on estimates of the selection factor of predicates in the query. Traditionally, database systems estimate selectivities through fairly detailed statistics on the distribution of values in each column, such as histograms. This technique works well for estimation of selectivities of individual predicates. However many queries have conjunctions of predicates such as selectcount(*) from R where R.make='Honda' andR.model='Accord'. Query predicates are often highly correlated (for example, model='Accord' implies make='Honda'), and it is very hard to estimate the selectivity of the conjunct in general. Poor cardinality estimates and uncaught correlation are one of the main reasons why query optimizers pick poor query plans. This is one reason why a database administrator should regularly update the database statistics, especially after major data loads/unloads. 24 Extensions • Classical query optimization assumes that query plans are compared according to one single cost metric, usually execution time, and that the cost of each query plan can be calculated without uncertainty. Both assumptions are sometimes violated in practice[1] and multiple extensions of classical query optimization have been studied in the research literature that overcome those limitations. Those extended problem variants differ in how they model the cost of single query plans and in terms of their optimization goal. 25 Parametric Query Optimization • Classical query optimization associates each query plan with one scalar cost value. Parametric query optimization[2] assumes that query plan cost depends on parameters whose values are unknown at optimization time. Such parameters can for instance represent the selectivity of query predicates that are not fully specified at optimization time but will be provided at execution time. Parametric query optimization therefore associates each query plan with a cost function that maps from a multidimensional parameter space to a one-dimensional cost space. • The goal of optimization is usually to generate all query plans that could be optimal for any of the possible parameter value combinations. This yields a set of relevant query plans. At run time, the best plan is selected out of that set once the true parameter values become known. The advantage of parametric query optimization is that optimization (which is in general a very expensive operation) is avoided at run time. 26 Multi-Objective Query Optimization • • • There are often other cost metrics in addition to execution time that are relevant to compare query plans [1]. In a cloud computing scenario for instance, one should compare query plans not only in terms of how much time they take to execute but also in terms of how much money their execution costs. Or in the context of approximate query optimization, it is possible to execute query plans on randomly selected samples of the input data in order to obtain approximate results with reduced execution overhead. In such cases, alternative query plans must be compared in terms of their execution time but also in terms of the precision or reliability of the data they generate. Multi-objective query optimization[3] models the cost of a query plan as a cost vector where each vector component represents cost according to a different cost metric. Classical query optimization can be considered as a special case of multi-objective query optimization where the dimension of the cost space (i.e., the number of cost vector components) is one. Different cost metrics might conflict with each other (e.g., there might be one plan with minimal execution time and a different plan with minimal monetary execution fees in a cloud computing scenario). Therefore, the goal of optimization cannot be to find a query plan that minimizes all cost metrics but must be to find a query plan that realizes the best compromise between different cost metrics. What the best compromise is depends on user preferences (e.g., some users might prefer a cheaper plan while others prefer a faster plan in a cloud scenario). The goal of optimization is therefore either to find the best query plan based on some specification of user preferences provided as input to the optimizer (e.g., users can define weights between different cost metrics to express relative importance or define hard cost bounds on certain metrics) or to generate an approximation of the set of Pareto-optimal query plans (i.e., plans such that no other plan has better cost according to all metrics) such that the user can select the preferred cost tradeoff out of that plan set. 27 Multi-Objective Parametric Query Optimization • Multi-objective parametric query optimization[1] generalizes parametric and multiobjective query optimization. Plans are compared according to multiple cost metrics and plan costs may depend on parameters whose values are unknown at optimization time. The cost of a query plan is therefore modeled as a function from a multi-dimensional parameter space to a multi-dimensional cost space. The goal of optimization is to generate the set of query plans that can be optimal for each possible combination of parameter values and user preferences. 28 Physical Database Design (Slide 30 – 56) 29 Physical Database Design Purpose–translate the logical description of data into the technical specifications for storing and retrieving data Goal–create a design for storing data that will provide adequate performance and insure database integrity, security, and recoverability 30 Physical Design Process Inputs Normalized Volume Decisions relations Attribute estimates Attribute Physical record descriptions (doesn’t always match logical design) definitions Response time expectations Data Leads to security needs Backup/recovery Integrity DBMS data types File organizations Indexes and database architectures needs expectations Query optimization technology used 31 Physical Design for regulatory compliance Sarbanes- Oxley Act (SOX) – protect investors by improving accuracy and reliability Committee of Sponsoring Organizations (COSO) of the Treadway Commission IT Infrastructure Library (ITIL) Control Objectives for Information and Related Technology (COBIT) Regulations and standards that impact physical design decisions 32 Designing Fields Field: smallest unit of application data recognized by system software Field design Choosing data type Coding, compression, encryption Controlling data integrity 33 Choosing Data Types 34 Figure 5-1 Example of a code look-up table (Pine Valley Furniture Company) Code saves space, but costs an additional lookup to obtain actual value Chapter 5 Copyright © 2014 Pearson Education, Inc. 35 35 Field Data Integrity  Default value–assumed value if no explicit value  Range control–allowable value limitations (constraints or validation rules)  Null value control–allowing or prohibiting empty fields  Referential integrity–range control (and null value allowances) for foreign-key to primarykey match-ups Sarbanes-Oxley Act (SOX) legislates importance of financial data integrity 36 Handling Missing Data Substitute an estimate of the missing value (e.g., using a formula) Construct a report listing missing values In programs, ignore missing data unless the value is significant (sensitivity testing) Triggers can be used to perform these operations 37 Denormalization  Transforming normalized relations into non-normalized physical record specifications  Benefits:  Can improve performance (speed) by reducing number of table lookups (i.e. reduce number of necessary join queries)  Costs (due to data duplication)  Wasted storage space  Data integrity/consistency threats  Common denormalization opportunities  One-to-one relationship (Fig. 5-2)  Many-to-many relationship with non-key attributes (associative entity) (Fig. 5-3)  Reference data (1:N relationship where 1-side has data not used in any other relationship) (Fig. 5-4) 38 Figure 5-2 A possible denormalization situation: two entities with oneto-one relationship Chapter 5 Copyright © 2014 Pearson Education, Inc. 39 39 Figure 5-3 A possible denormalization situation: a many-to-many relationship with nonkey attributes Extra table access required Null description possible Chapter 5 Copyright © 2014 Pearson Education, Inc. 40 40 Figure 5-4 A possible denormalization situation: reference data Extra table access required Data duplication Chapter 5 Copyright © 2014 Pearson Education, Inc. 41 41 Denormalize with caution Denormalization can Increase chance of errors and inconsistencies Reintroduce anomalies Force reprogramming when business rules change Perhaps other methods could be used to improve performance of joins Organization of tables in the database (file organization and clustering) Proper query design and optimization 42 Designing Physical database Files Physical File: A named portion of secondary memory allocated for the purpose of storing physical records Tablespace–named logical storage unit in which data from multiple tables/views/objects can be stored Tablespace components Segment – a table, index, or partition Extent–contiguous section of disk space Data block – smallest unit of storage 43 Figure 5-5 DBMS terminology in an Oracle 11g environment Chapter 5 Copyright © 2014 Pearson Education, Inc. 44 44 File Organizations Technique for physically arranging records of a file on secondary storage Types of file organizations Sequential Indexed Hashed 45 File Organizations Factors for selecting file organization: Fast data retrieval and throughput Efficient storage space utilization Protection from failure and data loss Minimizing need for reorganization Accommodating growth Security from unauthorized use 46 Figure 5-6a Sequential file organization Records of the file are stored in sequence by the primary key field values If sorted – every insert or delete requires re-sort If not sorted Average time to find desired record = n/2 Chapter 5 Copyright © 2014 Pearson Education, Inc. 47 47 Indexed File Organizations Storage of records sequentially or nonsequentially with an index that allows software to locate individual records Index: a table or other data structure used to determine in a file the location of records that satisfy some condition Primary keys are automatically indexed Other fields or combinations of fields can also be indexed; these are called secondary keys (or nonunique keys) 48 Figure 5-6b Indexed file organization uses a tree search Average time to find desired record = depth of the tree Chapter 5 Copyright © 2014 Pearson Education, Inc. 49 49 Figure 5-6c Hashed file organization Hash algorithm Usually uses divisionremainder to determine record position. Records with same position are grouped in lists. Chapter 5 Copyright © 2014 Pearson Education, Inc. 50 50 Figure 5-7 Join Indexes–speeds up join operations b) Join index for matching foreign key (FK) and primary key (PK) a) Join index for common non-key columns Chapter 5 Copyright © 2014 Pearson Education, Inc. 51 51 Chapter 5 Copyright © 2014 Pearson Education, Inc. 52 52 Using and Selecting Keys • Creating a unique key index – Example: CustomerID (primary key) of Customer – Example: Composite primary key for OrderLine • Creating a secondary key index – Example: Description field for Product (not unique) 53 Rules for Using Indexes 1. Use on larger tables 2. Index the primary key of each table 3. Index search fields (fields frequently in WHERE clause) 4. Fields in SQL ORDER BY and GROUP BY commands 5. When there are >100 values but not when there are <30 values 54 Rules for Using Indexes (cont.) Avoid use of indexes for fields with long values; perhaps compress values first 7. If key to index is used to determine location of record, use surrogate (like sequence nbr) to allow even spread in storage area 8. DBMS may have limit on number of indexes per table and number of bytes per indexed field(s) 9. Be careful of indexing attributes with null values; many DBMSs will not recognize null values in an index search 6. 55 Query Optimization Parallel query processing–possible when working in multiprocessor systems Overriding automatic query optimization– allows for query writers to preempt the automated optimization Data warehouses are already configured for optimized query performance 56

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 4 Implementation of Relational Operators