Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 8 Physical Database Design Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization Index Selection Additional Choices in Physical Database Design McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Overview of Physical Database Design Importance of the process and environment of physical database design – Process: inputs, outputs, objectives – Environment: file structures and query optimization Physical Database Design is characterized as a series of decision-making processes. Decisions involve the storage level of a database: file structure and optimization choices. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Storage Level of Databases The storage level is closest to the hardware and operating system. At the storage level, a database consists of physical records organized into files. A file is a collection of physical records organized for efficient access. The number of physical record accesses is an important measure of database performance. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Relationships between Logical Records (LR) and Physical Records (PR) (a) Multiple LRs per PR PR LR (b) LR split across PRs LR PR LR PR LRT1 LRT2 LR PR McGraw-Hill/Irwin (c) PR containing LRs from different tables LRT2 © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Transferring Physical Records Application buffers: Logical records (LRs) LR 1 DBMS Buffers: Logical records (LRs) inside of physical records (PRs) read PR 1 LR 2 LR 3 LR 4 Operating system: Physical records (PRs) on disk read LR 1 LR 2 write PR1 write PR2 LR 3 PR2 LR 4 McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Objectives Minimize response time to access and change a database. Minimizing computing resources is a substitute measure for response time. Database resources – Physical record transfers – CPU operations – Communication network usage (distributed processing) McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Constraints Main memory and disk space are considered as constraints rather than resources to minimize. Minimizing main memory and disk space can lead to high response times. Thus, reducing the number of physical record accesses can improve response time. CPU usage also can be a factor in some database applications. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Combined Measure of Database Performance To accommodate both physical record accesses and CPU usage, a weight can be used to combine them into one measure. The weight is usually close to 0 because many CPU operations can be performed in the time to perform one physical record transfer. . Combined Measure of Database Performance : PRA + W * CPU-OP McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Inputs, Outputs, and Environment Table design (from logical database design) File structures Table profiles Application profiles Physical database design Data placement Data formatting Denormalization Knowledge about file structures and query optimization McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Difficulty of physical database design Number of decisions Relationship among decisions Detailed inputs Complex environment Uncertainty in predicting physical record accesses McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Inputs of Physical Database Design Physical database design requires inputs specified in sufficient detail. Table profiles and application profiles are important and sometimes difficult-todefine inputs. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Table Profile A table profile summarizes a table as a whole, the columns within a table, and the relationships between tables. Typical Components of a Table Profile Component Table Column Relationship McGraw-Hill/Irwin Statistics Number of rows and physical records Number of unique values, distribution of values Distribution of the number of related rows © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Application profiles Application profiles summarize the queries, forms, and reports that access a database. Typical Components of an Application Profile Application Type Query Form Report McGraw-Hill/Irwin Statistics Frequency; distribution of parameter values Frequency of insert, update, delete, and retrieval operations to the main form and the subform Frequency; distribution of parameter values © 2004 The McGraw-Hill Companies, Inc. All rights reserved. File structures Selecting among alternative file structures is one of the most important choices in physical database design. In order to choose intelligently, you must understand characteristics of available file structures. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Sequential Files Simplest kind of file structure Unordered: insertion order Ordered: key order Simple to maintain Provide good performance for processing large numbers of records McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Unordered Sequential File StdSSN PR1 Name ... 123-45-6789 Joe Abbot ... 788-45-1235 Sue Peters ... 122-44-8655 Pat Heldon ... ... Insert a new logical record in the last physical record . PRn 466-55-3299 Bill Harper ... 323-97-3787 Mary Grant ... 543-01-9593 Tom Adtkins McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Ordered Sequential File StdSSN PR1 Name ... 122-44-8655 Pat Heldon ... 123-45-6789 Joe Abbot ... 323-97-3787 Mary Grant ... ... Rearrange physical record to insert new logical record. PRn 466-55-3299 Bill Harper ... 788-45-1235 Sue Peters ... 543-01-9593 Tom Adtkins McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Hash Files Support fast access unique key value Converts a key value into a physical record address Mod function: typical hash function – Divisor: large prime number close to the file capacity – Physical record number: hash function plus the starting physical record number McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Example: Hash Function Calculations for StdSSN Key . StdSSN 122448655 123456789 323973787 466553299 788451235 543019593 McGraw-Hill/Irwin StdSSN Mod 97 26 39 92 80 24 13 PR Number 176 189 242 230 174 163 © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Hash File after Insertions PR163 543-01-9593 Tom Adtkins 123-45-6789 Joe Abbot PR230 466-55-3299 Bill Harper ... ... PR189 PR174 788-45-1235 Sue Peters 122-44-8655 Pat Heldon McGraw-Hill/Irwin ... ... PR176 PR242 323-97-3787 Mary Grant © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Linear Probe Collision Handling During an Insert Operation Home address = Hash function value + Base address (122448946 mod 97 = 26) + 150 PR176 ... 122-44-8655 Pat Heldon ... 122-44-8752 Joe Bishop ... Home address (176) is full. 122-44-8849 Mary Wyatt ... 122-44-8946 Tom Atkins adtkins PR177 Linear probe to find a physical record with space 122-44-8753 Bill Hayes ... ... McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Multi-Way Tree (Btrees) Files A popular file structure supported by most DBMSs. Btree provides good performance on both sequential search and key search. Btree characteristics: – – – – McGraw-Hill/Irwin Balanced Bushy: multi-way tree Block-oriented Dynamic © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Structure of a Btree of Height 3 Root node Level 0 Level 1 Level 2 ... ... ... Leaf nodes McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Btree Node Containing Keys and Pointers Key1 Key 2 ... Key d ... Key 2d Pointer 1 Pointer 2 Pointer 3 Pointer d+1 ... Pointer 2d+1 Each non root node contains at least half capacity (d keys and d+1 pointers). Each non root node contains at most full capacity (2d keys and 2 d+1 pointers). McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Btree Insertion Examples (a) Initial Btree 20 22 28 35 45 70 40 50 60 65 (b) After inserting 55 20 22 28 35 45 70 40 50 55 (c) After inserting 58 20 22 28 35 40 50 45 58 60 65 Middle key value (58) moved up 70 55 60 65 Node split McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Btree Deletion Examples (a) Initial Btree 20 22 28 45 70 35 50 60 50 65 65 (b) After deleting 60 20 22 28 45 70 35 (c) After deleting 65 20 22 McGraw-Hill/Irwin 28 35 70 45 50 Borrowing a key © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Cost of Operations The height of Btree dominates the number of physical record accesses operation. Logarithmic search cost – Upper bound of height: log function’ – Log base: minimum number of keys in a node The cost to insert a key = [the cost to locate the nearest key] + [the cost to change nodes]. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. B+Tree Provides improved performance on sequential and range searches. In a B+tree, all keys are redundantly stored in the leaf nodes. To ensure that physical records are not replaced, the B+tree variation is usually implemented. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Index Matching Determining usage of an index for a query Complexity of condition determines match. Single column indexes: =, <, >, <=, >=, IN <list of values>, BETWEEN, IS NULL, LIKE ‘Pattern’ (meta character not the first symbol) Composite indexes: more complex and restrictive rules McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Bitmap Index Can be useful for stable columns with few values Bitmap: – String of bits: 0 (no match) or 1 (match) – One bit for each row Bitmap index record – Column value – Bitmap – DBMS converts bit position into row identifier. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Bitmap Index Example Faculty Table RowId 1 2 3 4 5 6 7 8 9 10 11 12 FacSSN 098-55-1234 123-45-6789 456-89-1243 111-09-0245 931-99-2034 998-00-1245 287-44-3341 230-21-9432 321-44-5588 443-22-3356 559-87-3211 220-44-5688 … FacRank Asst Asst Assc Prof Asst Prof Assc Asst Prof Assc Prof Asst Bitmap Index on FacRank FacRank Asst Assc Prof McGraw-Hill/Irwin Bitmap 110010010001 001000100100 000101001010 © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Bitmap Join Index Bitmap identifies rows of a related table. Represents a precomputed join Can define for a join column or a non-join column Typically used in query dominated environments such as data warehouses (Chapter 16) McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Summary of File Structures B+tree Unordered Ordered Hash Y Extra PRs Y Linear Linear N Y Constant time N Primary only Primary only Sequential Y search Key search Range search Usage McGraw-Hill/Irwin Bitmap N Logarithmic Y Y Primary or Primary secondary or secondary Y Secondary only © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Query Optimization Query optimizer determines implementation of queries. Major improvement in software productivity You can sometimes improve the optimization result through knowledge of the optimization process. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Translation Tasks Query Syntax and semantic analysis Parsed query Query transformation Relational algebra query Access plan evaluation Access Plan McGraw-Hill/Irwin Access Plan Access plan interpretation Code generation Query results Machine code © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Access Plans Sort Merge Join McGraw-Hill/Irwin Sort(FacSSN) BTree(FacSSN) Sort Merge Join Faculty Btree(OfferNo) BTree(OfferNo) Enrollment Offering © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Access Plan Evaluation Optimizer evaluates thousands of access plans Access plans vary by join order, file structures, and join algorithm. Some optimizers can use multiple indexes on the same table. Access plan evaluation can consume significant resources McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Join Algorithms Nested loops Sort merge Hybrid join Hash join Star join McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Optimization Tips I Detailed and current statistics needed Save access plans for repetitive queries Review access plans to determine problems Use hints carefully to improve results McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Optimization Tips II Replace Type II nested queries with separate queries. For conditions on join columns, test the condition on the parent table. Do not use the HAVING clause for row conditions. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Index Selection Most important decision Difficult decision Choice of clustered and nonclustered indexes McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Clustering Index Example Index set Sequence set <Abe, 1> <Adam, 2> 1. Abe, Denver, ... 2. Adam, Boulder, ... McGraw-Hill/Irwin <Bill, 4> <Bob, 3> 3. Bob, Denver, ... 4. Bill, Aspen, ... <Carl, 5> <Carol, 6> 5. Carl, Denver, ... 6. Carol, Golden, ... ... Physical records containing rows © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Nonclustering Index Example Index set Sequence set <Abe, 6> <Adam, 2> 1. Carl,Denver, ... 2. Adam, Boulder, ... McGraw-Hill/Irwin <Bill, 4> <Bob, 5> 3. Carol, Golden, ... 4. Bill, Aspen, ... <Carl, 1> <Carol, 3> 5. Bob, Denver, ... 6. Abe, Denver, ... ... Physical records containing rows © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Inputs and Outputs of Index Selection SQL statements and weights Clustered index choices Index Selection Table profiles McGraw-Hill/Irwin Nonclustered index choices © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Trade-offs in Index Selection Balance retrieval against update performance Nonclustering index usage: – Few rows satisfy the condition in the query – Join column usage if a small number of rows result in child table Clustering index usage: – Larger number of rows satisfy a condition than for nonclustering index – Use in sort merge join algorithm to avoid sorting – More expensive to maintain McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Difficulties of Index Selection Application weights are difficult to specify. Distribution of parameter values needed Behavior of the query optimization component must be known. The number of choices is large. Index choices can be interrelated. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Selection Rules Rule 1: A primary key is a good candidate for a clustering index. Rule 2: To support joins, consider indexes on foreign keys. Rule 3: A column with many values may be a good choice for a non-clustering index if it is used in equality conditions. Rule 4: A column used in highly selective range conditions is a good candidate for a nonclustering index. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Selection Rules (Cont.) Rule 5: A frequently updated column is not a good index candidate. Rule 6: Volatile tables (lots of insertions and deletions) should not have many indexes. Rule 7: Stable columns with few values are good candidates for bitmap indexes if the columns appear in WHERE conditions. Rule 8: Avoid indexes on combinations of columns. Most optimization components can use multiple indexes on the same table. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Index Creation To create the indexes, the CREATE INDEX statement can be used. The word following the INDEX keyword is the name of the index. CREATE INDEX is not part of SQL:1999. CREATE INDEX StdGPAIndex ON Student (StdGPA) Example: CREATE UNIQUE INDEX OfferNoIndex ON Offering (OfferNo) CREATE BITMAP INDEX OffYearIndex ON Offering (OffYear) McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Denormalization Additional choice in physical database design Denormalization combines tables so that they are easier to query. Use carefully because normalized designs have important advantages. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Normalized designs Better update performance Require less coding to enforce integrity constraints Support more indexes to improve query performance McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Repeating Groups A repeating group is a collection of associated values. The rules of normalization force repeating groups to be stored in an M table separate from an associated one table. If a repeating group is always accessed with its associated one table, denormalization may be a reasonable alternative. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Denormalizing a Repeating Group Normalized Denormalized Te rritory TerrNo TerrName TerrLoc 1 M Te rritorySale s Te rritory TerrNo TerrName TerrLoc Qtr1Sales Qtr2Sales Qtr3Sales Qtr4Sales TerrNo TerrQtr TerrSales McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Denormalizing a Generalization Hierarchy Normalized Denormalized Em p EmpNo EmpName EmpHireDate Em p 1 1 1 SalaryEm p HourlyEm p EmpNo EmpSalary EmpNo EmpRate McGraw-Hill/Irwin EmpNo EmpName EmpHireDate EmpSalary EmpRate © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Codes and Meanings Normalized Denormalized De pt De pt DeptNo DeptName DeptLoc DeptNo DeptName DeptLoc 1 1 M M Em p Em p EmpNo EmpName DeptNo EmpNo EmpName DeptNo DeptName McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Record Formatting Record formatting decisions involve compression and derived data. Compression is a trade-off between inputoutput and processing effort. Derived data is a trade-offs between query and update operations. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Storing Derived Data to Improve Query Performance Order Product OrdNo OrdDate OrdAmt ProdNo ProdName ProdPrice 1 derived data 1 M OrdLine M OrdNo ProdNo Qty McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Parallel Processing Parallel processing can improve retrieval and modification performance. Retrieving many records can be improved by reading physical records in parallel. Many DBMSs provide parallel processing capabilities with RAID systems. RAID is a collection of disks (a disk array) that operates as a single disk. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Striping in RAID Storage Systems Each stripe consists of four adjacent physical records. Three stripes are shown separated by dotted lines. PR1 PR2 PR3 PR4 PR5 PR6 PR7 PR8 PR9 PR10 PR11 PR12 McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Other Ways to Improve Performance Transaction processing: add computing capacity and improve transaction design. Data warehouses: add computing capacity and store derived data. Distributed databases: allocate processing and data to various computing locations. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Summary Goal: minimize computing resources Table profiles and application profiles must be specified in sufficient detail. Environment: file structures and query optimization Monitor and possibly improve query optimization results Index selection: most important decision Other techniques: denormalization, record formatting, and parallel processing McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.