Download Chapter 10 of Database Application Development and Design

Document related concepts

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Transcript
Chapter 8
Physical Database Design
Outline
Overview of Physical Database Design
 Inputs of Physical Database Design
 File Structures
 Query Optimization
 Index Selection
 Additional Choices in Physical Database
Design

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Overview of Physical Database
Design

Importance of the process and environment of
physical database design
– Process: inputs, outputs, objectives
– Environment: file structures and query optimization


Physical Database Design is characterized as a
series of decision-making processes.
Decisions involve the storage level of a database:
file structure and optimization choices.
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Storage Level of Databases
The storage level is closest to the hardware
and operating system.
 At the storage level, a database consists of
physical records organized into files.
 A file is a collection of physical records
organized for efficient access.
 The number of physical record accesses is
an important measure of database
performance.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Relationships between Logical
Records (LR) and Physical
Records (PR)
(a) Multiple LRs per PR
PR
LR
(b) LR split across PRs
LR
PR
LR
PR
LRT1
LRT2
LR
PR
McGraw-Hill/Irwin
(c) PR containing LRs
from different tables
LRT2
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Transferring Physical Records
Application buffers:
Logical records (LRs)
LR 1
DBMS Buffers:
Logical records (LRs) inside
of physical records (PRs)
read PR
1
LR 2
LR 3
LR 4
Operating system:
Physical records
(PRs) on disk
read
LR 1
LR 2
write
PR1
write
PR2
LR 3
PR2
LR 4
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Objectives
Minimize response time to access and
change a database.
 Minimizing computing resources is a
substitute measure for response time.
 Database resources

– Physical record transfers
– CPU operations
– Communication network usage (distributed
processing)
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Constraints




Main memory and disk space are considered
as constraints rather than resources to
minimize.
Minimizing main memory and disk space can
lead to high response times.
Thus, reducing the number of physical record
accesses can improve response time.
CPU usage also can be a factor in some
database applications.
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Combined Measure of
Database Performance
To accommodate both physical record
accesses and CPU usage, a weight can be
used to combine them into one measure.
 The weight is usually close to 0 because
many CPU operations can be performed in
the time to perform one physical record
transfer.
 . Combined Measure of Database Performance :

PRA + W * CPU-OP
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Inputs, Outputs, and
Environment
Table design
(from logical database design)
File structures
Table
profiles
Application
profiles
Physical
database
design
Data placement
Data formatting
Denormalization
Knowledge about file structures and
query optimization
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Difficulty of physical database
design
Number of decisions
 Relationship among decisions
 Detailed inputs
 Complex environment
 Uncertainty in predicting physical record
accesses

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Inputs of Physical Database
Design
Physical database design requires inputs
specified in sufficient detail.
 Table profiles and application profiles are
important and sometimes difficult-todefine inputs.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Table Profile

A table profile summarizes a table as a
whole, the columns within a table, and the
relationships between tables.
Typical Components of a Table Profile
Component
Table
Column
Relationship
McGraw-Hill/Irwin
Statistics
Number of rows and physical records
Number of unique values, distribution of values
Distribution of the number of related rows
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Application profiles

Application profiles summarize the
queries, forms, and reports that access a
database.
Typical Components of an Application Profile
Application Type
Query
Form
Report
McGraw-Hill/Irwin
Statistics
Frequency; distribution of parameter values
Frequency of insert, update, delete, and retrieval
operations to the main form and the subform
Frequency; distribution of parameter values
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
File structures
Selecting among alternative file structures
is one of the most important choices in
physical database design.
 In order to choose intelligently, you must
understand characteristics of available file
structures.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Sequential Files
Simplest kind of file structure
 Unordered: insertion order
 Ordered: key order
 Simple to maintain
 Provide good performance for processing
large numbers of records

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Unordered Sequential File
StdSSN
PR1
Name ...
123-45-6789 Joe Abbot ...
788-45-1235 Sue Peters ...
122-44-8655 Pat Heldon ...
...
Insert a new logical
record in the last
physical record .
PRn
466-55-3299 Bill Harper ...
323-97-3787 Mary Grant ...
543-01-9593 Tom Adtkins
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Ordered Sequential File
StdSSN
PR1
Name ...
122-44-8655 Pat Heldon ...
123-45-6789 Joe Abbot ...
323-97-3787 Mary Grant ...
...
Rearrange physical record
to insert new logical record.
PRn
466-55-3299 Bill Harper ...
788-45-1235 Sue Peters ...
543-01-9593 Tom Adtkins
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Hash Files
Support fast access unique key value
 Converts a key value into a physical record
address
 Mod function: typical hash function

– Divisor: large prime number close to the file
capacity
– Physical record number: hash function plus
the starting physical record number
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Example: Hash Function
Calculations for StdSSN Key
.
StdSSN
122448655
123456789
323973787
466553299
788451235
543019593
McGraw-Hill/Irwin
StdSSN Mod 97
26
39
92
80
24
13
PR Number
176
189
242
230
174
163
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Hash File after Insertions
PR163
543-01-9593 Tom Adtkins
123-45-6789 Joe Abbot
PR230
466-55-3299 Bill Harper
...
...
PR189
PR174
788-45-1235 Sue Peters
122-44-8655 Pat Heldon
McGraw-Hill/Irwin
...
...
PR176
PR242
323-97-3787 Mary Grant
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Linear Probe Collision Handling
During an Insert Operation
Home address = Hash function value + Base address
(122448946 mod 97 = 26) + 150
PR176
...
122-44-8655 Pat Heldon
...
122-44-8752 Joe Bishop
...
Home address (176) is full.
122-44-8849 Mary Wyatt
...
122-44-8946 Tom Atkins
adtkins
PR177
Linear probe to find
a
physical record with space
122-44-8753 Bill Hayes
...
...
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Multi-Way Tree (Btrees) Files



A popular file structure supported by most
DBMSs.
Btree provides good performance on both
sequential search and key search.
Btree characteristics:
–
–
–
–
McGraw-Hill/Irwin
Balanced
Bushy: multi-way tree
Block-oriented
Dynamic
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Structure of a Btree of Height 3
Root
node
Level
0
Level
1
Level
2
...
...
...
Leaf nodes
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Btree Node Containing Keys
and Pointers
Key1 Key 2 ... Key d
... Key 2d
Pointer 1 Pointer 2 Pointer 3 Pointer d+1 ... Pointer 2d+1
Each non root node contains at least half capacity
(d keys and d+1 pointers).
Each non root node contains at most full capacity
(2d keys and 2 d+1 pointers).
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Btree Insertion Examples
(a) Initial Btree
20
22
28
35
45
70
40
50
60
65
(b) After inserting 55
20
22
28
35
45
70
40
50
55
(c) After inserting 58
20
22
28
35
40
50
45
58
60
65
Middle key value
(58) moved up
70
55
60
65
Node split
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Btree Deletion Examples
(a) Initial Btree
20
22
28
45
70
35
50
60
50
65
65
(b) After deleting 60
20
22
28
45
70
35
(c) After deleting 65
20
22
McGraw-Hill/Irwin
28
35
70
45
50
Borrowing a key
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Cost of Operations
The height of Btree dominates the number
of physical record accesses operation.
 Logarithmic search cost

– Upper bound of height: log function’
– Log base: minimum number of keys in a node

The cost to insert a key = [the cost to
locate the nearest key] + [the cost to
change nodes].
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
B+Tree
Provides improved performance on
sequential and range searches.
 In a B+tree, all keys are redundantly stored
in the leaf nodes.
 To ensure that physical records are not
replaced, the B+tree variation is usually
implemented.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Index Matching
Determining usage of an index for a query
 Complexity of condition determines match.
 Single column indexes: =, <, >, <=, >=, IN
<list of values>, BETWEEN, IS NULL,
LIKE ‘Pattern’ (meta character not the first
symbol)
 Composite indexes: more complex and
restrictive rules

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Bitmap Index


Can be useful for stable columns with few values
Bitmap:
– String of bits: 0 (no match) or 1 (match)
– One bit for each row

Bitmap index record
– Column value
– Bitmap
– DBMS converts bit position into row identifier.
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Bitmap Index Example
Faculty Table
RowId
1
2
3
4
5
6
7
8
9
10
11
12
FacSSN
098-55-1234
123-45-6789
456-89-1243
111-09-0245
931-99-2034
998-00-1245
287-44-3341
230-21-9432
321-44-5588
443-22-3356
559-87-3211
220-44-5688
… FacRank
Asst
Asst
Assc
Prof
Asst
Prof
Assc
Asst
Prof
Assc
Prof
Asst
Bitmap Index on FacRank
FacRank
Asst
Assc
Prof
McGraw-Hill/Irwin
Bitmap
110010010001
001000100100
000101001010
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Bitmap Join Index
Bitmap identifies rows of a related table.
 Represents a precomputed join
 Can define for a join column or a non-join
column
 Typically used in query dominated
environments such as data warehouses
(Chapter 16)

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Summary of File Structures
B+tree
Unordered Ordered Hash
Y
Extra PRs Y
Linear
Linear
N
Y
Constant
time
N
Primary
only
Primary
only
Sequential Y
search
Key
search
Range
search
Usage
McGraw-Hill/Irwin
Bitmap
N
Logarithmic Y
Y
Primary or
Primary
secondary
or
secondary
Y
Secondary
only
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Query Optimization
Query optimizer determines
implementation of queries.
 Major improvement in software
productivity
 You can sometimes improve the
optimization result through knowledge of
the optimization process.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Translation Tasks
Query
Syntax and semantic
analysis
Parsed query
Query transformation
Relational algebra query
Access plan
evaluation
Access
Plan
McGraw-Hill/Irwin
Access
Plan
Access plan
interpretation
Code generation
Query results
Machine code
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Access Plans
Sort Merge Join
McGraw-Hill/Irwin
Sort(FacSSN)
BTree(FacSSN)
Sort Merge Join
Faculty
Btree(OfferNo)
BTree(OfferNo)
Enrollment
Offering
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Access Plan Evaluation
Optimizer evaluates thousands of access
plans
 Access plans vary by join order, file
structures, and join algorithm.
 Some optimizers can use multiple indexes
on the same table.
 Access plan evaluation can consume
significant resources

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Join Algorithms
Nested loops
 Sort merge
 Hybrid join
 Hash join
 Star join

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Optimization Tips I
Detailed and current statistics needed
 Save access plans for repetitive queries
 Review access plans to determine
problems
 Use hints carefully to improve results

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Optimization Tips II
Replace Type II nested queries with separate
queries.
 For conditions on join columns, test the
condition on the parent table.
 Do not use the HAVING clause for row
conditions.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Index Selection
Most important decision
 Difficult decision
 Choice of clustered and nonclustered
indexes

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Clustering Index Example
Index set
Sequence set
<Abe, 1> <Adam, 2>
1. Abe, Denver, ...
2. Adam, Boulder, ...
McGraw-Hill/Irwin
<Bill, 4> <Bob, 3>
3. Bob, Denver, ...
4. Bill, Aspen, ...
<Carl, 5> <Carol, 6>
5. Carl, Denver, ...
6. Carol, Golden, ...
...
Physical records
containing rows
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Nonclustering Index Example
Index set
Sequence set
<Abe, 6> <Adam, 2>
1. Carl,Denver, ...
2. Adam, Boulder, ...
McGraw-Hill/Irwin
<Bill, 4> <Bob, 5>
3. Carol, Golden, ...
4. Bill, Aspen, ...
<Carl, 1> <Carol, 3>
5. Bob, Denver, ...
6. Abe, Denver, ...
...
Physical
records
containing rows
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Inputs and Outputs of Index
Selection
SQL statements
and weights
Clustered index
choices
Index
Selection
Table profiles
McGraw-Hill/Irwin
Nonclustered
index choices
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Trade-offs in Index Selection


Balance retrieval against update performance
Nonclustering index usage:
– Few rows satisfy the condition in the query
– Join column usage if a small number of rows result in
child table

Clustering index usage:
– Larger number of rows satisfy a condition than for
nonclustering index
– Use in sort merge join algorithm to avoid sorting
– More expensive to maintain
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Difficulties of Index Selection
Application weights are difficult to specify.
 Distribution of parameter values needed
 Behavior of the query optimization
component must be known.
 The number of choices is large.
 Index choices can be interrelated.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Selection Rules
Rule 1: A primary key is a good candidate for a
clustering index.
Rule 2: To support joins, consider indexes on
foreign keys.
Rule 3: A column with many values may be a good
choice for a non-clustering index if it is used in
equality conditions.
Rule 4: A column used in highly selective range
conditions is a good candidate for a nonclustering index.
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Selection Rules (Cont.)
Rule 5: A frequently updated column is not a
good index candidate.
Rule 6: Volatile tables (lots of insertions and
deletions) should not have many indexes.
Rule 7: Stable columns with few values are good
candidates for bitmap indexes if the columns
appear in WHERE conditions.
Rule 8: Avoid indexes on combinations of
columns. Most optimization components can
use multiple indexes on the same table.
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Index Creation
To create the indexes, the CREATE
INDEX statement can be used.
 The word following the INDEX keyword
is the name of the index.
 CREATE INDEX is not part of SQL:1999.

CREATE INDEX StdGPAIndex ON Student (StdGPA)
Example:
CREATE UNIQUE INDEX OfferNoIndex ON Offering
(OfferNo)
CREATE BITMAP INDEX OffYearIndex ON Offering
(OffYear)
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Denormalization
Additional choice in physical database
design
 Denormalization combines tables so that
they are easier to query.
 Use carefully because normalized designs
have important advantages.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Normalized designs
Better update performance
 Require less coding to enforce integrity
constraints
 Support more indexes to improve query
performance

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Repeating Groups
A repeating group is a collection of
associated values.
 The rules of normalization force repeating
groups to be stored in an M table separate
from an associated one table.
 If a repeating group is always accessed
with its associated one table,
denormalization may be a reasonable
alternative.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Denormalizing a Repeating
Group
Normalized
Denormalized
Te rritory
TerrNo
TerrName
TerrLoc
1
M
Te rritorySale s
Te rritory
TerrNo
TerrName
TerrLoc
Qtr1Sales
Qtr2Sales
Qtr3Sales
Qtr4Sales
TerrNo
TerrQtr
TerrSales
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Denormalizing a Generalization
Hierarchy
Normalized
Denormalized
Em p
EmpNo
EmpName
EmpHireDate
Em p
1
1
1
SalaryEm p
HourlyEm p
EmpNo
EmpSalary
EmpNo
EmpRate
McGraw-Hill/Irwin
EmpNo
EmpName
EmpHireDate
EmpSalary
EmpRate
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Codes and Meanings
Normalized
Denormalized
De pt
De pt
DeptNo
DeptName
DeptLoc
DeptNo
DeptName
DeptLoc
1
1
M
M
Em p
Em p
EmpNo
EmpName
DeptNo
EmpNo
EmpName
DeptNo
DeptName
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Record Formatting
Record formatting decisions involve
compression and derived data.
 Compression is a trade-off between inputoutput and processing effort.
 Derived data is a trade-offs between query
and update operations.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Storing Derived Data to
Improve Query Performance
Order
Product
OrdNo
OrdDate
OrdAmt
ProdNo
ProdName
ProdPrice
1
derived
data
1
M
OrdLine
M
OrdNo
ProdNo
Qty
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Parallel Processing
Parallel processing can improve retrieval
and modification performance.
 Retrieving many records can be improved
by reading physical records in parallel.
 Many DBMSs provide parallel processing
capabilities with RAID systems.
 RAID is a collection of disks (a disk array)
that operates as a single disk.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Striping in RAID Storage Systems
Each stripe consists of four adjacent physical records. Three stripes are
shown separated by dotted lines.
PR1
PR2
PR3
PR4
PR5
PR6
PR7
PR8
PR9
PR10
PR11
PR12
McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Other Ways to Improve
Performance
Transaction processing: add computing
capacity and improve transaction design.
 Data warehouses: add computing capacity
and store derived data.
 Distributed databases: allocate processing
and data to various computing locations.

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Summary
Goal: minimize computing resources
 Table profiles and application profiles must be
specified in sufficient detail.
 Environment: file structures and query optimization
 Monitor and possibly improve query optimization
results
 Index selection: most important decision
 Other techniques: denormalization, record
formatting, and parallel processing

McGraw-Hill/Irwin
© 2004 The McGraw-Hill Companies, Inc. All rights reserved.