Download 19. Implementation - University of St. Thomas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational algebra wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Ingres (database) wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Clusterpoint wikipedia , lookup

Join (SQL) wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Transcript
Part 19
Implementation
Performance
Efficiency Depends On:





Physical data storage
Use of indices
Query optimization
Compiled vs. interpreted execution
Ability to predict database usage, communicate that
prediction to the DBMS, and make use of that
information
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 2
Physical Data Storage
Flat files/tables can be inefficient
PROBLEM:
Access to an employee and all assigned
tasks may require one physical disk access per task
Possible Solutions:

use a hybrid DBMS (hierarchical, network)
(different physical and logical)

store first normal form relations

store in master/detail (JoinDef) form ("array" within
relation)
One relation per entity can be inefficient
PROBLEM:
80-20 rule states that 80% of all
retrievals will occur against 20% of the attributes
(e.g. emergency contact)
Possible Solutions:

multiple relations per entity, cluster attributes

separate access mechanism for each attribute
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 3
Table Organization
Default organization in most systems is a "heap"

non-sequential file, usually in order of input
Organizations commonly available:
heap
cheap
heapsort
cheapsort
hash
chash
btree
cbtree
isam
cisam
sorted
non-sequential, duplicates, new records at
end
compressed heap
sorted at modify, maintained as heap
compressed heapsort
random hash table, no duplicates
compressed hash
dynamic B-tree, no duplicates
compressed btree
static indexed-sequential, no duplicates
compressed isam
maintained as sorted sequential
Ingres allows table organization to be
established via:
MODIFY emp TO chash UNIQUE ON name, WHERE
FILLFACTOR = 50;
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 4
Use of Indices
An index is used to speed up retrieval



aid "associative" retrieval by rapidly mapping
values to locations
can answer existence questions without data access
“no free lunch” principle - slows updates
General index types


Sequential - useful for range queries
Direct - useful for list queries
Specific index types






sorted sequential
direct relation
B-tree
hashing
pointer chains
bit maps
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 5
Index Creation
1. Create an index
CREATE INDEX rateindex ON emp(rate)

generates a table of the form
rate
pointer
In Oracle, this new index table is sorted and can be
used immediately. In Ingres, this index table is a
“cheap” and is useless until modified.
2. In Ingres, organize the index
MODIFY rateindex TO Btree on rate;
3. In Oracle, automatically get an index when
you specify that a field is a PRIMARY
KEY or UNIQUE
4. Use of the index is selected by query
optimizer
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 6
Query Optimization
REQUIRED in any mainframe / mini
environment to get acceptable performance
OPPORTUNITY to capitalize on the strengths
of the relational model
System free to


decide which record-level operations are needed
use information available to it



data values
history of access
select from a wide variety of alternatives
EXAMPLE:
SELECT DISTINCT s.sname FROM s, sp
WHERE s.s# = sp.s# AND sp.p# = 'p2';
(100 suppliers - 10,000 shipments - 50 shipments of p2)
Method 1: Generate Cartesian product, then restrict
Method 2: Restrict sp before join
Method 2 is 1/2 of 1% the work of Method 1!
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 7
Query Optimization Process
1. Cast query into an internal representation


SQL has many ways to state the same query
usually cast as an abstract syntax tree
2. Convert query into a canonical form


apply rules for restating query
usually converted to conjunctive normal form
p or (q and r) --> (p or q) and (p or r)
3. Choose candidate low-level procedures

consider availability of indices, physical locations
of data, and size of relations
4. Generate query plans and choose the cheapest



can be done at compile time or run time
estimate the number of disk accesses or use a rulebased system
don't generate/consider all combinations
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 8
Index Selection
Query optimizers generally try to use indices when
available to speed retrieval.
However, indices slow update speed and take disk space.
This is further complicated by the fact that you can index
combinations of columns in addition to single
columns. So if you have a table with 12 columns,
there are over 1.3 billion possible column
combinations that can be indexed in this one table
alone. In addition, each combination of columns
can be indexed multiple times, using various index
organizations.
Below is a set of “rules of thumb” to use when setting up
indices:
1. If you are doing almost exclusively data entry, forget
about indices until you start doing queries.
2. Index your primary keys. (For example deptno in
table dept, and the combination of ename,
project_id, and tname in table task.)
3. Index any field(s) referenced as foreign keys. (For
example mgr and deptno in table emp)
4. Index any fields used in a significant number of
WHERE clauses, provided that no one value occurs
in more than 20% of the rows. (For example, if there
were lots of retrievals on the task table both by hours
and by tname, hours would be better than tname.)
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 9
Mathematical Sidelight on Number of
Possible Indices
For 1 field, there is only 1 index possible.
For 2 fields, a and b, you can index a, b, ab, ba, so 4
indices are possible.
For 3 fields, a, b, and c, you can index a, b, c, ab, ba, ac,
ca, bc, cb, abc, acb, bac, bca, cab, cba, so 15 indices
are possible.
In general, you can create n + n(n-1) + n(n-1)(n-2) + …
+ n! indices on n columns.
This can be re-written as n!(1/(n-1)! + 1/(n-2)! + 1/(n-3)!
+ … + 1/(1!) + 1/(0!)), but the sum in the
parentheses approaches e = 2.718281828… as n
gets large, so the number of indices is
approximately e(n!).
The table below gives the exact values for all small
values of n.
# of columns Number of indices possible
1
1
2
4
3
15
4
64
5
325
6
1956
7
13699
8
109600
9
986409
10
9864100
11
108505113
12
1302061346
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 10
Selection of Views
Reports

select report items, do entry-level derivations
Frequent queries

include commonly retrieved combinations, predefine common joins
Logical groups of users


apply security to views
pre-select only items of interest to the group
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 11
Semantic Disintegrity
EXAMPLE (Martin):
Query: List all incidents that were reported on
passenger Jones' voyage:
passenger
passenger#
incident
incident#
passenger_name
address
incident_name
voyage#
details
...
join
voyage#
...
This join is valid because every incident on passenger
Jones' voyage is needed
Query: List all projects that employee Jones
works on:
employee
employee# employee_name
project
project# project_name
address
details
department#
join
department#
...
This join is not valid because employee Jones does not,
in general, work on every project in the department
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 12
Read-Only Database
Database is created all at one time
from another file / database
Batch reports are generated
perhaps at database creation
On-line queries are performed
no updates, but maybe new tables
Database is destroyed after limited lifetime
from hours to a month
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 13
The Need for Multiple Databases
In general, it is not reasonable to have only one
copy of the database serve the multiple
purposes of:
On-line transaction processing and update
Regular reporting
Ad-hoc queries
Application development
Beta-testing new software
Stress testing applications
Training
Rapid prototyping
Therefore, it is reasonable to generate multiple
databases, not all with identical content.
Suggested databases to develop are:
Production Copy - for on-line transactions/update
Read-Only Copy - for reporting and ad-hoc queries
Scaled-down Sample - for development and testing
Exceptional Cases - for stress testing
Tiny Sample - for training and prototyping
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 14
Applications of Read-Only Databases







Downloads
Historical data
Extracts
Freeze-Frame
Copy for performance reasons
External version of internal data
Enhance capabilities of non-relational
database
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 15
Reasons to Identify Read-Only
Situations

Security
Only include what needs to be read
No risk of altered data
Access not delayed by updates

Integrity
No risk of update creating an anomaly

Simplicity
Can create tables that directly relate to
reports

Performance
Can store data in “processed” form

Altered design criteria
Normalization is required to eliminate
update anomalies - what updates?
Copyright  1971-2002 Thomas P. Sturm
Implementation
Part 19, Page 16