Download Chapter 4 Implementation of Relational Operators

Document related concepts

Microsoft Access wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Join (SQL) wikipedia , lookup

Relational algebra wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Chapter 4 Implementation of
Relational Operators, Query
Optimization, and Physical Database
Design
IAM6133 Advanced Database Technology
1
Contents
• Relational Operator
• Query Optimization
• Physical Database Design
2
Relational Operator
(Slide 4 – 16)
3
Relational Query Languages
• Query languages: Allow manipulation and
retrieval of data from a database.
• Relational model supports simple, powerful QLs:
–
–
Strong formal foundation based on logic.
Allows for much optimization.
• Query Languages != programming languages!
–
–
–
QLs not expected to be “Turing complete”.
QLs not intended to be used for complex calculations.
QLs support easy, efficient access to large data sets.
4
Formal Relational Query Languages
• Two mathematical Query Languages form
the basis for “real” languages (e.g. SQL),
and for implementation:
– Relational Algebra: More
operational(procedural), very useful for
representing execution plans.
– Relational Calculus: Lets users describe what
they want, rather than how to compute it.
(Non-operational, declarative.)
5
Preliminaries
• A query is applied to relation instances, and
the result of a query is also a relation instance.
–
–
Schemas of input relations for a query are fixed
(but query will run regardless of instance!)
The schema for the result of a given query is also
fixed! Determined by definition of query language
constructs.
• Positional vs. named-field notation:
–
–
Positional notation easier for formal definitions,
named-field notation more readable.
Both used in SQL
6
R1
Example Instances
• “Sailors” and “Reserves”
relations for our examples.
“bid”= boats. “sid”: sailors
• We’ll use positional or named
field notation, assume that
names of fields in query
results are `inherited’ from
names of fields in query input
relations.
sid bid
day
22 101 10/10/96
58 103 11/12/96
S1
sid
22
31
58
sname rating age
dustin
7
45.0
lubber
8
55.5
rusty
10 35.0
S2
sid
28
31
44
58
sname rating age
yuppy
9
35.0
lubber
8
55.5
guppy
5
35.0
rusty
10 35.0
Relational Algebra
• Basic operations:
–
–
–
–
–
Selection (  ) Selects a subset of rows from relation.
Projection ( ) Deletes unwanted columns from relation.
Cross-product ( ) Allows us to combine two relations.
Set-difference (
) Tuples in reln. 1, but not in reln. 2.
Union (  ) Tuples in reln. 1 and in reln. 2.


• Additional operations:
–
Intersection, join, division, renaming: Not essential, but (very!)
useful.
• Since each operation returns a relation, operations can be
composed! (Algebra is “closed”.)
8
sname
rating
lubber
guppy
rusty
9
8
5
10
Projectionyuppy
• Deletes attributes that are not in
projection list.
• Schema of result contains exactly the
fields in the projection list, with the
same names that they had in the (only)
input relation.
• Projection operator has to eliminate
duplicates! (Why??, what are the
consequences?)
–
Note: real systems typically
don’t do duplicate elimination
unless the user explicitly asks
for it. (Why not?)
 sname,rating(S2)
age
35.0
55.5
 age(S2)
Selection
• Selects rows that satisfy
selection condition.
• Schema of result identical
to schema of (only) input
relation.
• Result relation can be the
input for another
relational algebra
operation! (Operator
composition.)
sid sname rating age
28 yuppy 9
35.0
58 rusty
10
35.0
 rating 8(S2)
sname rating
yuppy 9
rusty
10
 sname,rating( rating 8(S2))
Union, Intersection, Set-Difference
sid sname rating age
• All of these operations take
two input relations, which
must be union-compatible:
–
–
Same number of fields.
`Corresponding’ fields
have the same type.
• What is the schema of result?
sid sname
22 dustin
rating age
7
45.0
S1 S2
22
31
58
44
28
dustin
lubber
rusty
guppy
yuppy
7
8
10
5
9
45.0
55.5
35.0
35.0
35.0
S1 S2
sid sname rating age
31 lubber 8
55.5
58 rusty
10
35.0
S1 S2
Cross-Product
• Each row of S1 is paired with each row of R1.
• Result schema has one field per field of S1 and
R1, with field names `inherited’ if possible.
–
Conflict: Both S1 and R1 have a field called sid.
(sid) sname rating age
(sid) bid day
22
dustin
7
45.0
22
101 10/10/96
22
dustin
7
45.0
58
103 11/12/96
31
lubber
8
55.5
22
101 10/10/96
31
lubber
8
55.5
58
103 11/12/96
58
rusty
10
35.0
22
101 10/10/96
58
rusty
10
35.0
58
103 11/12/96
 Renaming operator:
 (C(1 sid1, 5  sid 2), S1 R1)
12
Joins
• Condition Join:
R  c S   c ( R  S)
(sid) sname rating age
22
dustin 7
45.0
31
lubber 8
55.5
(sid) bid
58
103
58
103
day
11/12/96
11/12/96
S1 same as that of cross-product.
R1
• Result schema
S1. sid  R1. sid
• Fewer tuples than cross-product. Filters tuples not
satisfying the join condition.
• Sometimes called a theta-join.
13
Joins
• Equi-Join: A special case of condition join where
the condition c contains only equalities.
sid
22
58
sname
dustin
rusty
rating age
7
45.0
10
35.0
bid
101
103
day
10/10/96
11/12/96
sid ,.., age,bid ,..(S1 sid R1)
• Result schema similar to cross-product, but only
one copy of fields for which equality is specified.
• Natural Join: Equijoin on all common fields.
14
Division
• Not supported as a primitive operator, but useful for
expressing queries like:
Find sailors who have reserved all boats.
• Precondition: in A/B, the attributes in B must be
included in the schema for A. Also, the result has
attributes A-B.
– SALES(supId, prodId);
– PRODUCTS(prodId);
– Relations SALES and PRODUCTS must be built using
projections.
– SALES/PRODUCTS: the ids of the suppliers supplying ALL
products.
15
Examples of Division A/B
sno
s1
s1
s1
s1
s2
s2
s3
s4
s4
pno
p1
p2
p3
p4
p1
p2
p2
p2
p4
A
pno
p2
B1
pno
p2
p4
B2
pno
p1
p2
p4
B3
sno
s1
s2
s3
s4
sno
s1
s4
sno
s1
A/B1
A/B2
A/B3
16
Query Optimization
(Slide 18-28)
17
Introduction
• Query optimization is a function of many relational
database management systems. The query
optimizer attempts to determine the most efficient
way to execute a given query by considering the
possible query plans.
• Generally, the query optimizer cannot be accessed
directly by users: once queries are submitted to
database server, and parsed by the parser, they are
then passed to the query optimizer where optimization
occurs. However, some database engines allow guiding
the query optimizer with hints.
18
Introduction
•
A query is a request for information from a database. It can be as simple as
"finding the address of a person with SS# 123-45-6789," or more complex like
"finding the average salary of all the employed married men in California between
the ages 30 to 39, that earn less than their wives." Queries results are generated
by accessing relevant database data and manipulating it in a way that yields the
requested information. Since database structures are complex, in most cases, and
especially for not-very-simple queries, the needed data for a query can be
collected from a database by accessing it in different ways, through different datastructures, and in different orders. Each different way typically requires different
processing time. Processing times of the same query may have large variance,
from a fraction of a second to hours, depending on the way selected. The purpose
of query optimization, which is an automated process, is to find the way to process
a given query in minimum time. The large possible variance in time justifies
performing query optimization, though finding the exact optimal way to execute a
query, among all possibilities, is typically very complex, time consuming by itself,
may be too costly, and often practically impossible. Thus query optimization
typically tries to approximate the optimum by comparing several common-sense
alternatives to provide in a reasonable time a "good enough" plan which typically
does not deviate much from the best possible result.
19
General considerations
•
There is a trade-on between the amount of time spent figuring out the best query
plan and the quality of the choice; the optimizer may not choose the best answer
on its own. Different qualities of database management systems have different
ways of balancing these two. Cost-based query optimizers evaluate the resource
footprint of various query plans and use this as the basis for plan selection. These
assign an estimated "cost" to each possible query plan, and choose the plan with
the smallest cost. Costs are used to estimate the runtime cost of evaluating the
query, in terms of the number of I/O operations required, CPU path length,
amount of disk buffer space, disk storage service time, and interconnect usage
between units of parallelism, and other factors determined from the data
dictionary. The set of query plans examined is formed by examining the possible
access paths (e.g., primary index access, secondary index access, full file scan) and
various relational table join techniques (e.g., merge join, hash join, product join).
The search space can become quite large depending on the complexity of
the SQL query. There are two types of optimization. These consist of logical
optimization—which generates a sequence of relational algebra to solve the
query—and physical optimization—which is used to determine the means of
carrying out each operation.
20
• Implementation[edit]
• Most query optimizers represent query plans as a tree of
"plan nodes". A plan node encapsulates a single operation
that is required to execute the query. The nodes are
arranged as a tree, in which intermediate results flow from
the bottom of the tree to the top. Each node has zero or
more child nodes—those are nodes whose output is fed as
input to the parent node. For example, a join node will have
two child nodes, which represent the two join operands,
whereas a sort node would have a single child node (the
input to be sorted). The leaves of the tree are nodes which
produce results by scanning the disk, for example by
performing an index scan or a sequential scan.
21
Join ordering
•
•
•
•
•
The performance of a query plan is determined largely by the order in which the tables are joined.
For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows,
respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to
execute than one that joins A and C first. Most query optimizers determine join order via adynamic
programming algorithm pioneered by IBM's System Rdatabase project[citation needed]. This algorithm
works in two stages:
First, all ways to access each relation in the query are computed. Every relation in the query can be
accessed via a sequential scan. If there is an index on a relation that can be used to answer
a predicate in the query, an index scan can also be used. For each relation, the optimizer records
the cheapest way to scan the relation, as well as the cheapest way to scan the relation that
produces records in a particular sorted order.
The optimizer then considers combining each pair of relations for which a join condition exists. For
each pair, the optimizer will consider the available join algorithms implemented by the DBMS. It
will preserve the cheapest way to join each pair of relations, in addition to the cheapest way to join
each pair of relations that produces its output according to a particular sort order.
Then all three-relation query plans are computed, by joining each two-relation plan produced by
the previous phase with the remaining relations in the query.
Sort order can avoid a redundant sort operation later on in processing the query. Second, a
particular sort order can speed up a subsequent join because it clusters the data in a particular way.
22
Query planning for nested SQL
queries
• A SQL query to a modern relational DBMS does more
than just selections and joins. In particular, SQL queries
often nest several layers of SPJ blocks (Select-ProjectJoin), by means of group by,exists, and not
exists operators. In some cases such nested SQL
queries can be flattened into a select-project-join
query, but not always. Query plans for nested SQL
queries can also be chosen using the same dynamic
programming algorithm as used for join ordering, but
this can lead to an enormous escalation in query
optimization time. So some database management
systems use an alternative rule-based approach that
uses a query graph model.
23
•
•
Cost estimation[edit]
One of the hardest problems in query optimization is to accurately estimate the
costs of alternative query plans. Optimizers cost query plans using a mathematical
model of query execution costs that relies heavily on estimates of the cardinality,
or number of tuples, flowing through each edge in a query plan. Cardinality
estimation in turn depends on estimates of the selection factor of predicates in the
query. Traditionally, database systems estimate selectivities through fairly detailed
statistics on the distribution of values in each column, such as histograms. This
technique works well for estimation of selectivities of individual predicates.
However many queries have conjunctions of predicates such
as selectcount(*) from R where R.make='Honda' andR.model='Accord'. Query
predicates are often highly correlated (for
example, model='Accord' implies make='Honda'), and it is very hard to estimate
the selectivity of the conjunct in general. Poor cardinality estimates and uncaught
correlation are one of the main reasons why query optimizers pick poor query
plans. This is one reason why a database administrator should regularly update the
database statistics, especially after major data loads/unloads.
24
Extensions
• Classical query optimization assumes that query plans
are compared according to one single cost metric,
usually execution time, and that the cost of each query
plan can be calculated without uncertainty. Both
assumptions are sometimes violated in practice[1] and
multiple extensions of classical query optimization
have been studied in the research literature that
overcome those limitations. Those extended problem
variants differ in how they model the cost of single
query plans and in terms of their optimization goal.
25
Parametric Query Optimization
• Classical query optimization associates each query plan with one scalar
cost value. Parametric query optimization[2] assumes that query plan cost
depends on parameters whose values are unknown at optimization time.
Such parameters can for instance represent the selectivity of query
predicates that are not fully specified at optimization time but will be
provided at execution time. Parametric query optimization therefore
associates each query plan with a cost function that maps from a multidimensional parameter space to a one-dimensional cost space.
• The goal of optimization is usually to generate all query plans that could
be optimal for any of the possible parameter value combinations. This
yields a set of relevant query plans. At run time, the best plan is selected
out of that set once the true parameter values become known. The
advantage of parametric query optimization is that optimization (which is
in general a very expensive operation) is avoided at run time.
26
Multi-Objective Query Optimization
•
•
•
There are often other cost metrics in addition to execution time that are relevant to compare query plans [1]. In a
cloud computing scenario for instance, one should compare query plans not only in terms of how much time they
take to execute but also in terms of how much money their execution costs. Or in the context of approximate
query optimization, it is possible to execute query plans on randomly selected samples of the input data in order
to obtain approximate results with reduced execution overhead. In such cases, alternative query plans must be
compared in terms of their execution time but also in terms of the precision or reliability of the data they
generate.
Multi-objective query optimization[3] models the cost of a query plan as a cost vector where each vector
component represents cost according to a different cost metric. Classical query optimization can be considered as
a special case of multi-objective query optimization where the dimension of the cost space (i.e., the number of
cost vector components) is one.
Different cost metrics might conflict with each other (e.g., there might be one plan with minimal execution time
and a different plan with minimal monetary execution fees in a cloud computing scenario). Therefore, the goal of
optimization cannot be to find a query plan that minimizes all cost metrics but must be to find a query plan that
realizes the best compromise between different cost metrics. What the best compromise is depends on user
preferences (e.g., some users might prefer a cheaper plan while others prefer a faster plan in a cloud scenario).
The goal of optimization is therefore either to find the best query plan based on some specification of user
preferences provided as input to the optimizer (e.g., users can define weights between different cost metrics to
express relative importance or define hard cost bounds on certain metrics) or to generate an approximation of the
set of Pareto-optimal query plans (i.e., plans such that no other plan has better cost according to all metrics) such
that the user can select the preferred cost tradeoff out of that plan set.
27
Multi-Objective Parametric Query
Optimization
• Multi-objective parametric query
optimization[1] generalizes parametric and multiobjective query optimization. Plans are compared
according to multiple cost metrics and plan costs may
depend on parameters whose values are unknown at
optimization time. The cost of a query plan is therefore
modeled as a function from a multi-dimensional
parameter space to a multi-dimensional cost space.
The goal of optimization is to generate the set of query
plans that can be optimal for each possible
combination of parameter values and user preferences.
28
Physical Database Design
(Slide 30 – 56)
29
Physical Database Design
Purpose–translate the logical description
of data into the technical specifications
for storing and retrieving data
Goal–create a design for storing data
that will provide adequate performance
and insure database integrity, security,
and recoverability
30
Physical Design Process
Inputs
Normalized
Volume
Decisions
relations
Attribute
estimates
Attribute
Physical
record descriptions
(doesn’t always match
logical design)
definitions
Response
time
expectations
Data
Leads to
security needs
Backup/recovery
Integrity
DBMS
data types
File
organizations
Indexes
and database
architectures
needs
expectations
Query
optimization
technology used
31
Physical Design for regulatory
compliance
Sarbanes- Oxley Act (SOX) – protect investors by
improving accuracy and reliability
Committee of Sponsoring Organizations (COSO)
of the Treadway Commission
IT Infrastructure Library (ITIL)
Control Objectives for Information and Related
Technology (COBIT)
Regulations and standards that impact physical design decisions
32
Designing Fields
Field: smallest unit of application
data recognized by system software
Field design
Choosing data type
Coding, compression, encryption
Controlling data integrity
33
Choosing Data Types
34
Figure 5-1 Example of a code look-up table
(Pine Valley Furniture Company)
Code saves space, but costs
an additional lookup to
obtain actual value
Chapter 5
Copyright © 2014 Pearson Education, Inc.
35 35
Field Data Integrity
 Default value–assumed value if no explicit
value
 Range control–allowable value limitations
(constraints or validation rules)
 Null value control–allowing or prohibiting
empty fields
 Referential integrity–range control (and null
value allowances) for foreign-key to primarykey match-ups
Sarbanes-Oxley Act (SOX) legislates importance of financial data integrity
36
Handling Missing Data
Substitute an estimate of the missing value
(e.g., using a formula)
Construct a report listing missing values
In programs, ignore missing data unless the
value is significant (sensitivity testing)
Triggers can be used to perform these operations
37
Denormalization
 Transforming normalized relations into non-normalized
physical record specifications
 Benefits:
 Can improve performance (speed) by reducing number of table lookups
(i.e. reduce number of necessary join queries)
 Costs (due to data duplication)
 Wasted storage space
 Data integrity/consistency threats
 Common denormalization opportunities
 One-to-one relationship (Fig. 5-2)
 Many-to-many relationship with non-key attributes (associative entity)
(Fig. 5-3)
 Reference data (1:N relationship where 1-side has data not used in any
other relationship) (Fig. 5-4)
38
Figure 5-2 A possible denormalization situation: two entities with oneto-one relationship
Chapter 5
Copyright © 2014 Pearson Education, Inc.
39 39
Figure 5-3 A possible denormalization situation: a many-to-many
relationship with nonkey attributes
Extra table
access
required
Null description possible
Chapter 5
Copyright © 2014 Pearson Education, Inc.
40 40
Figure 5-4
A possible
denormalization
situation:
reference data
Extra table
access
required
Data duplication
Chapter 5
Copyright © 2014 Pearson Education, Inc.
41 41
Denormalize with caution
Denormalization can
Increase chance of errors and inconsistencies
Reintroduce anomalies
Force reprogramming when business rules
change
Perhaps other methods could be used to
improve performance of joins
Organization of tables in the database (file
organization and clustering)
Proper query design and optimization
42
Designing Physical database Files
Physical File:
A named portion of secondary memory allocated
for the purpose of storing physical records
Tablespace–named logical storage unit in which
data from multiple tables/views/objects can be
stored
Tablespace components
Segment – a table, index, or partition
Extent–contiguous section of disk space
Data block – smallest unit of storage
43
Figure 5-5 DBMS terminology in an Oracle 11g environment
Chapter 5
Copyright © 2014 Pearson Education, Inc.
44 44
File Organizations
Technique for physically arranging
records of a file on secondary
storage
Types of file organizations
Sequential
Indexed
Hashed
45
File Organizations
Factors for selecting file
organization:
Fast data retrieval and throughput
Efficient storage space utilization
Protection from failure and data loss
Minimizing need for reorganization
Accommodating growth
Security from unauthorized use
46
Figure 5-6a
Sequential file
organization
Records of the
file are stored in
sequence by the
primary key
field values
If sorted – every
insert or delete
requires re-sort
If not sorted
Average time to
find desired record
= n/2
Chapter 5
Copyright © 2014 Pearson Education, Inc.
47 47
Indexed File Organizations
Storage of records sequentially or
nonsequentially with an index that allows
software to locate individual records
Index: a table or other data structure used to
determine in a file the location of records that
satisfy some condition
Primary keys are automatically indexed
Other fields or combinations of fields can also
be indexed; these are called secondary keys (or
nonunique keys)
48
Figure 5-6b Indexed file organization
uses a tree search
Average time to find desired
record = depth of the tree
Chapter 5
Copyright © 2014 Pearson Education, Inc.
49 49
Figure 5-6c
Hashed file
organization
Hash algorithm
Usually uses divisionremainder to determine
record position. Records
with same position are
grouped in lists.
Chapter 5
Copyright © 2014 Pearson Education, Inc.
50 50
Figure 5-7 Join Indexes–speeds up join operations
b) Join index for matching foreign
key (FK) and primary key (PK)
a) Join index
for common
non-key
columns
Chapter 5
Copyright © 2014 Pearson Education, Inc.
51 51
Chapter 5
Copyright © 2014 Pearson Education, Inc.
52 52
Using and Selecting Keys
• Creating a unique key index
– Example: CustomerID (primary key) of Customer
– Example: Composite primary key for OrderLine
• Creating a secondary key index
– Example: Description field for Product (not unique)
53
Rules for Using Indexes
1. Use on larger tables
2. Index the primary key of each table
3. Index search fields (fields frequently in
WHERE clause)
4. Fields in SQL ORDER BY and GROUP BY
commands
5. When there are >100 values but not when
there are <30 values
54
Rules for Using Indexes (cont.)
Avoid use of indexes for fields with long values;
perhaps compress values first
7. If key to index is used to determine location of
record, use surrogate (like sequence nbr) to
allow even spread in storage area
8. DBMS may have limit on number of indexes per
table and number of bytes per indexed field(s)
9. Be careful of indexing attributes with null
values; many DBMSs will not recognize null
values in an index search
6.
55
Query Optimization
Parallel query processing–possible when
working in multiprocessor systems
Overriding automatic query optimization–
allows for query writers to preempt the
automated optimization
Data warehouses are already configured for
optimized query performance
56