Download queryProcessing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Encyclopedia of World Problems and Human Potential wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Relational algebra wikipedia , lookup

Transcript
Query Processing
CENG 352 Database Management Systems
1
The Role of Relational Algebra in a DBMS
CENG 352 Database Management Systems
2
Relational Operations
• We will consider how to implement:
–
–
–
Selection (  )
Projection (  )
Join (  )
CENG 352 Database Management Systems
3
Computing Selection (attr
op value)
• No index on attr:
– If rows are not sorted on attr:
• Scan all data pages to find rows satisfying selection condition
• Cost = F (number of pages in file to be sorted)
– If rows are sorted on attr and op is =, >, < then:
• Use binary search (at log2 F ) to locate first data page
containing row in which (attr = value)
• Scan further to get all rows satisfying (attr op value)
• Cost = log2 F + (cost of scan)
CENG 352 Database Management Systems
4
Computing Selection (attr
op value)
• Primary B+ tree index on attr (for “=” or range search):
– Locate first index entry corresponding to a row in which (attr =
value). Cost = depth of tree
– Rows satisfying condition packed in sequence in successive data
pages; scan those pages.
Cost: number of pages occupied by qualifying rows
B+ tree
index entries
(containing rows)
that satisfy
condition
CENG 352 Database Management Systems
5
Computing Selection (attr
op value)
• Secondary B+ tree index on attr (for “=” or range search):
– Locate first index entry corresponding to a row in which (attr =
value).
Cost = depth of tree
– Index entries with pointers to rows satisfying condition are packed in
sequence in successive index pages
• Scan entries and sort record Ids to identify table data pages with
qualifying rows
Any page that has at least one such row must be fetched once.
• Cost: number of rows that satisfy selection condition
CENG 352 Database Management Systems
6
Computing Selection (attr = value)
• Hash index on attr (for “=” search only):
– Hash on value. Cost  1.2
• 1.2 – typical average cost of hashing (> 1 due to possible
overflow chains)
• Finds the (unique) bucket containing all index entries satisfying
selection condition
• Clustered index – all qualifying rows packed in the bucket (a
few pages)
Cost: number of pages occupies by the bucket
• Unclustered index – sort row Ids in the index entries to identify
data pages with qualifying rows
Each page containing at least one such row must be fetched once
Cost: number of rows in bucket
CENG 352 Database Management Systems
7
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Similar to old schema; rname added for variations.
• Reserves:
–
Each tuple is 40 bytes long, 100 tuples per page, 1000
pages.
• Sailors:
–
Each tuple is 50 bytes long, 80 tuples per page, 500
pages.
CENG 352 Database Management Systems
8
Simple Selections
• With no index, unsorted:
SELECT *
FROM Reserves R
WHERE R.rname = ‘Joe’
Must essentially scan the whole relation; cost is 1000 I/Os (#pages in R).
• With no index, sorted data:
Utilize the sort order on rname by doing a binary search to locate the first Joe.
Cost is log2 1000  10 I/Os.
• With a B+ tree index on selection attribute:
Use index to find qualifying data entries, then retrieve corresponding data
records. Cost of finding the starting page is 2 or 3 I/Os; for a clustered index add
one more I/O; for an unclustered index add one page per qualifying tuple.
• Hash index:
1 or 2 I/Os to retrieve the index pages. If 100 reservations by Joe then an
additional 1-100 disk accesses depending how these records are distributed.
CENG 352 Database Management Systems
9
Using an Index for Selections
SELECT *
FROM Reserves R
WHERE R.rname < ‘C%’
• Cost depends on #qualifying tuples, and clustering.
• Assume we estimate roughly 10% of Reserves tuples
will be in result ( = 10,000 tuples, or 100 pages).
– With a primary B+tree index: cost is 100 I/Os + 1 or 2 disk
accesses for index.
– With a secondary B+tree index: cost could be as high as
10,000 I/Os in the worst case. (might be cheaper to simply
scan the entire relation)
CENG 352 Database Management Systems
10
The Projection Operation
• To implement projection we have to do the
following:
1. Remove unwanted attributes.
2. Eliminate any duplicate tuples produced.
• The expensive part is removing duplicates.
– SQL systems don’t remove duplicates unless the
keyword DISTINCT is specified in a query.
• There are two basic algorithms:
1. Sorting Approach.
2. Hashing Approach.
CENG 352 Database Management Systems
11
Approach based on sorting
• Modify Pass 1 of external sort to eliminate
unwanted fields. If B buffer pages are
available, runs of about 2B pages can be
produced, but tuples in runs are smaller than
input tuples. (Size ratio depends on # and size of
fields that are dropped.)
• Modify merging passes to eliminate
duplicates. Thus, number of result tuples
smaller than input. (Difference depends on # of
duplicates.)
CENG 352 Database Management Systems
12
Example
SELECT DISTINCT R.sid, R.bid
FROM Reserves R
Cost:
• In Pass 1, read original relation (1000 pages), write out
same number of smaller tuples.
– Assume we have 20 buffer pages
– Assume that each smaller tuple is 10 bytes long.
– Thus cost is 250 pages (7 runs about 40 pages each).
• In merging passes, fewer tuples written out in each pass.
– The temporary relation can be merged in 1 pass, since we have 20
buffer pages.
– We read the runs at a cost of 250 I/Os and merge them.
• The total cost is 1500 I/Os.
CENG 352 Database Management Systems
13
Computing Joins: R
A=B S
• The cost of joining two relations makes the choice
of a join algorithm crucial
• Assume: M pages in R, pR tuples per page, N
pages in S, pS tuples per page.
–
In our examples, R is Reserves and S is Sailors.
• Cost metric: # of I/Os. We will ignore output
costs.
CENG 352 Database Management Systems
14
Simple Nested Loops Join
foreach tuple r in R do
foreach tuple s in S do
if ri == sj then add <r, s> to result
• For each tuple in the outer relation R, we scan the entire inner
relation S.
–
Cost: M + pR * M * N = 1000 + 100*1000*500 I/Os.
• Page-oriented Nested Loops join: For each page of R, get each
page of S, and write out matching pairs of tuples <r, s>, where r
is in R-page and S is in S-page.
–
–
•
Cost: M + M*N = 1000 + 1000*500
If smaller relation (S) is outer, cost = 500 + 500*1000
Choose smaller relation for the outer loop
CENG 352 Database Management Systems
15
Block Nested Loops Join
• Use one page as an input buffer for scanning the inner S, one page as the
output buffer, and use all remaining pages to hold ``block’’ of outer R.
–
For each matching tuple r in R-block, s in S-page, add <r, s> to result. Then read
next R-block, scan S, etc.
• Cost can be reduced to
M + (M/(B-2))  N
R&S
(by using B buffer pages instead of 1.)
Hash table for block of R
(B-2 pages)
Join Result
...
...
...
Input buffer for S
Output buffer
CENG 352 Database Management Systems
16
Examples of Block Nested Loops
• Cost: Scan of outer + #outer blocks * scan of inner
–
–
#outer blocks = # of pages of outer / blocksize
i.e. Cost = M + N * M/(B-2) 
• With Reserves (R) as outer, and 100 pages of R:
–
–
–
Cost of scanning R is 1000 I/Os; a total of 10 blocks.
Per block of R, we scan Sailors (S); 10*500 I/Os.
If space for just 90 pages of R, we would scan S 12 times.
• With 100-page block of Sailors as outer:
–
–
Cost of scanning S is 500 I/Os; a total of 5 blocks.
Per block of S, we scan Reserves; 5*1000 I/Os.
CENG 352 Database Management Systems
17
Index Nested Loops Join
foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
• If there is an index on the join column of one relation (say S), can
make it the inner and exploit the index.
–
Cost: M + ( (M*pR) * cost of finding matching S tuples)
• For each R tuple, cost of probing S index is about 1.2 for hash
index, 2-4 for B+ tree. Cost of then finding S tuples depends on
clustering.
–
•
Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple.
Effective if number of rows of S that match tuples in R is small and
index is clustered
CENG 352 Database Management Systems
18
Examples of Index Nested Loops
• Hash-index (unclustered) on sid of Sailors (as inner):
–
–
Scan Reserves: 1000 page I/Os, 100*1000 tuples.
For each Reserves tuple: 1.2 I/Os to get data entry in index, plus 1 I/O
to get (the exactly one) matching Sailors tuple. Total: 220,000 I/Os.
• Hash-index (unclustered) on sid of Reserves (as inner):
–
–
Scan Sailors: 500 page I/Os, 80*500 tuples.
For each Sailors tuple: 1.2 I/Os to find index page with data entries
(1.2 * 40000= 48,500 I/Os), plus cost of retrieving matching Reserves
tuples. Assuming uniform distribution, 2.5 reservations per sailor
(100,000 / 40,000), thus cost of retrieving them is 2.5*40,000 I/Os.
Total cost is 48,500 + 100,000 = 148,500 I/Os (still better than simple
nested loops)
CENG 352 Database Management Systems
19