Download PowerPoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Control table wikipedia , lookup

Bloom filter wikipedia , lookup

Linked list wikipedia , lookup

Quadtree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Hash table wikipedia , lookup

Red–black tree wikipedia , lookup

Interval tree wikipedia , lookup

Rainbow table wikipedia , lookup

Binary search tree wikipedia , lookup

B-tree wikipedia , lookup

Transcript
Chapter 11
Indexing & Hashing
Indexing & Hashing



Sophisticated database access methods
Basic concerns: access/insertion/deletion time,
space overhead
Indexing




An index is specified on one (or more) field(s), called
search key field, of the record, which is not
necessarily unique
Different index structures associated with different
search keys
Allows fast random access to records
Index record (forms an access path to the data record),
is of the form < field value, pointer to record >
2
Indexing


Dense Index: For every unique search-key value, there is
an index record
Sparse Index: Index records are created for some searchkey values


Sparse index is slower, but requires less space & overhead
Primary Index:




Defined on an ordered data file, ordered on a search key field
& is usually the primary key.
A sequentially ordered file with a primary index is called
index-sequential file
A binary search on the index yields a pointer to the record
Index value is the search-key value of the first data record in
the block
3
Figure. Dense index
Figure. Sparse index
4
Figure. Primary index
on the ordering key
field of a file
5
Primary Index

Index Deletion



(Dense Index) delete the search-key value (of the deleted record)
from the index file if the deleted record is the last record with
the search-key value
(Sparse Index) if the deleted record is the last record with the
search-key value & the search-key value v (of the deleted
record) exists in the index file F, replace v by w, where w is
the next search-key value in order; if v and w are in F, simply
delete v
Index Insertion

(Dense Index) if the search-key value v (of the new record) does
not exist in the index file, insert v

(Sparse Index) if a new data block B is created, then the 1st
search-key value of B is inserted into the index file
6
Multi-Level Indices





First-level index: the original index file
Second-level index: primary index to the original
index file
(Rare) Third-level index: top level index (fit in one
disk block)
Form a search tree, such as B-tree or B+-tree
structures
Insertion/deletion of new indexes are not trivial in
indexed files
7
Figure. A twolevel primary
index
8
Secondary Indices


Defined on an unordered data file, i.e., not by the indexed field
order (can be defined on a candidate key/non-key field)
Each pointer often points to a bucket which consists of pointers
to records with the same search-key value. The bucket
structure can be eliminated if


the index is dense, and the search-key values form a primary key, i.e., unique
Advantages:
i.
ii.
iii.
Improve the performance of queries that use candidate keys
Eliminate extra pointers within the records
Eliminate the need for scanning records sequentially

Disadvantages: overhead/modification

Types of Secondary Indices:

Dense: pointers in a bucket point to records w/ same search-key values

Sparse: a pointer in a bucket points to records w/ search-key values in the
appropriate range
9
Figure. A secondary
index on a key field
of a file.
10
Figure. A secondary
index on a non-key
field implemented
using a level of
indirection
11
B+-Tree (Multi-level) Indices

Frequently used index structure in DB

Allow efficient insertion/deletion of new/existing search-key values


A balanced tree structure: all leaf nodes are at the same level (which
may form a dense index)
Each node, corresponding to a disk block, has the format:
P1
K1
where Pi, 1
Ki, 1
…
P2
i
i
Pn-1
Kn-1
Pn
n, is a pointer
n-1, is a search-key value &
Ki < Kj, i < j, i.e., search-key values are in order
P1
X
X < K1

K1
…
Ki-1
Pi
Ki
X
Ki-1
X < Ki
…
Kn-1
Pn
X
Kn-1
X
In each leaf node, Pi points to either (i) a data record with search-key
value Ki or (ii) a bucket of pointers, each points to a data record
with search-key value Ki
12
B+-Tree (Multi-level) Indices




Each leaf node is kept between half full & completely full, i.e.,
( (n-1)/2 , n-1) search-key values
Non-leaf nodes form a sparse index
Each non-leaf node (except the root) must have ( n/2 , n)
pointers
No. of Block accesses required for searching a search-key value
@leaf-node level is log n/2 (K)
where K = no. of unique search-key values & n = no. of indices/node

Insertion into a full node causes a split into two nodes which may
propagate to higher tree levels
Note: if there are n search-key values to be split, put the first
( (n1)/2 in the existing node & the remaining in a new
node
13

A less than half full node caused by a deletion must be merged
B+-Tree Algorithms

Algorithm 1. Searching for a record with search-key value K, using a B+-Tree .
Begin
n block containing root node of B+-Tree ;
read block n;
while (n is not a leaf node of the B+-Tree) do
begin
q number of tree pointers in node n;
if K < n.K1 /* n.Ki refers to the ith search-key value in node n */
then n n.P1 /* n.Pi refers to the ith pointer in node n */
else if K  n.Kq-1
then n n.Pq
else begin
search node n for an entry i such that n.Ki-1 K < n.Ki;
n n.Pi ;
end; /*ELSE*/
read block n;
end; /*WHILE*/
search block n for entry Ki with K = Ki; /*search leaf node*/
if found, then read data file block with address Pi and retrieve record
else record with search-key value K is not in the data file;
end. /*Algorithm 1*/
14
B+-Tree Algorithms

Algorithm 2. Inserting a record with search-key value K in a B+-Tree of order p.
/* A B+-Tree of order p contains at most p-1 values an p pointers*/
Begin
n block containing root node of B+-Tree ;
read block n;
set stack S to empty;
while (n is not a leaf node of the B+-Tree ) do
begin
push address of n on stack S; /* S holds parent nodes that are needed in case of split */
q number of tree pointers in node n;
if K < n.K1 /* n.Ki refers to the ith search-key value in node n */
then n n.P1 /* n.Pi refers to the ith pointer in node n */
else if K  n.Kq-1
then n n.Pq
else begin
search node n for an entry i such that n.Ki-1 K < n.Ki;
n n.Pi ;
end; /* ELSE */
read block n;
end; /* WHILE */
15
search block n for entry Ki with K = Ki; /* search leaf node */
Algorithm 2 Continue
if found
then return /*record already in index file - no insertion is needed */
else
begin /* insert entry in B+-Tree to point to record */
create entry (P, K), where P points to file block containing new record;
if leaf node n is not full
then insert entry (P, K) in correct position in leaf node n
else
begin /* leaf node n is full – split */
copy n to temp; /* temp is an oversize leaf node to hold extra entry */
insert entry (P, K) in temp in correct position; /* temp now holds p+1 entries
of the form (Pi, Ki) */
new  a new empty leaf node for the tree;
* j  p/2
n  first j entries in temp (up to entry (Pj, Kj));
n.Pnext  new; /* Pnext points to the next leaf node*/
new  remaining entries in temp;
* K  Kj+1;
/* Now we must move (K, new) and insert in parent internal node. However, if parent is
full, split may propagate */
finished  false;
16
Algorithm 2 continue
Repeat
if stack S is empty, then /*no parent node*/
begin /* new root node is created for the B+-Tree */
root a new empty internal node for the tree;
*
root  <n, K, new>; /* set P1 to n & P2 to new */
finished true;
end
else
begin
n pop stack S;
if internal node n is not full, then
begin /* parent node not full - no split */
insert (K, new) in correct position in internal node n;
finished  true
end
else
17
Algorithm 2 continue
begin /* internal node n is full with p tree pointers – split */
copy n to temp; /* temp is an oversize internal node */
insert (K, new) in temp in correct position; /* temp has p+1 tree pointers */
new  a new empty internal node for the tree;
*
j  ( (p + 1)/2
n  entries up to tree pointer Pj in temp;
/* n contains <P1, K1, P2, K2, .., Pj-1, Kj-1, Pj> */
new  entries from tree pointer Pj+1 in temp;
/*new contains < Pj+1 , Kj+1, .., Kp-1, Pp, Kp, Pp+1 >*/
*
K  Kj;
/* now we must move (K, new) and insert in parent internal node */
end
end
until finished
end; /* ELSE */
end; /* ELSE */
end. /* Algorithm 2 */
18
Hashing

Uses dense index

Avoids accessing an index structure to locate data

Allocate search-key values to different buckets



(Static Hash Function) given a search-key value v, a hash
function h computes (assigns) the address of the desired
bucket (which contains a pointer to the record) for v
h: K  B
where K: set of search-key values
B: set of (fixed) bucket addresses
The hash function maps a search-key value to a bucket b and
perform a (linear) search of every record in b
An ideal hash function


Uniform distribution of search-key values, i.e., same no. of
search-key values in each bucket
Random distribution of search-key values, i.e., each search-key 19
value has the same possibility
Dynamic (Extendable) Hash Function (EHF)



Resolves the problems of static hashing

Allowing hash function to be modified dynamically, accommodating
changes in DB size (no reserved buckets for future growth)

Minimizing space overhead, i.e., bucket address table (b-a-t) is
small
Allows buckets to be split or combined to maintain space
efficiency
Buckets are created on demand, as records are inserted.

Result: low performance overhead (reorganization requires
one bucket at a time)
20
Dynamic (Extendable) Hash Function (EHF)



EHF uses i bits, which grows and shrinks with DB size,
as an offset into b-a-t
i bits (which changes as file grows) of h(K) are required
to determine the correct bucket for K
All entries of the i-bit b-a-t pointed to the same bucket j
have a common hash prefix (chp) and bucket j is
associated with an integer ij to denote the length of
the chp
No. of entries of b-a-t that point to bucket j = 2(i - ij)
21
=1
…
=2
=2
…
=2
…
Figure. General extendable hash structure
22
Dynamic (Extendable) Hash Function (EHF)


Lookup K, a search-key value: locate the bucket pointed
to by the b-a-t entry which is determined by the first i
high-order bits of h(K)
Insert a record r with search-key value K
1.
Lookup K and locate bucket j
2.
If j is not full, insert the info of K in j and r in the file
3.
If j is full, create a new bucket z. There are two cases to
be considered:
23
Dynamic (Extendable) Hash Function (EHF)
(a) Case i = ij (only one entry in b-a-t points to j):
1.
2.
Increase i by 1, i.e., doubling the size of b-a-t. Each entry
is replaced by 2 entries which contain the same
pointer as the original entry
(For the b-a-t entry that causes the split) Set the 2nd entry
created from the entry for j to point to z
3.
Set ij = i(new) and iz = i(new)
4.
Rehash the records in j based on (new) i and redistribute
records in j and r
5.
Re-attempt to insert r and repeat the whole process if r
and all records in j have the same hash prefix
24
Figure. Sample deposit file
Figure. Hash function for branch-name
Downtown
Round Hill
Figure. Initial extendable hash structure
(Each bucket can hold up to 2 records)
Figure. Hash structure after 3 insertions
(Downtown, Round Hill, Perryridge)
25
*
*
*
Figure. Sample deposit file
Figure. Hash function for branch-name
Figure. Hash structure after four insertions
26
Hashing

Insert a record r with search-key value K
(b) Case i > ij (> 1 entry in b-a-t points to j):

iz = ij + 1 and ij = ij + 1
2.
Adjust entries in b-a-t that point to j: set the first half of entries
point to j and the remaining ones to z
3.
Rehash and allocate records in j
4.
Reattempt to insert r and repeat the whole process (of insertion)
if r and all records in j have the same hash prefix
Delete a record r with search-key value K:
1.
Lookup K and locate bucket j
2.
Remove K from j and r from the file. Remove j if j becomes empty
Adjust b-a-t if necessary
3.

1.
Disadvantages


Lookup involves additional level of indirection (must access b-a-t)
Additional complexity in implementation
27
Figure. Sample deposit file
Figure. Hash function for branch-name
Figure. Extendable
hash structure for the
deposit file
28
Figure. Sample account file
Figure. Hash function for
branch-name
Figure. Initial extendable
hash structure.
29
Figure 11.28 Hash structure
after four insertions
30
Redwood
A-222 700 0011
Figure 11.29 Hash structure
after nine insertions
Round Hill A-305 350 1101
Figure 11.29 Hash structure after seven insertions
31
Figure 11.30 Extendable hash structure for the account file
32
Indexing & Hashing


Expected types of queries is critical to the choice
between indexing and hashing
Comparison

For query with an equality comparison of an attribute,
hashing is preferable

For query with a range of values specified, indexing is
preferable

Most DB systems use indexing - difficult to find a good
hash function that preserves order to support range
queries
33