Download Indexing Tree-Based Indexing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Indexing
Chapter 8, 10, 11
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
1
Tree-Based Indexing
The data entries are arranged in sorted order
by search key value.
 A hierarchical search data structure (tree) is
maintained that directs searches to the correct
page of data entries.
 Tree-structured indexing techniques support
both range searches and equality searches.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
2
Trees
Tree
A structure with a unique starting node (the
root), in which each node is capable of having
child nodes and a unique path exists from the
root to every other node
Root
The top node of a tree structure; a node with
no parent
Leaf Node
A tree node that has no children
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
3
Trees
Level
Distance of
a node from
the root
Height
The
maximum
level
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
4
Time Complexity of Tree Operations

Time complexity of Searching:
 O(height of the tree)

Time complexity of Inserting:
 O(height of the tree)

Time complexity of Deleting:
 O(height of the tree)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
5
Multi-way Tree
 If
we relax the restriction that each
node can have only one key, we
can reduce the height of the tree.
 A multi-way search tree is a tree in
which the nodes hold between 1 to
m-1 distinct keys
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
6
Range Searches

``Find all students with gpa > 3.0’’



If data is in sorted file, do binary search to find first
such student, then scan to find others.
Cost of binary search can be quite high.
Simple idea: Create an `index’ file.
Page 1
Page 2
Index File
kN
k1 k2
Page 3
Page N
Data File
Can do binary search on (smaller) index file!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
7
Property of Multi-way Tree
The keys in each node are sorted.
 A node with k values has k+1 sub-trees, where
the sub-trees may be empty.
 The i’th sub-tree of a node [v1, ..., vk], 0 < i < k,
may hold only values v in the range
vi < v < vi+1 (v0 is assumed to equal -∞,
and vk+1is assumed to equal +∞).
 A m-way tree of height h has between h
and mh - 1 keys.

 The height of a complete m-ary tree with n nodes is
ceiling(logmn).
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
8
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
9
Searching m-way Tree
We make an m-way branching decision at
each node according to the number of the
node’s children.
 Searching is performed in a recursive way.
 Time complexity is O(h).

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
10
Two Popular Trees

ISAM (Indexed Sequential Access Method)
 A static structure;

B+ tree:
 A dynamic structure, adjusts gracefully under
inserts and deletes.

Leaf pages of both of them contain data
entries.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
ISAM
index entry
P
0

11
K
1
P
1
K 2
P
2
K m
P m
Index file may still be quite large. But we can
apply the idea repeatedly!
Non-leaf
Pages
Leaf
Pages
Overflow
page
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
Primary pages
12
Comments on ISAM



A multi-way tree
Each tree node is a disk page;
Leaf pages contain data entries .
 Alternative 1 : Leaf pages are created with data
record with key value k; All leaf pages are
allocated sequentially and sorted on the search
key value.
 Alternative 2 or 3 : Data records are created and
sorted in a separate file, and then storing <key,
rid> in the leaf pages of ISAM index.
 Then index pages allocated, then space for
overflow pages.
Data Pages
Index Pages
Overflow pages
• Index entries: <search key value, page id>; they
`direct’ search for data entries, which are in leaf
pages.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
Comments on ISAM

Search: Start at root; use key comparisons to
go to leaf. Cost  log F N ;
 F = # entries/index pg, N = # leaf pgs


Insert: Find leaf data entry belongs to, and
put it there.
Delete: Find and remove from leaf; if empty
overflow page, de-allocate.
Static
13
Data
Pages
Index Pages
Overflow pages
tree structure: inserts/deletes affect only leaf pages.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
14
Example ISAM Tree
Each node can hold 2 entries; no need for
`next-leaf-page’ pointers. (Why? ->static)
 Example: search a record with the key value
Root
27.
40

10*
15*
20
33
20*
27*
51
33*
37*
40*
46*
63
55*
51*
63*
97*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
15
After Inserting 23*, 48*, 41*, 42* ...
Root
40
Index
Pages
20
33
20*
27*
51
63
Primary
Leaf
10*
15*
33*
37*
40*
46*
48*
41*
51*
55*
63*
97*
Pages
Overflow
23*
Pages
42*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
16
... Then Deleting 42*, 51*, 97*
The number of primary leaf pages is fixed at file creation
time – STATIC.
Root
40
10*
15*
20
33
20*
27*
23*
51
33*
37*
40*
46*
48*
41*
63
55*
63*
tree structure: inserts/deletes affect only leaf pages.
Note that 51* appears in index levels, but not in leaf!
Static
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
17
Comments on ISAM

Static design leads to the problem that long
overflow chains could develop.
 To alleviate this problem, the tree is initially
created so that about 20% of each page is free.

Static design has the advantage that locking
step is not needed since index-level pages are
never modified.
Static
tree structure: inserts/deletes affect only leaf pages.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
18
B+ Trees
A Dynamic Index Structure
Most Widely Used Index
 A balanced tree in which the internal nodes
direct the search and the leaf nodes contain
the data entries.
 Tree structure grows and shrinks
dynamically, leaf pages are not sequentially
allocated.
 Leaf pages are organized in doubly linked
list.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
19
B+ Tree Indexes
Non-leaf
Pages
Leaf
Pages
(Sorted by search key)
Double linked list
Leaf pages contain data entries, and are chained (prev & next)
 Non-leaf pages have index entries; only used to direct searches:

index entry
P0
K 1
P1
K 2
P 2
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
K m Pm
20
Fan-out of the Tree
Fan-out of the tree: the average number of
children for a non-leaf node.
 If every non-leaf node has n children, a tree of
height h has nh leaf pages.
 A good approximation of the number of leaf
pages, Fh ( F is the average # of children, which
is at least 100).
 Example:

 A tree of height 4 contains 100 million leaf pages.
 A binary search will take log2100,000,000 > 25 I/Os.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
21
B+ Trees



Insert/delete at log F N cost; keep tree height-balanced. (F
= fanout, N = # leaf pages)
Minimum 50% occupancy (except for root). Each node
contains m entries, where d <= m <= 2d . Except the
root has m entries, where 1 <= m <= 2d. The parameter
d is called the order of the tree.
Supports equality and range-searches efficiently.
Index Entries
(Direct search)
Data Entries
("Sequence set")
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
22
Example B+ Tree
Note how data entries
in leaf level are sorted
Root
17
Entries <= 17
5
2*
3*
Entries > 17
27
13
5*
7* 8*
14* 16*
22* 24*
30
27* 29*
33* 34* 38* 39*
Find 28*? 29*? All > 15* and < 30*
 Insert/delete: Find data entry in leaf, then
change it. Need to adjust parent sometimes.

 And change sometimes bubbles up the tree
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
23
B+ Trees in Practice

Typical order: d=100. Typical fill-factor: 67%.


Typical capacities:



average fanout = 133
Height 4: 1334 = 312,900,700 records
Height 3: 1333 = 2,352,637 records
Can often hold top levels in buffer pool:



Level 1 =
1 page = 8 Kbytes
Level 2 =
133 pages = 1 Mbyte
Level 3 = 17,689 pages = 133 MBytes
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
24
Inserting a Data Entry into a B+ Tree
Find correct leaf node L.
 Put data entry onto L.



If L has enough space, done!
Else, must split L (into L and a new node L2)
• Redistribute entries evenly, copy up middle key.
• Insert index entry pointing to L2 into parent of L.

This can happen recursively


To split index node, redistribute entries evenly, but
push up middle key. (Contrast with leaf splits.)
Splits “grow” tree; root split increases height.

Tree growth: gets wider or one level taller at top.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
25
Inserting 8* entry into B+ Tree
Root
13
2*
3*
5* 7*
17
24
30
19* 20* 22*
14* 16*
24* 27* 29*
33* 34* 38* 39*
Entry to be inserted in parent node.
Split the full leaf
node, copy up the
middle key.
(Note that 5 is
s copied up and
continues to appear in the leaf.)
5
2*
3*
5*
7*
8*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
26
Inserting 8* into Example B+ Tree
To split index
node, redistribute
entries evenly, but
push up middle
key. (Contrast
with leaf splits.)
appears once in the index. Contrast
this with a leaf split.)
5


Entry to be inserted in parent node.
(Note that 17 is pushed up and only
17
13
24
30
Observe how minimum occupancy is guaranteed in
both leaf and index pg splits.
Note difference between copy-up and push-up; be sure
you understand the reasons for this.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
27
Example B+ Tree After Inserting 8*
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
 Notice that root was split, leading to increase in height.
 In this example, we can avoid split by re-distributing
entries; however, this is usually not done in practice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
28
Deleting a Data Entry from a B+ Tree
Start at root, find leaf node L where entry belongs.
 Remove the entry.

If L is at least half-full, done!
If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L).
• If re-distribution fails, merge L and sibling.


If merge occurred, must delete entry (pointing to L
or sibling) from parent of L.
 Merge could propagate to root, decreasing height.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
29
Example Tree After (Inserting 8*,
Then) Deleting 19*
Root
17
5
2*
3*
24
13
5*
7* 8*
19* 20* 22*
14* 16*
30
24* 27* 29*
33* 34* 38* 39*
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
20* 22*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
30
24* 27* 29*
33* 34* 38* 39*
30
Example Tree After (Inserting 8*, Deleting
19*, then) Deleting 20* ...
Root
17
5
2* 3*
24
13
5* 7* 8*
20* 22*
14* 16*
30
33* 34* 38* 39*
24* 27* 29*
Root
17
5
2* 3*

27
13
5* 7* 8*
22* 24*
14* 16*
30
33* 34* 38* 39*
27* 29*
Deleting 20* is done with re-distribution. Notice how middle key is
copied up.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
31
... And Then Deleting 24*
Must merge.
 Observe `toss’
of index entry.
30

22*
27*
29*
33*
34*
38*
39*
Root
17
5
2* 3*
27
13
5* 7* 8*
14* 16*
22* 24*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
30
27* 29*
33* 34* 38* 39*
32
... And Then Deleting 24*
Merge recursively.
 Observe `pull down’ of index entry

Root
17
5
2* 3*
30
13
5* 7* 8*
22* 27* 29*
14* 16*
33* 34* 38* 39*
Root
5
2*
3*
5*
7*
8*
13
17
30
22* 27* 29*
14* 16*
33* 34* 38* 39*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
33
Example of Non-leaf Re-distribution
Tree is shown below during deletion of 24*.
 In contrast to previous example, can re-distribute
entry from left child of root to right child.

Root
22
5
2* 3*
5* 7* 8*
13
14* 16*
17
30
20
17* 18*
20* 21*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
22* 27* 29*
33* 34* 38* 39*
34
After Re-distribution
Intuitively, entries are re-distributed by `pushing
through’ the splitting entry in the parent node.
 It suffices to re-distribute index entry with key 20;
we’ve re-distributed 17 as well for illustration.

Root
17
5
2* 3*
5* 7* 8*
13
14* 16*
20
17* 18*
22
20* 21*
30
22* 27* 29*
33* 34* 38* 39*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
35
Bulk Loading of a B+ Tree
If we have a large collection of records, and we
want to create a B+ tree on some field, doing so
by repeatedly inserting records is very slow.
 Bulk Loading can be done much more efficiently.
 Initialization: Sort all data entries, insert pointer
to first (leaf) page in a new (root) page.

Root
3* 4*
Sorted pages of data entries; not yet in B+ tree
6* 9*
10* 11*
12* 13* 20* 22* 23* 31* 35* 36*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
38* 41* 44*
36
Bulk Loading (Contd.)
Root


Index entries for leaf
pages always
entered into rightmost index page just
3*
above leaf level.
When this fills up, it
splits. (Split may go
up right-most path
to the root.)
Much faster than
repeated inserts,
especially when one
considers locking!
10
20
Data entry pages
6
4*
3* 4*
6* 9*
12
23
not yet in B+ tree
10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44*
Root
20
10
6
6* 9*
35
12
Data entry pages
not yet in B+ tree
35
23
38
10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
37
Summary of Bulk Loading

Option 1: multiple inserts.



Slow.
Does not give sequential storage of leaves.
Option 2: Bulk Loading




Has advantages for concurrency control.
Fewer I/Os during build.
Leaves will be stored sequentially (and linked, of
course).
Can control “fill factor” on pages.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
38
A Note on `Order’

Order (d) concept replaced by physical space
criterion in practice (`at least half-full’).



Index pages can typically hold many more entries
than leaf pages.
Variable sized records and search keys mean differnt
nodes will contain different numbers of entries.
Even with fixed length fields, multiple records with
the same search key value (duplicates) can lead to
variable-sized data entries (if we use Alternative (3)).
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
39
Summary
Tree-structured indexes are ideal for rangesearches, also good for equality searches.
 ISAM is a static structure.




Only leaf pages modified; overflow pages needed.
Overflow chains can degrade performance unless size
of data set and data distribution stay constant.
B+ tree is a dynamic structure.



Inserts/deletes leave tree height-balanced; log F N cost.
High fanout (F) means depth rarely more than 3 or 4.
Almost always better than maintaining a sorted file.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
40
Summary (Contd.)



Typically, 67% occupancy on average.
Usually preferable to ISAM, modulo locking
considerations; adjusts to growth gracefully.
If data entries are data records, splits can change rids!
Key compression increases fanout, reduces height.
 Bulk loading can be much faster than repeated
inserts for creating a B+ tree on a large data set.
 Most widely used index in database management
systems because of its versatility. One of the most
optimized components of a DBMS.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
41
Comparing File Organizations

A collection of employee records with
composite search key <age, sal>





Heap files (random order; insert at eof)
Sorted files, sorted on <age, sal>
Clustered B+ tree file, Alternative (1), search key
<age, sal>
Heap file with unclustered B + tree index on
search key <age, sal>
Heap file with unclustered hash index on search
key <age, sal>
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
42
Operations to Compare
Scan: Fetch all records from disk
Search with Equality search
Search with Range selection
Insert a record
Delete a record





Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
43
Cost Model for Our Analysis
We ignore CPU costs, for simplicity:



B: The number of data pages
R: Number of records per page
D: (Average) time to read or write a disk page
•

C: (Average) time to process a record (e.g.,
comparing a field value to a selection constant)
•

100 nanoseconds
H: the time to apply the hash function to a record.
•

15 milliseconds
100 nanoseconds
F: Fan-out of a index tree.
•
100
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
44
Cost Model
We expect the cost of I/O to dominate.
 Disk speeds are not increasing at a similar
pace as CPU speeds rises.
 Measuring number of page I/O’s ignores
gains of pre-fetching a sequence of pages;
thus, even I/O cost is only approximated.
 Average-case analysis; based on several
simplistic assumptions.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
45
Assumptions in Our Analysis

Heap Files:


Sorted Files:


Equality selection on key; exactly one match.
Files compacted after deletions.
Indexes:
 Alt (2), (3): data entry size = 10% size of record
 Hash: No overflow buckets.
•

80% page occupancy => File size = 1.25 data size
Tree: 67% occupancy (this is typical).
•
Implies file size = 1.5 data size
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
46
Assumptions (contd.)

Scans:
 Leaf levels of a tree-index are chained.
 Index data-entries plus actual file scanned for
unclustered indexes.

Range searches:
 We use tree indexes to restrict the set of data
records fetched, but ignore hash indexes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
47
Heap Files
Scan: B(D+RC)
 Search with Equality Selection: 0.5B(D+RC)

 Average case of linear search
Search with Range Selection: B(D+RC)
 Insert: 2D+C

 Insert at the end of file

Delete: Searching Cost +(D+C)
 Search the page through rid <page #, slot #> 2D+C
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
48
Sorted Files
Scan: B(D+RC)
 Search with Equality Selection: Dlog2B +
Clog2R
 Search with Range Selection: The cost of
search + the cost of retrieving the set of
records.

 The cost of search includes fetching first page.
Insert: Searching cost + 2*0.5B(D+RC)
 Delete: Searching cost + 2*0.5B(D+RC)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
49
Clustered Files with B+ Tree
Scan: 1.5B(D+RC)
 Search with Equality Selection: DlogF1.5B +
Clog2R
 Search with Range Selection: Similar to
Search with many qualifying records
 Insert/Delete: DlogF1.5B + Clog2R+D

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
50
Heap File with
Un-clustered Tree Index



The number of leaf pages is 0.1*1.5B = 0.15B.
The number of data entries on a page is 10 *0.67R
= 6.7R.
Scan: 0.15B(D+6.7RC) I/Os + BR(D+C) or 4B
(sorting)
 Data Entries Cost : 0.15B(D+6.7RC)
 Fetching records (one I/O per record): BR(D+C) or
 Just sort records : 4B

Search with Equality Selection: DlogF0.15B +
Clog26.7R+D
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
51
Heap File with
Un-clustered Tree Index

Search with Range Selection:
 Similar to Search with many qualifying records one I/O per each qualifying record
 If 10% of data records qualify, better sort the data
file.

Insert:
 Insert the record in heap file: 2D+C
 Insert the data entry in the index: DlogF0.15B +
Clog26.7R+D

Delete: DlogF0.15B + Clog26.7R+2D
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
52
Heap File with
Un-clustered Hash Index



The number of leaf pages is 1.25*0.1B = 0.125B
The number of data entries on a page is 10*0.8R=8R.
Scan: 0.125B(D+8RC) I/Os + BR(D+C).
 Data entries: 0.125B(D+8RC)
 Fetching each record: BR(D+C)

Search with Equality Selection: H+2D+4RC




Hashing cost: H
Fetching the data entry page: D
Search the data entry page: 0.5 *8RC= 4RC
Fetching the data record page: D
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
53
Heap File with
Un-clustered Hash Index

Search with Range Selection: B(D+RC)
 Scan the whole heap file.

Insert: 2D+C plus H+2D+C
 Insert the data record in heap file: 2D+C
 Update the hash index : H+2D+C

Delete: H+2D+4RC plus 2D.





Hashing cost: H
Fetching the data entry page: D
Search the data entry page: 0.5 *8RC= 4RC
Fetching the data record page: D
Update both data entry page and data record page: 2D
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
54
Cost of Operations
(a) Scan
(b) Equality
(c ) Range
(d) Insert (e) Delete
(1) Heap
BD
0.5BD
BD
2D
(2) Sorted
BD
Dlog 2B
D(log 2 B +
# pgs with
match recs)
(3)
1.5BD
Dlog F 1.5B D(log F 1.5B
Clustered
+ # pgs w.
match recs)
(4) Unclust. BD(R+0.15)
D(1 +
D(log F 0.15B
Tree index
log F 0.15B) + # pgs w.
match recs)
(5) Unclust. BD(R+0.125) 2D
BD
Hash index
Search
+ BD
Search
+D
Search
+BD
Search
+D
Search
+D
Search
+ 2D
Search
+ 2D
Search
+ 2D
Search
+ 2D
Several assumptions underlie these (rough) estimates!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
55
Summary
Heap file: Good storage efficiency, fast scan,
slow search and deletion.
 Sorted file: Good storage efficiency, slow
insertion and deletion. Searching is faster
than heap file.
 Clustered file: as good as sorted file plus
efficient insertion and deletion. Space
overhead.
 Un-clustered file: fast search, insertion,
deletion. Slow scan and range searching.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
56
Impact of the Workload
An index supports efficient retrieval of data
entries that satisfy a given selection condition.
 Hash-based indexing are optimized for
equality selections and poor on range
selection.
 Tree-based indexing support both efficiently.
 Both of them support inserts, deletes, and
updates quite efficiently.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
57
Impact of the Workload

B+ tree has two important advantages over
sorted files
 Handle inserts and deletes of data entries
efficiently.
 Finding the correct leaf page when searching for a
record by search key value is much faster than
binary search in a sorted file.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
58
Understanding the Workload

For each query in the workload:




Which relations does it access?
Which attributes are retrieved?
Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
For each update in the workload:


Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
The type of update (INSERT/DELETE/UPDATE), and the
attributes that are affected.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
59
Choice of Indexes

What indexes should we create?


Which relations should have indexes? What field(s)
should be the search key? Should we build several
indexes?
For each index, what kind of an index should it
be?

Clustered? Hash/tree?
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
60
Choice of Indexes

Before creating an index, must also consider
the impact on updates in the workload!


Trade-off: Indexes can make queries go faster,
updates slower. Require disk space, too. Using
indexes on tables that are frequently updated can
result in poor performance.
Indexes should be used on tables whose data
does not change frequently but is used a lot in
queries.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
61
Choice of Indexes
One approach: Consider the most important
queries in turn. Consider the best plan using
the current indexes, and see if a better plan is
possible with an additional index. If so,
create it.
 Try to choose indexes that benefit as many
queries as possible. Since only one index can
be clustered per relation, choose it based on
important queries that would benefit the
most from clustering.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
62
Index Selection Guidelines

Attributes in WHERE clause are candidates
for index keys.
 Exact match condition suggests hash index.
 Range query suggests tree index.
 Clustering is especially useful for range queries;
can also help on equality queries if there are many
duplicates.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
63
Examples of Clustered Indexes
B+ tree index on E.age can be used to get
qualifying tuples.
 Alternative way is a sorted file on E.age.
 Considering:

 How selective is the condition?
 Is the index clustered?
SELECT E.dno
FROM Emp E
WHERE E.age>40;
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
64
Examples of Clustered Indexes

Consider the GROUP BY query.


If many tuples have E.age > 10, using E.age index
and sorting the retrieved tuples may be costly.
Clustered E.dno index may be better since sorting
is expensive.
SELECT E.dno, COUNT
FROM Emp E
WHERE E.age>10
GROUP BY E.dno;
(*)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
65
Examples of Clustered Indexes
If an index on a search key that does not
include a candidate key, clustering is
important.
 Equality queries and duplicates:


Clustering on E.hobby helps!
SELECT E.dno
FROM Emp E
WHERE E.hobby=“Stamps”;
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
66
Index Selection Guidelines

Composite search keys should be considered
when a WHERE clause contains several
conditions.



Hash index works for equality conditions on every
field.
Tree index works for equality or range condition
on a prefix of the composite search key. So order
of attributes is important for range queries.
Such indexes can sometimes enable index-only
strategies for important queries.
• For index-only strategies, clustering is not important!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
67
Indexes with Composite Search Keys
Examples of composite key
indexes using lexicographic order.
Equality query:
<sal,age> index:
age=20 and sal =75
11,80
12,20
13,75
<age, sal>
Data entries in index
sorted by search key
to support range
queries.
Lexicographic order,
or
Spatial order.
11
12
12,10
10,12
20,12
75,13
name age sal
bob 12
10
cal
11
80
joe 12
20
sue 13
75
12
13
<age>
10
Data records
sorted by name
80,11
75
80
<sal, age>
Data entries in index
sorted by <sal,age>
20
<sal>
Data entries
sorted by <sal>
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
68
Composite Search Keys
To retrieve Emp records with age=30 AND
sal=4000, an index on <age,sal> would be better
than an index on age or an index on sal.
 If condition is: 20<age<30 AND 3000<sal<5000:



If condition is: age=30 AND 3000<sal<5000:


Clustered tree index on <age,sal> or <sal,age> is best.
Clustered <age,sal> index much better than <sal,age>
index!
Composite indexes are larger, updated more
often.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
69
Index-Only Evaluation
Plan A: sort Employees on E.dno to compute
the count.
 Plan B: if an index on E.dno is available, the
query could be answered by scanning only
the index.

SELECT E.dno, COUNT(*)
FROM Emp E
GROUP BY E.dno;
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
70
Index-Only Plans

A composite B+ tree index on <age,sal> could
answer the query with an index-only scan.
SELECT AVG(E.sal)
FROM Emp E
WHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000

A composite B+ tree index on <dno, sal> could
answer this query with an index-only scan
SELECT E.dno, MIN(E.sal)
FROM Emp E
GROUP BY E.dno
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
71
Create Index Basic Syntax

Indexes can be defined in two ways
 At the time of table creation
 After table has been created.

Example schemas
 Sailors (sid: integer, sname: string, rating: integer,
age: real)
 Boats (bid: integer, bname: string, color: string)
 Reserves (sid: integer, bid: integer, day: dates)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
72
Example

For Sailor table, we expect lots of searches to
the database on Sailor id, which is the
primary key.
CREATE TABLE Sailors
(sid INTEGER NOT NULL AUTO_INCREMENT,
sname CHAR(30) NOT NULL,
rating INTEGER,
age REAL,
CONSTRAINT StudentsKey PRIMARY KEY (sid) USING BTREE ,
CHECK (rating >=1 AND rating<=10))
*The primary key of the table have already been indexed by MySql.
Its name is PRIMARY KEY. This index is the clustered index.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
73
Example

For Sailor table, we could also create an index
on sname when we create the table.
CREATE TABLE Sailors
(sid INTEGER NOT NULL AUTO_INCREMENT,
sname CHAR(30) NOT NULL,
rating INTEGER,
age REAL,
CONSTRAINT StudentsKey PRIMARY KEY (sid) USING BTREE ,
CHECK (rating >=1 AND rating<=10),
INDEX sname_index (sname) USING HASH )
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
74
Example

Later, we found people search a lot with
sailors’ names and ages.
CREATE TABLE Sailors
(sid INTEGER NOT NULL AUTO_INCREMENT,
sname CHAR(30) NOT NULL,
rating INTEGER,
age REAL,
CONSTRAINT StudentsKey PRIMARY KEY (sid),
CHECK (rating >=1 AND rating<=10));
CREATE INDEX ‘sname_age_index’ ON
Sailors(sname, age) USING BTREE ;
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
75
Index Status

View the indexes defined on a particular table
SHOW INDEX FROM Sailors;

Sometimes, to drop an index to improve
performance.
DROP INDEX sname_index FROM Sailors;
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
76
Summary
Many alternative file organizations exist, each
appropriate in some situation.
 If selection queries are frequent, sorting the
file or building an index is important.




Hash-based indexes only good for equality search.
Sorted files and tree-based indexes best for range
search; also good for equality search. (Files rarely
kept sorted in practice; B+ tree index is better.)
Index is a collection of data entries plus a way
to quickly find entries with given key values.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
77
Summary (Contd.)

Data entries can be actual data records, <key,
rid> pairs, or <key, rid-list> pairs.

Choice orthogonal to indexing technique used to
locate data entries with a given key value.
Can have several indexes on a given file of
data records, each with a different search key.
 Indexes can be classified as clustered vs.
unclustered, primary vs. secondary, and
dense vs. sparse. Differences have important
consequences for utility/performance.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
78