Download Indexing and Index Tuning

Document related concepts

Array data structure wikipedia , lookup

Quadtree wikipedia , lookup

Control table wikipedia , lookup

Interval tree wikipedia , lookup

Binary search tree wikipedia , lookup

B-tree wikipedia , lookup

Transcript
CS5226
Indexing and Index Tuning
When Change is the Only
Constant
CPU
Memory Speed and Size
Harddisk Speed and Size
Bandwidth
Moore’s Law being
proved...
main memory
cache
cache pages
disk size
disk/memory size
transfer rate
random access
scanning full disk
1969
200 KB
20 KB
20
7.5 MB
40
150 KB/s
50 ms
130 s
2001
200 MB
20 MB
5000
20 GB
100
15 MB/s
5 ms
1300 s
Factor
103
103
<103
3*103
-2.5
102
10
-10
Improvement in
Performance
CPU
(60%/yr)
10000
1000
100
DRAM
(10%/yr)
10
Disk
(5%/yr)
1
1980
2000
Indexing
Single dimensional Indexing
Multi-dimensional Indexing
High-dimensional Indexing
Indexing for advanced applications
Single Record and Range Searches
Single record retrievals
``Find student name whose matric# = 921000Y13’’
Range queries
``Find all students with cap > 3.0’’
Sequentially scanning the file is costly
If data is in sorted file, do binary search to find
first such student, then scan to find others.
cost of binary search can still be quite high.
Indexes
An index on a file speeds up selections on the
search key fields for the index.
Any subset of the fields of a relation can be the
search key for an index on the relation.
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation).
e.g., consider Student(matric#, name, addr, cap),
the key is matric#, but the search key can be
matric#, name, addr, cap or any combination of
them.
Simple Index File (Data File Sorted)
Dense Index
10
20
30
40
50
60
70
80
90
100
110
120
Sequential File
10
20
30
40
50
60
70
80
90
100
Simple Index File (Cont)
Sparse Index
10
30
50
70
90
110
130
150
170
190
210
230
Sequential File
10
20
30
40
50
60
70
80
90
100
Simple Index File (Cont)
Sequential File
Sparse 2nd level
10
90
170
250
330
410
490
570
10
30
50
70
90
110
130
150
170
190
210
230
10
20
30
40
50
60
70
80
90
100
Secondary indexes
Sequence
field
• Sparse index
30
20
80
100
30
50
20
70
90
...
80
40
does not make sense!
100
10
90
60
Secondary indexes
Sequence
field
• Dense index
10
50
90
...
sparse
high
level
10
20
30
40
30
50
50
60
70
...
80
40
20
70
100
10
90
60
Conventional indexes
Advantages:
- Simple
- Index is sequential file
good for scans
Disadvantages:
- Inserts expensive, and/or
- Lose sequentiality & balance
Example
Index(sequential)
continuous
10
20
30
33
40
50
60
free space
70
80
90
39
31
35
36
32
38
34
overflow area
(not sequential)
Tree-Structured Indexing
Tree-structured indexing techniques support
both range searches and equality searches
index file may still be quite large. But we
can apply the idea repeatedly!
Data
pages
B+ Tree: The Most Widely Used Index
Height-balanced.
Insert/delete at log F N cost (F = fanout, N = #
leaf pages);
Grow and shrink dynamically.
Minimum 50% occupancy (except for root).
Each node contains d <= m <= 2d entries. The
parameter d is called the order of the tree.
`next-leaf-pointer’ to chain up the leaf nodes.
Data entries at leaf are sorted.
Example B+ Tree
Each node can hold 4 entries (order = 2)
Root
17
5
2
3
24
13
5
7
8
14 16
19 20
22
30
24 27 29
33 34 38 39
Node structure
Non-leaf nodes
index entry
P
0
K
1
P
1
K 2
P
K 2
P
2
K m
Pm
2
K m
Pm
Leaf nodes
P
0
K
1
P
1
Next leaf
node
Searching in B+ Tree
Search begins at root, and key comparisons
direct it to a leaf (as in ISAM).
Search for 5, 15, all data entries >= 24 ...
Root
13
2
3
5
14
16
17
19
20
24
22
30
24
27
29
33
34
38
39
 Based on the search for 15*, we know it is not in the tree!
B+-Tree Scalability
Typical order: 100. Typical fill-factor: 67%.
average fanout = 133
Typical capacities (root at Level 1, and has 133
entries):
Level 5: 1334 = 312,900,700 records
Level 4: 1333 =
2,352,637 records
Can often hold top levels in buffer pool:
Level 1 =
1 page =
8 Kbytes
Level 2 =
133 pages =
1 Mbyte
Level 3 = 17,689 pages = 133 MBytes
A Note on `Order’
 Order (d) concept replaced by physical space
criterion in practice (`at least half-full’).
Index pages can typically hold many more entries
than leaf pages.
Variable sized records and search keys mean different
nodes will contain different numbers of entries.
Even with fixed length fields, multiple records with the
same search key value (duplicates) can lead to
variable-sized data entries
Inserting a Data Entry into a B+ Tree
Find correct leaf L.
Put data entry onto L.
If L has enough space, done!
Else, must split L (into L and a new node L2)
Redistribute entries evenly, copy up middle key.
Insert index entry pointing to L2 into parent of L.
This can happen recursively
To split index node, redistribute entries evenly, but
push up middle key. (Contrast with leaf splits.)
Splits “grow” tree; root split increases height.
Tree growth: gets wider or one level taller at top.
Inserting 7 & 8 into Example B+ Tree
Root
13
2
3
5
7
14
16
17
19
20
24
22
30
24
27 29
33
34
38
39
(Note that 5 is copied up and
continues to appear in the leaf.)
 Observe how minimum
occupancy is
guaranteed in both leaf
and index pg splits.
13
5
2
3
5
7
17
8
24
30
Insertion (Cont)
Note
difference
between
copy-up and
push-up; be
sure you
understand
the reasons
for this.
(Note that 17 is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
17
5
13
24
13
5
2
3
30
5
7
17
8
24
30
Example B+ Tree After Inserting 8
Root
17
5
2
3
24
13
5
7
8
14 16
19 20 22
30
24 27 29
33 34 38 39
 Notice that root was split, leading to increase in height.
 In this example, we can avoid splitting by re-distributing
entries; however, this is usually not done in practice. Why?
Deleting a Data Entry from a B+ Tree
Start at root, find leaf L where entry belongs.
Remove the entry.
If L is at least half-full, done!
If L has only d-1 entries,
Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L).
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to
L or sibling) from parent of L.
Merge could propagate to root, decreasing height.
Example Tree After (Inserting 8, Then)
Deleting 19
Root
17
5
2
3
24
13
5
7
8
14 16
Deleting 19 is easy.
20 22
30
24 27 29
33 34 38 39
Example Tree After Deleting 20 ...
Root
17
5
2
3
27
13
5
7
8
14 16
22 24
30
27 29
33 34 38 39
Deleting 20 is done with re-distribution.
Notice how middle key is copied up.
... And Then Deleting 24
Must merge.
Observe `toss’ of
index entry (on right),
and `pull down’ of
index entry (below).
30
22
27
29
33
38
34
39
Root
5
2
3
5
7
8
13
14
16
17
30
22
27
29
33
34
38
39
Example of Non-leaf Re-distribution
(Delete 24)
22
5
13
17
Root
27
20
22 24
2
3
5
7
8
 In contrast to previous
example, can redistribute entry from left
child of root to right
child.
5
2
3
5
7
8
17 18
14 16
13
14 16
Root
30
20
17 18
33 34 38 39
20 21
22
17
27 29
30
20 21
22 27 29
33 34 38 39
After Re-distribution
Intuitively, entries are re-distributed by `pushing
through’ the splitting entry in the parent node.
It suffices to re-distribute index entry with key 20;
we’ve re-distributed 17 as well for illustration.
Root
17
5
2
3
5
7
8
13
14 16
20
17 18
20 21
22
30
22 27 29
33 34 38 39
Index Classification
 Primary vs. secondary: If search key contains
primary key, then called primary index.
Unique index: Search key contains a candidate key.
 Clustered vs. unclustered: If order of data
records is the same as, or `close to’, order of
data entries, then called clustered index.
A file can be clustered on at most one search key.
Cost of retrieving data records through index varies
greatly based on whether index is clustered or not!
Clustered vs. Unclustered Index
Suppose the data file is unsorted.
 To build clustered index, first sort the data file (with some
free space on each page for future inserts).
Overflow pages may be needed for inserts. (Thus, order of
data recs is `close to’, but not identical to, the sort order.)
CLUSTERED
Index entries
direct search for
data entries
Data entries
UNCLUSTERED
Data entries
(Index File)
(Data file)
Data Records
Data Records
Index Classification (Cont.)
 Dense vs. Sparse:
If there is at least
one data entry per
search key value (in
some data record),
then dense.
Every sparse index is
clustered!
Sparse indexes are
smaller.
Ashby, 25, 3000
22
Basu, 33, 4003
Bristow, 30, 2007
25
30
Ashby
33
Cass
Cass, 50, 5004
Smith
Daniels, 22, 6003
Jones, 40, 6003
40
44
Smith, 44, 3000
44
50
Tracy, 44, 5004
Sparse Index
on
Name
Data File
Dense Index
on
Age
Index Classification (Cont.)
 Composite Search Keys:
Search on a combination of
fields.
Equality query: Every field value
is equal to a constant value. E.g.
wrt <sal,age> index:
age=20 & sal =75
Range query: Some field value is
not a constant. E.g.:
age =20; or age=20 & sal > 10
 Data entries in index sorted by
search key to support range
queries.
Lexicographic order, or
Spatial order.
Examples of composite key
indexes using lexicographic order.
11,80
11
12,10
12
12,20
13,75
<age, sal>
10,12
20,12
75,13
name age sal
bob 12
10
cal 11
80
joe 12
20
sue 13
75
12
13
<age>
10
Data records
sorted by name
80,11
<sal, age>
Data entries in index
sorted by <sal,age>
20
75
80
<sal>
Data entries
sorted by <sal>
Summary
Tree-structured indexes are ideal for rangesearches, also good for equality searches.
B+ tree is a dynamic structure.
Inserts/deletes leave tree height-balanced;
log F N cost.
High fanout (F) means depth rarely more than
3 or 4.
Almost always better than maintaining a
sorted file.
Summary (Cont.)
Typically, 67% occupancy on average.
Usually preferable to ISAM, modulo locking
considerations; adjusts to growth gracefully.
If data entries are data records, splits can change rids!
Indexes can be classified as clustered vs.
unclustered, primary vs. secondary, and dense vs.
sparse, simple vs composite
New Database Challenges
More Complex applications. Eg. GIS,
OLAP, Mobile
Hardware Advances: Big and Small
Web and Internet. Eg XML
New Index?
A most effective mechanism to prune the
search
Order of magnitude of difference between
I/O and CPU cost
Increasing data size
Increasing complexity of data and search
Something that
Transcends Time…B+-tree
Success Factors
Robustness
Concurrency
Performance
Scalability
Fundamentals of
Building DBMS
B+-trees forever?
Btree
Can the B+-tree being a single-dimensional
index be used for emerging applications such
as:
Spatial databases
High-dimensional databases
Temporal databases
Main memory databases
String databases
Genomic/sequence databases
...
CS5226
Multi-Dimensional Indexing
What is a Spatial
Database?
 A Spatial DBMS is a DBMS
 It offers spatial data types/data models/ query
language
 Support spatial properties/operations
 It supports spatial data types in its
implementation
 Support spatial indexing, algorithms for spatial
selection and join based on spatial relationships
Applications
Geographical Information Systems (GIS):
dealing extensively with spatial data. Eg.
Map system, resource management
systems
Computer-aided design and
manufacturing (CAD/CAM): dealing mainly
with surface data. Eg. design systems.
Multimedia databases: storing and
manipulating characteristics of MM
objects.
Spatial Data
 Examples of non-spatial
data
Names, zip-codes …
 Examples of Spatial data
Census Data
NASA satellites imagery
Weather and climate Data
Rivers, farms, ecological
impact
Medical Imaging
Spatial Databases
Spatial Objects:
Points: spatial location: eg. feature vectors
Lines: set of points: eg. roads, coastal line
Polygons: set of points: eg. Buildings, lakes
Data Types:
Point: a spatial data object with no extension
no size or volume
Region:a spatial object with a location and a
boundary that defines the extension
Spatial Relationships
Topological relationships:
adjacent, within/contain, intersect, disjoint,
etc
Direction relationships:
Above, below, north-of, south-of,etc
Metric relationships:
“distance < 100 km”
And operations to express the
relationships
Spatial Queries
Range queries: “Find all cities within 50
km of Madras?”
Nearest neighbor queries: “Find the 5
cities that are nearest to Madras?”
“Find the 10 images most similar to this
image?”
Spatial join queries: “Find pairs of cities
within 200 km of each other?’
More Examples
Window Range Query: “Find me data
points that satisfy the conditions x1 <A1
< y1, x2 <A2 <y2…?”
Spatial Query: “Find me buildings that are
adjacent to the Railway Stations?”
Nearest Neighbour Query: “Find me the
nearest fire station to Clementi Ave. 3?”
Spatial Representation
Raster model:
Vector model:
Representation of Spatial
Objects
Testing on real objects is expensive
Minimum Bounding Box/Rectangle
How to test if 2-d rectangles intersect?
y2
y1
x1
representation
x2
testing
Query Operation & Spatial
Index
Filter Step:
Select the objects whose mbr satisfies the
spatial predicate
Traverse the index and apply the spatial test on
the mbrs indexed by the index
Output: set of oids (including negatives)
Refinement Step:
Spatial test is done on the actual geometries of
objects whose mbr satisfied the filter step
(output)
Costly operation
Executed only on a limited number of objects
Why spatial index methods
(SAMs)?
B-tree & hash tables
Guarantee the number of I/O operations is
respectively logarithmic and constant with respect
to the collection’s size
Index a collection on a key
Rely on a total order on the key domain, the order
of natural numbers, or the lexicographic order on
strings
There is no such total order for geometric
objects with spatial extent
SAMs were designed to try as much as
possible to preserve spatial object proximity
Approaches to the Design of
SAMs
Space-Based structures:
Partition the embedding Space into rectangular cells
Independent from the distribution of the objects
Objects are mapped to the cells based on some
geometric criterion
Eg: Grid file, Buddy-tree, KDB-tree
Data-Based structures:
Organize by partitioning the set of objects based on
spatial proximity such that each group can be fit into a
page, as opposed to the embedding space
Adapt to the objects’ distribution in the embedding
space
Eg. R-tree, R* tree, R+ tree
Mapping
The R-tree
A leaf entry is a pair (mbr, oid)
A non-leaf node contains an array of node entries
The number of entries is between m (fill-factor)
and M
For each entry (mbr, nodeid) in a non-leaf node
N, mbr is the directory rectangle of a child node
of N, whose page address is nodeid
All leaves are at the same level
An object appears in one, and only one of the
tree leaves
R-trees
A
B
Height balanced tree
A
B
Problem:
Overlap of
covering
rectangles.
Insertion in the R-Tree
Algorithm ChooseSubtree
CS1
[Initialize] Set N to be the root node
CS2
[Leaf check]
If N is a leaf,
return N
else
[Choose subtree]
Choose the entry in N whose rectangle needs least area
enlargement to include the new data. Resolve ties by
choosing the entry with the rectangle of smallest area
end
CS3
[Descend until a leaf is reached]
Set N to be the childnode pointed to by the childpointer of
the chosen entry. Repeat from CS2
Splitting Strategies in the
R-Tree
Three versions
all are designed to minimize area covered by two
covering rectangles resulting from split
Exponential
find the area with global minimum
CPU cost is too high
Quadratic and Linear
find approximation
Quadratic performs much better than linear
Splitting Strategies in the
R-Tree
Algorithm QuadraticSplit
[Divide a set of M+1 index entries into two groups]
QS1
[Pick first entry for each group ]
Invoke PickSeeds to choose two entries, each be first
entry of each group
QS2
[Check if done]
Repeat
DistributeEntry
until
all entries are distributed or one of the two groups has
Mm+1 entries (so that the other group has m entries)
QS3
[Select entry to assign ]
If entries remain, assign them to the other group so that it
has the minimum number m required
Splitting in the R-Tree
Algorithm PickSeeds
[Choose two entries to be the first entries of the groups]
PS1 [Calculate inefficiency of grouping entries
together]
For each pair of entries E1 and E2, compose a
rectangle R including E1 rectangle and E2 rectangle
Calculate d = area(R) - area(E1 rectangle) area(E2 rectangle)
PS2 [Choose the most wasteful pair ]
Choose the pair with the largest d
[the seeds will tend to be small, if the rectangles are of
very different size (and) or the overlap between them
is high]
Splitting in the R-Tree
Algorithm DistributeEntry
[Assign the remaining entries by the criterion of minimum area]
DE1
Invoke PickNext to choose the next entry to be assigned
DE2
Add It to the group whose covering rectangle will have to
be enlarged least to accommodate It. Resolve ties by
adding the entry to the group with the smallest area, then
to the one with the fewer entries, then to either
Algorithm PickNext
[chooses the entry with best area-goodness-value in every
situation]
DE1
For each entry E not yet in a group, calculate d1 = the area
increase required in the covering rectangle of Group 1 to
include E Rectangle. Calculate d2 analogously for Group 2
DE2
Choose the entry with the maximum difference between d1
and d2
Node Splitting R-trees
Node Splitting R-trees
R-trees
Range Query
Insert
Node splitting
Optimization
Coverage
Overlap
Delete
Variants: R+-tree
R*-tree, buddy-tree
The R*-Tree
A variant of R-Tree
Several improvements to the insertion
algorithm
Aim at optimizing
Node overlapping
Area covered by a node
Perimeter of a node’s directory rectangle
Given a fixed area, the shape that minimizes the rectangles
perimeter is the square
Two variants that bring the most significant
improvement
Split Algorithm
Forced Reinsertion Strategy
The R+ Tree
The directory rectangles at a given level do not
overlap
For a point query, a single path is followed from
the root to a leaf; for a region query, subtrees
whose covering mbr intersecting the query region
is traversed
The I/O complexity is bounded by the depth of
the tree
Dead space problem
Index Tuning
Index issues
Indexes may be better or worse than scans
Multi-table joins that run on for hours,
because the wrong indexes are defined
Concurrency control bottlenecks
Indexes that are maintained and never used
3 - Index Tuning
68
Information about
indexes...
Application codes
V$SQLAREA -- look for the one with high
# of executions
INDEX_STATS: meta information about
indexes
HASH_AREA_SIZE
HASH_MULTIBLOCK_IO_COUNT
…home work
Clustered / Non clustered
index
Clustered index
(primary index)
A clustered index on
attribute X co-locates
records whose X values are
near to one another.
Records
Non-clustered index
(secondary index)
A non clustered index
does not constrain table
organization.
There might be several
non-clustered indexes
per table.
Records
Dense / Sparse Index
Sparse index
Dense index
Pointers are associated to
pages
P1
3 - Index Tuning
P2
Pi
Pointers are associated to
records
Non clustered indexes are
dense
record
record
record
71
Index Implementations in
some major DBMS
 SQL Server
B+-Tree data structure
Clustered indexes are
sparse
Indexes maintained as
updates/insertions/deletes
are performed
 DB2
B+-Tree data structure,
spatial extender for R-tree
Clustered indexes are
dense
Explicit command for index
reorganization
3 - Index Tuning
 Oracle
B+-tree, hash, bitmap,
spatial extender for R-Tree
clustered index
Index organized table
(unique/clustered)
Clusters used when
creating tables.
 TimesTen (Main-memory
DBMS)
T-tree
72
Types of Queries
 Point Query
SELECT balance
FROM accounts
WHERE number = 1023;
 Multipoint Query
SELECT balance
FROM accounts
WHERE branchnum = 100;
3 - Index Tuning
 Range Query
SELECT number
FROM accounts
WHERE balance > 10000
and balance <= 20000;
 Prefix Match Query
SELECT *
FROM employees
WHERE name = ‘J*’ ;
73
More Types of Queries
 Extremal Query
 Grouping Query
SELECT *
FROM accounts
WHERE balance =
max(select balance from accounts)
 Ordering Query
SELECT *
FROM accounts
ORDER BY balance;
3 - Index Tuning
SELECT branchnum, avg(balance)
FROM accounts
GROUP BY branchnum;
 Join Query
SELECT distinct branch.adresse
FROM accounts, branch
WHERE
accounts.branchnum =
branch.number
and accounts.balance > 10000;
74
Index Tuning -- data
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
clustered index c on employees(hundreds1)
with fillfactor = 100;
nonclustered index nc on employees (hundreds2);
index nc3 on employees (ssnum, name, hundreds2);
index nc4 on employees (lat, ssnum, name);
 1000000 rows ; Cold buffer
 Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec
(80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Index Tuning -- operations
Operations:
Update:
update employees set name = ‘XXX’ where ssnum = ?;
Insert:
insert into employees values
(1003505,'polo94064',97.48,84.03,4700.55,3987.2);
Multipoint query:
select * from employees where hundreds1= ?;
select * from employees where hundreds2= ?;
Covered query:
select ssnum, name, lat from employees;
Range Query:
select * from employees where long between ? and ?;
Point Query:
select * from employees where ssnum = ?
Clustered Index
clustered
nonclustered
no index
Throughput ratio
1
0.8
0.6
0.4
0.2
0
SQLServer
3 - Index Tuning
Oracle
DB2
Multipoint query that
returns 100 records
out of 1000000.
Cold buffer
Clustered index is
twice as fast as nonclustered index and
orders of magnitude
faster than a scan.
77
Positive Points of
Clustering indexes
If the index is sparse, it has less points -less I/Os
Good for multipoint queries
eg. Looking up names in telephone dir.
Good for equijoin. Why?
Good for range, prefix match, and
ordering queries
Index “Face Lifts”
Throughput
(queries/sec)
SQLServer
100
80
60
40
20
0
No maintenance
Maintenance
0
20
40
60
80
% Increase in Table Size
3 - Index Tuning
100
 Index is created with
fillfactor = 100.
 Insertions cause page
splits and extra I/O for
each query
 Maintenance consists in
dropping and recreating
the index
 With maintenance
performance is constant
while performance
degrades significantly if
no maintenance is
79
performed.
Index Maintenance
Throughput
(queries/sec)
Oracle
20
15
10
No
maintenance
5
0
0
20
40
60
80
% Increase in Table Size
3 - Index Tuning
100
 In Oracle, clustered index
are approximated by an
index defined on a clustered
table
 No automatic physical
reorganization
 Index defined with pctfree
=0
 Overflow pages cause
performance degradation
80
Covering Index - defined
Select name from employee where
department = “marketing”
Good covering index would be on
(department, name)
Index on (name, department) less useful.
Index on department alone moderately
useful.
3 - Index Tuning
81
Throughput (queries/sec)
Covering Index - impact
70
60
covering
50
covering - not
ordered
non clustering
40
30
20
clustering
10
0
SQLServer
3 - Index Tuning
 Covering index performs
better than clustering
index when first
attributes of index are in
the where clause and last
attributes in the select.
 When attributes are not
in order then
performance is much
worse.
82
Positive/negative points of
non-clustering indexes
Eliminate the need to access the
underlying table
eg. Index on (A, B, C)
Select B,C From R Where A=5.
Good if each query retrieves significantly
fewer records than there are pages in DB
May not be good for multipoint queries
Examples:
Table T has 50-bytes records and attribute
A has 20 different values which are
uniformly distributed. Page size=4K. Is a
nonclustering index on A any good?
Now the record size is 2Kbytes.
Throughput (queries/sec)
Scan Can Sometimes Win
scan
non clustering
0
5
10
15
% of selected records
3 - Index Tuning
20
25
 IBM DB2 v7.1 on
Windows 2000
 Range Query
 If a query retrieves 10%
of the records or more,
scanning is often better
than using a nonclustering non-covering
index. Crossover > 10%
when records are large or
table is fragmented on
disk – scan cost
increases.
85
Throughput (updates/sec)
Index on Small Tables
18
16
14
12
10
8
6
4
2
0
no index
3 - Index Tuning
index
 Small table: 100 records,
i.e., a few pages.
 Two concurrent
processes perform
updates (each process
works for 10ms before it
commits)
 No index: the table is
scanned for each update.
No concurrent updates.
 A clustered index allows
to take advantage of row
locking.
86
Bitmap vs. Hash vs. B+Tree
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
create cluster c_hundreds (hundreds2 number(8)) PCTFREE 0;
create cluster c_ssnum(ssnum integer) PCTFREE 0 size 60;
create cluster c_hundreds(hundreds2 number(8)) PCTFREE 0
HASHKEYS 1000 size 600;
create cluster c_ssnum(ssnum integer) PCTFREE 0 HASHKEYS
1000000 SIZE 60;
create bitmap index b on employees (hundreds2);
create bitmap index b2 on employees (ssnum);
 1000000 rows ; Cold buffer
 Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec
(80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Multipoint query: B-Tree,
Hash Tree, Bitmap
Throughput (queries/sec)
Multipoint Queries
25
20
15
10
5
0
B-Tree
3 - Index Tuning
Hash index
Bitmap index
There is an overflow
chain in a hash index
In a clustered B-Tree
index records are on
contiguous pages.
Bitmap is proportional
to size of table and
non-clustered for
record access.
88
B-Tree, Hash Tree, Bitmap
Throughput (queries/sec)
Range Queries
Hash indexes don’t
help when evaluating
range queries
0.5
0.4
0.3
0.2
0.1
0
B-Tree
Hash index
Bitmap index
Throughput(queries/sec)
Point Queries
60
50
40
Hash index
outperforms B-tree
on point queries
30
20
10
0
B-Tree
3 - Index Tuning
hash index
89
Summary
Primary means to reduce search costs
(I/O and CPU)
Properties: robust, concurrent, scalable,
efficient
Most supported indexes: Hash, B+-trees,
bitmap index, and R-trees
Tuning: Usage, Maintenance,
Drop/Rebuild, index locking in buffer...