Download Index Structures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
SS ZG515: Data Warehousing
Index Structures
An index is any data structure that takes as input a property of records – typically the
values of one or more fields – and finds the records with that property quickly. An index
lets us find records without having to look at not more than a fraction of all possible
records. Te field(s) on which the index is based is called the search key. For a given data
file, we create an index file, consisting of key-pointer pairs.
There are many different data structures that serve as indexes:
1. Primary indexes on sorted files
2. Secondary indexes on unsorted files
3. B-trees, a commonly used index on any file
4. Hash indexes
Indexes can be classified as:
 Single-level
Examples: primary, secondary, clustering
 Multi-level
Examples: ISAM (Indexed Sequential Access Method), B-tree, B+-tree
Indexes can also be classified as:
 Dense
If the index file contains the same number of records as the data file
 Sparse
If the index file contains less number of records than the data file
Single-Level Indexes
1. Primary Indexes
These indexes require a sequential file (i.e. the file should be sorted on the search key
field). When the search key is a key of the relation, we call the index as primary index,
and when the search key is not a key of the relation, the index is called clustering index.
The following examples will make things clear:
Example 1: Consider the data file sorted on the key field
Data File
10
Index File
20
10
30
30
50
40
70
90
50
60
70
80
90
100
Dr. Navneet Goyal, BITS, Pilani
Page 1 of 7
SS ZG515: Data Warehousing
Points to note:
 Primary index requires that the ordering field of the data file have a distinct value
for each record.
 Primary index is sparse
 Contains as many records as there are blocks* in the data file (there are 5 blocks in
this example and each block can hold only 2 records).
 The first record in each block of the data file is called anchor record of the block,
or simply block anchor.
 There can be only one primary index on a table
A dense index on the above data file will have 10 records, one for each key value, and
record pointers instead of block pointers.
2. Clustering Indexes
If records of a file are physically ordered on a nonkey field, called the clustering field.
We can create a clustering index to speed up the retrieval of records that have the same
value for the clustering field. This differs from a primary index, which requires that the
ordering field of the data file have a distinct value for each record.
The clustering index file, like the primary index file, has two fields. The first field
contains distinct values of the clustering field, and the second field contains block
pointers. The block pointer points to the first block in the data file that has a record with
that value for its clustering field.
Data File
1
Index File
1
1
1
2
2
3
4
5
2
3
3
3
3
3
4
5
Clustering index is always non-dense.
Dr. Navneet Goyal, BITS, Pilani
Page 2 of 7
SS ZG515: Data Warehousing
3. Secondary Indexes
Secondary indexes do not require the data file to be sorted on the indexed field. They can
be created on both key and nonkey fields.
Secondary Index on Key Field
The key field on which a secondary index is created is often called a secondary key.
There is one index entry for each record in the data file., which contains the value of the
secondary key and a pointer to either the block in which the record is stored or to the
record itself. Such an index is dense.
Data File
5
Index File
2
1
7
2
1
3
4
5
4
6
3
7
8
8
6
Secondary index on a key field
Secondary Index on Nonkey Field
We can also create a secondary index on a nonkey field of a file. There are two options
for implementing such an option:
 Option 1 is to include several index entries with the same index field valueone for each record. This would be a dense index
Data File
Emp#
Dr. Navneet Goyal, BITS, Pilani
SSN
Name Dept #
3
5
1
3
2
3
4
5
3
DOB SALARY
Page 3 of 7
SS ZG515: Data Warehousing
Index File would look like this:
1
2
3
3
3
3
4
5
5

Option 2 is to have variable length records for the index entries, with a
repeating field for the pointer-one pointer to each block that contains a record
with matching indexing field value. This would be a non-dense index.
The index File would look like this:
1
2
3
4
5
B1(1)
B2(1)
B3(1), B3(2), B3(3), B3(4)
B4(1)
B5(1)
Where Bi (n) (i represents the indexing field value and n takes value from 1 to
number of matching records for that indexing field value)
Summary of Single-Level Indexes
Types of Indexes
Key Field
Nonkey Field
Ordering Field
Primary Index
Clustering Index
Nonordering Field
Secondary Index (key)
Secondary Index (nonkey)
Properties of Index Types
Type of
Number of Index Entries
Index
Primary
No. of blocks in data file
Clustering
No. of distinct index field values
Secondary
Number of records in data file
(key)
Secondary
No. of records**
(nonkey)
No. of distinct index field values***
* Yes if every distinct value of the ordering field
otherwise
** For Option 1
*** For Option 2
Dr. Navneet Goyal, BITS, Pilani
Dense or
Sparse
Sparse
Sparse
Dense
Block
Anchoring
Yes
Yes/no*
No
Dense
No
Sparse
starts from a new block; no
Page 4 of 7
SS ZG515: Data Warehousing
Multilevel Indexes
In all single level indexes, we have seen that the index file is always sorted on the search
key. For an index with bi blocks, a binary search requires approximately (log2 bi) block
accesses (each step of the algorithm reduces the part of the index file by a factor of 2, that
is why we take the log to the base 2). The idea behind multilevel indexes is to reduce the
part of the index file that we continue to search by a factor of bfri = (block size in
bytes/record size in bytes), the blocking factor for the index, which is always greater than
2. Hence the search space is reduced much faster. The value bfri is called the fan-out (fo)
for the multilevel index. Searching a multilevel index requires (logfo bi) block accesses,
which is a smaller number that for binary search if fo>2.
A multilevel index considers the index file as ordered file with distinct values. Hence we
can create a primary index for the first level index. This primary index will be called the
second level of the multilevel index. Because the second level index is a primary index,
we can use block anchors so that the second level has one entry for each block of the first
level.
bfri (for the second level and all subsequent levels) = bfri (for the first level index)
If the first level has r1 entries, and the blocking factor, which is also the fan-out, then the
first level needs r1/fo blocks, which is therefore the number of entries r2 needed at the
second level of the index.
The same process can be repeated for the second level.
r3 = r2/fo
Note that we need a second level index only if the first level needs more than one block,
and similarly, we need a third level index only if the second level index needs more than
one block. The process of increasing the levels of index continues till all the entries of
some index level t fit in a single block.
 The block at the tth level is called the top index level.
 Each level reduces the number of entries at the previous level by a factor of fo
 Formula 1<= [ r1/(fo)t] can be used to calculate t.
 t  [logfo(r1)]
The above scheme can be applied to primary, clustering, or secondary first level index as
long as the first level index has distinct values for the index field and fixed length entries.
Dynamic Multilevel Indexes: B-Trees & B+-Trees
While 1, 2 or more levels of index are very useful in speeding up queries, there is a more
general structure that is used in commercial RDBMSs. The general family of data
structures is called B-tree and the particular variant that is most often used is known as a
B+-tree. Important characteristics of B-trees are:
 B-trees automatically maintain as many levels of index as is appropriate for
the size of the file being indexed.
 B-trees manage the space on the blocks they use so that every block is
between half used and completely full. No overflow blocks are ever needed
for the index.
The layout of the blocks in a B-tree is determined by a parameter n. Each block will have
space for n search-key values and n+1 pointers. B-tree blocks are similar to conventional
Dr. Navneet Goyal, BITS, Pilani
Page 5 of 7
SS ZG515: Data Warehousing
index blocks, except that the B-tree block has an extra pointer. We pick n to be as large as
will allow n+1 pointers and n keys to fit in one block.
Suppose block size is 4096 bytes
Keys are integers of 4 bytes
Pointers are 8 bytes
If no header information is kept on the blocks, we want to find the largest integer value n
such that 4 n + 8(n+1) <= 4096.
This gives n=340. This means that a block can hold 340 key values and 341 pointers.
Rules for B-trees
 At the root, there are at least two used pointers. All pointers point to the B-tree
blocks at the lower level.
 At a leaf, the last pointer points to the next leaf block to the right, i.e., to the
block with next higher keys. Among the other n pointers in a leaf, at least
(n+1)/2 are used to point to data records and unused pointers can be thought of
as null and do not point anywhere. The ith pointer, if it is used, points to a
record with the ith key.
 At any interior node, all the n+1 pointers can be used to point to B-tree blocks
at the next lower level. At least (n+1)/2 of them are actually used. If j pointers
are used, then there will be j-1 keys, k1, k2,…., kj-1. The first pointer points to a
part of the B-tree where some of the records with keys less than k1 will be
found. The second pointer goes to that part of the tree where all the records
with keys that are at least k1, but less than k2 will be found, and so on. Finally,
the jth pointer gets us to that part of the B-tree where some of the records with
keys greater than or equal to kj-1 are found.
Note that some of the records with keys far below k1 or far above kj-1 may not
be reachable from this block at all, but will be reached via another block at the
same level.
 The nodes at any level, left to right, contain keys in non-decreasing order.
57
81 95
To next leaf
in the sequence
To record
with key 57
To record
with key 81
To record
with key 95
A B-tree Leaf
Dr. Navneet Goyal, BITS, Pilani
Page 6 of 7
SS ZG515: Data Warehousing
57
To keys
K < 57
81 95
To keys
57  K < 81
To keys
81  K < 95
To keys
K  95
An Interior Node of B-tree
13
7
2
3
5
7
11
23 31 43
13 17 19
23 29
31 37 41
43 47
A B-tree
The above notes have been compiled by taking material from the following sources:
1. Hector Garcia-Molina, JD Ullman, & J Widom, Database System
Implementations, Pearson Education, 2001.
2. B Elmasiri, & SB Navathe, Fundamentals of Database Systems, 3e, Addison
Wesley, 2000.
Dr. Navneet Goyal, BITS, Pilani
Page 7 of 7