Download B Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
CSE2132
Database Systems
Week 10 Lecture
Physical Database Design - File Structures
Physical DB Design 10. 1
Data Structures -What will we cover?
 Underlying data structures
– File organizations
– Access modes
– Binary trees
– B+ trees
 Oracle data structures
Physical DB Design 10. 2
Underlying Data Structures
 Data structures are the bricks and mortar that hold databases
together.
 Data structures (for the ANSI/SPARC standard) are defined in the
internal model level and implemented in the physical data
organization.
 Data structures are often hidden from the application programmer,
since they are primarily used by the DBMS and Operating Systems.
 A good understanding and choice of data structures is important for
machine performance, also to improve program design and to allow
easier communication with DBMS specialists.
Physical DB Design 10. 3
File Organization
 A file organization is a technique for physically arranging the
records of a file on a secondary storage device.
File organizations
Sequential
Sequential
(block index)
Hardwaredependent
(ISAM)
Indexed
Non-sequential
(full index)
Direct
RelativeAddressed
HashAddressed
Hardwareindependent
(VSAM)
Physical DB Design 10. 4
Record Access Modes
 Sequential Access
In sequential access, record storage starts at a designated point,
usually the beginning, and proceeds in a linear sequence through
the file. Each record can only be retrieved by accessing all the
records that physically precede it.
 Random Access
In random access, a given record is accessed "out of the blue"
without referencing other records in the file.
Physical DB Design 10. 5
File Organization and Access Mode
 A File organization is established when the file is created, and is
rarely changed. However, record access mode can change each time
the file is used.
Record access mode
File
Sequential
Random
Organization
Sequential
Yes
No (impractical)
Indexed Seq.
Yes
Yes
Direct-Relative
Yes
Yes
Direct-Hashed
No
(impractical)
Yes
Physical DB Design 10. 6
Indexed Sequential Architecture
(Partial Index)
747
363
153
575
252
Index set
(many
levels)
683
363 -
-
Sequence
set
100 125 153
207 221
252
The actual
data records
Control interval
Control Area
Physical DB Design 10. 7
Direct - Relative Files
 Each record can be retrieved by specifying its relative record
number.
The relative record number is a number 0 to n that gives the
position of the record relative to the beginning of the file.
 This provides a method of direct file organization.
Both sequential and direct access are handled but having a key
allocation suitable for this method is not always easy or possible.
Physical DB Design 10. 8
Direct - Hashed Files
 In applications which do updates and retrievals in random mode, and
there is rarely the need for sequential access to the data records
(e.g. reservation systems). Hashed file organization provides rapid
access to individual records based on a key.
 The major disadvantage of hash organization is that sequential
access is not convenient because the records are not stored in
primary key sequence. But highly concurrent environments doing
random access are suitable for using hash organization.
 The basis of a hash file is an addressing algorithm which transforms
the record identifier into a relative address.
Physical DB Design 10. 9
Components of a Hashed File
Identifier
Transformation
1
Primary
storage
area
2
3
.....
s
1
2
Slot
Bucket
overflow
technique
Bucket
b
Overflow
storage
area
1
2
3
.....
s
0
Physical DB Design 10. 10
Hashed File Design
Load Factor(Fill Factor): The load factor is the percentage of space
allocated to the file that is taken up by the records in the file. A low load
factor reduces the number of records that overflow their home addresses
It is common to use 50% to 80%, using a lower load factor for files
which that will grow.
Bucket Capacity: Increasing the bucket capacity will also reduce the
number of overflows and hence the average search length also.
b=1
1.5
b=2
Average
Search
Length 1.3
b=3
b=4
1.1
b = records
per bucket
1.0
20
40
60
Load Factor (%)
80
100
Physical DB Design 10. 11
Comparison of Organizations
Sequential
Start of
file
ASTEROIDS
BREAKOUT
COMBAT
ZAXXON
Indexed Sequential
H
Key
P
Z
Index
A
ASTEROIDS
.....
D
K
....H
....
M
..
MEGAMANIA
P
......
ZAXXON
Physical DB Design 10. 12
Comparison of Organizations(2)
Direct - Relative
CHESS
Relative
record number
COMBAT
1
DEFENDER
2
ZAXXON
3
n
Direct - Hashed
Hashing
Routine
KEY
PITFALL
1
BERSERK
2
Relative
record no.
ODYSSEY
3
....
DONKEY
KONG
n
Physical DB Design 10. 13
Binary Trees
 A non-linear data structure, each element having several "next" elements
( branching ).
 A binary tree has a maximum of two branches per element or node.
 A node consist of some data and a maximum of two pointers, a left
pointer to the left branch and right pointer to the right branch. If there is
no left or right branch then a nil pointer is used.
Physical DB Design 10. 14
A Diagram of a Binary Tree
PRODUCT#
LINK
RLINK
Basic binary
tree record
Primary
Less Than
Greater Than
Data
layout for
Key
Pointer
Pointer
PRODUCT
__________________________________________
1000
1000
<
>
1600
(1) Initial tree
<
0350
1000
(2) Insert 1000
<
>
2000
(5) Insert 2000
0350
1600
(4) Insert 0350
<
0350
>
1600
>
0975
(6) Insert 0975
>
2000
>
0350
1000
>
1600
(3) Insert 1600
1000
1000
>
>
1600
< 0975
0625
>
2000
(7) Insert 0625
Physical DB Design 10. 15
An Example of a Binary Tree
1000
<
>
1600
0350
<
0100
<
0625
>
<
>
0975
1250
>
1425
2000
<
1775
Task: Indicate the different traversals on this diagram.
Physical DB Design 10. 16
B Trees
 The problem with Binary Trees is balance, the tree can easily deteriorate
to a linked list. Consequently, the reduced search times are lost, this
problem is overcome in B trees.
B stands for Balanced, where all the leaves are the same distance from
the root. B trees guarantee a predictable efficiency.
 There are several varieties of Btrees, most applications use the B+tree.
A B+tree of degree m has the following properties:
1. All leaves are at the same level, that is the same depth from the root.
2. A non-leaf node that has n branches will contain n-1 keys.
Physical DB Design 10. 17
Example of a B Tree
<
0625
0350
1000
1250
1291
1277
1282
>
1425
2000
1600
2107
A Btree provides balance and quick direct access but sequential
processing can be slow. Because of this the B+tree was introduced.
In a B+tree all key values occur in a leaf node so that sequential processing
can be supported. This means that the leaf nodes have a different structure to
high level nodes and some key values will occur twice in the tree.
Physical DB Design 10. 18
B+ Tree Node Structure
A high level node
K1
P1
Pointer to
subtree for
keys< K 1
K2
P2
..
Pointer to
subtree for
keys>= K &
1 <K
2
.....
P n-1
K n-1
Pointer to
subtree for
keys>= K n-&
2 <K
P n
Pointer to
subtree for
keys>= K n-1
n-1
A leaf node (Every key value appears in a leaf node)
P1
Pointer to
record (block)
with key K 1
K1
P2
K2
Pointer to
record (block)
with key K 2
.......
P n-1
K n-1
Pointer to
record (block)
with key K n-1
P n
Pointer to leaf
with smallest
key greater than
K n-1
Physical DB Design 10. 19
Example of a B+ Tree
Leaf
Nodes
<
>=
1250
0625 1000
0350
0350
0625
0625
1425 2000
1000
1000
1250 1300
1250
1425 1600
2000
1425
1300
2000
1600
Actual Data Records
Physical DB Design 10. 20
Building a B+ Tree
67, 89 , 123,18, 34, 87, 99, 104, 36, 55, 78, 9
root node
<
<
67
89
leaf node
89
>=
67
89
123
data records
(node split a bc ; 3 do not fit so split and promote middle value)
<
18
67
89
<
>=
89
123
34
18
89
89
34
>=
123
67
Physical DB Design 10. 21
A Review of Trees
 Can permit rapid retrieval of data for both random and sequential
processing.
 Can be used based on primary or secondary keys.
 Trees are special cases of networks; in networks records from
different files are joined without a strict hierarchy being observed.
Physical DB Design 10. 22
Indexes in Oracle(1)
CREATE [bitmap] [unique] INDEX index ON table(column
[,column]..);
 An index is a schema object that contains an entry for each value that
appears in the indexed column(s) of the table or cluster and provides
direct, fast access to rows.
 Indexes may be created on
 one or more(up to 32) columns of a table, a partitioned table, or a
cluster;
 one or more scalar typed object attributes of a table or a cluster.
 It is preferable to use primary key when creating the table as Create
Unique Index will fail if there are duplicates.
Physical DB Design 10. 23
Indexes in Oracle(2)
 An index is an ordered list of all the values that reside in a group of
one or more columns at a given time. Such a list makes queries that
test the values in those columns vastly more efficient. Indexes also
take up storage space, and must be changed whenever the data is, so
a cost-benefit analysis must be made in each case to determine
whether and how indexes should be used. Oracle can use indexes to
improve performance when:
 searching for rows with specified index column values
 accessing tables in index column order
 When you initially insert rows into a new table, it is generally faster
to create the table, insert the rows, and then create the index. If you
create the index before inserting the rows, Oracle must update the
index for every row inserted.
Physical DB Design 10. 24
Indexes in Oracle(3)
 Multiple Indexes Per Table
Unlimited indexes can be created for a table provided that the
combination of columns differ for each index. You can create
more than one index using the same columns provided that you
specify distinctly different combinations of the columns. For
example, the following statements specify valid combinations:
CREATE INDEX emp_idx1 ON emp (ename, job);
CREATE INDEX emp_idx2 ON emp (job, ename);
 Note that each index increases the processing time needed to
maintain the table during updates to indexed data. There is overhead
in maintaining indexes when a table is updated. Thus, updating a
table with a single index will take less time than if the table had five
indexes.
Physical DB Design 10. 25
Indexes in Oracle(4) - Nulls
 Table rows in which all key columns are NULL are not indexed.
Consider the following statement:
SELECT ename
FROM emp
WHERE comm IS NULL;
The above query does not use an index created on the COMM column.
Physical DB Design 10. 26
Indexes in Oracle(5) - Bitmap Index
 Bitmap indexes store the rowids associated with a key value as a bitmap.
Each bit in the bitmap corresponds to a possible ROWID, and if the bit is
set, it means that the row with the corresponding ROWID contains the key
value. The internal representation of bitmaps is best suited for applications
with low levels of concurrent transactions, such as data warehousing.
 Bitmap indexes are appropriate when there are few distinct values for a
column that the index is created on. An example would be a flag column
that held either Y or N.
CREATE BITMAP INDEX masterflagbitmap_ix ON
film_copy(masterflag);
 The index holds a bitmap value for each possible value for every row in the
table
Y<11011001............>
N<00100110............>
Physical DB Design 10. 27
Clusters(1)
 A cluster is a schema object that contains one or more tables that all
have one or more columns in common. Rows of one or more tables
that share the same value in these common columns are physically
stored together within the database.
 Clustering provides more control over the physical storage of rows
within the database. Clustering can reduce both the time it takes to
access clustered tables and the space needed to store the table. After
you create a cluster and add tables to it, the cluster is transparent.
You can access clustered tables with SQL statements just as you can
non-clustered tables.
 While clustering multiple tables improves the performance of joins,
it is likely to reduce the performance of full table scans, INSERT
statements, and UPDATE statements that modify cluster key values.
Physical DB Design 10. 28
Clusters(2) - creating an Indexed Cluster
 The rows of two related tables are interleaved in a single area called a
cluster. The cluster key is the column or columns by which the tables are
usually joined in a query.
CREATE CLUSTER cluster (column datatype [,column datatype] . . . );
e.g.
CREATE CLUSTER workerandskill (tempname varchar2(25) );
This sets aside a space. The column name is irrelevant but the datatype must
match Name in the table worker.
Next tables are created to be included in the cluster.
CREATE TABLE worker (Name Varchar2(25) not null,
Age Number,
Lodging Varchar2(15) )
CLUSTER workerandskill (Name);
Physical DB Design 10. 29
Clusters(3) - creating an Indexed Cluster
 Now a second table is added to the cluster
CREATE TABLE workerskill ( Name Varchar2(25) not null,
Skill
Varchar2(25) not null,
Ability Varchar2(15) )
CLUSTER workerandskill (Name);
 Prior to inserting rows into worker and workerskill you must create a
cluster index.
CREATE INDEX workerandskill_ix ON CLUSTER workerandskill;
Note that no index columns are specified since the index is automatically
built on all the columns of the cluster key. For cluster indexes, all rows
are indexed.
Physical DB Design 10. 30
Example of a Cluster: Name is the Cluster Key
Age
23
29
22
18
16
43
27
Lodging
PAPA KING
ROSE HILL
CRAMNER
ROSE HILL
MATTS
WEITBROCHT
ROSE HILL
ROSE HILL
Name
Skill
Ability
ADAH TALBOT
WORK
GOOD
ANDREW DYE
BART SARJEANT
DICK JONES
SMITHY
EXCELLENT
DONALD ROLLO
ELBERT TALBOTDISCUS
SLOW
JOHN PEARSON
COMBINE DRIVER
WOODCUTTER GOOD
SMITHY
AVERAGE
KAY AND PALMER WALLBOM
From the WORKER table
From the WORKERSKILL table
Physical DB Design 10. 31
Clusters(4) - creating an Indexed Cluster
 Each cluster key value is stored only once. It is as if the cluster were a
big table containing data drawn from both of the tables that make it up.
 You may want to use indexed clusters in the following cases:
Your queries retrieve rows over a range of cluster key values.
Your clustered tables may grow unpredictably.
 You cannot specify integrity constraints as part of the definition of a
cluster key column. Instead, you can associate integrity constraints with
the tables that belong to the cluster.
Physical DB Design 10. 32
Clusters(5) - creating a Hash Cluster
 In a hash cluster, Oracle stores together rows that have the same hash
key value. The hash value for a row is the value returned by the
cluster's hash function.
 When you create a hash cluster, you can either specify a hash
function or use the Oracle internal hash function. Hash values are not
actually stored in the cluster, although cluster key values are stored
for every row in the cluster.
 You may want to use hash clusters in the following cases:
Your queries retrieve rows based on equality conditions involving
all cluster key columns.
Your clustered tables are static or you can determine the maximum
number of rows and the maximum amount of space required by
the cluster when you create the cluster.
Physical DB Design 10. 33
Clusters(6) - creating a Hash Cluster
 The following statement creates a hash cluster named PERSONNEL
with the cluster key column DEPARTMENT_NUMBER.
CREATE CLUSTER personnel
( department_number NUMBER )
HASHKEYS 500;
 The hashkeys clause creates the hash cluster, using an internal hash
function and specifies the number of hash values rounded to the nearest
prime number (503 in this case).
 Now create the tables indicating the cluster in the cluster clause
Physical DB Design 10. 34