Download 1.9 File Structure and Indexing - KV Institute of Management and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

ContactPoint wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Base Management System
Unit 1
Introduction
Topics to be covered – Database and DBMS-Characteristics – importance – advantage –
evaluation - Codd rules – database architecture: data organization – file structure and indexing.
Table of Contents
Unit 1............................................................................................................................................................. 1
Introduction .................................................................................................................................................. 1
Topics to be covered – Database and DBMS-Characteristics – importance – advantage – evaluation Codd rules – database architecture: data organization – file structure and indexing. ........................ 1
1.1
Database and DBMS...................................................................................................................... 2
1.2
Characteristics of DBMS ................................................................................................................ 3
1.3
Importance of DBMS ..................................................................................................................... 3
1.4
Advantages of DBMS..................................................................................................................... 3
1.5
Evolution of DBMS ........................................................................................................................ 4
1.6
Codd Rules .................................................................................................................................... 6
1.7 Database Architecture ........................................................................................................................ 8
1.8
Data organization .......................................................................................................................... 9
1.9 File Structure and Indexing ............................................................................................................... 11
1.9.1 File structuring ........................................................................................................................... 11
1.9.2 Indexing ...................................................................................................................................... 12
1.10 Reference ........................................................................................................................................ 18
1.1 Database and DBMS





File – is a two-dimensional table summarizing the multiple instance of a set of fields of
an entity.
Database – is a collection of interrelated files, while database management system is a
collection of database, database utilities and data dictionary/directory, operated by user
groups/ application developers and administered by database administrator.
Database system is essentially a database management system which is free from all the
drawbacks of the conventional file processing system.
In the database system, data is independent of programs.
Ex –
Room
Items
Patients
Physician
Charges
In – patient treatment
Outpatient treatment



Database is an ‘operational data’ different from input data, output data.
A database is a collection of stored operational data used by the application systems of
some particular organization.
Input data is one, which comes into the system from outside world, i.e. from terminals.
1.2 Characteristics of DBMS
(i)
Shared –
Data in a database are shared among different users and applications.
(ii)
Persistence –
Data is a database exist permanently in the sense the data can live beyond
the scope of the process that crested it.
(iii)
Validity / Integrity / Correctness –
Data should be correct with respect to the real world entity that they
represent.
(iv)
Security –
Data should be protected from unauthorized access.
(v)
Consistency –
Whenever more than one data element in a database represents related
real-world values, the values should be consistence with respect to the
relationship.
(vi)
Independence –
The three levels in the schema (internal, conceptual and external) should
be independent of each other so that the changes in the schema at one level
should not affect others levels.
1.3 Importance of DBMS





It helps make data management more efficient and effective
Its query language allows quick answers to ad-hoc quires.
It provides end users better access to more and better – managed data.
It provides an integrated view of the organization’s operations “Big picture”
It reduces the probability of inconsistent
1.4 Advantages of DBMS
(i)
Redundancy can be reduced
 In non-database systems, each application or department has its own
private files resulting in considerable amount of redundancy o the stored
data.



Thus storage space is wasted.
By having a centralized database should be eliminated.
Sometimes there are sound business and technical reasons for maintaining
multiple copies of the same data.
(ii)
Inconsistency can be avoided
 When the same data is duplicated and changes are made at one site, which
is not propagated to the other sites, it gives rise to inconsistency.
 Then the two entities regarding the same data will not agree.
 At such times the data is said to be inconsistent.
 If redundancy is removed chances of having inconsistent data is also
removed.
(iii)
Data can be shared The existing applications can share the data in a database.
(iv)
Standards can be enforced
 With the central control of the database, the database administrator can
enforce standard.
(v)
Security restrictions can be applied
 Having complete authority over the operational data, enables the database
administrator in ensuring that the only means of access to the database is
through proper channels.
 The DBA can define authorization checks to be carried out whenever
access to sensitive data is attempted.
(vi)
Integrity can be maintained
 Integrity – that the data in the database is accurate.
 Centralized control of the data helps in permitting the administrator to
define integrity constraints to the data in the database.
(vii)
Conflicting requirements can be balanced
 Knowing the overall requirements helps the database designers in creating
a database design that is best for the organization.
1.5 Evolution of DBMS
(i) Late 1960s and 1970s:

Widespread use of hard disks in the late 1960s changed the scenario for
data processing greatly, since hard disks allowed direct access to data.






The position of data on disk was immaterial, since any location on disk
could be accessed in just tens of milliseconds.
With disks, network and hierarchical databases could be created that
allowed data structures such as lists and trees to be stored on disk.
Programmers could construct and manipulate these data structures.
Codd [1970] defined the relational model, and nonprocedural ways of
querying data in the relational model, and relational databases were born.
The simplicity of the relational model and the possibility of hiding
implementation details completely from the programmer were enticing
indeed.
Codd later won the prestigious Association of Computing Machinery
Turing Award for his work.
(ii) 1980s:








Although
academically interesting, the relational model was not used in practice
initially, because of its perceived performance disadvantages - relational
databases could not match the performance of existing network and
hierarchical databases.
That changed
with System R, a groundbreaking project at IBM Research that developed
techniques for the construction of an efficient relational database system.
The
fully
functional System R prototype led to IBM’s first relational database
product, SQL/DS.
Initial
commercial relational database systems, such as IBM DB2, Oracle, Ingres,
and DEC R Database, played a major role in advancing techniques for
efficient processing of declarative queries.
By the early
1980s, relational databases had become competitive with network and
hierarchical database systems even in the area of performance.
Relational
databases were so easy to use that they eventually replaced
network/hierarchical databases; programmers using such databases were
forced to deal with many low-level implementation details, and had to
code their queries in a procedural fashion.
Most
importantly, they had to keep efficiency in mind when designing their
programs, which involved a lot of effort.
In contrast, in
a relational database, almost all these low-level tasks are carried out
automatically by the database, leaving the programmer free to work at a
logical level.


Since
attaining dominance in the 1980s, the relational model has reigned
supreme among data models.
The
1980s
also saw much research on parallel and distributed databases, as well as
initial work on object-oriented databases.
(iii) Early 1990s:





The
SQL
language was designed primarily for decision support applications, which
are query intensive, yet the mainstay of databases in the 1980s was
transaction processing applications, which are update intensive.
Decision
support and querying re-emerged as a major application area for
databases.
Tools
for
analyzing large amounts of data saw large growths in usage.
Many
database vendors introduced parallel database products in this period.
Database
vendors also began to add object-relational support to their databases.
(iv) Late 1990s:




The major event was the explosive growth of the World Wide Web.
Databases were deployed much more extensively than ever before.
Database systems now had to support very high transaction processing
rates, as well as very high reliability and 24×7 availability.
Database systems also had to support Web interfaces to data.
1.6 Codd Rules
 Codd ’s 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar
F. Codd, a pioneer of the relational model for databases, designed to define what is
required from a database management system in order for it to be considered relational,
i.e., an RDBMS.
 Codd produced these rules as part of a personal campaign to prevent his vision of the
relational database being diluted, as database vendors scrambled in the early 1980s to
repackage existing products with a relational veneer.
 Rule 12 was particularly designed to counter such a positioning. In fact, the rules are so
strict that all popular so-called "relational" DBMSs fail on many of the criteria.
The rules
(i) Rule 0:

The system must qualify as relational , as a database , and as a
management system For a system to qualify as a relational database
management system (RDBMS), that system must use its relational
facilities (exclusively) to manage the database

The information rule: All information in the database is to be represented
in one and only one way, namely by values in column positions within
rows of tables.

The guaranteed access rule: All data must be accessible. This rule is
essentially a restatement of the fundamental requirement for primary keys.
It says that every individual scalar value in the database must be logically
addressable by specifying the name of the containing table, the name of
the containing column and the primary key value of the containing row.

Systematic treatment of null values: The DBMS must allow each field to
remain null (or empty). Specifically, it must support a representation of
"missing information and inapplicable information" that is systematic,
distinct from all regular values (for example, "distinct from zero or any
other number", in the case of numeric values), and independent of data
type. It is also implied that such representations must be manipulated by
the DBMS in a systematic way.

Active online catalog based on the relational model: The system must
support an online, inline, relational catalog that is accessible to authorized
users by means of their regular query language. That is, users must be able
to access the database's structure (catalog) using the same query language
that they use to access the database's data.

The comprehensive data sublanguage rule: The system must support at
least one relational language that
Can be used both interactively and within application programs,
Supports data definition operations (including view definitions),
data manipulation operations (update as well as retrieval), security
and integrity constraints, and transaction management operations
(begin, commit, and rollback).

The view updating rule: All views that are theoretically updatable must be
updatable by the system.

High-level insert, update, and delete: The system must support set-at-atime insert update, and delete operators. This means that data can be
.
(ii) Rule 1:
(iii)Rule 2:
(iv) Rule 3:
(v) Rule 4:
(vi) Rule 5:
(vii)
Rule 6:
(viii) Rule 7:
retrieved from a relational database in sets constructed of data from
multiple rows and/or multiple tables. This rule states that insert, update,
and delete operations should be supported for any retrievable set rather
than just for a single row in a single table.
(ix) Rule 8:

Physical data independence: Changes to the physical level (how the data is
stored, whether in arrays or linked lists etc.) must not require a change to
an application based on the structure.

Logical data independence: Changes to the logical level (tables, columns,
rows, and so on) must not require a change to an application based on the
structure. Logical data independence is more difficult to achieve than
physical data independence.

Integrity independence: Integrity constraints must be specified separately
from application programs and stored in the catalog. It must be possible to
change such constraints as and when appropriate without unnecessarily
affecting existing applications.
(x) Rule 9:
(xi) Rule 10:
(xii)
Rule 11:



Distribution independence: he distribution of portions of the database to
various locations should be invisible to users of the database. Existing
applications should continue to operate successfully :
when a distributed version of the DBMS is first introduced; and
When existing distributed data are redistributed around the system.
(xiii) Rule 12:
 The non-subversion rule: If the system provides a low-level (record-at-a-time)
interface, then that interface cannot be used to subvert the system, for example,
bypassing a relational security or integrity constraint.
1.7 Database Architecture
The goal of the three-schema architecture, is to separate the user applications and the physical
database. In this architecture, schemas can be defined at the following three levels:
1. The internal level has an internal schema, which describes the physical storage structure of
the database. The internal schema uses a physical data model and describes the complete
details of data storage and access paths for the database.
2. The conceptual level has a conceptual schema, which describes the structure of the whole
database for a community of users. The conceptual schema hides the details of physical
storage structures and concentrates on describing entities, data types, relationships, user
operations, and constraints. A high-level data model or an implementation data model can
be used at this level.
3. The external or view level includes a number of external schemas or user views. Each
external schema describes the part of the database that a particular user group is
interested in and hides the rest of the database from that user group. A high-level data
model or an implementation data model can be used at this level.
1.8 Data organization
 The choice of the organization and the methods used for direct and access depend
on five characteristics of the data:
o File Volatility
o File activity
o File query needs
o File size
o Data currency
(i) File Volatility
 When data are frequently added to or deleted form a file, the file is said to have a
high volatility.
 Airline and railway files are volatile files, as reservation transactions occur at the
rate of hundreds or thousands per minute.
 But an employee master file for a company with a low employee turnover would
have very low volatility is low, sequential or index sequential forms of file
organization and access work well, especially when few queries are made against
the file.
(ii) File Activity




The percentage of records in a file that is actually accessed during any one run is
the file activity rate.
In application, such as bank teller or hotel reservation systems, each transaction
must be processed immediately.
Other application, such as sending out invoices at the end of the month,
transaction can be batched and then processed as a group.
When the file activity rate is around 60 percent or higher – meaning that 60
percent of the records in the file may be accessed at any one time –sequential
techniques are often considered more efficient.
(iii)File Query
 When information must be retrieved very quickly, then some form of direct
organization must be used.
 Railways must be retrieved systems, inventory systems, and automatic teller
machine systems all fall into this category.
(iv) File Size
 When the records in a large file must be accessed immediately, then direct
organization must be used.
 But if the size of the file is small – under about 100KB the sequential file
organization can be used.
 Files of the size can be read in their entirely into a computer’s main memory.
(v) File Currency
 Data currency refers to the timeliness of data.
 If the data need to be up-to-the minute, then direct organization and processing
will be required.

Stock quote systems, airline reservation systems, and on-line shopping systems all
depend on timely data and therefore depend on direct systems.
1.9 File Structure and Indexing
1.9.1 File structuring














Magnetic disks are new the first choice as secondary storage medium because of their
data access speeds and decreasing cost of the storage.
Magnetic material in form of circular in form of circular disk and information is stored on
the disk surface in concentric circles are called tracks.
In case of disks packs (no of disks together) tracks with same diameter on various surface
called cylinder.
Cylinder are important because data stored on the same cylinder could be accessed much
faster than the data that is distributed among the different cylinder.
A tracks is divided into smaller units called sectors.
Sectors further divided into blocks – (done by the operating system during disk
formatting / installation.
Block size fix once during formatting and can’t be changed until the disk is formatting
again.
Block ranges from 512 bytes – 4096 bytes.
Blocks are separated by inter-block gaps. Transfer of data between main memory and
disk takes place in units of blocks – the address of a block supplied to the disks input /
output hardware.
The address of the area in the main memory reserved to hold the contents of the block
(called buffer).
Read command used is issued – contents of the block from the disk are copied to the
buffer.
Write command issued- contest of buffer copied to disk.
Size of the buffer in main memory is large enough to accommodate several blocks, called
cluster access speeds.
Access time consists of 4 factors seek time need switching time, rotating delay and data
rime,

Total time needed to locate and transfer data from a block is the sum of all the above four
factor.
Record types:


Data is stored as records and each records is a collection of valves and each valves is
formed by one or more bytes corresponding to a specific field in the record.
Book record represents a book entity and each field value of the record specifies an
attributed of the book entity like ISBN, AUTHOR, PUBLISHER, PRICE and so on.
1.9.2 Indexing






Database system indices play the same role as book indices or card catalogs in libraries.
For example, to retrieve an account record given the account number, the database
system would look up an index to find on which disk block the corresponding record
resides, and then fetch the disk block, to get the account record.
Keeping a sorted list of account numbers would not work well on very large databases
with millions of accounts, since the index would itself be very big; further, even though
keeping the index sorted reduces the search time, finding an account can still be rather
time-consuming.
Instead, more sophisticated indexing techniques may be used.
There are two basic kinds of indices:
(1) Ordered indices. Based on a sorted ordering of the values.
(2) Hash indices. Based on a uniform distribution of values across a range of
buckets. The bucket to which a value is assigned is determined by a function,
called a hash function.
We shall consider several techniques for both ordered indexing and hashing. No one
technique is the best. Rather, each technique is best suited to particular database
applications. Each technique must be evaluated on the basis of these factors:
(i)
Access types: The types of access that are supported efficiently. Access types can
include finding records with a specified attribute value and finding records whose
attribute values fall in a specified range.
(ii)
Access time: The time it takes to find a particular data item, or set of items, using
the technique in question.
(iii)
Insertion time: The time it takes to insert a new data item. This value includes
the time it takes to find the correct place to insert the new data item, as well as the
time it takes to update the index structure.
(iv)
Deletion time: The time it takes to delete a data item. This value includes the time
it takes to find the item to be deleted, as well as the time it takes to update the
index structure.
(v)
Space overhead: The additional space occupied by an index structure. Provided
that the amount of additional space is moderate, it is usually worthwhile to
sacrifice the space to achieve improved performance.
(1) Ordered Indices








To gain fast random access to records in a file, we can use an index structure.
Each index structure is associated with a particular search key.
Just like the index of a book or a library catalog, an ordered index stores the values of the
search keys in sorted order, and associates with each search key the records that contain
it.
The records in the indexed file may themselves be stored in some sorted order, just as
books in a library are stored according to some attribute such as the Dewey decimal
number.
A file may have several indices, on different search keys.
If the file containing the records is sequentially ordered, a primary index is an index
whose search key also defines the sequential order of the file.
Primary indices are also called clustering indices.
The search key of a primary index is usually the primary key, although that is not
necessarily so. Indices whose search key specifies an order different from the sequential
order of the file are called secondary indices, or non-clustering indices.
Primary Index




In this section, we assume that all files are ordered sequentially on some search key.
Such files, with a primary index on the search key, are called index-sequential files.
They represent one of the oldest index schemes used in database systems.
They are designed for applications that require both sequential processing of the entire
file and random access to individual records.
Dense and Sparse Indices






An index record, or index entry, consists of a search-key value, and pointers to one or
more records with that value as their search-key value.
The pointer to a record consists of the identifier of a disk block and an offset within the
disk block to identify the record within the block.
There are two types of ordered indices that we can use:
o Dense index: An index record appears for every search-key value in the file.
In a dense primary index, the index record contains the search-key value and a pointer to
the first data record with that search-key value.
The rest of the records with the same search key-value would be stored sequentially after
the first record, since, because the index is a primary one, records are sorted on the same
search key.
Dense index implementations may store a list of pointers to all records with the same
search-key value; doing so is not essential for primary indices.
o Sparse index:
 An index record appears for only some of the search-key values.
 As is true in dense indices, each index record contains a search-key value
and a pointer to the first data record with that search-key value.
 To locate a record, we find the index entry with the largest search-key
value that is less than or equal to the search-key value for which we are
looking.
 We start at the record pointed to by that index entry, and follow the
pointers in the file until we find the desired record.
Dense Index
Sparse Index
Index Update
 Regardless of what form of index is used, every index must be updated whenever a
record is either inserted into or deleted from the file. We first describe algorithms for
updating single-level indices.
o Insertion. First, the system performs a lookup using the search-key value that
appears in the record to be inserted. Again, the actions the system takes next
depend on whether the index is dense or sparse:
Dense indices:
1. If the search-key value does not appear in the index, the system inserts an index
record with the search-key value in the index at the appropriate position.
2. Otherwise the following actions are taken:
a. If the index record stores pointers to all records with the same searchkey value, the system adds a pointer to the new record to the index record.
b. Otherwise, the index record stores a pointer to only the first record with
the search-key value. The system then places the record being inserted after the
other records with the same search-key values.
Sparse indices:
o We assume that the index stores an entry for each block.
o If the system creates a new block, it inserts the first search-key value (in searchkey order) appearing in the new block into the index.
o On the other hand, if the new record has the least search-key value in its block,
the system updates the index entry pointing to the block; if not, the system makes
no change to the index.
Deletion. To delete a record, the system first looks up the record to be deleted.
The actions the system takes next depend on whether the index is dense or sparse:
Dense indices:
1. If the deleted record was the only record with its particular search-key value,
then the system deletes the corresponding index record from the index.
2. Otherwise the following actions are taken:
a. If the index record stores pointers to all records with the same searchkey value, the system deletes the pointer to the deleted record from the
index record.
b. Otherwise, the index record stores a pointer to only the first record with
the search-key value. In this case, if the deleted record was the first record
with the search-key value, the system updates the index record to point to
the next record.
Sparse indices:
1. If the index does not contain an index record with the search-key value of the
deleted record, nothing needs to be done to the index.
2. Otherwise the system takes the following actions:
a. If the deleted record was the only record with its search key, the system
replaces the corresponding index record with an index record for the next
search-key value (in search-key order). If the next search-key value
already has an index entry, the entry is deleted instead of being replaced.
b. Otherwise, if the index record for the search-key value points to the
record being deleted, the system updates the index record to point to the
next record with the same search-key value.
Secondary Indices





Secondary indices must be dense, with an index entry for every search-key value, and a
pointer to every record in the file.
A primary index may be sparse, storing only some of the search-key values, since it is
always possible to find records with intermediate search-key values by a sequential
access to a part of the file, as described earlier.
If a secondary index stores only some of the search-key values, records with intermediate
search-key values may be anywhere in the file and, in general, we cannot find them
without searching the entire file.
A secondary index on a candidate key looks just like a dense primary index, except that
the records pointed to by successive values in the index are not stored sequentially.
In general, however, secondary indices may have a different structure from primary
indices.




If the search key of a primary index is not a candidate key, it suffices if the index points
to the first record with a particular value for the search key, since the other records can be
fetched by a sequential scan of the file.
In contrast, if the search key of a secondary index is not a candidate key, it is not enough
to point to just the first record with each search-key value.
The remaining records with the same search-key value could be anywhere in the file,
since the records are ordered by the search key of the primary index, rather than by the
search key of the secondary index.
Therefore, a secondary index must contain pointers to all the records.
(2) Hash Indices
 Hashing can be used not only for file organization, but also for index-structure
creation.
 A hash index organizes the search keys, with their associated pointers, into a hash file
structure.
 We construct a hash index as follows.
o We apply a hash function on a search key to identify a bucket, and store the key
and its associated pointers in the bucket (or in overflow buckets).
o A secondary hash index on the account file, for the search key account-number.


We use the term hash index to denote hash file structures as well as secondary hash indices.
Strictly speaking, hash indices are only secondary index structures.


A hash index is never needed as a primary index structure, since, if a file itself is organized
by hashing, there is no need for a separate hash index structure on it.
However, since hash file organization provides the same direct access to records that
indexing provides, we pretend that a file organized by hashing also has a primary hash index
on it.
Dynamic Hashing
o As we have seen, the need to fix the set B of bucket addresses presents a serious problem
with the static hashing technique of the previous section.
o Most databases grow larger over time. If we are to use static hashing for such a database, we
have three classes of options:



1. Choose a hash function based on the current file size. This option will result in
performance degradation as the database grows.
2. Choose a hash function based on the anticipated size of the file at some point
in the future.
Although performance degradation is avoided, a significant amount of space
may be wasted initially.
3. Periodically reorganize the hash structure in response to file growth. Such a
reorganization involves choosing a new hash function, re computing the hash
function on every record in the file, and generating new bucket assignments.
This reorganization is a massive, time-consuming operation. Furthermore, it is necessary to
forbid access to the file during reorganization.
Several dynamic hashing techniques allow the hash function to be modified dynamically to
accommodate the growth or shrinkage of the database.
We describe one form of dynamic hashing, called extendable hashing.
1.10 Reference
1. Database Management System – Alexis Leon and Mathews Leon
2. Database Management System – R. Panneerselvam
3. Database Management System – Rajesh Narang.