Download unit iii database management systems

Document related concepts

Serializability wikipedia , lookup

Big data wikipedia , lookup

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
INFORMATION MANAGEMENT
UNIT III DATABASE MANAGEMENT SYSTEMS
DBMS – HDBMS, NDBMS, RDBMS, OODBMS, Query Processing, SQL, Concurrency
Management, Data warehousing and Data Mart
Contents
3.1.Database Management System ................................................................................................................ 3
3.2.HDBMS: Hierarchical Database Management System .......................................................................... 11
3.3.NDBMS: Network Database Management System............................................................................... 13
3.4.RDBMS .................................................................................................................................................. 15
3.5. OODBMS .............................................................................................................................................. 18
3.6 Query Language .................................................................................................................................... 21
3.7 SQL ....................................................................................................................................................... 21
3.8. Concurrency Control ............................................................................................................................ 38
3.9. Data Warehouse ................................................................................................................................... 48
3.10.What Is a Data Mart? .......................................................................................................................... 54
3.1. Introduction
Database is a collection of related data. Database management system is software designed to
assist the maintenance and utilization of large scale collection of data. DBMS came into
existence in 1960 by Charles. Integrated data store which is also called as the first general
purpose DBMS. Again in 1960 IBM brought IMS-Information management system. In 1970
Edgor Codd at IBM came with new database called RDBMS. In 1980 then came SQL
Architecture- Structure Query Language. In 1980 to 1990 there were advances in DBMS
e.g. DB2, ORACLE.
1
3.1.1 In general data management consists of following tasks
 Data capture: Which is the task associated with gathering the data as and
when they originate.

Data classification: Captured data has to be classified based on the nature
and intended usage.

Data storage: The segregated data has to be stored properly.

Data arranging: It is very important to arrange the data properly

Data retrieval: Data will be required frequently for further processing,
Hence it is very important to create some indexes so that data can be retrieved
easily.

Data maintenance: Maintenance is the task concerned with keeping the data
up- to-date.

Data Verification: Before storing the data it must be verified for any error.

Data Coding: Data will be coded for easy reference.

Data Editing: Editing means re-arranging the data or modifying the data
for presentation.

Data transcription: This is the activity where the data is converted from one
form into another.

Data transmission: This is a function where data is forwarded to the place
where it would be used further.
3.1.2Database

Database may be defined in simple terms as a collection of data

A database is a collection of related data.

The database can be of any size and of varying complexity.
A database may be generated and maintained manually or it may be computerized
2
3.1.3 Database Management System

A Database Management System (DBMS) is a collection of program that
enables user to create and maintain a database.

The DBMS is hence a general purpose software system that facilitates the
process of defining, constructing and manipulating database for various
applications.

DBMS is a collection of data and user is not required to write the procedures
for managing the database.

DBMS provides an abstract view of data that hides the details.

DBMS is efficient to use since there are wide varieties of sophisticated
techniques to store and retrieve the data.

DBMS takes care of Concurrent access using some form of locking.

DBMS has crash recovery mechanism, DBMS protects user from the effects
of system failures.

DBMS has a good protection mechanism.
3.1.4 Characteristics of DBMS

To incorporate the requirements of the organization, system should be
designed for easy maintenance.

Information systems should allow interactive access to data to obtain
new information without writing fresh programs.

System should be designed to co-relate different data to meet new requirements.

An
independent
central
repository,
which
gives
information
and
meaning of available data is required.

Integrated database will help in understanding the inter-relationships between
data stored in different applications.

The stored data should be made available for access by different
users simultaneously.

Automatic recovery feature has to be provided to overcome the problems
with processing system failure.
3
3.1.5 Advantages of DBMS.
Due to its centralized nature, the database system can overcome the disadvantages of the file
system-based system
1. Data independency:
Application program should not be exposed to details of data representation and storage
DBMS provides the abstract view that hides these details.
2. Efficient data access.:
DBMS utilizes a variety of sophisticated techniques to store and retrieve data efficiently.
3. Data integrity and security:
Data is accessed through DBMS, it can enforce integrity constraints. E.g.: Inserting salary
information for an employee.
4. Data Administration:
When users share data, centralizing the data is an important task, Experience professionals
can minimize data redundancy and perform fine tuning which reduces retrieval time.
5. Concurrent access and Crash recovery:
DBMS schedules concurrent access to the data. DBMS protects user from the effects of
system failure.
6. Reduced application development time.
DBMS supports important functions that are common to many applications.
4
A database management system (DBMS) is a collection of programs that enables users to
create and maintain database. The DBMS is a general purpose software system that
facilitates the process of defining, constructing, manipulating and sharing databases
among various users and applications. Defining a database specifying the database involves
specifying the data types, constraints and structures of the data to be stored in the database.
The descriptive information is also stored in the database in the form database catalog or
dictionary; it is called meta-data.
Manipulating the data includes the querying the database to retrieve the specific data. An
application program accesses the database by sending the queries or requests for data to
DBMS.The important function provided by the DBMS includes protecting the database
and maintain the database.
5
3.1.6 Example of a Database (with a Conceptual Data Model)
Mini-world for the example:
Part of a UNIVERSITY environment.
Some mini-world entities:
STUDENTs
COURSEs
SECTIONs (of COURSEs)
(academic)
DEPARTMENTs
INSTRUCTORs
Example of a Database (with a Conceptual Data Model)
Some mini-world relationships:
SECTIONs
are
COURSEs
STUDENTs
take
SECTIONs
COURSEs
have
prerequisite
of
specific
COURSEs
INSTRUCTORs teach SECTIONs
COURSEs are offered by DEPARTMENTs
STUDENTs major in DEPARTMENT
6
Example of a simple Database
7
Example of a Student File
8
3.1.7 Architecture of DBMS
A commonly used views of data approach is the three-level architecture suggested by
ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements
Committee). ANSI/SPARC produced an interim report in 1972 followed by a final report in
1977. The reports proposed an architectural framework for databases. Under this approach, a
database is considered as containing data about an enterprise. The three levels of the architecture
are three different views of the data:



External Level
Conceptual View
Internal View
The three level database architecture allows a clear separation of the information meaning
(conceptual view) from the external data representation and from the
physical data
structure layout. A database system that is able to separate the three different views of data
is likely to be flexible and adaptable. This flexibility and adaptability is data independence.
We now briefly discuss the three different views.
9
(A) External Level
The external level is the view that the individual user of the database has. This view is often
a restricted view of the database and the same database may provide a number of different
views for different classes of users. In general, the end users and even the application
programmers are only interested in a subset of the database. For example, a department head
may only be interested in the departmental finances and student enrolments but not the
library information. The librarian would not be expected to have any interest in the
information about academic staff. The payroll office would have no interest in student
enrolments.
(B) Conceptual View
The conceptual view is the information model of the enterprise and contains the view of the
whole enterprise without any concern for the physical implementation. This view is normally
more stable than the other two views. In a database, it may be desirable to change the
internal view to improve performance while there has been no change in theconceptual view of
the database. The conceptual view is the overall community view of the database and it
includes all the information that is going to be represented in the database. The conceptual
view is defined by the conceptual schema which includes definitions of each of the various
types of data.
(C) Internal View
The internal view is the view about the actual physical storage of data. It tells us what data
is stored in the database and how. At least the following aspects are considered at this level:

Storage allocation

Access paths

Miscellaneous
Efficiency considerations are the most important at this level and the data structures are
chosen to provide an efficient database. The internal view does not deal with the physical
devices directly. Instead it views a physical device as a collection of physical pages and
allocates space in terms of logical pages.
The separation of the conceptual view from the internal view enables us to provide a
10
logical description of the database without the need to specify physical structures. This is
often called physical data independence. Separating the external views from the conceptual
view enables us to change the conceptual view without affecting the external views. This
separation is sometimes called logical data independence.
Assuming the three level view of the database, a number of mappings are needed to enable
the users working with one of the external views. For example, the payroll office may have
an external view of the database that consists of the following information only:
Staff number, name and address.
Staff tax information e.g. number of dependents.
Staff bank information where salary is deposited
Staff employment statuses, salary level, leave information etc.
The conceptual view of the database may contain academic staff, general staff, casual staff etc.
A mapping will need to be created where all the staff in the different categories are combined
into one category for the payroll office. The conceptual view would include information about
each staff's position, the date employment started, full-time or part- time etc. This will need
to be mapped to the salary level for the salary office. Also, if there is some change in the
conceptual view, the external view can stay the same if the mapping is changed.
3.2. HDBMS: Hierarchical Database Management System
The hierarchical structure was used in early mainframe DBMS. Records' relationships form a
tree like model. This structure is simple but inflexible because the relationship is confined to a
one-to-many relationship. The IBM Information Management System (IMS) and the RDM
Mobile are examples of a hierarchical database system with multiple hierarchies over the same
data. RDM Mobile is a newly designed embedded database for a mobile computer system.
A hierarchical database model is a data model in which the data is organized into a tree-like
structure. The structure allows representing information using parent/child relationships: each
parent can have many children, but each child has only one parent (also known as a 1-to-many
relationship). All attributes of a specific record are listed under an entity type.
11
Data is represented in a hierarchical structure, or upside down tree.
In a hierarchical model, data is accessed by following the arrows, or path, beginning at the
leftmost segment. This path is known as the hierarchical path or the preorder traversal.
For example:
Consider the following sample data
In order to access the "Offering 4" data, the hierarchical path, beginning from the left,
would be:
Teacher 1 > Subject 1 > Offering 1 > Offering 2 > Teacher 2 > Subject 2 > Offering 3 >
Subject 3 > Offering 4
3.2.1 Description of Hierarchical Database
In the hierarchical structure, data is represented by a simple tree structure. The record type at the
top of the tree is usually known as the "root." The simple hierarchical structure consists of a root
12
and a single dependent record type. In general, the root may have any number of dependent
records, each of which may have any number of lower-level dependents, and so on, to any
number of levels.
The hierarchy view contains records of different types connected by links. Hierarchical
relationships of records are explicitly defined in the data structure. A parent record can have
many child records but a child record can have only one parent. There are no many-to-many
relationships between records. No dependent record within a hierarchical data structure can exist
without its parent record. For this reason, records must be seen in context.

Strengths of Hierarchical Databases
The advantages of a hierarchical database are:
o Efficient representation of hierarchical structures,
o Efficient single key search and access time (if the hierarchical structure
corresponds to application views of the data),
o Fast update performance where locality of reference exists (locality of reference
states that performance is significantly enhanced when the processing is close to
the data being processed).

Weaknesses of Hierarchical Databases
The disadvantages of a hierarchical database are:
o Lack of flexibility (non-hierarchical relationships are awkward to represent;
redundancy may be required),
o Poor performance for non-hierarchical accesses,
o Lack of maintainability (changing
reorganization of data).
relationships
may require
physical
3.3. NDBMS: Network Database Management System
A DBMS is said to be Network DBMS, when it organizes the data in a network structure. A
network may have as many connections as it can.
13
In our DBMS language, we can say that a parent can have many children and a child can have
more than one parent. So a Network DBMS will have many-to-many relationship.
There are some differences between hierarchical DBMS and Network DBMS.

In hierarchical DBMS, we can have only one parent to a child. But in Network, we can
have more than one.

Unlike hierarchical, Network DBMS does not necessarily follow downward tree
structure. In some cases it may follow upward tree structure.
NDBMS is the product of IBM Corporation which support all types of relationships { one to one,
one to many, many to many }and data is represented in the form of nodes and links.NDBMS is
also a set of features which is maintained by IDMS software which implements the features of
the NDBMS.
Disadvantages of NDBMS:It cannot access the huge amount of data with speed and accuracy therefore it has complexity
relationship occurs
Example :
ABC College has two Child. i.e. Department A and College library. It represents one to many
relationship. Even though there is no relation between Department A and College library, a
14
student can be a member of both Department A and College library. This represents many to
one relationship. So as per the above example, student has two parents which tell us this is the
Network DBMS model. This is the simple and good example for Network DBMS.
3.4. RDBMS
 Identifying a record uniquely:
–
Identifying a record uniquely is primary key associated with that record.
–
2 concepts gel with each other in the sense that both RDBMS and object
technologies believe that records and objects have existence beyond their
properties.
–
Student class – student_ID
–
Salesperson class – salesperson_ID
Object ID (Implicit, hidden
from the entire world)
Primary Key (Explicit, visible to
entire world)
 Mapping classes to table
–
If we need to store classes on a disk, there must be some way of mapping them to
RDBMS structure.
–
The moment of the program is removed from the main memory of the computer,
all the objects associated with that program also die.
15
–
We think of mapping objects to tables
One object maps exactly
one table
More than one objects
map to one table.
One objects maps to more
than one table
Student Class
2 attributes
Student Name
Marks
Map to approximate table
Outline the corresponding table
model for this class
–
Write SQL code corresponding to
the table model
Transforming the object model to the table model, we add a record identifier
(student ID), student ID is unique and not null.(primary key)
–
Mapping Binary Associations to tables
–
Binary associations can be largely classified into 2 types (a) many – to – many
association (b) one to many association.
–
Ex – many students choosing many subjects for their courses obtaining some
marks in each one (Many to many) and A school enrolling many students per
standard (one to many)
 Super class and subclass tables
–
Map the super class (employees) to a table and 2 subclasses (ie manager and
clerk) to two their respective tables.
16
Super Class
Attributes
Employee ID
Employee Name
Age
Grade
Attributes
Employee ID
Bonus
Number of
Subordinates
Attributes
Employee ID
Number of Pending
tasks
Nulls allowed?
N
N
Y
N
Sub Class
Nulls allowed?
N
Y
N
Nulls allowed?
N
Y
Sub Class
Employee Table
Manager Table
Clerk table
Create table employee
–
(Employee_ID
Integer
–
Employee_Name
Char(20) Not null,
–
Age
–
Grade
–
Primary Key (EmployeeID));
Not null,
integer,
Char(10) Not null,
Create table manager
–
(EmployeeID
–
Bonus
integer
Not null,
integer,
17
•
–
Number of subordinate
–
Primary key (EmployeeID)
–
Foreign key (EmployeeID) Reference Employee);
integer,
Create table clerk
–
(EmployeeID
integer
–
No of pending tasks
integer,
–
Primary key
(EmployeeID),
–
Foreign key (EmployeeID) Reference Employee);
Not null,
 Many subclass tables
–
No Super class table instead of that we need to obtain information from a sub
class table. All sub class table must of self – sufficient
 One super class table
–
We have only one table in the design now, there is no super class table
3.5. OODBMS
–
Object Oriented Database Management systems (OODBMS)
–
OODBMS provides a persistent (permanent) storage for object.
–
OODBMS used in a multi user client / server environment – it controls the
concurrent access to objects, provide locking mechanism and transactional
features offers security features at the object level and also ensure object backup
and restoration.
–
OODBMS generally uses class definitions and traditional OOP language, such as
C++ and Java to define, manipulate and retrieve data.
18
Java Object
C++ Object
OODBMS
OODBMS specialization
OQL (Object Query
Language)
ODL (Object Definition
Language)
OML (Object Manipulation
Language)
•
When should OODBMS be used?
–
Traditional business application require data to be stored in the form of rows and
column for ex – payroll application would need employee details to be stored in
one table, the payment details in another and on all in a tubular form. – conceptual
understanding and retrieval of data quite easy.
–
In the application it is not wise to split data into rows and column instead of that
data stored in the original form.
–
Object stored as a object not as rows and columns.
19
Object
Object
Tables
Object
RDBMS
OODBMS
 Advantage of OODBMS
– (1) Quicker access to information
OODBMS keeps track of object via their unique object IDs.
Search operation moves from one object to another via these IDs not through
complex foreign key traversals.
– (2) Creating new data types
OODBMS does not restrict the type of data that can be stored.
RDBMS provide fixed numbers of data types- integer and strings.
– (3) Integer with OOP language
OODBMS is actually an extension of OOP language such as java and C++
There is no importance between the language and the DBMS.
(Impedance mismatch – when we execute a SELECT query via the C program
which multiple rows. The C program stores them in a buffer and processes one by
one. View of C program is on row at a time but RDBMS is multiple rows at one
shot.)
But OODBMS deals with one object where the original characteristics of an
object.
20
3.6 Query Language
 The tools related to database management creating tables, querying the database for
information, modifying the data in the database, deleting them, granting access to users
and so on.
3.6.1History of SQL

Dr. E. F. Codd published the paper, "A Relational Model of Data for Large Shared Data
Banks", in June 1970 in the Association of Computer Machinery (ACM) journal,
Communications of the ACM.

Codd’s model is now accepted as the definitive model for relational database
management systems (RDBMS).

The language, Structured English Query Language (SEQUEL) was developed by IBM
Corporation, Inc., to use Codd's model.

SEQUEL later became SQL (still pronounced "sequel").

In 1979, Relational Software, Inc. (now Oracle) introduced the first commercially
available implementation of SQL. Today, SQL is accepted as the standard RDBMS
language.
3.6.2 Advantage of SQL

High Speed
o SQL Queries can be used to retrieve large amounts of records from a database
quickly and efficiently.

Well Defined Standards Exist
o SQL databases use long-established standard, which is being adopted by ANSI &
ISO.
o Non-SQL databases do not adhere to any clear standard.

No Coding Required
o Using standard SQL it is easier to manage database systems without having to
write substantial amount of code.

Emergence of ORDBMS
o Previously SQL databases were synonymous with relational database.
o With the emergence of Object Oriented DBMS, object storage capabilities are
extended to relational databases.
21
3.6.3 Disadvantages of SQL

Difficulty in Interfacing
o Interfacing an SQL database is more complex than adding a few lines of code.

More Features Implemented in Proprietary way
o Although SQL databases conform to ANSI & ISO standards, some databases go
or proprietary extensions to standard SQL to ensure vendor lock-in.
3.6.4 SQL data types








BOOLEAN –
o A Boolean value either true, false or null.
CHAR (size) or CHARACTER (size) –
o A string of fixed length. The maximum size of a CHAR string is 1 billion
characters.
VARCHAR (size), LONGVARCHAR (size), CHARACTER VARYING (size), LONG
CHARACTER VARYING (size), TEXT (size) or STRING (size) –
o A string of variable length. The size constraint of these string types do not have to
be given and defaults to the maximum possible size of strings that the database is
able to store. The maximum size of these string types is 1 billion characters.
TINYINT –
o An 8-bit signed integer value. The range of TINYINT is -128 to 127.
SMALLINT –
o A 16-bit signed integer value. The range of SMALLINT is -32768 to 32767.
INTEGER or INT –
o A 32-bit signed integer value. The range of INTEGER is -2147483648 to
2147483647.
BIGINT –
o A 64-bit signed integer value. The range of BIGINT is -9223372036854775808 to
9223372036854775807.
FLOAT or DOUBLE –
o A 64-bit precision floating point value. These types are analogous to the Java
double type.
22

REAL, NUMERIC or DECIMAL –
o A higher precision numeric value.
o These numeric types are represented by java. Math. Big Decimal and therefore
can represent numeric values of any precision and scale.

DATE –
o A day/month/year value.
o The DATE type does not have any near time bounding issues and is able to
represent dates many millennia in the future and the past.
o The TIMESTAMP type is internally represented by java. Until .Date.
TIME –
o A time of day value.



TIMESTAMP –
o A day/month/year and time of day value.
o The TIMESTAMP type does not have any near time bounding issues and is able
to represent dates many millennia in the future and the past.
o The TIMESTAMP type is internally represented by java. Util. Date.
BINARY (size), VARBINARY (size) or LONGVARBINARY (size) –
o A variable sized binary object. The size constraint is optional and defaults to the
maximum size. The maximum size of a binary object is 2 billion bytes.
3.6.5 SQL Data Creation



Creating a basic table involves naming the table and defining its columns and each
column's data type.
The SQL CREATE TABLE statement is used to create a new table.
Syntax –
CREATE TABLE table_name(
column1 datatype,
column2 datatype,
column3 data type,
.....
Column N data type,
PRIMARY KEY( one or more columns )
);


CREATE TABLE is the keyword telling the database system what you want to do.in this
case, you want to create a new table.
The unique name or identifier for the table follows the CREATE TABLE statement.
23




Then in brackets comes the list defining each column in the table and what sort of data
type it is.
The syntax becomes clearer with an example below.
A copy of an existing table can be created using a combination of the CREATE TABLE
statement and the SELECT statement.
Ex –
SQL> CREATE TABLE CUSTOMERS(
ID INT
NOT NULL,
NAME VARCHAR (20) NOT NULL,
AGE INT
NOT NULL,
ADDRESS CHAR (25) ,
SALARY DECIMAL (18, 2),
PRIMARY KEY (ID)
);
3.6.6 Retrieval and manipulation of data
SQL Logical Operators
Operator
ALL
Description
The ALL operator is used to compare a value to all values in another value
set.
AND
The AND operator allows the existence of multiple conditions in an SQL
statement's WHERE clause.
ANY
The ANY operator is used to compare a value to any applicable value in the
list according to the condition.
BETWEEN
The BETWEEN operator is used to search for values that are within a set of
values, given the minimum value and the maximum value.
EXISTS
The EXISTS operator is used to search for the presence of a row in a
specified table that meets certain criteria.
IN
The IN operator is used to compare a value to a list of literal values that have
been specified.
LIKE
The LIKE operator is used to compare a value to similar values using
wildcard operators.
NOT
The NOT operator reverses the meaning of the logical operator with which it
is used.
OR
The OR operator is used to combine multiple conditions in an SQL
statement's WHERE clause.
IS NULL
The NULL operator is used to compare a value with a NULL value.
UNIQUE
The UNIQUE operator searches every row of a specified table for
uniqueness (no duplicates).
SQL Aggregate Functions
SQL aggregate functions return a single value, calculated from values in a column.
24
Useful aggregate functions:
AVG() - Returns the average value
COUNT() - Returns the number of rows
FIRST() - Returns the first value
LAST() - Returns the last value
MAX() - Returns the largest value
MIN() - Returns the smallest value
SUM() - Returns the sum
SQL Scalar functions
SQL scalar functions return a single value, based on the input value.
Useful scalar functions:
UCASE() - Converts a field to upper case
LCASE() - Converts a field to lower case
MID() - Extract characters from a text field
LEN() - Returns the length of a text field
ROUND() - Rounds a numeric field to the number of decimals specified
NOW() - Returns the current system date and time
FORMAT() - Formats how a field is to be displayed
(a) Select
o SQL SELECT Statement is used to fetch the data from a database table which returns
data in the form of result table. These result tables are called result-sets.
Syntax
SELECT column1, column2, columnN FROM table_name;
SELECT * FROM table_name;
25
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
o SQL> SELECT ID, NAME, SALARY FROM CUSTOMERS;
ID
NAME
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
(b) Where
o The SQL WHERE clause is used to specify a condition while fetching the data from
single table or joining with multiple table.
o If the given condition is satisfied then only it returns specific value from the table.
o You would use WHERE clause to filter the records and fetching only necessary records.
o The WHERE clause not only used in SELECT statement, but it is also used in UPDATE,
DELETE statement etc. which we would examine in subsequent chapters.
o Specify a condition using comparison or logical operators like >, <, =, LIKE, NOT etc.
Below examples would make this concept clear.
Syntax
SELECT column1, column2, columnN
FROM table_name
WHERE [condition]
Ex –
SQL> SELECT ID, NAME, SALARY
26
FROM CUSTOMERS
WHERE SALARY > 2000;
ID
NAME
SALARY
4
5
6
Chaitali
Hardik
Komal
6500.00
8500.00
4500.00
(c) Delete

The SQL DROP TABLE statement is used to remove a table definition and all data,
indexes, triggers, constraints, and permission specifications for that table.

You have to be careful while using this command because once a table is deleted then all
the information available in the table would also be lost forever.

Syntax –
o DROP TABLE table_name;

Ex – DESC
Field
Type
Extra
ID
int(11)
NAME
Varchar(20)
AGE
int(11)
ADDRESS
char(25)
SALARY
decimal(18,2)
SQL> DROP TABLE CUSTOMERS;
SQL> DROP DESC
Default
NULL
NULL
Null
NO
NO
NO
YES
YES
Key
PRI
(d) Insert a Table


The SQL INSERT INTO Statement is used to add new rows of data to a table in the
database
Syntax
INSERT INTO TABLE_NAME (column1, column2, column3,...columnN)]
VALUES (value1, value2, value3,...valueN);
o Here column1, column2,...columnN are the names of the columns in the table into
which you want to insert data.
27
o You may not need to specify the column(s) name in the SQL query if you are
adding values for all the columns of the table. But make sure the order of the
values is in the same order as the columns in the table. The SQL INSERT INTO
syntax would be as follows:

Ex –
o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (1, 'Ramesh', 32, 'Ahmedabad', 2000.00);
o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (2, 'Khilan', 25, 'Delhi', 1500.00);
o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (3, 'kaushik', 23, 'Kota', 2000.00);
o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (4, 'Chaitali', 25, 'Mumbai', 6500.00);
o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (5, 'Hardik', 27, 'Bhopal', 8500.00);
o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (6, 'Komal', 22, 'MP', 4500.00);
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
(e) Like

The SQL LIKE clause is used to compare a value to similar values using wildcard
operators. There are two wildcards used in conjunction with the LIKE operator:


The percent sign (%)
The underscore (_)
28


The percent sign represents zero, one, or multiple characters. The underscore represents a
single number or character.
The symbols can be used in combinations.
Syntax:
The basic syntax of % and _ is as follows:
SELECT FROM table_name
WHERE column LIKE 'XXXX%'
Or
SELECT FROM table_name
WHERE column LIKE '%XXXX%'
Or
SELECT FROM table_name
WHERE column LIKE 'XXXX_'
Or
SELECT FROM table_name
WHERE column LIKE '_XXXX'
Or
SELECT FROM table_name
WHERE column LIKE '_XXXX_'


You can combine N number of conditions using AND or OR operators. Here XXXX
could be any numeric or string value.
Ex Here are number of examples showing WHERE part having different LIKE clause with
'%' and '_' operators:
Statement
WHERE SALARY LIKE
'200%'
WHERE SALARY LIKE
'%200%'
WHERE SALARY LIKE
'_00%'
WHERE SALARY LIKE
'2_%_%'
WHERE SALARY LIKE
'%2'
WHERE SALARY LIKE
'_2%3'
WHERE SALARY LIKE
'2___3'
Description
Finds any values that start with 200
Finds any values that have 200 in any position
Finds any values that have 00 in the second and third positions
Finds any values that start with 2 and are at least 3 characters in
length
Finds any values that end with 2
Finds any values that have a 2 in the second position and end with
a3
Finds any values in a five-digit number that start with 2 and end
with 3
29
(f) Order by



The SQL ORDER BY clause is used to sort the data in ascending or descending order,
based on one or more columns.
Some database sorts query results in ascending order by default.
Syntax:
The basic syntax of ORDER BY clause is as follows:
SELECT column-list
FROM table_name
[WHERE condition]
[ORDER BY column1, column2, .. columnN] [ASC | DESC];

Example:
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
SQL> SELECT * FROM CUSTOMERS
ORDER BY NAME;
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Mumbai
Bhopal
Kota
Delhi
MP
Ahmedabad
6500.00
8500.00
2000.00
1500.00
4500.00
2000.00
Chaitali
Hardik
kaushik
Khilan
Komal
Ramesh
25
27
23
25
22
32
(g) Group By


The SQL GROUP BY clause is used in collaboration with the SELECT statement to
arrange identical data into groups.
The GROUP BY clause follows the WHERE clause in a SELECT statement and
precedes the ORDER BY clause.
30
Syntax:
SELECT column1, column2
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2
ORDER BY column1, column2
o Ex –
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
o SQL> SELECT NAME, SUM (SALARY) FROM CUSTOMERS
GROUP BY NAME;
NAME
Chaitali
Hardik
kaushik
Khilan
Komal
Ramesh
SUM (SALARY)
6500.00
8500.00
2000.00
1500.00
4500.00
2000.00
(h) Update
o The SQL UPDATE Query is used to modify the existing records in a table.
o You can use WHERE clause with UPDATE query to update selected rows otherwise all
the rows would be effected.
Syntax:
UPDATE table_name
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE [condition];
31
Ex –
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
MP
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
SQL> UPDATE CUSTOMERS
SET ADDRESS = 'Pune'
WHERE ID = 6;
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
Pune
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
(i) AND & OR
o The SQL AND & OR operators are used to compile multiple conditions to narrow data in
an SQL statement.
o These two operators are called conjunctive operators.
o These operators provide a means to make multiple comparisons with different operators
in the same SQL statement.
Syntax
SELECT column1, column2, columnN
ROM table_name
WHERE [condition1] AND [condition2]...AND [conditionN];
32
Ex –
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
Pune
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
SQL> SELECT ID, NAME, SALARY
FROM CUSTOMERS
WHERE SALARY > 2000 AND age < 25;

ID
NAME
SALARY
6
Komal
4500.00
The OR operator is used to combine multiple conditions in an SQL statement's WHERE
clause.
Syntax
SQL> SELECT ID, NAME, SALARY
FROM CUSTOMERS
WHERE SALARY > 2000 OR age < 25;
ID
NAME
SALARY
3
4
5
6
kaushik
Chaitali
Hardik
Komal
2000.00
6500.00
8500.00
4500.00
(j) Sub query

A Subquery or Inner query or Nested query is a query within another SQL query, and
embedded within the WHERE clause.
33

A Subquery is used to return data that will be used in the main query as a condition to
further restrict the data to be retrieved.

Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements
along with the operators like =, <, >, >=, <=, IN, BETWEEN etc.

There are a few rules that Subqueries must follow:

Subqueries must be enclosed within parentheses.

A Subquery can have only one column in the SELECT clause, unless multiple
columns are in the main query for the Subquery to compare its selected columns.

An ORDER BY cannot be used in a Subquery, although the main query can use an
ORDER BY. The GROUP BY can be used to perform the same function as the
ORDER BY in a Subquery.

Subqueries that return more than one row can only be used with multiple value
operators, such as the IN operator.

The SELECT list cannot include any references to values that evaluate to a BLOB,
ARRAY, CLOB, or NCLOB.

A Subquery cannot be immediately enclosed in a set function.

The BETWEEN operator cannot be used with a Subquery; however, the BETWEEN
can be used within the Subquery.
ExSQL> SELECT *
FROM CUSTOMERS
WHERE ID IN (SELECT ID
FROM CUSTOMERS
WHERE SALARY > 4500) ;
ID
NAME
AGE
ADDRESS
SALARY
4
5
Chaitali
Hardik
25
27
Mumbai
Bhopal
6500.00
8500.00
34
Subqueries also can be used with INSERT statements.


The INSERT statement uses the data returned from the Subquery to insert into
another table.
The selected data in the Subquery can be modified with any of the character, date, or
number functions.
Syntax
INSERT INTO table_name [ (column1 [, column2 ]) ]
SELECT [ *|column1 [, column2 ]
FROM table1 [, table2 ]
WHERE VALUE OPERATOR ]
Subqueries with the UPDATE Statement:


The Subquery can be used in conjunction with the UPDATE statement.
Either single or multiple columns in a table can be updated when using a Subquery
with the UPDATE statement.
Syntax
UPDATE table
SET column_name = new_value
[ WHERE OPERATOR [ VALUE ]
(SELECT COLUMN_NAME
FROM TABLE_NAME)
[ WHERE) ]
Subqueries with the DELETE Statement:

The Subquery can be used in conjunction with the DELETE statement like with any
other statements mentioned above.
Syntax
DELETE FROM TABLE_NAME
[ WHERE OPERATOR [ VALUE ]
(SELECT COLUMN_NAME
FROM TABLE_NAME)
[ WHERE) ]
35
(k) Views

A view is nothing more than a SQL statement that is stored in the database with an
associated name.

A view is actually a composition of a table in the form of a predefined SQL query.

A view can contain all rows of a table or select rows from a table.

A view can be created from one or many tables which depends on the written SQL query
to create a view.

Views which are kind of virtual tables, allow users to do the following:

Structure data in a way that users or classes of users find natural or intuitive.

Restrict access to the data such that a user can see and (sometimes) modify
exactly what they need and no more.


Summarize data from various tables which can be used to generate reports.
Database views are created using the CREATE VIEW statement. Views can be created
from a single table, multiple tables, or another view.

To create a view, a user must have the appropriate system privilege according to the
specific implementation.
Syntax
CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE [condition];
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
32
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
Pune
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
36
ExSQL > CREATE VIEW CUSTOMERS_VIEW AS
SELECT name, age
FROM CUSTOMERS;
SQL > SELECT * FROM CUSTOMERS_VIEW;
NAME
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
AGE
32
25
23
25
27
22
Updating a View:
A view can be updated under certain conditions:

The SELECT clause may not contain the keyword DISTINCT.

The SELECT clause may not contain summary functions.

The SELECT clause may not contain set functions.

The SELECT clause may not contain set operators.

The SELECT clause may not contain an ORDER BY clause.

The FROM clause may not contain multiple tables.

The WHERE clause may not contain Subqueries.

The query may not contain GROUP BY or HAVING.

Calculated columns may not be updated.

All NOT NULL columns from the base table must be included in the view in order for
the INSERT query to function.
Ex –
SQL > UPDATE CUSTOMERS_VIEW
SET AGE = 35
WHERE name='Ramesh';
37
ID
NAME
AGE
ADDRESS
SALARY
1
2
3
4
5
6
Ramesh
Khilan
kaushik
Chaitali
Hardik
Komal
35
25
23
25
27
22
Ahmedabad
Delhi
Kota
Mumbai
Bhopal
Pune
2000.00
1500.00
2000.00
6500.00
8500.00
4500.00
3.7. Concurrency Control

Coordinates simultaneous transaction execution in multiprocessing database Ensure
serializability of transactions in multiuser database environment Potential problems in
multiuser environments

Three main problems: lost updates, uncommitted data, and inconsistent retrievals
(I)Lost updates

Exo Assume that two concurrent transactions (T1, T2) occur in a PRODUCT table
which records a product’s quantity on hand (PROD_QOH). The transactions are:
Transaction
Computation
T1: Purchase 100 units
T2: Sell 30 units
PROD_QOH
=
PROD_QOH
+
100
PROD_QOH = PROD_QOH - 30

Table 2 Normal Execution of Two Transactions

Table 3 Lost Updates
o The first transaction (T1) has not yet been committed when the second transaction
(T2) is executed.
o T2 still operates on the value 35, and its subtraction yields 5 in memory.
o T1 writes the value 135 to disk, which is promptly overwritten by T2.
38
(II)Uncommitted Data
o When two transactions, T1 and T2, are executed concurrently and the first transaction
(T1) is rolled back after the second transaction (T2) has already accessed the
uncommitted data – thus violating the isolation property of transactions.
o The transactions are:
o Transaction
Computation
T1: Purchase 100 units
T2: Sell 30 units
PROD_QOH = PROD_QOH + 100 (Rollback)
PROD_QOH = PROD_QOH - 30

Table 4 Correct Execution of Two Transactions

Table 5 an Uncommitted Data Problem
(III)Inconsistent Retrievals

When a transaction calculates some summary (aggregate) functions over a set of data
while other transactions are updating the data.

The transaction might read some data before they are changed and other data after they
are changed, thereby yielding inconsistent results.

T1 calculates the total quantity on hand of the products stored in the PRODUCT table.

T2 updates PROD_QOH for two of the PRODUCT table’s products.

Table 6 Retrieval During Update

Table 7 Transaction Results: Data Entry Correction

Table 8 Transaction Result: Data Entry Correction

The transaction table in Table 8 demonstrates that inconsistent retrievals are possible
during the transaction execution, making the result of T1’s execution incorrect.

Unless the DBMS exercises concurrency control, a multi-user database environment can
create chaos within the information system.
3.7.1The Scheduler – Schedule, Serializability, Recovery, Isolation

Previous examples executed the operations within a transaction in an arbitrary order:

As long as two transactions, T1 and T2, access unrelated data, there is no conflict, and the
order of execution is irrelevant to the final outcome.
39

If the transactions operate on related (or the same) data, conflict is possible among the
transaction components, and the selection of one operational order over another may have
some undesirable consequences.

Establishes order of concurrent transaction execution

Interleaves execution of database operations to ensure serializability

Bases actions on concurrency control algorithms

Locking

Time stamping

Ensures efficient use of computer’s CPU

First-come-first-served basis (FCFS) – executed for all transactions if no way to schedule
the execution of transactions.

Within multi-user DBMS environment, FCFS scheduling tends to yield unacceptable
response times.

READ and/or WRITE actions that can produce conflicts.

Table 9 Read/Write Conflict Scenarios: Conflicting Database Operations Matrix
Schedules –

Sequences that indicate the chronological order in which instructions of concurrent
transactions are executed a schedule for a set of transactions must consist of all
instructions of those transactions must preserve the order in which the instructions appear
in each individual transaction.

Example of schedules
40

Schedule 1 (right figure): Let T1 transfer $50 from A to B, and T2 transfer 10% of the
balance from A to B. The following is a serial schedule, in which T1 is followed by T2.

Schedule 2 (right figure): Let T1 and T2 be the transactions defined previously. The
following schedule is not a serial schedule, but it is equivalent to Schedule 1.

Schedule 3 (lower right figure): The following concurrent schedule does not preserve the
value of the sum A + B.
Serializability

A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
Different forms of schedule equivalence give rise to the notions of:
1. conflict serializability
2. View serializability

Conflict Serializability: Instructions li and lj of transactions Ti and Tj respectively, conflict
if and only if there exists some item Q accessed by both li and lj, and at least one of these
instructions wrote Q.
o Ii = read (Q), Ij = read (Q).
Ii and Ij don’t conflict.
o Ii = read (Q), Ij = write (Q).
They conflict.
o Ii = write (Q), Ij = read (Q).
They conflict
o Ii = write (Q), Ij = write (Q). They conflict
41

If a schedule S can be transformed into a schedule S’ by a series of swaps of nonconflicting instructions, we say that S and S’ are conflict equivalent.

We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
schedule.

View Serializability: Let S and S´ be two schedules with the same set of transactions. S
and S´ are view equivalent if the following three conditions are met:

For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then
transaction Ti must, in schedule S’, also read the initial value of Q.

For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was
produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the
value of Q that was produced by transaction Tj.

For each data item Q, the transaction (if any) that performs the final write(Q) operation
in schedule S must perform the final write(Q) operation in schedule S’.

As can be seen, view equivalence is also based purely on reads and writes alone.

A schedule S is view serializable it is view equivalent to a serial schedule.

Every conflict serializable schedule is also view serializable.

Schedule in right figure – a schedule which is view-serializable but not conflict
serializable.

Every view serializable schedule that is not conflict serializable has blind writes.
42

Other Notions of Serializability

Schedule in right figure given below produces same outcome as the serial schedule < T1,
T5 >, yet is not conflict equivalent or view equivalent to it.

Determining such equivalence requires analysis of operations other than read and write.
Recoverability

Need to address the effect of transaction failures on concurrently running transactions

Recoverable schedule – if a transaction Tj reads a data items previously written by a
transaction Ti, the commit operation of Ti appears before the commit operation of Tj.

The schedule in right figure is not recoverable if T9 commits immediately after the read.

If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent
database state. Hence database must ensure that schedules are recoverable.

Cascading rollback – a single transaction failure leads to a series of transaction rollbacks.
Consider the following schedule where none of the transactions has yet committed (so the
schedule is recoverable)


If T10 fails, T11 and T12 must also be rolled back.

Can lead to the undoing of a significant amount of work
43

Cascade less schedules — cascading rollbacks cannot occur; for each pair of transactions
Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of
Ti appears before the read operation of Tj.

Every cascade less schedule is also recoverable

It is desirable to restrict the schedules to those that are cascade less
Implementation of Isolation

Schedules must be conflict or view serializable, and recoverable, for the sake of database
consistency, and preferably cascade less.

A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.

Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.

Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
3.7.2Concurrency Control with Locking Methods

Lock guarantees current transaction exclusive use of data item, i.e., transaction T2 does
not have access to a data item that is currently being used by transaction T1.

Acquires lock prior to access.

Lock released when transaction is completed.

DBMS automatically initiates and enforces locking procedures.

All lock information is managed by lock manager.

Lock Granularity

Lock granularity indicates level of lock use: database, table, page, row, or field
(attribute).

Database-Level

The entire database is locked.

Transaction T2 is prevented to use any tables in the database while T1 is being executed.

Good for batch processes, but unsuitable for online multi-user DBMSs.

Table-Level
44

The entire table is locked. If a transaction requires access to several tables, each table
may be locked.

Transaction T2 is prevented to use any row in the table while T1 is being executed.

Two transactions can access the same database as long as they access different tables.

It causes traffic jams when many transactions are waiting to access the same table.

Table-level locks are not suitable for multi-user DBMSs.
Page-Level

The DBMS will lock an entire disk page (or page), which is the equivalent of a disk block
as a (referenced) section of a disk.

A page has a fixed size and a table can span several pages while a page can contain
several rows of one or more tables.

Page-level lock is currently the most frequently used multi-user DBMS locking method.

T2 must wait for using a locked page which locates a row, if T1 is using it.
Row-Level

With less restriction respect to previous discussion, it allows concurrent transactions to
access different rows of the same table even if the rows are located on the same page.

It improves the availability of data, but requires high overhead cost for management.
Field-Level

It allows concurrent transactions to access the same row, as long as they require the use
of different fields (attributes) within a row.

The most flexible multi-user data access, but cost extremely high level of computer
overhead.
Lock Types

The DBMS may use different lock types: binary or shared/exclusive locks.

A locking protocol is a set of rules followed by all transactions while requesting and
releasing locks. Locking protocols restrict the set of possible schedules.
45
Binary Locks

Two states: locked (1) or unlocked (0).

Locked objects are unavailable to other objects.

Unlocked objects are open to any transaction.

Transaction unlocks object when complete.

Every transaction requires a lock and unlock operation for each data item that is accessed.
Shared/Exclusive Locks

Shared (S Mode)

Exists when concurrent transactions granted READ access

Produces no conflict for read-only transactions

Issued when transaction wants to read and exclusive lock not held on item

Exclusive (X Mode)

Exists when access reserved for locking transaction

Used when potential for conflict exists (also refer Table 9)

Issued when transaction wants to update unlocked data
Lock-compatibility matrix

A transaction may be granted a lock on an item if the requested lock is compatible with
locks already held on the item by other transactions

Any number of transactions can hold shared locks on an item, but if any transaction holds
an exclusive on the item no other transaction may hold any lock on the item.

If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.

Reasons to increasing manager’s overhead

The type of lock held must be known before a lock can be granted
46

Three lock operations exist: READ_LOCK (to check the type of lock), WRITE_LOCK
(to issue the lock), and UNLOCK (to release the lock).

The schema has been enhanced to allow a lock upgrade (from shared to exclusive) and a
lock downgrade (from exclusive to shared).

Problems with Locking

Transaction schedule may not be serializable

Managed through two-phase locking

Schedule may create deadlocks

Managed by using deadlock detection and prevention techniques
Two-Phase Locking

Two-phase locking defines how transactions acquire and relinquish (or revoke) locks.

Growing phase – acquires all the required locks without unlocking any data. Once all
locks have been acquired, the transaction is in its locked point.

Shrinking phase – releases all locks and cannot obtain any new lock.

Governing rules

Two transactions cannot have conflicting locks

No unlock operation can precede a lock operation in the same transaction

No data are affected until all locks are obtained

When the locked point is reached, the data are modified to conform to the transaction
requirements.

The transaction is completed as it released all of the locks it acquired in the first phase.
Deadlocks

Occurs when two transactions wait for each other to unlock data. For example:

T1 = access data items X and Y

T2 = access data items Y and X

Deadly embrace – if T1 has not unlocked data item Y, T2 cannot begin; if T2 has not
unlocked data item X, T1 cannot continue.
47
Control techniques

Deadlock prevention – a transaction requesting a new lock is aborted if there is the
possibility that a deadlock can occur.

If the transaction is aborted, all the changes made by this transaction are rolled back, and
all locks obtained by the transaction are released.
3.8. What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. It usually contains historical data derived from transaction data, but it can
include data from other sources. It separates analysis workload from transaction workload and
enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction,
transportation, transformation, and loading (ETL) solution, an online analytical processing
(OLAP) engine, client analysis tools, and other applications that manage the process of gathering
data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data
warehouse as set forth by William Inmon:

Subject Oriented

Integrated

Nonvolatile

Time Variant
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about your
company's sales data, you can build a warehouse that concentrates on sales. Using this
warehouse, you can answer questions like "Who was our best customer for this item last year?"
This ability to define a data warehouse by subject matter, sales in this case, makes the data
warehouse subject oriented.
48
Integrated
Integration is closely related to subject orientation. Data warehouses must put data from
disparate sources into a consistent format. They must resolve such problems as naming conflicts
and inconsistencies among units of measure. When they achieve this, they are said to be
integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical
because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in
contrast to online transaction processing (OLTP) systems, where performance requirements
demand that historical data be moved to an archive. A data warehouse's focus on change over
time is what is meant by the term time variant.
3.8.1Data Warehouse Architecture (Basic)
Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data
derived from several source systems through the data warehouse.
49
Figure 1-2 Architecture of a Data Warehouse
This illustrates four things:




Data Sources (operational systems and flat files)
Staging Area (where data sources go before the warehouse)
Warehouse (metadata, summary data, and raw data)
Users (analysis, reporting, and mining)
In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an
additional type of data, summary data. Summaries are very valuable in data warehouses because
they pre-compute long operations in advance. For example, a typical data warehouse query is to
retrieve something like August sales. A summary in Oracle is called a materialized view.
Data Warehouse Architecture (with a Staging Area)
In Figure 1-2, you need to clean and process your operational data before putting it into the
warehouse. You can do this programmatically, although most data warehouses use a staging
area instead. A staging area simplifies building summaries and general warehouse management.
Figure 1-3 illustrates this typical architecture.
50
Figure 1-3 Architecture of a Data Warehouse with a Staging Area
This illustrates four things:




Data Sources (operational systems and flat files)
Staging Area (where data sources go before the warehouse)
Warehouse (metadata, summary data, and raw data)
Users (analysis, reporting, and mining)
Data Warehouse Architecture (with a Staging Area and Data Marts)
Although the architecture in Figure 1-3 is quite common, you may want to customize your
warehouse's architecture for different groups within your organization. You can do this by
adding data marts, which are systems designed for a particular line of business. Figure 1-4
illustrates an example where purchasing, sales, and inventories are separated. In this example, a
financial analyst might want to analyze historical data for purchases and sales.
51
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
This illustrates five things:





Data Sources (operational systems and flat files)
Staging Area (where data sources go before the warehouse)
Warehouse (metadata, summary data, and raw data)
Data Marts (purchasing, sales, and inventory)
Users (analysis, reporting, and mining)
Warehouse data modeling levels
There are three levels of data modeling: conceptual, logical, and physical. Each level of data
modeling has its own purpose in data warehouse design.
Conceptual
The high-level data model is a consistent definition of all of business subject areas and data
elements common to the business, from a high-level business view to a generic logical data
design. From this, you can derive the general scope and understanding of the business
requirements. This conceptual data model is the basis for both current and future phases of data
warehouse development.
52
Logical
The logical data model contains much more detailed information about the business subject
areas. It captures the detailed business requirements in the target business subject areas. It is the
basis for the physical data modeling for the current project.
Starting from this stage, this solution is adapting the bottom-up approach, which means that only
the most important and urgent business subject areas are targeted in this logical data model.
The features of the logical data model include:

Specifications for all entities and relationships among them

Specifications for each entity's attributes

Specifications for all primary keys and foreign keys

Normalization and aggregation

Specification for multidimensional data structure
Physical
The physical data modeling applies physical constraints, such as space, performance, and the
physical distribution of data. The physical data model is tightly related to the database system
and data warehouse tools that you will use. The purpose of this phase is to design the actual
physical implementation.
It is particularly important to clearly separate logical modeling from physical modeling. Good
logical modeling practice focuses on the essence of the problem domain. Logical modeling
addresses the "what" question. Physical modeling addresses the question of "how" for the model,
which represents implementation reality in a given computing environment. Since the business
computing environment changes from time to time, the separation of logical and physical data
modeling will help stabilize the logical models from phase to phase.
53
Figure 4. Data warehouse logical data model life cycle
Once a data warehouse is implemented and your customers begin using it, they will often
generate new requests and requirements. This will start another cycle of development, continuing
the iterative and evolutionary process of building the data warehouse. As you can see, the logical
data model is a living part of a data warehouse, used and maintained throughout the entire life
cycle of the data warehouse. The process of data warehouse modeling can be truly endless.
3.9.What Is a Data Mart?
A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled
by a single department within an organization. Given their single-subject focus, data marts
usually draw data from only a few sources. The sources could be internal operational systems, a
central data warehouse, or external data.
3.9.1.Dependent and Independent Data Marts
There are two basic types of data marts: dependent and independent. The categorization is based
primarily on the data source that feeds the data mart. Dependent data marts draw data from a
central data warehouse that has already been created. Independent data marts, in contrast, are
54
standalone systems built by drawing data directly from operational or external sources of data, or
both.
The main difference between independent and dependent data marts is how you populate the data
mart; that is, how you get data out of the sources and into the data mart. This step, called the
Extraction-Transformation-and Loading (ETL) process, involves moving data from operational
systems, filtering it, and loading it into the data mart.
With dependent data marts, this process is somewhat simplified because formatted and
summarized (clean) data has already been loaded into the central data warehouse. The ETL
process for dependent data marts is mostly a process of identifying the right subset of data
relevant to the chosen data mart subject and moving a copy of it, perhaps in a summarized form.
With independent data marts, however, you must deal with all aspects of the ETL process, much
as you do with a central data warehouse. The number of sources is likely to be fewer and the
amount of data associated with the data mart is less than the warehouse, given your focus on a
single subject.
The motivations behind the creation of these two types of data marts are also typically different.
Dependent data marts are usually built to achieve improved performance and availability, better
control, and lower telecommunication costs resulting from local access of data relevant to a
specific department. The creation of independent data marts is often driven by the need to have a
solution within a shorter time.
3.9.2What Are the Steps in Implementing a Data Mart?
Simply stated, the major steps in implementing a data mart are to design the schema, construct
the physical storage, populate the data mart with data from source systems, access it to make
informed decisions, and manage it over time.
This section contains the following topics:

"Designing"

"Constructing"
55

"Populating"

"Accessing"

"Managing"
Designing
The design step is first in the data mart process. This step covers all of the tasks from initiating
the request for a data mart through gathering information about the requirements, and developing
the logical and physical design of the data mart. The design step involves the following tasks:

Gathering the business and technical requirements

Identifying data sources

Selecting the appropriate subset of data

Designing the logical and physical structure of the data mart
Constructing
This step includes creating the physical database and the logical structures associated with the
data mart to provide fast and efficient access to the data. This step involves the following tasks:

Creating the physical database and storage structures, such as tablespaces, associated
with the data mart

Creating the schema objects, such as tables and indexes defined in the design step

Determining how best to set up the tables and the access structures
Populating
The populating step covers all of the tasks related to getting the data from the source, cleaning it
up, modifying it to the right format and level of detail, and moving it into the data mart. More
formally stated, the populating step involves the following tasks:

Mapping data sources to target data structures

Extracting data

Cleansing and transforming the data
56

Loading data into the data mart

Creating and storing metadata
Accessing
The accessing step involves putting the data to use: querying the data, analyzing it, creating
reports, charts, and graphs, and publishing these. Typically, the end user uses a graphical frontend tool to submit queries to the database and display the results of the queries. The accessing
step requires that you perform the following tasks:

Set up an intermediate layer for the front-end tool to use. This layer, the metalayer,
translates database structures and object names into business terms, so that the end user
can interact with the data mart using terms that relate to the business function.

Maintain and manage these business interfaces.

Set up and manage database structures, like summarized tables, that help queries
submitted through the front-end tool execute quickly and efficiently.
Managing
This step involves managing the data mart over its lifetime. In this step, you perform
management tasks such as the following:

Providing secure access to the data

Managing the growth of the data

Optimizing the system for better performance

Ensuring the availability of data even with system failures
3.9.3 Difference between Data Warehousing and Data Mart

It is important to note that there are huge differences between these two tools though they
may serve same purpose. Firstly, data mart contains programs, data, software and
hardware of a specific department of a company. There can be separate data marts for
finance, sales, production or marketing. All these data marts are different but they can be
coordinated. Data mart of one department is different from data mart of another
57
department, and though indexed, this system is not suitable for a huge data base as it is
designed to meet the requirements of a particular department.

Data Warehousing is not limited to a particular department and it represents the database
of a complete organization. The data stored in data warehouse is more detailed though
indexing is light as it has to store huge amounts of information. It is also difficult to
manage and takes a long time to process. It implies then that data marts are quick and
easy to use, as they make use of small amounts of data. Data warehousing is also more
expensive because of the same reason
3.9.4 What Is Metadata?
Metadata is information about the data. For a data mart, metadata includes:

A description of the data in business terms

Format and definition of the data in system terms

Data sources and frequency of refreshing data
The primary objective for the metadata management process is to provide a directory of technical
and business views of the data mart metadata. Metadata can be categorized as technical metadata
and business metadata.
Technical metadata consists of metadata created during the creation of the data mart, as well as
metadata to support the management of the data mart. This includes data acquisition rules, the
transformation of source data into the format required by the target data mart, and schedules for
backing up and refreshing data.
Business metadata allows end users to understand what information is available in the data mart
and how it can be accessed
3.9.5 Data modeling for data mart
Since warehouse end users interact directly with data marts, the data mart modeling is one of the
most effective tools in capturing end-user business requirements. The data mart modeling
process depends on many factors. Three of the most important are described below.
58

Data mart modeling is end-user-driven. End users must be involved in the data mart
modeling process, as they obviously are the ones who will use the data mart. Because you
should expect that end users are not at all familiar with complex data models, the
modeling techniques and the modeling process as a whole should be organized such that
complexity is transparent to end users.

Data mart modeling is driven by business requirements. Data mart models are useful for
capturing the business requirements because they are often used directly by end users,
and are easy to understand.

Data mart modeling is greatly affected by data analysis technologies. The techniques of
data analysis can impact the type of data models selected and their content. There are
several techniques for data analysis that are in common use today: query and reporting,
multidimensional analysis, and data mining.
If the intent is simply to provide query and reporting capability, an ER model with a
normalized or denormalized data structure would be most appropriate. A dimensional
data model might also be a good choice because it is user-friendly and has better
performance. If the objective is to perform multidimensional data analysis, a dimensional
data model would be the only choice here. Data mining, however, usually works best
with the lowest level of detail available. Thus, if the data warehouse is used for data
mining, a low level of detailed data should be included in the model.
59