Download File - Be the change you wanna see in this world..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Transcript
UNIT 1 INTRODUCTION TO DBMS
1. File system organisation
1. Computer file contains information arranged in an electronic format. It also facilitates easy
storage, retrieval, and manipulation of data.
2. They are stored in the form bits and bytes. It has a name and the computer would recognize a
file based on this name.
3. A programmer working with this file can give instructions to the computer to open the file,
read from it, write to it, modify its contents, close it, and so on.
4. A program passes control to another in a sequence. This is called batch processing, where
no or minimum human interaction is required.
5. In many situations the program needs to be conversational. These days, the computer
performs both the searching and the answering operations in an automated manner.
6. A search that can take place at any time is called as an online query.
7. When an instantaneous answer is expected, it is called online processing or real-time
processing. Example: Airlines reservations
8. Data can be classified into two types:
 Master data -does not change with time.
 Transaction data -can change from time to time.
9. Example: Library Management
10. There is a library of books and a librarian to maintain it. The librarian has created one card
per book, which contains details such as book number, title, author, price and date of
purchase.
11. For this, the librarian has used the conceptual record layout as shown in the figure 1.1
Fig1.1Record layout for the Book file
12. A card is similar to a record and, in technical terms, the entire pile of cards is similar to a file.
13. A field used to identify a record is called as a record key, or just a key.
14. A record key can be of two types:
 Primary key: Identifies a record uniquely based on a field value. Example:Book number.
 Secondary key: May or may not identify a record uniquely, but can identify one or more
records based on a field value. Example: Author.
1.1 Sequential organisation
 A file is called as a sequential file when it contains records arranged in sequential fashion.
 The records are added as and when they are available. That is a new record is always added
to the end of the current file.
 Advantages of sequential organisation
 Simplicity:The sequential organisation of records is quite simple ant it just need to create
new record for the new books.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 1
 Less overhead: There is no need to keep any key or any other extra information on the
books file. The file is enough.
 Disadvantage
 Difficulties in searching:
 Searching can be a very slow process.
 It starts from the beginning and continue till the end, or until the desired record is
found, whichever is earlier. This is both time consuming and cumbersome.
 Lack of support for queries: To even find out whether something is available in the file
or not, the entire file has to be read.
 Problem with record deletion: It is not simple to delete records. The space freed by the
deletion of the record cannot be reclaimed.
1.2 pointers
 A pointer in a record is a special field, whose value is the address/reference of another record
in the same file.
 The special field forms a chain of records. A chain of records is a logical sequence of records
created by the use of pointer fields.
 These chains were called one-way chains.
 Problems with One-way Chains:
 They can suffer from the drawback of lost/damaged references.
 They are unidirectional by nature.
 Two-way Chains
1. In two-way chains another logical chain is created.
2. Add another chain in the reverse direction, such that there are two pointer fields. The new
pointer fields is called as the back-pointer.
3. The back-pointer points to the previous record in the chain.
4. A broken chain does not cause a major problem in two-way chains. This is for the simple
reason that another chain exists in the opposite direction.
5. Two-way chains do not suffer from the drawback of lost/damaged references.
 Disadvantages:
1. More effort goes into their maintenance. For every new record that is being added, there
is a need to make the entry in both forward and backward fields.
2. If a record is lost both the previous and next pointers need to be adjusted so that they
point to the correct record.
1.3 Indexed Organisation
1. One of the fields in the file is the primary key. This field identifies a record uniquely. In
every record, the primary key field should occupy the same position.
2. In order to create and maintain index files, a computer creates a data file and an index file.
The data file contains the actual contents (data) of the record, whereas the index file contains
the index entries.
3. The way files are organized in computers is as follows:
 The data file is sorted in the order of the primary key field values.
 The index file contains two fields: the key value and the pointer to the data area.
 One record in the index file thus consists of a key value and a pointer to the
corresponding data record.
 The key value is generally the largest primary key value in a given range of records. The
pointer points to the first entry within that range of data records.
4. This is illustrated in figure. In the first index entry, the index value is C, which is the highest
primary key value in the first data block.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 2
5. The pointer from this index entry points to the start of this range (i.e. A). The address (i.e.
memory location) of this on the disk is assumed to be 0, as shown.
6. The second index entry contains F as the highest primary key value for that range of records,
and a pointer to D, which is the start of the range, and so on.
7. The address of this on the disk is assumed to be 100, as shown.
Fig1.2 Indexed file organization
8. This arrangement works fine. There are two problems, as follows:
 To insert new index values between any two existing values.
 The number of index values becomes too high.
 Solution :
 Inserting a new index entry would necessitate a split in the index, and appropriate
adjustments in the address values.
 To solve the second problem create a multi-level index (index of indexes). In this type of
index, the very first line does not point to the data items as before. Instead, it points to
another lower-level index.
 Depending on the need, this lower-level index may point to yet another lower-level
index, and so on. Only the final level of index points to the actual data items.
1.4 Direct organisation
 The idea is quite simple.All records in direct file are of the same size.
 Every record has an associated record number.The record number serves the same purpose as
a primary key in an index file.
 Direct files can be classified in to two main types .They are:
 Hashed files
 Non-hashed files
 Non - hashed files
 Here, records are placed in its appropriate slot based on its record number.
 Th drawback of the non-hashed file approach is the creation of too many empty slots.
 Hashed file
 In hashed file the record number itself becomes an equivalent of the primary key.
 The term hash indicates splitting or chopping of key in to pieces.
 The are three primary hashing techniques they are:divison method,mid-square method
and folding method.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 3
2.Purpose of Database System
9. Database systems arose in response to early methods of computerized management of
commercial data.
10. As an example consider part of a university organization that among other data, keeps
information about all instructors, students, departments and course offerings.
11. One way to keep the information on a computer is to store it in operating system files.
12. To allow users to manipulate the information, the system has a number of application
programs that manipulate the files, including programs to:
 Add new students, instructors, and courses.
 Register students for courses.
 Assign grades to students, compute grade point averages (GPA) and generate transcripts.
13. New application programs are added to the system as the need arises.
2.1 File Processing System
 This system is supported by a conventional operating system.
 The system stores permanent records in various files.
 It needs different application programs to extract records from, and add records to, the
appropriate files.
 Before database management systems (DBMSs) were introduced, organizations usually
stored information in such systems.
2.1.1 Drawbacks of using file systems to store data

Data redundancy and inconsistency
 Different programmers create files and application program.
 The files created, have different structures and the programs may be written in several
programming language.
 The same information may be duplicated in several files.

Difficulty in accessing data
 Need to write a new program to carry out each new task.

Data isolation
 Data are scattered in various file. The files may be stored in different format
 Writing new application program to retrieve appropriate data is difficult.

Integrity problems
 Integrity constraints (e.g., account balance > 0) become “buried” in program code rather
than being stated explicitly.
 Hard to add new constraints or change existing ones.

Atomicity of updates
 Failures may leave database in an inconsistent state with partial updates carried out.
 Example: Transfer of funds from one account to another should either complete or not
happen at all.

Concurrent access anomalies
 Concurrent access needed for improved performance.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 4
 Uncontrolled concurrent accesses can lead to inconsistencies.
 Example: Two people reading a balance (say 100) and updating it by withdrawing money
(say 50 each) at the same time.

Security problems
 Not every user of the database system should be able to access all the data.
 Example: In a university, payroll personnel need to see only the financial information
.They do not see information about academic records.
 Since application programs are added to file processing system in an adhoc manner,
enforcing such security constraint is difficult.

Database systems offer solutions to all the above problems.
3. Database System Terminologies
Database: A collection of related data.
Data: Known facts that can be recorded and have an implicit meaning.
Mini-world: Some part of the real world about which data is stored in a database. For example,
student grades and transcripts at a university.
Database Management System (DBMS): A software package/ system to facilitate the creation and
maintenance of a computerized database.
Database System: The DBMS software together with the data itself. Sometimes, the applications
are also included.
4. Database Characteristics
The main characteristics of the database approach are the following:
1. Self-describing nature of a database system
2. Insulation between programs and data, and data abstraction
3. Support of multiple views of the data
4. Sharing of data and multiuser transaction processing
1.Self-describing nature of a database system
 A DBMS catalog stores the description of a particular database (e.g. data structures, types,
and constraints)
 The description is called meta-data.
 This allows the DBMS software to work with different database applications.
2.Insulation between programs and data, and data abstraction
 The structure of data files is stored in the DBMS catalog separately from the access
programs. This property is called program-data independence.
 Allows changing data structures and storage organization without having to change the
DBMS access programs.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 5


Data Abstraction: A data model is used to hide storage details and present the users with a
conceptual view of the database.
Programs refer to the data model constructs rather than data storage details.
3. Support of multiple views of the data
 Each user may see a different view of the database, which describes only the data of interest
to that user.
4. Sharing of data and multiuser transaction processing
 Allowing a set of concurrent users to retrieve from and to update the database.
 Concurrency control within the DBMS guarantees that each transaction is correctly executed
or aborted.
 Recovery subsystem ensures each completed transaction has its effect permanently recorded
in the database.
 OLTP (Online Transaction Processing) is a major part of database applications. This allows
hundreds of concurrent transactions to execute per second.
5. Data Models
Data abstraction:
 Suppression of details of data organization and Storage.
 Highlighting the essential features for an improved understanding of data.
Data model:
 Collection of concepts that describe the structure of a database.
 Provides means to achieve data abstraction.
 Basic operations
 Specify retrievals and updates on the database
 Dynamic aspect or behavior of a database application
 Allows the database designer to specify a set of valid operations allowed on database
objects.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 6
Categories of Data Models


High-level or conceptual data models
 Close to the way many users perceive data.
Conceptual data models use concepts such as entities, attributes, and relationships.
 Entity-Represents a real-world object or concept.
 Attribute-Represents some property of interest that further describes an entity.
 Relationship among two or more entities represents an association among the entities.

Low-level or physical data models
 Describe the details of how data is stored on computer storage media.
 Representational data models
 Easily understood by end users.
 Also similar to how data organized in computer storage.
 Relational data model
 Used most frequently in traditional commercial DBMSs.
 Object data model
 New family of higher-level implementation data models that are closer to conceptual
data models.
 Physical data models
 Describe how data is stored as files in the computer.
 Access path- Structure that makes the search for particular database records efficient.
 Index- Example of an access path that allows direct access to data using an index term or
keyword.
6. DBMS Components
 A DBMS is a complex software system.
 Figure illustrates, in a simplified form, the typical DBMS components.
Fig6.1Component modules of a DBMS and their interactions
CS6302 DATABASE MANAGEMENT SYSTEM
Page 7

The top part of the figure refers to the various users of the database environment and their
interfaces.
 The lower part shows the internals of the DBMS responsible for storage of data and
processing of transactions.
Let us consider the top part of Figure:
1. It shows interfaces for
 the DBA staff, casual users who work with interactive interfaces to formulate queries,
 application programmers who program using some host languages,
 parametric users who do data entry work by supplying parameters to predefined
transactions.
2. The DDL compiler: processes schema definitions, specified in the DDL, and stores
descriptions of the schemas (meta-data) in the DBMS catalog.
3. Casual users and persons with occasional need for information from the database interact
using some form of interface called as as interactive query interface.
4. Query compiler: handles high-level queries that are entered interactively.
5. The query optimizer is concerned with the rearrangement and possible reordering of
operations, elimination of redundancies, and use of correct algorithms and indexes during
execution.
6. Application programmers write programs in host languages such as Java, C, or C++that are
submitted to a precompiler.
7. The precompiler extracts DML commandsfrom an application program written in a host
programming language.
8. DML compiler: compiles the DML commands into objectcode for database access.
9. The rest of the program is sent to the host language compiler.
10. The object codes for the DML commands and the rest of the program are linked, forming a
canned transaction whose executable code includes calls to the runtime database processor.
Now, Let us consider the lower part of figure
1. Run-time database processor: handles database access at run time. It receives retrieval and
update operations and carries them out on the database.
2. It also works with the stored data manager, which controls access to DBMS information
that is stored on disk through interaction with operating system.
3. Concurrency control and backup and recovery systems are integrated into the working of
the runtime database processor for purposes of transaction management.
Database System Utilities
 There are some functions that are not provided through the normal DBMS components
rather they are provided through additional programs called utilities. Some of these are:
1. Loading or import utility: used to load or import existing data files into the database.
2. Backup utility: used to create backup copies of the database, usually by dumping the entire
database onto tape.
3. File reorganization utility: is used to reorganize a database file into a different file
organization to improve performance.
4. Performance monitoring utility: is used to monitor database usage and provides statistics to
the DBA.
7. Relational Algebra
1. A set of operators (unary and binary) that take relation instances as arguments and return new
relations.
2. Gives a procedural method of specifying a retrieval query.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 8
3.
4.
5.
6.
Forms the core component of a relational query engine.
SQL queries are internally translated into Relational Algebra expressions.
Provides a framework for query optimization.
A sequence of relational algebra operations forms a relational algebra expression
7.1 Unary Relational Operations: SELECT ,PROJECT and RENAME
 The Select operation ( denoted by σ ( sigma))can be used to select those tuples of a relation
that satisfy a given condition.






Notation:
σ : select operator ( read as sigma)
R: relation name
Examples of select expressions
Obtain information about a professor with name “giridhar”
σ name= “giridhar”(professor)
Obtain information about professors who joined the university between 1980 and 1985
σ startYear≥1980 ^ startYear < 1985(professor)
To select the tuples for all employees who either work in department 4 and make over
$25,000 per year, or work in department 5 and make over $30,000, the following SELECT
operation is given:
σ(Dno=4 AND Salary>25000) OR (Dno=5 AND Salary>30000)(EMPLOYEE)
The result is shown in Figure
Fig7.1.Results of select operation
 TheBoolean conditions AND, OR, and NOT have their normal interpretation, as follows:
 (cond1 AND cond2) is TRUE if both (cond1) and (cond2) are TRUE; otherwise,it is FALSE.
 (cond1 OR cond2) is TRUE if either (cond1) or (cond2) or both are TRUE;otherwise, it is
FALSE.
 (NOT cond) is TRUE if cond is FALSE; otherwise, it is FALSE.
 The project operation(denoted by π(pie)) can be used to keep only the required attributes of
a relation instance and throw away others.

Notation:
Π:project operator(read as pie)
R: relation name
 Examples of project expressions
 To list each employee’s first and last name and salary, the PROJECT operation is used as
follows:
πLname, Fname, Salary(EMPLOYEE)
The result is shown in figure
CS6302 DATABASE MANAGEMENT SYSTEM
Page 9
Fig7.2.Results of project operation
 The Rename operator is denoted by ρ (rho).
 It is used to rename the attributes of a relation or the relation name or both.
 The general RENAME operation ρ can be expressed by any of the following forms:
 ρS (B1, B2, …, Bn )(R) changes both:
 the relation name to S, and
 the column (attribute) names to B1, B1, …..Bn
 ρS(R) changes:
 the relation name only to S
 ρ(B1, B2, …, Bn )(R) changes:
 the column (attribute) names only to B1, B1, …..Bn
 Example of Rename operation
 To rename the attributes in a relation, simply list the new attribute names in parentheses, as in
the following example:
TEMP ← σ DNO = 4 (EMPLOYEE)
R (FN, LN, SAL)← π FNAME, LNAME, SALARY (TEMP)
 These two operations are illustrated in Figure
Fig7.3.Results of Rename operation
7.2 Relational Algebra Operations from Set Theory
Union Operation
1. Binary operation, denoted by .
2. The result of R S is a relation that includes all tuples that are either in R or in S or in both R
and S.
3. Duplicate tuples are eliminated.
4. The two operand relations R and S must be “typecompatible” (or UNION compatible):
 R and S must have same number of attributes.
 Each pair of corresponding attributes must be type compatible ( have same domains).
Intersection operation
1. INTERSECTION is denoted by ∩.
2. The result of the operation R ∩ S, is a relation that includes all tuples that are in both R and
S.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 10
3. The attribute names in the result will be the same as the attribute names in R.
4. The two operand relations R and S must be “type compatible”.
Set Difference
1. SET DIFFERENCE (also called MINUS or EXCEPT) is denoted by –
2. The result of R – S, is a relation that includes all tuples that are in R but not in S
3. The attribute names in the result will be the same as the attribute names in R
4. The two operand relations R and S must be“type compatible”
Example of union, intersection and set difference operations
Cartesian (Or Cross) Product Operation
1. This operation is used to combine tuples from two relations in a combinatorial fashion.
2. Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm).
3. Result is a relation Q with degree n + m attributes: Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in
that order.
4. The resulting relation state has one tuple for each combination of tuples—one from R and
one from S.
5. Hence, if R has nR tuples and S has nS tuples, then R x S will have nR * nS tuples.
6. The two operands do NOT have to be "type compatible”.
7. Example:
FEMALE_EMPS ← σ SEX=’F’(EMPLOYEE)
EMPNAMES ← π FNAME, LNAME, SSN (FEMALE_EMPS)
EMP_DEPENDENTS ← EMPNAMES x DEPENDENT
8. EMP_DEPENDENTS will contain every combination of EMPNAMES and DEPENDENT.
9. The operations are illustrated in the figure
CS6302 DATABASE MANAGEMENT SYSTEM
Page 11
Fig7.4 . The Cartesian Product (Cross Product) operation
8. Relational DBMS (RDBMS)
 It is a database management system where the data are organized as tables of data values and
all the operations on the data work on these tables.
8.1 codd’s rule
 Dr. Edgar F. Codd proposed a set of 12 rules that were intended to define the important
characteristics and capabilities of any relational system [Codd 1986]. The rules are listed
below:
Rule
Rule Name
Description
Rule 1
Information rule
All information is represented logically by values in tables
Rule 2
Guaranteed Access
Every data value is logically accessible by a combination of table name,
Rule
primary key value and column name.
Rule 3
Missing Information
Null values are systematically supported independent of data type.
rule
Rule 4
System catalogue
The logical description of the database is represented and may be interrogated
Rule
by authorized users.
Rule 5
Comprehensive
A high level relational language that support all of the following: data
language Rule
definition, view definition, data manipulation, integrity constraints,
authorization, transaction boundaries.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 12
Rule 6
Rule 7
View update rule
Set level Update Rule
Rule 8
Physical data
independence rule
Physical data
independence rule
Rule 9
Rule 10
Integrity
independence rule
Distribution
independence rule
Non-subversion rule
Rule11
Rule 12
The system should able to perform all theoretically possible updates on view.
The ability to treat whole table as single object applies to insertion,
modification and deletion, as well as retrieval of data.
User operations and application program should be independent of any
changes in physical storage.
User operations and application program should be independent of any
changes in
Logical structure of base table provided they involve no loss of information.
Entity and referential integrity constraints should be defined in the high level
relational language, not by application programs.
User operations and application program should be independent of location of
data when it is distributed over multiple computers.
If a low-level procedural language is supported, it must not able to subvert
integrity or security constraints expressed in the high-level relational
language
9. Entity-Relationship model
 Entity-Relationship (ER) model- Popular high-level conceptual data model.
 ER diagrams -Diagrammatic notation associated with the ER model.
 Entity- Thing in real world with independent existence.
 Attributes-Particular properties that describe entity. For example, an EMPLOYEE entity
may be described by the attributes employee’s name, age, address, salary, and job.
 Several types of attributes occur in the ER model: simple, composite, single valued, multi
valued, stored, and derived.
 Simple or atomic attributes: Attributes that are not divisible.
 Composite attributes: It can be divided into smaller subparts, which represent more basic
attributes with independent meanings. Composite attributes can form a hierarchy.
 Example: Address attribute of the EMPLOYEE entity can be subdivided into Street_address,
City, State, and Zip.
Fig9.1. A hierarchy of composite attributes



Single-Valued Attributes: Attributes that have a single value for a particular entity. For
example, Age of a person.
Multivalued Attributes:
 An attribute can have a set of values for the same entity.
 A multivalued attribute may have lower and upper bounds to constrain the number of
values allowed for each individual entity.
Stored versus Derived Attributes:
Two (or more) attribute values are related.
Example: Age and Birth_date attributes of a person.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 13
 For a particular person entity, the value of Age can be determined from the current (today’s)
date and the value of that person’s Birth_date.
 The Age attribute is called a derived attribute and is said to be derivable from the
Birth_date attribute, which is called a stored attribute.
 Entity type: Collection (or set) of entities that have the same attributes.






Fig 9.2 Two entity types,EMPLOYEE andCOMPANY, and some member entities ofeach
Key or uniqueness constraint: Attributes whose values are distinct for each individual entity
in entity set
Key attribute: Uniqueness property must hold for every entity set of the entity type.
Value sets (or domain of values):Specifies set of values that may be assigned to that
attribute for each individual entity.
Relationship: attribute of one entity type refers to another entity type. Represent references
as relationships not attributes.
Relationship Types, Sets, and Instances:
 Relationship type R among n entity types E1, E2, ..., En:Defines a set of associations
among entities from these entity types.
 Relationship instances ri: Each ri associates n individual entities (e1,e2, ..., en)and each
entity ej in ri is a member of entity set Ej.
Relationship Degree
 Degree of a relationship type:1. Number of participating entity types 2. A relationship
type of degree two is called binary, and one of degree three is calledternary.
 Relationships as attributes:Think of a binary relationship type in terms of attributes.
Fig9.3. Some instances in the WORKS_FOR relationship set, which represents a relationship type
WORKS_FOR between EMPLOYEE and DEPARTMENT
CS6302 DATABASE MANAGEMENT SYSTEM
Page 14









Role names :Role name signifies the role that a participating entity plays in each
relationship instance.
Recursive relationships: Same entity type participates more than oncein a relationship type in
different roles.
Cardinality ratio for a binary relationship: Specifies maximum number of relationship
instances that entity can participate in.
Participation constraint: Specifies whether existence of entity depends on its being related
to another entity.
 Types: total and partial.
Attributes of Relationship Types
 Attributes of 1:1 or 1:N relationship types:can be migrated to one entity type.
 For a 1:N relationship type:Relationship attribute can be migrated only to
entity type on N-side of relationship.
 For M:N relationship types :1.Some attributes may be determined by combination of
participating entities2. be specified as relationship attributes.
Weak Entity Types
 Do not have key attributes of their own.
 Identified by being related to specific entities another entity type.
Regular entity types that do have a key attribute are called strong entity types.
Identifying relationship of the weak entity type: The relationship type that relates a weak
entity type to its owner.
Summary of the notation for ER diagram:
CS6302 DATABASE MANAGEMENT SYSTEM
Page 15
Fig 9.4 ER Design for the COMPANY Database
10.Functional dependencies
1. The whole database is described by a single universal relation schema R = { A1, A2, ..., An }.
a. Definition:
2. A functional dependency, denoted by X → Y, between two sets of attributes X and Y that are
subsets of R specifies a constraint on the possible tuples that can form a relation state r of R.
3. The constraint is that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also
have t1[Y] = t2[Y].
4. The values of the Y component of a tuple in r depend on, or are determined by, the values of
the X component.
5. The values of the X component of a tuple uniquely (or functionally) determine the values of
the Y component.
6. There is a functional dependency (FD or f.d) from X to Y, or that Y is functionally
dependent on X.
7. X functionally determines Y in a relation schema R if, and only if, whenever two tuples of
r(R) agree on their X-value, they must necessarily agree on their Y value. Note the following:
 If a constraint on R states that there cannot be more than one tuple with a given X-value
in any relation instance r(R)
 That is, X is a candidate key of R—this implies that X → Y for any subset of attributes Y
of R.
 If X→Y in R, this does not say whether or not Y→X in R.
8. A functional dependency is a property of the semantics or meaning of the attributes.
9. Whenever the semantics of two sets of attributes in R indicate that a functional dependency
should hold, specify the dependency as a constraint.
10. Relation extensions r(R) that satisfy the functional dependency constraints are called legal
relation states (or legal extensions) of R.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 16
Fig10.1 . Relation schemas EMP_PROJ.
11. Consider the relation schema EMP_PROJ in Figure10.1; from the semantics of the attributes
and the relation, the following functional dependencies should hold:
Ssn→Ename
Pnumber →{Pname, Plocation}
{Ssn, Pnumber}→Hours
12. These functional dependencies specify that
 The value of an employee’s Social Security number (Ssn) uniquely determines the
employee name (Ename),
 The value of a project’s number (Pnumber) uniquely determines the project name
(Pname) and location (Plocation),
 Acombination of Ssn and Pnumber values uniquely determines the number of hours the
employee currently works on the project per week (Hours).
 Alternatively, Ename is functionally determined by (or functionally dependent on) Ssn.
10.1Normal Forms Based on Primary Keys
10.1.1 Normalization of Relations:
 The normalization process, as first proposed by Codd (1972a), takes a relation schema
through a series of tests to certify whether it satisfies a certain normal form.
10.1.2Normalization of data:
1. It can be considered a process of analyzing the given relation schemas based on their FDs and
primary keys to achieve the desirable properties of (1) minimizing redundancy and (2)
minimizing the insertion, deletion, and update anomalies.
2. Unsatisfactory relation schemas that do not meet certain conditions—the normal form tests
are decomposed into smaller relation schemas that meet the tests and hence possess the
desirable properties.
3. Definition: The normal form of a relation refers to the highest normal form condition that it
meets, and hence indicates the degree to which it has been normalized.
4. Normalization must confirm the existence of additional properties:
5. The nonadditive join or lossless join property, which guarantees that the spurious tuple
generation problem does not occur with respect to the relation schemas created after
decomposition.
6. The dependency preservation property, which ensures that each functional dependency is
represented in some individual relation resulting after decomposition.
10.1.3 Denormalization:

It is the process of storing the join of higher normal form relations as a base relation, which is
in a lower normal form.
10.1.4 Definitions of Keys and Attributes Participating in Keys
1. A key K is a superkey with the additional property that removal of any attribute from K will
cause K not to be a superkey any more.
2. If a relation schema has more than one key, each is called a candidate key.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 17
3. One of the candidate keys is arbitrarily designated to be the primary key, and the others are
called secondary keys.
4. An attribute of relation schema R is called a prime attribute of R if it is a member of some
candidate key of R.
5. An attribute is called nonprime if it is not a prime attribute, that is, if it is not a member of
any candidate key.
10.2 First Normal form
 It states that the domain of an attribute must include only atomic (simple, indivisible) values
and that the value of any attribute in a tuple must be a single value from the domain of that
attribute.
 It disallows having a set of values, a tuple of values, or a combination of both as an attribute
value for a single tuple.
Fig10.2. A relation schema that is not in 1NF
Fig10.3 Sample state of relation DEPARTMENT Fig10.4 . 1NF version of the same relation
with redundancy
 Fig10.2is not in 1NF because Dlocations is not an atomic attribute.
 There are three main techniques to achieve first normal form:
 First technique:
1. Remove the attribute Dlocations and place it in a separate relation DEPT_LOCATIONS,
along with the primary key Dnumber.
2. The primary key of this relation is the combination {Dnumber, Dlocation}.
3. A distinct tuple in DEPT_LOCATIONS exists for each location of a department.
4. This decomposes the non-1NF relation into two 1NF relations.
 Second Technique:
1. Expand the key so that there will be a separate tuple, in the original DEPARTMENT
relation for each location of a DEPARTMENT.
2. The primary key becomes the combination {Dnumber, Dlocation}.
3. Disadvantage: introducing redundancy in the relation.
 Third technique:
1. If a maximum number of values is known for the attribute—for example, if it is known
that at most three locations can exist for a department—replace the Dlocations attribute
by three atomic attributes: Dlocation1, Dlocation2, and Dlocation3.
2. Disadvantage: Introducing NULL values if most departments have fewer than three
locations.
 The first solution is considered best because it does not suffer from redundancy and it is
completely general, having no limit placed on a maximum number of values.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 18
10.3Second Normal Form
1. It is based on the concept of full functional dependency.
2. A functional dependency X → Y is a full functional dependency if removal of any attribute
A from X means that the dependency does not hold any more.
3. A functional dependency X→Y is a partial dependency if some attribute A € X can be
removed from X and the dependency still holds.
4. In the following figure, {Ssn, Pnumber} → Hours is a full dependency (neither Ssn → Hours
nor Pnumber→Hours holds).
5. However, the dependency {Ssn, Pnumber}→Ename is partial because Ssn→Ename holds.
Fig10.5 Relation schema EMP_PROJ
6. The EMP_PROJ relation is in 1NF but is not in 2NF.
7. The functional dependencies FD2 and FD3 make Ename, Pname, and Plocation partially
dependent on the primary key {Ssn, Pnumber} of EMP_PROJ.
8. If a relation schema is not in 2NF, it can be second normalized or 2NF normalized into a
number of 2NF relations.
9. In that 2NF Relation , nonprime attributes are associated only with the part of the primary
key on which they are fully functionally dependent.
10. The functional dependencies FD1, FD2, and FD3 lead to the decomposition of EMP_PROJ
into the three relation schemas EP1, EP2, and EP3 shown in figure, each of which is in 2NF.
Fig10.6 . Normalizing EMP_PROJ into 2NF relations
10.4 Third Normal Form
1. It is based on the concept of transitive dependency.
2. A functional dependency X→Y in a relation schema R is a transitive dependency if there
exists a set of attributes Z in R that is neither a candidate key nor a subset of any key of R,
and both X→Z and Z→Y hold.
3. The dependency Ssn→Dmgr_ssn is transitive through Dnumber in EMP_DEPT in figure,
because both the dependencies Ssn → Dnumber and Dnumber → Dmgr_ssn hold and
Dnumber is neither a key itself nor a subset of the key of EMP_DEPT.



Fig10.7 . Relation schema EMP_DEPT
Definition: A relation schema R is in 3NF if it satisfies 2NF and no nonprime attribute of R
is transitively dependent on the primary key.
The relation schema EMP_DEPT is in 2NF but not in 3NF because of the transitive
dependency.
EMP_DEPT is normalized by decomposing it into the two 3NF relation schemas ED1 and
ED2.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 19
Fig10.8. Normalizing EMP_DEPT into 3NF relations
10.5 Boyce Codd Normal Form

Definition: A relation schema R is in BCNF if whenever a nontrivial functional dependency
X→A holds in R, then X is a superkey of R.
1. Example: Consider a relation TEACH with the following dependencies:
FD1: {Student, Course} → Instructor
FD2: Instructor → Course
2. {Student, Course} is a candidate key for this relation and that the dependencies shown follow
the pattern in figure, with Student as A, Course as B, and Instructor as C.
Fig10.9. A schematic relation with FDs; it is in 3NF, but not in BCNF
3. Hence this relation is in 3NF but not BCNF.
4. Decomposition of this relation schema into two schemas is not straightforward because it
may be decomposed into one of the three following possible pairs:
{Student, Instructor} and {Student, Course}
{Course, Instructor} and {Course, Student}
{Instructor, Course} and {Instructor, Student}
5. All three decompositions lose the functional dependency FD1. The desirable decomposition
of those just shown is 3 because it will not generate spurious tuples after a join.
6. A relation not in BCNF should be decomposed so as to meet this property. Nonadditive
decomposition is a must during normalization.
10.6 Formal definition of Multivalued dependencies(MVD):
 The MVD x →→ Y is said to hold for R(X,Y,Z) if, whenever t1 and t2 are two rows in R
that have the same values for attributes X and therefore t1[x]=t2[x] then R also contains t3
and t4,such that
t3 [X] = t4 [X] = t1 [X] = t2 [X]
t3 [Y] = t1 [Y] and t4[Y] = t2 [Y]
t3 [Z] = t2 [Z] and t4 [Z] = t1[Z]
10.6.1Fourth Normal Form


A relation schema R is in 4NF with respect to a set of dependencies F if, for every nontrivial
multivalued dependency X →→ Y in F+, X is a superkey for R.
Consider the EMP relation in figure. EMP is not in 4NF because in the nontrivial MVDs
Ename→→ Pname and Ename →→ Dname, and Ename is not a superkey of EMP.
Fig10.10. The EMP relation with two MVDs: Ename →→ Pname and Ename →→ Dname
CS6302 DATABASE MANAGEMENT SYSTEM
Page 20



Decompose EMP into EMP_PROJECTS and EMP_DEPENDENTS, shown in figure.
Both EMP_PROJECTS and EMP_DEPENDENTS are in 4NF, because the MVDs Ename
→→ Pname in EMP_PROJECTS and Ename →→ Dname in EMP_DEPENDENTS are
trivial MVDs.
No other nontrivial MVDs hold in either EMP_PROJECTS or EMP_DEPENDENTS. No
FDs hold in these relation schemas either.
Fig10.11. Decomposing the EMP relation into two 4NF relations EMP_PROJECTS and
EMP_DEPENDENTS
10.7 Join Dependencies
 Let a relation R have subset of its attribute A,B,C ,..Then R satisfies the Join dependency
(JD) written as *(A,B,C) if and only if every possible legal value of R is equal to the join of
its projection A,B,C…
10.7.1Definition of 5NF:
 A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies of the
form *(R1, R2, ..., Rn), where each Ri is a subset of the set of attributes of R and R = R1⋃
R2⋃...⋃Rn, at least one of the following holds.
 *(R1, R2, ..., Rn) is a trivial join-dependency (i.e., one of Ri is R)
 Every Ri is a super key for R.
Example:
Department Subject Student
Comp. Sc.
CP1000 John Smith
Mathematics MA1000 John Smith
Comp. Sc.
CP2000 Arun Kumar
Comp. Sc.
CP3000 Reena Rani
Physics
PH1000 Raymond Chew
Chemistry
CH2000 Albert Garcia
1. The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are taken
by a variety of students.
2. No student takes all the subjects and no subject has all students enrolled in it and therefore all three
fields are needed to represent the information.
3. The above relation does not show MVDs since the attributes subject and student are not independent;
they are related to each other and the pairings have significant information in them.
4. The relation can therefore not be decomposed in two relations
 (dept, subject), and (dept, student)
Without losing some important information.
The relation can however be decomposed in the following three relations
 (dept, subject), and
 (dept, student)
 (subject, student)
Now it can be shown that this decomposition is lossless.
CS6302 DATABASE MANAGEMENT SYSTEM
Page 21