Download Normalization-Anomalies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Oracle Database wikipedia , lookup

SQL wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Relational algebra wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Ingres (database) wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Database Normalization is a technique for designing relational database tables to minimize duplication of information and, in so
doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when
multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent
when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to
problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be
represented by a single instance only.
Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce
performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated
transactions (e.g. an Automated teller machine), while less normalized tables tend to be used in database applications that do not need
to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application).
Database theory describes a table's degree of normalization in terms of normal forms of successively higher degrees of strictness. A
table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not always the
case.
Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms
are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally,
it is meaningful to consider the extent to which the representation is normalized.
A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data
operations. In such a table:



The same information can be expressed on multiple records; therefore updates to the table may result in logical
inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and
Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each
of his skills). If the update is not carried through successfully—if, that is, the employee's address is updated on some records
but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question
of what this particular employee's address is. This phenomenon is known as an update anomaly.
There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their
Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details
of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member
who has not yet been assigned to teach any courses. This phenomenon (fact/ event) is known as an insertion anomaly.
There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing
completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of
anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on
which that faculty member appears. This phenomenon is known as a deletion anomaly.
Ideally, a relational database table should be designed in such a way as to exclude the possibility of update, insertion, and deletion
anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular design will be
vulnerable to such anomalies. It is possible to correct an unnormalized design so as to make it adhere to the demands of the normal
forms: this is called normalization.
Normalization typically involves decomposing an unnormalized table into two or more tables that, were they to be combined (joined),
would convey exactly the same information as the original table.
Problems addressed by normalization
An update anomaly. Employee 519 is shown as having different addresses on different records.
An insertion anomaly. Until the new faculty member is assigned to teach at least one course, his details cannot be recorded.
A deletion anomaly. All information about Dr. Giddens is lost when he temporarily ceases to be assigned to any courses.
A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data
operations. In such a table:



The same information can be expressed on multiple records; therefore updates to the table may result in logical
inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and
Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each
of his skills). If the update is not carried through successfully—if, that is, the employee's address is updated on some records
but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question
of what this particular employee's address is. This phenomenon is known as an update anomaly.
There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their
Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details
of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member
who has not yet been assigned to teach any courses. This phenomenon (fact/ event) is known as an insertion anomaly.
There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing
completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of
anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on
which that faculty member appears. This phenomenon is known as a deletion anomaly.
Ideally, a relational database table should be designed in such a way as to exclude the possibility of update, insertion, and deletion
anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular design will be
vulnerable to such anomalies. It is possible to correct an unnormalized design so as to make it adhere to the demands of the normal
forms: this is called normalization.
Normalization typically involves decomposing an unnormalized table into two or more tables that, were they to be combined (joined),
would convey exactly the same information as the original table.
Background to Normalization: Definitions


Functional dependency: Attribute B has a functional dependency on attribute A i.e. A → B if, for each value of attribute A,
there is exactly one value of attribute B. In our example, Employee Address has a functional dependency on Employee ID,
because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse
need not be true: several employees could live at the same address and therefore one Employee Address value could
correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An
attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to
determine the extent to which a design is normalized without understanding what functional dependencies apply to the
attributes within its tables; understanding this, in turn, requires knowledge of the problem domain. For example, an Employer
may require certain employees to split their time between two locations, such as New York City and London, and therefore
want to allow Employees to have more than one Employee Address. In this case, Employee Address would no longer be
functionally dependent on Employee ID.
Trivial functional dependency: A trivial functional dependency is a functional dependency of an attribute on a superset of
itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.








Full functional dependency: An attribute is fully functionally dependent on a set of attributes X if it is a) functionally
dependent on X, and b) not functionally dependent on any proper subset of X. {Employee Address} has a functional
dependency on {Employee ID, Skill}, but not a full functional dependency, for it is also dependent on {Employee ID}.
Transitive dependency: A transitive dependency is an indirect functional dependency, one in which X→Z only by virtue of
X→Y and Y→Z.
Multivalued dependency: A multivalued dependency is a constraint according to which the presence of certain rows in a table
implies the presence of certain other rows: see the Multivalued Dependency article for a rigorous definition.
Join dependency: A table T is subject to a join dependency if T can always be recreated by joining multiple tables each
having a subset of the attributes of T.
Superkey: A superkey is an attribute or set of attributes that uniquely identifies rows within a table; in other words, two distinct
rows are always guaranteed to have distinct superkeys. {Employee ID, Employee Address, Skill} would be a superkey for the
"Employees' Skills" table; {Employee ID, Skill} would also be a superkey.
Candidate key: A candidate key is a minimal superkey, that is, a superkey for which we can say that no proper subset of it is
also a superkey. {Employee Id, Skill} would be a candidate key for the "Employees' Skills" table.
Non-prime attribute: A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would
be a non-prime attribute in the "Employees' Skills" table.
Primary key: Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible
unique keys. A primary key is a candidate key which the database designer has designated for this purpose.
* Functional Dependency (also known as Normalization) is a relationship between attributes in the same table. Or, a constraint between two sets of attributes in a
relation from a database. It occurs when one attribute in a relation uniquely determines another attribute.
History
Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form:
There is, in fact, a very simple elimination[1] procedure which we shall call normalization. Through decomposition non-simple domains are
replaced by "domains whose elements are atomic (non-decomposable) values."
—Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks [2]
In his paper, Edgar F. Codd used the term "non-simple" domains to describe a heterogeneous data structure, but later researchers
would refer to such a structure as an abstract data type. In his biography Edgar F. Codd also cited that the inspiration for his work was
his eager assistant Tom Ward who used to challenge Edgar to rounds of database normalization similar to a chess match between
master and apprentice. Tom Ward has been often quoted in industry magazines as stating that he has always enjoyed database
normalization even more than sudoku.
Normal Forms
The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to
logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to such inconsistencies
and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of
all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.
The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables
are in normal form n.
Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first
normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed
table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving
the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, be cause
3NF tables usually need no modification to meet the requirements of these higher normal forms.
Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These normal forms have been summarized as
requiring that all non-key attributes be dependent on "the key, the whole key and nothing but the key". The fourth and fifth normal
forms (4NF and 5NF) deal specifically with the representation of many-to-many and one-to-many relationships among attributes. Sixth
normal form (6NF) incorporates considerations relevant to temporal databases.
First Normal Form
Main article: First normal form
A table is in first normal form (1NF) if and only if it faithfully represents a relation.[3] Given that database tables embody a relationlike form, the defining characteristic of one in first normal form is that it does not allow duplicate rows or nulls. Simply put, a table with
a unique key (which, by definition, prevents duplicate rows) and without any nullable columns is in 1NF.
Note that the restriction on nullable columns as a 1NF requirement, as espoused by Chris Date, et. al., is controversial. This particular
requirement for 1NF is a direct contradiction to Dr. Codd's vision of the relational database, in which he stated that "null values" must
be supported in a fully relational DBMS in order to represent "missing information and inapplicable information in a systematic way,
independent of data type."[4] By redefining 1NF to exclude nullable columns in 1NF, no level of normalization can ever be achieved
unless all nullable columns are completely eliminated from the entire database. This is in line with Date's and Darwen's vision of the
perfect relational database, but can introduce additional complexities in SQL databases to the point of impracticality. [5]
One requirement of a relation is that every table contain exactly one value for each attribute. This is sometimes expressed as "no
repeating groups"[6]. While that statement itself is axiomatic, experts disagree about what qualifies as a "repeating group", in
particular whether a value may be a relation value; thus the precise definition of 1NF is the subject of some controversy.
Notwithstanding, this theoretical uncertainty applies to relations, not tables. Table manifestations are intrinsically free of variable
repeating groups because they are structurally constrained to the same number of columns in all rows.
See the first normal form article for a fuller discussion of the nuances of 1NF.
Second Normal Form
Main article: Second normal form
The criteria for second normal form (2NF) are:
 The table must be in 1NF.
 None of the non-prime attributes of the table are functionally dependent on a part (proper subset) of a candidate key; in
other words, all functional dependencies of non-prime attributes on candidate keys are full functional dependencies. [7] For
example, in an "Employees' Skills" table whose attributes are Employee ID, Employee Address, and Skill, the combination of
Employee ID and Skill uniquely identifies records within the table. Given that Employee Address depends on only one of those
attributes – namely, Employee ID – the table is not in 2NF.
 Note that if none of a 1NF table's candidate keys are composite – i.e. every candidate key consists of just one attribute – then
we can say immediately that the table is in 2NF.

Third Normal Form
Main article: Third normal form
The criteria for third normal form (3NF) are:
 The table must be in 2NF.
 Every non-prime attribute of the table must be non-transitively dependent on every candidate key.[7] A violation of 3NF would
mean that at least one non-prime attribute is only indirectly dependent (transitively dependent) on a candidate key. For
example, consider a "Departments" table whose attributes are Department ID, Department Name, Manager ID, and Manager
Hire Date; and suppose that each manager can manage one or more departments. {Department ID} is a candidate key.
Although Manager Hire Date is functionally dependent on the candidate key {Department ID}, this is only because Manager
Hire Date depends on Manager ID, which in turn depends on Department ID. This transitive dependency means the table is not
in 3NF.
Boyce-Codd Normal Form
Main article: Boyce-Codd normal form
A table is in Boyce-Codd normal form (BCNF) if and only if, for every one of its non-trivial functional dependencies X → Y, X is a
superkey—that is, X is either a candidate key or a superset thereof.
Fourth Normal Form
Main article: Fourth normal form
A table is in fourth normal form (4NF) if and only if, for every one of its non-trivial multivalued dependencies X →→ Y, X is a
superkey—that is, X is either a candidate key or a superset thereof.
Fifth Normal Form
Main article: Fifth normal form
The criteria for fifth normal form (5NF and also PJ/NF) are:
 The table must be in 4NF.

There must be no non-trivial join dependencies that do not follow from the key constraints. A 4NF table is said to be in the 5NF
if and only if every join dependency in it is implied by the candidate keys.
Domain/key Normal Form
Main article: Domain/key normal form
Domain/key normal form (or DKNF) requires that a table not be subject to any constraints other than domain constraints and key
constraints.
Sixth Normal Form
A table is in sixth normal form (6NF) if and only if it satisfies no non-trivial join dependencies at all.[10] This obviously means that the
fifth normal form is also satisfied. The sixth normal form was only defined when extending the relational model to take into account the
temporal dimension. Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal
extensions to SQL are not relational. See work by Date, Darwen and Lorentzos[11] for a relational temporal extension, Zimyani[12] for
further discussion on Temporal Aggregation in SQL, or TSQL2 for a non-relational approach.
Denormalization
Main article: Denormalization
Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for Online
Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales
record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By
contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical
data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate Business
Intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or
redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a
consistent state. The normalized alternative to the star schema is the snowflake schema. It has never been proven that this
denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the
performance. The need for denormalization has waned as computers and RDBMS software have become more powerful.
Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since
these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform
(such as Palm), or no changes are to be made to the data and a swift response is crucial.
Non-First Normal Form (NF²)
In recognition that denormalization can be deliberate and useful, the non-first normal form is a definition of database designs which do
not conform to the first normal form, by allowing "sets and sets of sets to be attribute domains" (Schek 1982). This extension is a (nonoptimal) way of implementing hierarchies in relations. Some theoreticians have dubbed this practitioner developed method, "First Abnormal Form", Codd defined a relational database as using relations, so any table not in 1NF could not be considered to be relational.
Consider the following table:
Non-First Normal Form
Person Favorite Colors
Bob
blue, red
Assume a person has several favorite colors. Obviously, favorite colors consist of a set of colors modeled
by the given table.
To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational
algebra of the higher normal forms. The reverse operator is called "nest" which is not always the
mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another
constraint required is for the operators to be bijective, which is covered by the Partitioned Normal Form
(PNF).
Database Anomalies Database anomalies are the problems in relations that occur due to redundancy in
the relations. These anomalies affect the process of inserting, deleting and modifying data in the relations.
Some important data may be lost if a elations is updated that contains database anomalies. It is
Jane green, yellow, red important to remove these anomalies in order to perform different processing on the relations without any
problem. Types of Anomalies.Different types of database anomalies are as follows: Insertion
Anomaly.The insertion anomaly occurs when a new record is inserted in the relation. In this anomaly, the
user cannot insert a fact about an entity until he has an additional fact about another entity. Deletion Anomaly.The deletion anomaly
occurs when a record is deleted from the relation. In this anomaly, the deletion of facts about an entity automatically deleted the fact
of another entity.Modification Anomaly.The modification anomaly occurs when the record is updated in the relation. In this anomaly, the
modification in the value of specific attribute requires modification in all records in which that value occurs.
BENEFITS OF DENORMALIZED RELATIONAL DATABASE TABLES
ABSTRACT
Heuristics for denormalizing relational database tables are examined with an objective of improving processing performance for data
insertions, deletions and selection. Client-server applications necessitate consideration of denormalized database schemas as a means
of achieving good system performance where the client-server interface is graphical (GUI) and the network capacity is limited by the
network channel.
INTRODUCTION
Relational database table design efforts encompass both the conceptual and physical modeling levels of the three-schema
architecture. Conceptual diagrams, either entity-relationship or object-oriented, are a precursor to designing relational table structures.
CASE tools will generate relational database tables at least to the third normal form (3NF) based on conceptual models, but have not
advanced to the point that they produce table structures that guarantee acceptable levels of system processing performance.
As firms move away from the mainframe toward cheaper client-server platforms, managers face a different set of issues in the
development of information systems. One critical issue is the need to continue to provide a satisfactory level of system performance,
usually reflected by system response time, for mission-critical, on-line, transaction processing systems.
A fully normalized database schema can fail to provide adequate system response time due to excessive table join operations. It is
difficult to find formal guidance in the literature that outlines approaches to denormalizing a database schema, also termed usage
analysis. This paper focuses on identifying heuristics or rules of thumb that designers may use when designing a relational schema.
Denormalization must balance the need for good system response time with the need to maintain data, while avoiding the various
anomalies or problems associated with denormalized table structures. Denormalization goes hand-in-hand with the detailed analysis of
critical transactions through view analysis. View analysis must include the specification of primary and secondary access paths for
tables that comprise end-user views of the database. Additionally, denormalization should only be attempted under conditions that
allow designers to collect detailed performance measurements for comparison with system performance requirements [1]. Further,
designers should only denormalize during the physical design phase and never during conceptual modeling.
Relational database theory provides guidelines for achieving an idealized representation of data and data relationships. Conversely,
client-server systems require physical database deign measures that optimize performance for specific target systems under less than
ideal conditions [1]. The final database schema must be adjusted for characteristics of the environment such as hardware, software,
and other constraints.
We recommend following three general guidelines to denormalization [1]. First, perform a detailed view analysis in order to identify
situations where an excessive number of table joins appears to be required to produce a specific end-user view. While no hard rule
exists for defining "excessive," any view requiring more than three joins should be considered as a candidate for denormalization.
Beyond this, system performance testing by simulating the production environment is necessary to prove the need for denormalization.
Second, the designer should attempt to reduce the number of foreign keys in order to reduce index maintenance during insertions and
deletions. Reducing foreign keys is closely related to reducing the number of relational tables.
Third, the ease of data maintenance provided by normalized table structures must also be provided by the denormalized schema.
Thus, a satisfactory approach would not require excessive programming code (triggers) to maintain data integrity and consistency.
DENORMALIZING FIRST NORMAL FORM (1NF) TO UNNORMALIZED TABLES
There are more published formal guidelines for denormalizing 1NF than for any other normal form. This stems from the fact that
repeating fields occur fairly often in business information systems; therefore, designers and relational database theoreticians have been
forced to come to grips with the issue. It is helpful to examine an example problem for those new to the concept of normalization and
denormalization. Consider a situation where a customer has several telephone numbers that must be tracked. The third normal form
(3NF) solution and the denormalized table structure is given below with telephone number (Phone1, Phone2, Phone3, …) as a repeating
field:
3NF:
CUSTOMER (CustomerId, CustomerName,...)
CUST_PHONE (CustomerId, Phone)
Denormalized:
CUSTOMER (CustomerId, CustomerName, Phone1, Phone2, Phone3, ...)
Which approach is superior? The answer is that it depends on the processing and coding steps needed to store and retrieve the data in
these tables. In the denormalized solution, code must be written (or a screen designed) to enable any new telephone number to be
stored in any of the three Phone fields. Clearly this is not a difficult task and can be easily accomplished with modern CASE tools; still,
this denormalized solution is usually only appropriate if one can guarantee that a customer will have a limited finite number of
telephone numbers, or if the firm makes a management decision not store more than "X" number of telephone numbers.
Both solutions provide good data retrieval of telephone numbers as indicated by the following SQL statements. The denormalized
solution requires a simpler, smaller index structure for the CustomerId key field. The normalized solution would require at least two
indices for the Cust_Phone table - one on the composite key to ensure uniqueness integrity, and one on the CustomerId field to provide a
fast primary access path to records.
Select * from Cust_Phone where CustomerId = '3344';
Select * from Customer where CustomerId = '3344';
If the customer name is also required in the query, then the denormalized solution is superior as no table joins are involved.
Select * from Cust_Phone CP, Customer C where CP.CustomerId = '3344' and CP.CustomerId = C.CustomerId;
The effort required to extract telephone numbers for a specific customer based on the customer name is also more difficult fo r the 3NF
solution as a table join is required.
Select * from Cust_Phone CP, Customer C where C.CustomerName = 'Tom Thumb' and CP.CustomerId = C.CustomerId;
If the telephone number fields represent different types of telephones, for example, a voice line, a dedicated fax line, and a
dedicated modem line, then the appropriate table structures are:
3NF:
CUSTOMER (CustomerId, CustomerName, ... )
CUST_PHONE (CustomerId, Phone)
PHONE (Phone, PhoneType)
Denormalized:
CUSTOMER (CustomerId, CustomerName, VoicePhone, FaxPhone, ModemPhone, ...)
Again the denormalized solution is simpler for data storage and retrieval and, as before, the only factor favoring a 3NF solution is the
number of potential telephone numbers that an individual customer may have. The Phone table would be at least 30 to 50 percent the
size of the Cust_Phone table. The decision to denormalize is most crucial when the Customer table is large in a client-server
environment; for example, one hundred thousand customers, each having two or three telephone numbers. The join of Customer,
Cust_phone,
and
Phone
may
be
prohibitive
in
terms
of
processing
efficiency.
DENORMALIZING TO SECOND NORMAL FORM (2NF) TO 1NF
The well-known order entry modeling problem involving Customers, Orders, and Items provides a realistic situation where
denormalization is possible without significant data maintenance anomalies. Consider the following 3NF table structures.
3NF:
CUSTOMER (CustomerId, CustomerName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId)
ORDERLINE (OrderId, ItemId, QtyOrdered, OrderPrice)
ITEM (ItemId, ItemDescription, CurrentPrice)
The many-to-many relationship between the Order and Item entities represents a business transaction that occurs repeatedly over time.
Further, such transactions, once complete, are essentially "written in stone" since the transaction records would never be deleted. Even if
an order is partially or fully deleted at the request of the customer, other tables not shown above will be used to record the deletion of
an item or an order as a separate business transaction for archive purposes.
The OrderPrice field in the Orderline table represents the storage of data that is time-sensitive. The OrderPrice is stored in the Orderline
table because this monetary value may differ from the value in the CurrentPrice field of the Item table since prices may change over
time. While the OrderPrice field is functionally determined by the combination of ItemId and OrderDate, storing the OrderDate field in
the Orderline table is not necessary, since the OrderPrice is recorded at the time that the transaction takes place. Therefore, while
Orderline is not technically in 3NF, most designers would consider the above solution 3NF for all practical purposes. A true 3NF
alternative solution would store price data in an ItemPriceHistory table, but such a solution is not central to the discussion of
denormalization.
In the proposed denormalized 1NF solution shown below (the Customer and Order tables remain unchanged) the Item.ItemDescription
field is duplicated in the Orderline table. This solution violates second normal form (2NF) since the ItemDescription field is fully
determined by ItemId and is not determined by the full key (OrderId + ItemId). Again, this denormalized solution must be evaluated for
the potential effect on data storage and retrieval.
Denormalized 1NF:
ORDERLINE (OrderId, ItemId, QtyOrdered, OrderPrice, ItemDescription)
ITEM (ItemId, ItemDescription, CurrentPrice)
Orderline records for a given order are added to the appropriate tables at the time that an order is made. Since the OrderPrice for a
given item is retrieved from the CurrentPrice field of the Item table at the time that the sale takes place, the additional processing
required to retrieve and store the ItemDescription value in the Orderline table is negligible.
The additional expense of storing Orderline.ItemDescription must be weighed against the processing required to produce various enduser views of these data tables. Consider the production of a printed invoice. The 3NF solution requires joining four tables to produce
an invoice view of the data. The denormalized 1NF solution requires only the Customer, Order, and Orderline tables. Clearly, the
Order table will be accessed by an indexed search based on the OrderId field. Similarly, the retrieval of records from the Orderline
and Item tables may also be via indexes; still, the savings in processing time may offset the cost of extra storage.
An additional issue concerns the storage of consistent data values for the ItemDescription field in the Orderline and Item tables.
Suppose, for example, an item description of a record in the Item table is changed from "Table" to "Table, Mahogany." Is there a
need to also update corresponding records in the denormalized Orderline table? Clearly an argument can be made that these kinds of
data maintenance transactions are unnecessary since the maintenance of the non-key ItemDescription data field in the Orderline table is
not
critical
to
processing
the
order.
DENORMALIZING THIRD NORMAL FORM (3NF) TO 2NF
An example for denormalizing from a 3NF to a 2NF solution can be found by extending the above example to include data for
salespersons. The relationship between the Salesperson and Order entities is one-to-many (many orders can be processed by a single
salesperson, but an order is normally associated with one and only one salesperson). The 3NF solution for the Salesperson and Order
tables is given below, along with a denormalized 2NF solution.
3NF:
SALESPERSON (SalespersonId, SalespersonName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, SalespersonId)
Denormalized 2NF:
SALESPERSON (SalespersonId, SalespersonName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, SalespersonId, SalespersonName)
Note that the SalespersonId in the Order table is a foreign key linking the Order and Salesperson tables. Denormalizing the table
structures by duplicating the SalespersonName in the Order table results in a solution that is 2NF because the non-key SalespersonName
field is determined by the non-key Salesperson field. What is the effect of this denormalized solution?
By using view analysis for a typical printed invoice or order form, we may discover that most end-user views require printing of a
salesperson's name, not their identification number on order invoices and order forms. Thus, the 3NF solution requires joining the
Salesperson and Order two tables, in addition to the Customer and Orderline tables from the denormalized 1NF solution given in the
preceding section, in order to meet processing requirements.
As before, a comparison of the 3NF solution and the denormalized 2NF solution reveals that the salesperson name could easily be
recorded to the denormalized Order table at the time that the order transaction takes place. Once a sales transaction takes place, the
probability of changing the salesperson credited with making the sales is very unlikely.
One should also question the need to maintain the consistency of data between the Order and Salesperson tables. In this situation, we
find that the requirement to support name changes for salespeople is very small, and only occurs, for the most part, when a salesperson
changes names due to marriage. Furthermore, the necessity to update the Order table in such a situation is a decision for management
to make. An entirely conceivable notion is that such data maintenance activities may be ignored, since the important issue for
salesperson's usually revolves around whether or not they get paid their commission, and the denormalized 2NF solution supports
payroll
activities
as
well
as
the
production
of
standard
end-user
views
of
the
database.
DENORMALIZING HIGHER NORMAL FORMS
The concept of denormalizing also applies to the higher order normal forms (fourth normal form - 4NF or fifth normal form - 5NF), but
occurrences of the application of denormalization in these situations are rare. Recall that denormalization is used to improve processing
efficiency, but should not be used where there is the risk of incurring excessive data maintenance problems. Denormalizing from 4NF to
a lower normal form would almost always lead to excessive data maintenance problems. By definition, we normalize to 4NF to avoid
the problem of having to add multiple records to a single table as a result of a single transaction. Data anomalies associated with 4NF
violations only tend to arise when sets of binary relationships between three entities have been incorrectly modeled as a ternary
relationship. The resulting 4NF solution, when modeled in the form of an E-R diagram usually results in two binary one-to-many
relationships. If denormalization offers the promise of improving performance among the entities that are paired in these binary
relationships, then the guidance given earlier under each of the individual 1NF, 2NF, and 3NF sections applies; thus, denormalization
with 4NF would not require new heuristics.
While denormalization may also be used in 5NF modeling situations, the tables that result from the application of 5NF principles are
rarely candidates for denormalization. This is because the number of tables required for data storage have already been minimized.
In essence, the 5NF modeling problem is the mirror-image of the 4NF problem. A 5NF anomaly only arises when a database designer
has modeled what should be a ternary relationship as a set of two or more binary relationships.
SUMMARY
This article has described situations where denormalization can lead to improved processing efficiency. The objective is to improve
system response time without incurring a prohibitive amount of additional data maintenance requirements. This is especially important
for client-server systems. Denormalization requires thorough system testing to prove the effect that denormalized table structures have
on processing efficiency. Furthermore, unseen ad hoc data queries may be adversely affected by denormalized table structures.
Denormalization must be accomplished in conjunction with a detailed analysis of the tables required to support various end-user views
of the database. This analysis must include the identification of primary and secondary access paths to data.
Additional consideration may be given table partitioning that goes beyond the issues that surround table normalization. Horizontal
table partitioning may improve performance by minimizing the number of rows involved in table joins. Vertical partitioning may
improve performance by minimizing the size of rows involved in table joints. A detailed discussion of table partititioning my be found
elsewhere [1].
I work at a company that is doing the following:




Taking a product line with some minor attribute difference between products (for example, they are identical
except for the color of the product)
Creating a record for the product, and filling in the first color.
Copying it, filling in the second color.
Repeating until all colors have been entered.
This creates an amazing amount of duplication, as you can imagine. I grabbed a sample set of data, and 95% of it was
redundant.
When a coworker and I undertook the task of making the data relational, we hit some serious organizational resistance. It
turns out that the decision to denormalize was on purpose (!). The justification was that it's easier to query ("Get me all the
products that are red").
Am I missing something? It seems to me that if a denormalized view is useful, it should be generated from an
AuthoritativeSource that's normalized.
I think I'm going to be in the midst of a data management nightmare in a few months.
Agreed. Denormalization should be a last resort if nothing else can be done to speed things up. It's a basic relational
database design principle that data should be stored OnceAndOnlyOnce.
-- DavidPeterson
Or perhaps these are different products, that just happen to be similar in attributes other than colour. Where is the
denormalisation then?
By normalising, you are saying that conceptually, these colours are kinds of the same product, or subclasses of a
product. But perhaps these concepts are not the most useful.
(More likely you are dead right, but I thought it was worth saying.)
I've seen this done because they didn't want to add an index. Pretty funny until my last contract where they bought an E64K (a
mega-buck Cray equivalent unix server) with gigabytes of memory rather than add an index. Indexes must be heap scary to
some people.
In DataWarehouse building, denormalization seems to be an acceptable technique. For instance, you may have summary
tables that contain information that you could get with a query, but not so quickly. The StarSchema and other techniques
seem to flout NormalForm regularly. (Or am I wrong? We're implementing a DataWarehouse soon, so I've been reading
up, but maybe those with more experience could correct me?) --PhilGroce
If your data isn't changing then DenormalizationIsOk, but if your data is changing then you should be very wary about
denormalizing - it's so easy for duplicated data to become out of sync. Writing insert, update and delete triggers to
update multiple tables is not straightforward and needs careful planning and testing.
It may be an issue of short-term versus long-term. Normalization is the better long-term bet because it allows new requirements
to be added with the minimal fuss. If you optimize the schema for a given current (short-term) usage, you may indeed speed up
that one operation, but open yourself to risk down the road. I have seen a lot of companies with messed-up schemas and it costs
them. Revamping it would be a huge investment because it would impact a lot of applications, so they live with bad schemas.
FutureDiscounting perhaps was overdone. I think it may be better to find a compromize, such as periodic batch jobs that create
read-only tables for special, high-performance needs. That way the live production tables can remain clean. --top
You can refactor things having a normalized database with new tables and build a view with the name of the old
denormalized table.
It is generally accepted that a DatabaseIsRepresenterOfFacts. This has implications on how we interpret a normalized
schema vs. a denormalized schema.
A fully normalized schema is when we cannot state a fact more than once. A denormalized schema on the contrary allows
us to state a fact more than once, with the following consequences:



If we say a proposition twice it doesn't make it twice as true.
If we state two different proposition about essentially the same fact they might not be consistent.
When the fact has changed in the underlying reality, we might fail to update the corresponding database
propositions in all the places where that fact is stated.
I agree with the thrust of what you are saying, but I didn't think that this was the usual definition of normalized.
Normalisation (1NF, 2NF, 3NF, BCNF, 4NF, 5NF; also 6NF if you use DateAndDarwensTemporalDatabaseApproach?) avoids
_particular_ kinds of redundancy and update anomaly, not all kinds -- it doesn't preclude "roll up" running totals in a separate
table, for example, because it deals with relations individually. Also, a schema may be "unnormalised" but still prevent a fact
from being stated more than once using additional constraints (and such a schema may be preferable to a normalised one, even
as a logical schema, because normalisation beyond 3NF (IIRC) may make certain constraints harder to write). I think we need a
different term for a schema (relations+constraints) in which it is impossible to be inconsistent. "Consistent" itself is rather
overused, perhaps, so maybe "consistency-constrained"?
Therefore a denormalized schema is never justified. The main justification for a [de- surely?] normalized schema is very
often justified by the fact that it helps performance, but this is hardly a good justification.
{If the boss/owner wants a faster system, you either give it to them or lose your job to somebody who can. Purists are not
treated very well in the work world, for good or bad. Denormalization for performance takes advantage of the fact that
any given cell generally has a read-to-write ratio of roughly 10:1 or more in practice. As long as this lopsided relationship
exists, it appears that denormalization, or at least "join-and-copy", is an economical way to gain performance. It is roughly
analogous to caching on a larger scale: if you read the same item over and over, it makes sense to keep a local copy
instead of keep fetching from the "master copy" on each read.}
{{*First*, (even with current SQL DBMSs) there are better ways to achieve this than corrupting the logical data model:
materialised views (indexed views, for SQL Server), for example. *Second*, this is the worst kind of premature optimisation
-- not only are you spending the effort when it might not actually help (you've made the data bigger = slower than it
should be in cases when you don't need the "joined" columns), in the process you are corrupting perhaps the most important
public interface in the system, on which there will be many many dependencies (what happens when a new, more
performance-critical app using the same data wants an incompatible "denormalisation"?). *Third*, in many cases, any
performance benefits are lost in the additional validation needed on update to enforce consistency ("cache coherency", if
you want to think of it as a cache) between the duplicates -- although queries may be 90% of the load, updates have a
disproportionate effect (due to the "strength" difference between X and S locks, and because the "denormalisation" often
means that the consistency checks are expensive as they need to read several rows).}}
Denormalization biases the schema towards one application. So the denormalized schema suits better the chosen application
and makes the life harder for other ones. This might be justified, e.g. a DataWarehouse needs denormalized schema.
Why do you think that a DataWarehouse needs a denormalized schema ? A denormalized schema is automatically a
redundant schema. A redundant schema can serve no logical purpose when compared to the equivalent non-redundant
schema.
What it serves is a physical purpose: to be able to answer certain queries faster, for example by grouping related
information closer together on the disk, precomputing the results of frequently asked queries, or partial results needed for
the same purpose and storing them on the disk. However, this all comes at the expense of a greater risk of breaking logical
correctness because physical concerns surface to the logical level. The traditional approach of denormalizing data
warehouses was favored by poor implementation of the relational model in commercial SQL databases. The relational
model mandates a sharp separation between the logical level and the physical level, which it rarely is the case in current
DBMSes, although they are slowly making progress. For example almost always there will be a 1-1 relationship between
logical structures (tables) and physical structures (table extents, type of row stored in physical pages).
The right solution to solving datawarehouse kind of problems is to be able to have controlled redundancy and more
flexibility in storage at the physical level. For example, pieces of information from different table that are accessed
together can be stored in the same physical disk page , even the same piece of information can be stored several time in
various pages together with other related information. However, this can be done automatically by the DBMS, while the
user will still see only one logical structure (the initial table), so the user can only insert/update/delete the fact once. This is
not possible in a denormalized schema. The piece of the database kernel that controls the redundancy at the physical level
can be thoroughly tested and doesn't run the risk of corrupting data like all the ad-hoc user programs that are needed to
maintain a redundant schema.
Recently various SQL databases have been making progresses to this kind of flexibility in the physical layout through
the use of clustered tables (grouping information from several tables in the same physical location to avoid join penalty),
table partitioning (breaking a table apart in several physical locations either by data range , or by groups of columns),
materialized views (storing and maintaining precomputed results for important queries).
-CostinCozianu
Agreed. DataWarehouse approaches like StarSchemas are doomed to be obsoleted (WithinFiveYears?? Perhaps even today,
for many DW applications) by improvements to logical-physical separation in main-stream RDBMS products.
The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1
 Fabian Pascal
 DM Review Magazine, June 2002
"A traditional normalized structure cannot and will not outperform a denormalized star schema from a decision support system (DSS)
perspective. These schemas are designed for speed, not your typical record style transaction ... [DSS] requests put way to [sic] much stress on
a
model
that
was
originally
developed
to
support
the
handling
of
single
record
transactions."
– Practitioner with 20 years of experience
"Third normal form seems to be regarded by many as the point where your database will be most efficient ... If your database is
overnormalized [sic] you run the risk of excessive table joins. So you denormalize and break theoretical rules for real world performance
gains."
– http://www.sqlteam.com/Forums/
"Denormalization can be described as a process for reducing the degree of normalization with the aim of improving query processing
performance...we present a practical view of denormalization, and provide fundamental guidelines for denormalization ... denormalization
can enhance query performance when it is deployed with a complete understanding of application requirements."
– Sanders & Shin, State University NY Buffalo
Those familiar with my work know I have deplored the lack of foundational knowledge and sound practices in the database trade for
years. One of the most egregiously abused aspects is normalization, as reflected in the opening quotes. The prevailing argument is as
follows:
1.
2.
3.
4.
The higher the level of normalization, the greater the number of tables in the database;
The greater the number of tables in the database, the more joins are necessary for data manipulation;
Joins slow performance;
Denormalization reduces the number of tables and, therefore, the reliance on joins, which speeds up performance.
The problem is that points 2 and 3 are not necessarily true, in which case point 4 does not hold. What is more, even if the claims were
all true, denormalization would still be highly questionable because performance gains, if any, can only be had at the expense of
integrity.
The purpose of this article is to explain just why the notion of "denormalization for performance" is a misconception and to expose its
costly dangers, of which practitioners are blissfully unaware. If the integrity cost of denormalization is taken into consideration, it will
cancel out performance gains, if any.
The Key, the Whole Key and Nothing But the Key
Informally, a fully normalized relational database consists of R-tables. Each represents just one entity type. In such tables, all non-key
columns represent attributes of one entity type. The attributes of an entity type are represented by columns, and the entities of that
type are represented by rows. Because a table represents only one entity type and the entities of that type are identified by the key,
non-key values or sets thereof are dependent on (associated with) key values. Stated another way, non-key columns are "directly about
the key, the whole key and nothing but the key."
By definition, key columns do not have duplicate values; therefore, non-key columns dependent on the key won't have duplicate values
either. That is why there is no redundancy due to such dependencies in fully normalized tables. Practically, what this means is that in a fully
normalized database, it is sufficient for the database designer to declare key constraints to the RDBMS, and the integrity of the
database insofar as column dependencies are concerned will be guaranteed.1
Denormalizing a database means, again informally, "bundling" multiple entity types in one table. In a table that represents multiple
entity types, some non-key columns will not be directly about the key, the whole key and nothing but the key. The following kinds of
dependency are possible in such tables, each of which is associated with a normal form lower than full normalization:
1.
2.
3.
4.
Dependency on part of a (composite) key; table not in 2NF.
Indirect dependency on the key; dependency on another non-key column that is dependent on the key; table not in 3NF.
Intra- key multivalued dependency; table not in 4NF.
Intra-key join- dependency; table not in 5NF.2
Non-key columns and columns that are part of a (composite) key are not guaranteed to have unique values. If they have duplicates in
some rows in the table, other columns in the table dependent on such columns (rather than directly on the key, the whole key and nothing
but the key) will also have duplicate values in those same rows, as highlighted in table A in Figure 1, because of the association with
those non-key values. This causes redundancy in denormalized tables. For such tables, key constraints are no longer sufficient to
guarantee integrity. The DBMS must also control the redundancy, so that when a table such as A is updated, all stored instances of the
same data are updated. If this is not done, the database can be corrupted and become inconsistent. This complication is not necessary
for fully normalized tables (such as B1 and B2 in Figure 1) because they don't have redundancy due to under- normalization.
Figure 1: Denormalized versus Normalized Tables
Thus, from an integrity perspective, there are two design options:
1.
2.
Fully normalize the database and maximize the simplicity of integrity enforcement;
Denormalize the database and complicate enormously (prohibitively, as we shall see shortly) integrity enforcement.
Obviously the first choice is the better option. Why, then, the prevailing insistence on the second choice? The argument for
denormalization is, of course, based on performance considerations; but does it make sense?
The Logical-Physical Confusion
Suppose I ask you to retrieve two sets of books from a library: one set of five books and one of ten. If I ask you which set will take you
longer to retrieve, would you know? When I ask practitioners this question, many realize that the time required for set retrieval
depends not on the number of books per se, but on the characteristics of the library: how the books are physically stored and retrieved.
It is entirely possible for 10 books to take less time if, for example, they are all stored together while the five are scattered across the
library; or if the 10 are on the first floor and the five on the sixth; or if the 10 are indexed in a catalog, but the five are not. (If any of
the books were being used or borrowed by others, retrieval could take a very long time indeed.)
By definition, normalization increases the number of logical tables – books in the analogy – in the database. If the number of books
says nothing about library retrieval performance, how can denormalization – decreasing the number of logical tables – say anything
about database performance? As in the library case, performance is determined entirely at the physical database level (storage and
access methods, hardware, physical design, DBMS implementation details, degree of concurrent access, etc.). What, then, does
"denormalization for performance" mean? When I ask this question, I generally get answers along the following line: "Well, it's all fine
in theory, but we live in the real world. The reality is that I cannot get acceptable performance unless I denormalize my databases."
Let us assume, for the sake of argument, that this is always so (even though it's not). The only logical conclusion possible is not that
normalization slows performance and denormalization speeds it up. To the extent that performance with fully normalized databases is
slow and it improves with denormalization, this can only be due to the physical implementations of the database and DBMS. Those
implementations are simply not good enough for correct logical database design, forcing you to introduce redundancy in the database
and, as we shall see, trade integrity for performance.
Okay, you say, but I must work with existing DBMS software anyway so what difference does your argument make? First of all, it is
very important to realize that it is not normalization, but physical implementation that determines performance (more on this in Part 2).
Secondly, the only reason you may sometimes achieve better performance when you denormalize your database is because you ignore
the integrity implications. If and when those are taken into account, you will lose the performance gain, if any, from denormalization and
may even end up with a net performance loss relative to the fully normalized alternative.
I will demonstrate this next month in Part 2. Stay tuned.
References:
1.
2.
There may be redundancies in the database for reasons other than such column dependencies (e.g. derived columns, crosstable duplicates) which must be separately resolved by means other than full normalization (see Chapters 5 and 8 in Practical
Issues In Database Management, Addison Wesley, 2000).
All R-tables are by definition in first normal form (1NF), or normalized – they don't have multivalued columns (known in
older databases as repeating groups). Normalized databases can be further normalized to the higher normal forms 2NF –
5NF. If there are no composite keys in the database, a normalized database is automatically also in 2NF and a database in
3NF is automatically also in 5NF, or fully normalized. In other words, you need to worry about 4NF and 5NF only if
composite keys are present (see Chapter 5 in Practical Issues In Database Management, Addison Wesley, 2000).
Definitions of the Normal Forms
1st Normal Form (1NF)
A table (relation) is in 1NF if:
1. There are no duplicated rows in the table.
2. Each cell is single-valued (no repeating groups or arrays).
3. Entries in a column (field) are of the same kind.
*The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be
made up of more than one column, even possibly, of all the colomns).
2nd Normal Form (2NF)
A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all of the key. Since a partial dependency
occurs when a non-key attribute is dependent on only a part of the composite key, the definition of 2NF is
sometimes phrased as, “A table is in 2NF if it is in 1NF and if it has no partial dependencies.”
3rd Normal Form (3NF)
A table is in 3NF if it is in 2NF and if it has no transitive dependencies.
Boyce-Codd Normal Form (BCNF)
A table is in BCNF if it is in 3NF and if every determinant is a candidate key.
4th Normal Form (4NF)
A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.]
5th Normal Form (5NF)
A table is in 5NF, also called “Projection-join Normal Form” (PJNF), if it is in 4NF and if every join dependency in the table
is a consequence of the candidate keys of the table.
Domain-Key Normal Form (DKNF)
A table is in DKNF if every constraint on the table is a logical consequence of the definition of keys and domains.
Database Anomalies
Insertion Anomaly
It is a failure to place information about a new database entry into all the places in the database where information about
the new entry needs to be stored. In a properly normalized database, information about a new entry needs to be inserted
into only one place in the database, in an inadequately normalized database, information about a new entry may need to
be inserted into more than one place, and human fallibility being what it is, some of the needed additional insertions may
be missed.
Deletion Anomaly
It is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly
normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the
database, in an inadequately normalized database, information about that old entry may need to be deleted from more
than one place.
Update Anomaly
An update of a database involves modifications that may be additions, deletions, or both. Thus “update anomalies” can be
either of the kinds discussed above.
All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly
normalized database are much less susceptible to corruption than are un-normalized databases.
Rules of Normalization I
To normalize databases, there are certain rules to keep in mind. These pages will illustrate the basics of normalization in a
simplified way, followed by some examples.
Database Normalization Rule 1: Eliminate Repeating Groups. Make a separate table for each set of related attributes, and give
each table a primary key.
Unnormalized Data Items for Puppies
 puppy number
 puppy name
 kennel code
 kennel name
 kennel location
 trick ID
 trick name
 trick where learned
 skill level
In the original list of data, each puppy description is followed by a list of tricks the puppy has learned. Some might know
10 tricks, some might not know any. To answer the question “Can Fifi roll over?” we need first to find Fifi’s puppy record,
then scan the list of tricks associated with the record. This is awkward, inefficient, and extremely untidy.
Moving the tricks into a separate table helps considerably. Separating the repeating groups of tricks from the puppy
information results in first normal form. The puppy number in the trick table matches the primary key in the puppy table,
providing a foreign key for relating the two tables with a join operation. Now we can answer our question with a direct
retrieval look to see if Fifi’s puppy number and the trick ID for “roll over” appear together in the trick table.
First Normal Form:
Puppy Table
puppy number
puppy name
kennel name
kennel location
— primary key
Trick Table
puppy number
trick ID
trick name
trick where learned
skill level
Rules of Normalization II
Database Normalization Rule 2: Eliminate Redundant Data, if an attribute depends on only part of a multi-valued key, remove it
to a separate table.
The trick name (e.g. “roll over”) appears redundantly for every puppy that knows it. Just trick ID whould do.
TRICK TABLE
Puppy Number Trick ID Trick Name Where Learned Skill Level
52
27
”roll over”
16
9
53
16
“Nose Stand”
9
9
54
27
”roll over”
9
5
*Note that trick name depends on only a part (the trick ID) of the multi-valued, i.e. composite key.
In the trick table, the primary key is made up of the puppy number and trick ID. This makes sense for the “where learned”
and “skill level” attributes, since they will be different for every puppy-trick combination. But the trick name depends only
on the trick ID. The same name will appear redundantly every time its associated ID appears in the trick table.
Second Normal Form
Puppy Table
puppy number
puppy name
kennel code
kennel name
kennel location
Tricks Table
tricks ID
tricks name
Puppy-Tricks
puppy number
trick ID
trick where learned
skill level
Suppose you want to reclassify a trick, i.e. to give it a different trick ID. The change has to be made for every puppy that
knows the trick. If you miss some of the changes, you will have several puppies with the same trick under different IDs, this is
an update anomaly.
Rules of Normalization III
Database normalization Rule 3: Eliminate columns not dependent on key. If attributes do not contribute to a description of the
key, remove them to a separate table.
Puppy Table
puppy number
puppy name
kennel code
kennel name
The puppy table satisfies the first normal form, since in contains no repeating groups. It satisfies the second normal form,
since it does not have a multivalued key. But the key is puppy number, and the kennel name and the kennel location
describe only a kennel, not a puppy. To achieve the third normal form, they must be moved into a separate table. Since
they describe a kennel, kennel code becomes the key of the new “kennels” table.
Third Normal Form
Puppies
puppy number
puppy name
kennel code
Kennel
kennel code
kennel name
kennel location
Tricks
trick ID
trick name
Puppy Tricks
puppy number
trick ID
trick where learned
skill level
The motivation for this is the same as for the second normal form. We want to avoid update and delete anomalies. For
example suppose no puppies from the Puppy Farm were currently stored in the database. With the previous design, there
would be no record of its existence.
Boyce-Codd Normal Form
The previous normalization forms are considered elementary, and should be applied on tables during our design process.
This normalization form however, and the following forms, are done in special tables.
A table is considered in BCNF (Boyce-Codd Normal Form) if it’s already in 3NF AND doesn’t contain any nontrivial
functional dependencies. That is it doesn’t contain any field (other than the primary key) that can determine the value of
another field. Let’s take the following table:
Student
Subject
Teacher
Smith
Math
Dr. White
Smith
English
Dr. Brown
Jones
Math
Dr. White
Jones
English
Dr. Brown
Doe
Math
Dr. Green
By taking into consideration the following conditions:
 For each subject, every student is educated by one teacher.
 Every teacher teaches one subject only.
 Each subject can be taught by more than one teacher.
It’s clear we have the following functional dependency:
Teacher -> Subject
And the left side of this dependency is not the primary key.
So, to convert the table from 3NF to BCNF, we do these steps:
 Determine in the table, a key other than the primary key. That can be left side to the functional dependency.
 Delete the key in the right side of our functional dependency in the main table.
 Make a table for this dependency, with it’s key being the left side of the dependency, as the following:
Student
Teacher
Smith
Dr. White
Smith
Dr. Brown
Jones
Dr. White
Jones
Dr. Brown
Doe
Dr. Green
And
Teacher
Subject
Dr. White
Math
Dr. Brown
English
Dr. Green
Math
Rules of Normalization IV
Database Normalization Rule 4: Isolate independent multiple relationships. No table may contain two or more 1:n (one-tomany) or n:m (many-to-many) relationships that are not directly related.
This applies only to designs that include one-to-many and many-to-many relationships. An example of a one-to-many
relationship is that one kennel can hold many puppies. An example of a many-to-many relationship is that a puppy can
know many tricks and several puppies can know the same tricks.
Puppy Tricks and Costumes
puppy number
trick ID
trick where learned
skill level
costume
suppose we want to add a new attribute to the puppy-trick table, ”Costume”, this way we can look for puppies that can
both “set-up-and-beg” and wear a Groucho Marx mask, for example. The forth normal form dictates against this (i.e.
against using the puppy-tricks table not against begging while wearing a Groucho mask). The two attributes do not share a
meaningful relationship. A puppy may be able to wear a wet suit. This does not mean it can simultaneously sit up and beg.
How will you represent this if you store both attributes in the same table?
Fourth Normal Form
Puppies
puppy number
puppy name
kennel code
Kennels
kennel code
kennel name
kennel location
Tricks
trick ID
trick name
Puppy-Tricks
puppy number
trick ID
trick where learned
skill level
Costumes
costume number
costume name
Puppy-Custumes
puppy number
costume number
Examples
Question 1. Choose a key and write the dependencies for the following Grades:
relation:
GRADES(Student_ID, Course#, Semester#, Grade)
Answer:
Key is :
Student_ID, Course#, Semester#,
Dependency is:
Student_ID, Course#, Semester# -> Grade
Question 2. Choose a key and write the dependencies for the LINE_ITEMS relation:
LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty)
Answer:
Key can be: PO_Number, ItemNum
Dependencies are:
PO_Number, ItemNum -> PartNum, Description, Price, Qty
PartNum -> Description, Price
Question 3. What normal form is the above LINE_ITEMS relation in?
Answer:
First off, LINE_ITEMS could not be in BCNF because:
not all determinants are keys.
next: it could not be in 3NF because there is a transitive dependency:
PO_Number, ItemNum -> PartNum
and
PartNum -> Description
Therefore, it must be in 2NF, we can check this is true because:
the key of PO_Number, ItemNum determines all of the non-key attributes however, PO_Number by itself and ItemNum by
itself can not determine any other attributes.
Example II
Question 4: What normal form is the following relation in?
STORE_ITEM( SKU, PromotionID, Vendor, Style, Price )
SKU, PromotionID -> Vendor, Style, Price
SKU -> Vendor, Style
Answer:
STORE_ITEM is in 1NF (non-key attribute (vendor) is dependent on only part of the key.
Question 5: Normalize the above (Q4) relation into the next higher normal form.
Answer:
STORE_ITEM (SKU, PromotionID, Price)
VENDOR ITEM (SKU, Vendor, Style)
Question 6: Choose a key and write the dependencies for the following SOFTWARE relation (assume all of the vendor’s
products have the same warranty).
SOFTWARE (SoftwareVendor, Product, Release, SystemReq, Price, Warranty)
SoftwareVendor, Product, Release -> SystemReq, Price, Warranty
Answer:
key is: SoftwareVendor, Product, Release
SoftwareVendor, Product, Release -> SystemReq, Price, Warranty
SoftwareVendor -> Warranty
.:. SOFTWARE is in 1NF
Example III
Question 7: Normalize the above Software relation into 4NF.
Answer:
SOFTWARE (SoftwareVendor, Product, Release, SystemReq, Price)
WARRANTY (SoftwareVendor, Warranty)
Question 8: What normal form is the following relation in?
only H,I can act as the key.
STUFF (H, I, J, K, L, M, N, O)
H, I -> J, K, L
J -> M
K -> N
L -> O
Answer:
2NF (Transitive dependencies exist)
Question 9: What normal form the following relation in?
STUFF2 (D, O, N, T, C, R, Y)
D, O -> N, T, C, R, Y
C, R -> D
D -> N
Answer:
1NF (Partial Key Dependency exist)
Example IV
Invoice relation
Is this relation in 1NF? 2NF? 3NF?
Convert the relation to 3NF.
Inv# date custID Name Part# Desc Price #Used
14 12/63
14 12/63
15 1/64
42
42
44
Lee
Lee
Pat
A38
A40
A38
Nut
Saw
Nut
0.32
4.50
0.32
10
2
20
Ext
Price
3.20
9.00
6.40
Tax
rate
0.10
0.10
0.10
Tax
Total
1.22 13.42
1.22 13.42
064 7.04
Table not in 1NF because
- it contains derived values
EXT PRICE(=Price X # used)
3.2 = 0.32 X 10
- Tax (=sum of Ext price of same Inv# X Tax rate)
1.22 = (3.2 + 9.00) X 0.10
- Total (=sum of Ext price + Tax)
13.42 = (3.20 + 9.00) + 1.22
To get 1NF, identify PK and remove derived attributes
Inv# date
14 12/63
14 12/63
15 1/64
custID
42
42
44
Name
Lee
Lee
Pat
Part#
A38
A40
A38
Desc
Nut
Saw
Nut
Price
0.32
4.50
32
#Used
10
2
20
Tax rate
0.10
0.10
0.10
To get 2NF
- Remove partial dependencies
- Partial FDs with key attributes.
- Inv# -> Date, CustID, Name, Tax Rate
- Part# -> Desc, Price
Remove Partial FDs
|–K1-||———————–D1———————————||—K2—||——-D2———|
Inv# date custID Name Tax rate Part# Desc Price #Used
14 12/63 42
Lee
0.10
A38 Nut 0.32
10
14 12/63 42
Lee
0.10
A40 Saw 4.50
2
15 1/64
44
Pat
0.10
A38 Nut
32
20
=
Inv# date custID Name Tax
rate
14 12/63 42
Lee
0.10
14 12/63 42
Lee
0.10
15 1/64
44
Pat
0.10
Inv# Part# #Used
14
A38
10
14
A40
2
15
A38
20
Part#
A38
A40
A38
Desc
Nut
Saw
Nut
Price
0.32
4.50
32
Remove transitive FD
Inv#(PK) -> CustID -> Name
Inv#
date
custID
14
12/63
42
15
1/64
44
=
Inv#
date
custID
14
12/63
42
15
1/64
44
+
custID
Name
42
Lee
44
Pat
All relations in 3NF
Inv#
Part#
14
A38
14
A40
15
A38
Part#
A38
A40
date
12/63
1/64
custID
42
42
Name
Lee
Pat
Tax rate
0.10
0.10
Tax rate
0.10
0.10
#Used
10
2
20
Desc
Nut
Saw
Inv#
14
15
Name
Lee
Pat
custID
42
44
Price
0.32
4.50
Tax rate
0.10
0.10
Example V
In this page you will see a basic database normalization example, transforming BCNF table into a 4NF one(s).
The next table is in the BCNF form, convert it to the 4th normal form.
Employee
skill
language
Jones
electrical
French
Jones
electrical
German
Jones
mechanical
French
Jones
mechanical
German
Smith
plumbing
Spanish
The above table does not comply with the 4th normal form, because it has repetitions like this:
Jones
X
A
Jones
Y
B
So this data may be already in the table, which means that it’s repeated.
Jones
B
X
Jones
A
Y
To transform this into the 4th normal form (4NF) we must separate the original table into two tables like this:
employee
Jones
Jones
Smith
skill
electrical
mechanical
plumbing
employee
Jones
Jones
Smith
language
French
German
Spanish
And
Source : DBNormalization.com