Download Data Normalization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Transcript
/w EPDw UJNzg5N
L10: D ATA N ORMALIZATION - L OGICAL D ATABASE D ESIGN
In this learning unit, you will learn about the process of data normalization, and how it can be used to transform
existing data into an efficient logical model.
L EARNING O BJECTIVES





Define data normalization
Explain why data normalization is important
Explain how normalization helps reduce redundancy and anomalies
Solve data anomalies by transforming data from one normal form to the next; to the third normal form
Apply normalization with data modeling to produce good database design
This week we’ll explore more of the activities performed in logical design, Data Normalization. If you look at
our methodology you can see that we are still in the Design phase.
P ART 1: D ATA N ORMALIZATION
W HAT
IS
D ATA N ORMALIZATION ?
Data Normalization is a process which, when applied correctly to a data model, will produce an efficient logical
model. By efficient logical model, we mean one with minimal data redundancy and high efficiency in terms of table
design. Webster defines normalization as “to make conform or to reduce to a form or standard.”• This definition
holds true for data normalization, too. Data Normalization is a means of making your data conform to a standard. It
just so happens that standard is optimized for efficient storage in relational tables!
Highly efficient data models are minimally redundant, and vice-versa. The bottom line here is the more
“efficiency”• we add into the data model, the more tables we produce, the less chance data is repeated in rows of
the table, and the greater the chance we don’t have unreliable or inconsistent data in those tables. (Okay Mike,
take a breath!)
Let’s look at an example. For example, this data model is inefficient, because it contains redundant data: the rows
in the City and state columns. Redundancy leaves the door open to data inconsistency, as is the case with the
attributes in bold italic. It is obvious we meant “New York”• and “IL”•, and not “News York”• and “iL”• but
because our data model does not protect us from making this mistake, it could (and is bound to) happen.
An improved, 'normalized', data model would look like the following:
And because of the PK and FK constraints on the logical model, it is not impossible to have two different cities or
states for the same zip code; hence we’ve reduced the chance for someone to introduce bad data!
This is the modus operandi of data normalization. Make more tables and FK’s to eliminate the possibility of data
redundancy and inconsistency! 
W HERE
DOES
D ATA N ORMALIZATION F IT ?
Where does data normalization fit within the logical model? Normalization services two purposes: It is used to:


Improve upon existing data models and data implementations. Take a poor database design and make it
better. I normalize frequently just to correct problems with existing designs.
Check to make sure your Logical model is minimally redundant. You can run the normalization tests over
your data model to ensure your logical model is efficient.

“Reverse engineer”• an external data model, such as a view or report into its underlying table structure. I
use normalization to estimate the internal data model of an application based on the external data model
(the screens and reports of the application).
A NOTHER EXAMPLE , “D AVE D” S TYLE
Here’s another example of normalization, taken from Professor Dave Dischiave. If you look at the following set of
data you’ll see where this is going. Let’s say you received the following employee data:
By inspecting the data in the above relation we notice that the same data values highlighted in yellow live in the
data set more than once. The Dept Name Marketing occurs twice as does the title Manager. When data values live
in the database more that once, we refer to this condition as data redundancy. If this were a set of 300,000
employees you can image how ugly this problem could get. Any new employee added to the set or changes made
to an existing employee now have the opportunity to be added or changed inconsistently with existing data values
already in the database.
When these types of errors are introduced to the data they are referred to as insertion or modification anomalies.
Anomaly is another way of saying a deviation from an established rule or in Dave D terms an error. Let’s insert a
new row of data into the table above to prove our point:
Here we inserted a new employee 248, Carrie J, 10, marking, mg. We assigned her to the Marketing department or
did we? When we inspect the data, we find that the data is similar but unfortunately not the same. Is “marking”•
the same thing as Marketing? Is “mg”• the same as Manager? When using data with anomalies to make decisions
about employees that work in the Marketing department or have the title of Manager your results may be inaccurate.
Upon inspecting the data you can visually inspect the data to determine what the data means; but it doesn’t inspire
confidence about what other data values may be incorrect as well.
If you reflect back to week one, we stated that in order for data to be useful for making decisions data had to be
ARTC. Here the A stands for accurate. If we run the risk of insertion or modification anomalies (i.e. errors) with the
data you see how quickly the data can become inaccurate. I think you can now see why normalizing the data is
important. To normalize the data, that is to reduce anomalies, we need to first reduce redundancy. So how is that
done? To reduce redundancy we’ll apply the process of normalization by applying a set of steps called normal
forms also the name of the end state after each step has been applied.
H OW DOES D ATA N ORMALIZATION
WORK ?
T HE N ORMAL F ORMS
The process of normalization involves checking for data dependencies among the data in the columns of your data
model. Depending on the type of dependency, the model will be in a certain normal form, for example, 1st, 2nd or
3rd normal form. To move your data model from its current normal form to a higher normal form, involves
applying a normalization rule. The end product of a normalization rule is an increased number of tables (and
foreign keys) from that in which you started. The higher the normal form, the more tables in your data model, the
greater the efficiency, and the lower the chances for data redundancy and errors.
The specific normal forms and rules are examined in the section below:





1st normal form - any multi-valued attributes have been removed so that there is a single value at the
intersection of each row and column. This means eliminate repeating attributes or groups of repeating
attributes. These are very common in many-to-many relationships
2nd normal form - remove partial dependencies. Once in 1st normal form, eliminate attributes that are
dependent on only part of the composite PK
3rd normal form - remove transitive dependencies. From 2nd normal form, remove attributes that are
dependent on non-PK attributes
Boyce-Codd normal form - remove remaining anomalies that result from functional dependencies
4th normal form - remove multi-valued dependencies

5th normal form - remove remaining anomalies (essentially a catchall)
As we review the above steps we have to wonder: What do they actually mean? Are all six really necessary? Isn’t
there an easier way to remove anomalies? Must they be processed in order? Before we determine the answer to
these questions let’s look at the normal forms pictorially and see how far we need to go with normalization.
P ART 2: F UNCTIONAL D EPENDENCE
F UNCTIONAL D EPENDENCE
At its root, Normalization is all about functional dependence, or the relationship between two sets of data
(typically columns in the tables you plan to normalize). Functional dependence says that for each distinct value in
one column, say “column A”•, there is one and only one value in another column, say “column B.”• Furthermore,
we say “the data in column B is functionally dependent on the data in column A”• or just “B is functionally
dependent on A.”• You can also say “A determines B”•, since column B would be known as the determinant.
Where does all this formal mumbo-jumbo come from? Well, in the world of mathematics, a function, such as f(x) =
2x+3 takes all values in its domain (in this case x is all real numbers) and maps them to one and only one value in
the range (in this case 2x +3) so that for any given value of x, there should be one and only one value for f(x). This,
in my own words, is the formal definition of a function.
Functional dependence works the same way, but instead uses columns in a table as the domain and range. The
range is the determinant.
Take the following example: When we say the data in the customer name column is functionally dependent on the
customer id column, we’re saying that for each distinct customer id_ there is one and only one customer name_.
Don’t misinterpret what this says - It’s a-okay to have more than one customer id determine the same customer
name (as is the case with customers 101 and 103 since they’re both “Tom”), but it’s not all right to have one
customer id, such as 101 determine more than one name (Tom, Turk) or 104 Ted, Teddy (as is the case with the
items in red in the customer2 table). So in table Customer1 Customer id determines Customer Name, but in table
Customer2 this is not the case!
P RIME
VS .
N ON -P RIME
ATTRIBUTES
The very first step in the process of normalization is to establish a primary key from the existing columns in each
table. When we do this they’re not called “primary keys”• per-se but instead called candidate keys. This is a critical
step because if you create a surrogate key (think: int identity in T-SQL) for the table, then it makes the process of
functional dependence a little more trivial, since every column in the table is functionally dependent on the
surrogate key!
Bottom line here is if you need to use surrogate keys, be sure to add them after you’ve normalized, not before!
You will always normalize based on the existing data, and therefore must resist the urge to use surrogate keys!
Okay. Now we move to the definitions. Once you’ve established a candidate key, you can categorize the columns in
each table accordingly:


Prime Attributes (a.k.a. Key attributes)”“ those columns which are the candidate key or part of the
candidate key (in the case of a composite key).
Non-Prime Attribute (a.k.a. Non-Key attributes)”“ Those columns which are not part of the candidate key
Example, from the figure below: (Foo1 + Foo3 = Key / Prime, Foo2, Foo4 = non Key / non Prime)
P ART 3: E XTENSIVE
EVALUATION OF T HE
N ORMAL
FORMS
O VERVIEW
Normal forms are “states of being,”• like sitting or standing. A data model is either “in”• a given normal form or
it isn’t. The normal form of a logical data model is determined by:


Its current normal form
The degree of functional dependence among the attributes in each table.
Z ERO N ORMAL F ORM 0NF (U N - NORMALIZED
DATA )
Definitions are great and all, but sometimes they aren’t much help. Let’s look at a couple of examples to help
clarify the normal forms. As we inspect an un-normalized Employee table below what do we notice about the
attributes and the data values? Two things immediately jump out at us. There is an obvious repeating group and
there is significant redundancy. So the risk of data anomalies is high.
The Bottom Line (0NF): Data’s so bad, you can’t establish a primary key to create entity integrity from the existing
data! Ouch!
F IRST N ORMAL F ORM 1NF
Let’s take the table above and put it into first normal form by removing the repeating groups of attributes. The
result looks like the table below. You might say that this is not much of an improvement. In fact there is even more
redundancy. Well yes, more redundancy but now we can have more than three employees working in any
department without changing the structure of the database. You’ll also notice that I also had to create a composite
PK in order to maintain entity integrity. When you have entity integrity the primary key has functional
dependency.
The Bottom Line (1NF): At this point you can at least establish a candidate key among the existing data.
S ECOND N ORMAL F ORM 2NF
Next, let’s take the table above from 1st normal form and remove the partial dependencies that exist. Let’s
remove the attributes that are only dependent on part of the primary key. Here you can see that that there are
some logical associations, deptName is dependent on deptNo and empName is dependent on empNo therefore
we can remove each set of these attributes and place them in their own entities. Because deptName only needs
the deptNo for determination, we say there’s a partial (functional) dependency. Eliminating this partial
dependency by creating an additional table improves the data design since we've reduced the amount of
redundant data.
The Bottom Line (2NF): At this point, you have a candidate key and no partial dependencies in the existing data.
With each new table created, the data redundancy is reduced.
T HIRD N ORMAL F ORM 3NF
Let’s take the tables above from 2nd normal form and remove the transitive dependencies that is let’s remove the
attributes that are only dependent on non-primary key attributes or are not dependent on any attribute remaining
in the table. Whatever is left is considered normalized to the 3rd normal form. Here we only have the titleName
attribute remaining and it is not dependent on either of the PK attributes: deptNo or empNo; so we remove
titleName from the table above and place it in its own entity. But we still need a way to reference it. We need to
invent a PK with unique values and then associate the titleNames with the employees in the Employee entity via a
FK. The titleName is a transitive (functional) dependence. After 3rd normal form our three new tables look like
this:
You can see we've improved the design yet again by eliminating yet another layer of redundancy. You probably
know the Title table as a lookup table. Lookup tables are a common practice for 3NF.
The Bottom Line (3NF): At this point, you have (1) a candidate key and (2) no partial dependencies and (3) no
transitive dependencies in each table of existing data. Furthermore, this logical design is better than the original
design since we have reduced the amount of redundant data, and minimized the possibility of inserting or
updating bad data.
NOTE: See the Slides from this learning unit for more examples on the normal forms.