Download 2. First Normal Form

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
SQL Unit 17
Normalization
prepared by Kirk Scott
1
•
•
•
•
•
•
•
•
•
1.
2.
3.
4.
5.
6.
7.
8.
9.
Normal Forms
First Normal Form
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
Domain Key Normal Form
Nulls and Integrity
2
1. Normal Forms
• The benefits of relational database theory can
be summarized as follows:
• There is a step-by-step way of arriving at a
correct design
• There is a way of detecting flaws in a design
• The design process has to do with the
problem domain, not with computer-related
questions
3
• The database designer and user are protected
from questions related to the implementation
of the dbms and the hardware it’s running on
• Finally, if the design is correct, it will be
possible to:
• Store all desired information in it;
• Update the information on an ongoing basis;
• Retrieve any/all of the information as needed.
4
• Correct designs are based on what are called
normal forms.
• This section presents the background
information to the design process.
• It also discusses and illustrates the use of
normal forms.
5
Identifying Entities
• At its most basic level, design of a database
depends on determining what you want to
store information about.
• When deciding what the base tables will be,
you are trying to identify entities.
• From a language point of view, this involves
identifying nouns which do not modify other
things.
6
Identifying Attributes
• Identifying the entities leads to identifying
their attributes.
• Attribute names usually end up being nouns
too, but you figure out what they are when
you try to describe entities, and the
descriptions usually involve adjectives.
• One of the key points of database design is
that you only store information about the
entities and attributes you need to.
7
• There may be many possible entities
• All entities may have a long list of potential
attributes
• But you limit yourself to only those things you
will need to retrieve information about in the
future.
8
Identifying Keys
• You are familiar with primary keys and foreign
keys.
• When trying to organize the attributes around
entities in the design, the idea is to equate an
entity with a primary key field
• Then group the attributes with the entities that
they describe.
• Relationships between tables are captured by
embedding the primary keys of one or more
tables as foreign keys in other tables.
9
Functions, Determination, and
Dependency
• When described in general, the foregoing
sounds sensible enough.
• That’s why the book claims that if you can
model successfully, the result will be a correct
design
• In practice it can be difficult to do without
formal guidelines.
• This is what the normal forms provide.
10
• The normal forms are based on and described
in terms of an idea taken from math.
• One field in a table may functionally
determine another.
• Stated in reverse order: The other field
depends functionally on the one.
11
•
•
•
•
•
•
This is an example of a mathematical function:
y = f(x), for example, y = x2
y is a function of x.
x is in the domain and y is in the range.
x functionally determines y
Or, y functionally depends on x.
12
• For a mathematical function, you find the
dependent value by doing some sort of
computation on the determining value.
• The key point underlying a function is the
following:
• For each value of x, there can only be one
corresponding value of y.
• x uniquely determines y.
13
• The analogy in database design is the following:
• The primary key of a table should functionally
determine the values of the other fields in the
table.
• In other words, the non-key fields should
functionally depend on the primary key field.
• Just as the primary key uniquely identifies a
record, it uniquely determines the values of the
fields in the record
14
• Take this small table for example:
• This is its schema:
• Person(SSN, name, dob)
SSN
123-45-6789
…
Name
Bob
dob
1/1/01
15
• You don’t find a person’s name or birthdate by
doing a computation on their social security
number.
• However, given any one social security number,
there is exactly one corresponding name and
exactly one corresponding date of birth.
• It is true that different people with different
social security numbers may have the same name
and the same date of birth, but this is not a
problem.
16
• The point of the primary key field is that it is
the unique identifier that makes it possible to
distinguish between these two people.
• This idea came up at the beginning of the
course
• The point now is that the name and date of
birth fields functionally depend on the social
security number field.
17
• A new notation can be used to indicate this.
• In this notation, the arrows go from the field
that functionally determines another field, to
the field that is dependent.
• This is illustrated on the next overhead.
18
19
Normal Forms
• Some of the normal forms are identified by
number, for example 1st, 2nd, and 3rd normal
forms.
• Others are identified by name, for example
Boyce-Codd normal form, named after the
people who discovered it.
• These four normal forms are abbreviated 1NF,
2NF, 3NF, and BCNF, respectively.
• There are also higher normal forms, 4th, 5th, and
domain key normal forms (4NF, 5NF, DKNF).
20
• The normal forms have to do with finding
dependencies in tables which spring from fields
other than the primary key.
• These dependencies are undesirable and may be
referred to as stray dependencies.
• The normal forms make increasingly strict
statements about the kinds of stray dependencies
that have to be eliminated from correctly
designed tables.
• Designs containing stray dependencies are said to
violate the normal forms.
21
Eliminating Dependencies
• The design process using normal forms consists of
repetitive steps:
• Make a design
• Identify stray dependencies (normal form violations)
• Redesign to eliminate the dependencies
• Once you’ve eliminated all occurrences of one type
of violation, you will have promoted the design into
the next higher normal form
• Repeat until you’ve reached the highest normal form
22
• The rule of thumb at every stage is to remove
stray dependencies in the following way:
• Make any field which determines other fields
the primary key of a new table, and move the
fields that depend on that field to the new
table.
• Make sure that the new table is connected to
the old table by a primary key, foreign key pair.
23
Anomalies
• Design problems that are based on violations of
normal forms lead to what are called anomalies.
• The hallmark of a problematic design is that the
same information is stored multiple times.
• In other words, there is redundancy in the
database.
• Depending on the nature of the redundancy, this
can lead to problems when inserting data, when
updating data, and when deleting data.
24
Justifying Normal Forms
• The use of normal forms may seem unnecessarily
theoretical at first.
• However, they provide a convenient way of
identifying problems in designs and then
eliminating them.
• Normal forms are what justify these claims about
relation databases:
– There is a step-by-step way of arriving at a correct
design
– There is a way of detecting flaws in a design
25
The Plan of Action for the Following
Sections
• Each of the following sections will present a
normal form in this way:
• A definition of the normal form will be given.
• A scenario for information to be held in a
database will be given, with the underlying
assumptions given.
• An example database design which violates
the normal form will be given
26
• The violation will be shown using a diagram with
the notation indicating functional dependencies.
• The desired functional dependencies from the
primary key will be shown using arrows below
the field names.
• Undesired, stray dependencies, which need to be
eliminated in order to correct the design, will be
shown using arrows above the field names.
27
• Anomalies resulting from the incorrect design
will be discussed.
• In general, there will be insert, update, and
delete anomalies
• Finally, a corrected design will be given.
28
Basis for Examples
• All of the examples will be based on the general
topic of cars, salespeople, customers, and car
sales.
• Some of the field names are abbreviated, and
some of the fields clearly belong together in
some way.
• Here is a little preliminary explanation regarding
the fields that will be in the examples.
• Not all of the fields will appear in all of the
examples.
29
• vin: vehicle identification number. Vehicles
have makes, models, and years.
• spno, spname: Salesperson number and
name.
• custno, custname: Customer number and
name.
• A car sale has a salesprice and a date.
30
2. First Normal Form
• 1NF Definition:
• Formally (Watson):
• A relation is in first normal form if and only if
all columns are single-valued.
• Informally:
• Data is stored in flat files; there can be no
repeating groups in a record.
• (This was mentioned in the very first unit.)
31
• The assumptions underlying the design are
that a salesperson can sell many cars, but each
car can only be sold by one salesperson.
• In this design, each car is only sold once, so
the design captures information about the
sales of new cars.
• These assumptions don’t cause the problem.
• It is the implementation of them that causes
the problem.
32
• Here is the design that violates 1NF:
• Carsale(spno, spname, {vin, salesprice})
• The example design uses {} notation to indicate
repeating groups of fields.
• A diagram with arrows illustrating this design is
given on the next overhead
33
34
• In general, normal form violations have insert,
update, and delete anomalies.
• It would be possible to consider such problems
with a 1NF violation
• However, the general form of anomalies is much
clearer if they’re introduced with 2NF and higher
normal form violations
• The repeating group alone is a sufficient problem
to make this kind of design incorrect and
anomalies will not be discussed in this first case
35
• The solution to all basic normal form
violations is the same:
• Break out the stray dependency out into a
separate table
• In this case, break out the information
contained in the repeating group
36
• As stated in the assumptions, one salesperson
can sell many cars, but each car is sold only
once, so there is a 1-m relationship between
the two tables in the resulting design.
• The primary key of the table containing
salesperson information will have to be
embedded as a foreign key in the table
containing car information.
37
• Here is the corrected design:
• Salesperson(spno, spname)
• Carsale(vin, salesprice, spno f.k.)
38
3. Second Normal Form
• 2NF Definition:
• Formally (Watson):
• A relation is in second normal form if an only if it
is in first normal form, and all nonkey columns
are dependent on the key
• Informally:
• In a table with a concatenated primary key field,
there can be no stray dependencies that originate
in just part of the primary key field.
39
• The basic idea is that all nonkey fields have to
depend on the whole key.
• Stating it in this way will lead to a useful
mnemonic device which will be given later.
• When you lay it out in this way, you begin to
realize that 2NF deals with tables that have
concatenated key fields, where a dependency
from only one field of the key might be
possible.
40
• In this example the underlying assumptions are
that the same car can come back to the lot and
be sold more than once.
• It can be sold by the same salesperson more than
once, but not on the same day.
• It can also be sold by different salespeople at
different times.
• Although unlikely, the design is made so that two
different salespeople could sell the same car on
the same date.
41
• The design doesn’t contain any information
about customers, but the scenario would be
that one customer brought the car back, and a
different salesperson sold it again.
• It seems unlikely that the same customer
would buy the same car twice, whether on the
same date or different dates.
42
• In summary, this design works for used car
sales and both the date and the salesperson
information are needed, along with the car
information, to distinguish between different
sales.
43
• Here is the design that violates 2NF:
•
• Carsale(vin, spno, date, spname)
• A diagram with arrows illustrating this is given
on the next overhead
44
45
• This faulty design has insert, update, and delete
anomalies.
• Suppose a salesperson has not yet sold a car.
• In this case, it is not possible to insert information
about that salesperson.
• On the other hand, a salesperson may make many
sales.
• This means that the same information about that
salesperson would be stored in more than one record
in the table.
• This is redundancy.
46
• Not only is the redundancy itself wasteful, it
leads to the update anomaly.
• Suppose the salesperson’s name changes.
• Then it’s necessary to update multiple records
to reflect this fact, not just one.
47
• The delete anomaly is related to the insert
anomaly.
• Suppose that as part of the maintenance of
the database, on a yearly basis the sales table
is cleared.
• When you delete the last record containing a
sale by a particular salesperson, you not only
get rid of the sales record, you also lose the
salesperson’s name.
48
• As usual, the solution to the problem is to break
the stray dependency out into a table of its own.
• Each car sale has only one salesperson, but each
salesperson can be involved in many sales, so this
is a 1-m many relationship.
• The salesperson information is stored in a table
by itself, and the primary key of the salesperson
table is embedded as a foreign key in the car sale
table.
49
• Here is the corrected design:
• Salesperson(spno, spname)
• Carsale(vin, date, spno f.k.)
50
4. Third Normal Form
• 3NF Definition:
• Formally (Watson):
• A relation is in third normal form if an only if it
is in second normal form and has no transitive
dependencies
• Informally:
• There can be no stray dependencies from one
non-key field to another.
51
• In this example, for the sake of simplicity, it is
assumed that new cars are being sold and
they can only be sold once.
• Information about the customer is also
recorded with the sale.
• Each car can only be bought by one customer.
• It would be possible for a customer to buy
more than one car.
52
• Here is the design that violates 3NF:
• Carsale(vin, custno, custname, salesprice,
date)
• A diagram with arrows illustrating this is given
on the next overhead
53
54
• This design also has insert, update, and delete
anomalies and the pattern of the anomalies is
the same as in the previous example.
• They all stem from the presence of the stray
dependency in the design.
• If you have a potential customer who has not
yet bought a car, it is impossible to insert
information about that person.
55
• If a customer has bought more than one car, the
customer information is stored redundantly.
• In that case, if the customer’s name changes, it’s
necessary to change multiple records.
• Finally, if the sales table is cleared on a regular
basis, when you delete the last sales record for a
given customer, you not only get rid of the sales
record, you also lose the customer’s name.
56
• There is a situation that can arise in database
designs that appears to be a violation of 3NF,
but isn’t.
• The most common example of this situation is
a table which includes a city, state, and zip
code as part of an address.
• The postal service has divided up the country
into zones which are identified by zip codes.
57
• None of these zones cross city or state
boundaries.
• That means that a zip code, a non key field,
determines the city and state.
• It is not necessary for someone to break this
dependency out of their database design.
• The rule of thumb is that if you are not
responsible for maintaining the dependency,
then you can ignore it.
58
• The postal service has a table somewhere with
zip code as the primary key and all of the
descriptive fields about zip code that exist.
• The post office maintains this.
• A table not maintained by the postal service
can contain addresses with zip codes and
completely ignore the fact that there may in
reality be a dependency.
59
5. Boyce-Codd Normal Form
• BCNF Definition:
• Formally (Watson):
• A relation is in Boyce-Codd normal form if and
only if every determinant is a candidate key
• Informally:
• There can be no stray dependencies from a
non-key field to a field in the key
60
• Notice that the formal definition of BCNF
differs from that of the foregoing definitions
• If does not include the phrase “in 3NF and…”
• BCNF is actually a summation of 1NF through
3NF which covers one other case which is not
covered by the previous normal forms.
• BCNF will be explained by presenting an
example which includes this additional case
61
• For the purposes of this example suppose that
the same car can be sold by the same
salesperson more than once, but only one sale
of that car is possible per date.
• Suppose also that this dealership has a system
for assigning prospective customers to specific
salespeople, so that each salesperson is
associated with an exclusive list of clients.
62
• It would be normal to assume that this system
is implemented in some sort of table.
• Such a table is not shown here—it will
become part of the solution to the problem.
• The point now is to show the problem that
this assumption leads to in the table of
interest in the original, incorrect design.
63
• Here is the design which violates BCNF:
• Carsale(vin, spno, date, custno)
• A diagram with arrows illustrating this is given
on the next overhead
64
65
• It is obvious that there is a stray dependency
in the design.
• Technically we haven’t seen this kind of
dependency before, because it goes into the
primary key.
• If you studied the previous normal form
definitions, you would discover that they
don’t cover this case.
66
• The anomalies in this design are analogous to
the anomalies in the previous designs.
• It is not possible to insert information about
the relationship between a given customer
and salesperson without a sales record which
matches them.
• If the given customer has bought from the
same salesperson many times, their
relationship is in multiple records.
67
• An update would require changes in multiple
records.
• Finally, if you’re down to the last record
containing information about a particular
salesperson-customer pairing, deleting the
record would cause the information to be lost.
• As usual, the solution to the problem is to
break out the stray dependency in a separate
table.
68
• Here is the corrected design:
• Carsale(vin, date, custno f.k.)
• Customer-Salesperson(custno, spno)
• Which p.k. should be embedded as a f.k. is clear
• Under the assumptions of the example, one
customer can buy many cars, but not vice-versa
69
• The original design was intuitive, but wrong
• It recorded information linking salespeople
directly to their sales, but this turned out to
be problematic
• The new design is correct and solves the
problem
• However, it is not necessarily 100% intuitive
70
• Customers are uniquely associated with a single
salesperson
• The car sale record now tells you who bought the car
• If you want to know who the salesperson was for a
given sale, you have to do a join with the CustomerSalesperson table.
• To a user who doesn’t understand normalization and
the assumptions underlying the database, this will
seem like an unnecessary complication
• The “one table database” mindset is a hard one to
break
71
What is Allowed under BCNF
• There is another aspect of BCNF that needs to
be explained.
• Consider a design which includes both a
university-generated student id number and a
social security number.
• It would seem to violate BCNF as explained
above:
72
• Here is the design which seems to violate
BCNF:
• Student(studentIDno, SSN, name)
• A diagram with arrows illustrating this is given
on the next overhead
73
74
• The additional part of BCNF is that if the stray
dependency results from another field which
also could have been chosen as a primary key
for the table, then it is not a normal form
violation.
• This is what is meant by the formal definition:
• A relation is in Boyce-Codd normal form if and
only if every determinant is a candidate key
75
• In the example, both studentIDno and SSN are
valid, unique identifiers of students.
• You might want to record both.
• It is simply necessary to choose one of them
as the primary key of the field.
• The presence of the other one in the table
does no harm.
76
A Summary of the Normal Forms up
through BCNF
• Up through BCNF the normal forms can be
explained in terms of stray dependencies.
• An easy way to remember the requirements
for these normal forms is the following
statement:
• Every field in a table has to depend on the key,
the whole key, and nothing but the key.
77
• Because they are increasingly strict, the
normal forms can be thought of as nested.
• When checking a design, you begin with the
lowest normal form, make sure there are no
violations, and move on to the following ones.
• This is what makes the design process step-bystep.
78
• This idea can be represented using a Venn
diagram.
• The idea is that the set of designs which is in
some normal form is always a subset of those
designs which meet the conditions for a lower
normal form.
• A diagram of this is shown on the following
overhead
79
1NF
2NF
3NF
BCNF
…
80
6. Fourth Normal Form
• It was claimed earlier that there are only three
kinds of relationships: 1-1, 1-m, and m-n.
• This is not entirely true.
• There may be many-to-many-to-many
relationships (relationships between 3
different types of entities at the same time, mm-m),
81
• In theory there is no reason why there can’t
be relationships among 4 or more different
types of entities at the same time.
• Fourth and fifth normal form, 4NF and 5NF,
have to do with cases like these.
82
• 4NF and 5NF have been previewed earlier by
saying they are related to the question of a
cyclical design vs. a star design
• This is still true.
• Observe that there are essentially two options
for capturing a m-m-m relationship: either a
cycle or a star.
83
• The higher normal forms are not strictly
related to the question of stray dependencies
• The presentation of them will not follow
exactly the same plan as the lower normal
forms.
• However, the same basic ideas will be
covered: A definition plus examples and
explanations of how to do it right and what
can go wrong
84
Fourth Normal Form
• 4NF Definition:
• Formally (Watson):
• A relation is in fourth normal form if it is in
Boyce-Codd normal form and all multi-valued
dependencies on the relation are functional
dependencies.
• Informally:
• An informal definition will have to wait…
85
• The formal definition illustrates why the
presentation of the higher normal forms will
differ from that of the lower ones.
• It is not possible to understand the formal
definition or give a meaningful informal
definition without further explanation.
86
• There are three things to observe about the
definition:
• 1. It goes back to the model of, “in the
previous normal form and…”
• The definition of 4NF depends on the
definition of BCNF plus another condition
87
• 2. On the other hand, it is worded “if” rather
than “if and only if”
• I don’t know whether the author was
daydreaming or whether this change is
theoretically important.
• In any case, it is of no practical importance
• We can safely consider the two halves of the
proposition to be logically equivalent
88
• 3. The most important thing in the definition
requiring explanation is the introduction of a
new concept: multivalued dependencies.
• The new concept is actually “independent
multivalued dependencies”.
• It’s hard to understand a definition when it’s
based on concepts that haven’t been
explained yet.
89
• The rest of this section will be organized in
this way:
• 1. A correct example of a m-m-m relationship
implemented in a star design will be given.
• 2. Then the problems that underlie the
definition of the normal form will be
examined.
90
Correct Example
• Suppose that a given car can be sold more
than one time.
• In other words, you’re dealing in used cars.
• Suppose also that salespeople can sell more
than one different car, and customers can buy
more than one different car.
• This means that there are three 1-m
relationships: Car to Sale, Salesperson to Sale,
and Customer to Sale.
91
• For the three base tables, Car, Salesperson, and
Customer, there could be one table in the middle,
Carsale, which brought all three together.
• The idea can be represented using ER modeling.
• This results in the star shaped design shown on
the next overhead.
• It is reminiscent of the design of the example
database.
92
Car
Salesperson
Carsale
Customer
93
• The relationships are captured by embedding
primary keys as foreign keys, and a valid
design can be given as follows:
• Car(vin, make, model, year)
• Salesperson(spno, spname)
• Customer(custno, custname)
• Carsale(vin, spno, custno, date)
94
What Can Go Wrong?
• New assumptions are necessary in order to
create an example with 4NF problems.
• The oddness of the assumptions will probably
already suggest that the results will be
problematic.
• 1. Assume that it is desirable to keep a record of
all prospective customers salespeople have talked
to.
• Under this assumption, also assume that a given
customer only deals with one salesperson.
95
• 2. Assume also that it’s desirable to keep a
record of all cars salespeople have sold.
• Once the car is sold, it’s not important what
customer it was sold to.
• All that’s of interest is the salesprice, for
example, so that commissions can be
calculated.
• 3. Finally, assume that all of this data is to be
stored in one table.
96
• This set of assumptions captures the elements
of the definition of 4NF.
• The relationship between salespeople and
customers is a multivalued dependency.
• In other words, it’s a one-to-many
relationship, not a functional relationship.
• If you like notation, it can be indicated with a
double-headed arrow:
• Salesperson—>>Customer
97
• A table capturing this relationship alone would
have two columns, one for salesperson and
one for customer.
• Each row would consist of a pairing.
• In the table overall, a given salesperson could
appear more than once.
98
• The relationship between salespeople and
cars is also a multivalued dependency.
• It’s a one-to-many relationship, not a
functional relationship.
• Salesperson—>>Car
• It too, could be captured in a single table that
matched pairs, with salespeople potentially
appearing more than once.
99
• The two multivalued dependencies are
independent of each other.
• As spelled out in the assumptions, we’re not
keeping a record of a salesperson, a car, and
the customer that it was sold to.
• We’re keeping track of cars sold, and
independently, prospective customers that
have been talked to.
100
• Finally, the kicker assumption is that both of
these independent multivalued dependencies
are to be captured in a single table.
• As someone with background in database
modeling, you doubtless recognize that there
are three entities in the scenario and three
tables will be desirable in a solution.
• Just hang on; we’re not there yet.
101
• An example table is given on the next
overhead.
• This will be followed by explanations.
102
Salesperson
Customer
Car
Bob
Jay
Mustang
Bob
Jay
Camaro
Bob
Kay
Mustang
Bob
Kay
Camaro
Sue
Ann
Corolla
Sue
Ann
Civic
Sue
Jim
Corolla
Sue
Jim
Civic
103
• Salesperson multidetermines the other fields.
• In other words, there is no functional
dependency from salesperson to the other
fields.
• That means that salesperson alone isn’t the
key.
104
• The multivalued dependencies are
independent of each other.
• There is no relationship between customers
and cars.
• Therefore, those fields don’t determine each
other.
• Neither one can serve independently as a key.
105
• The conclusion is that the table is all key.
• Furthermore, there are no dependencies
among the constituent parts of the
concatenated key.
• Therefore, it is impossible for this design to
violate any of the normal forms up through
BCNF, which all depended on stray functional
dependencies.
106
• However, the design is full of anomalies.
• Look at the data in the table.
• Salesperson Bob talked to two customers and
sold two cars.
• The result is four rows in the table, each row
one of the four possible combinations of
customer and car.
107
• Why is this necessary?
• Because the table is all key, none of the
individual fields can be null.
• In order to record a customer, for example,
there has to be a car.
• Similarly, in order to record a car, there has to
be a customer.
108
• Why do every possible pairing?
• It is incidental that pairing is occurring, because
the example is based on two values for each field.
• The reality is that you’re getting a Cartesian
product for the multidetermined fields, the
number in one, m, times the number in the other,
n.
• Why not do what’s shown on the following
overhead instead—just show each car once and
show each customer once?
109
Salesperson
Customer
Car
Bob
Jay
Mustang
Bob
Kay
Camaro
Sue
Ann
Corolla
Sue
Jim
Civic
110
• The problem with this is that if you want to
delete the customer Jay, you lose the car
Mustang—or vice-versa.
• If you delete the customer Kay, you lose the
car Camaro—or vice-versa.
• So doing the Cartesian product (in this
example, pairing) is a consistent way to
protect from data loss.
111
• However, the pairing (Cartesian product) is
what causes the classic anomalies.
• To insert a single customer, you have to insert
as many rows as there are cars for that
salesperson.
• To insert a single car, you have to insert as
many rows as there are customers for that
salesperson.
112
• Now that there are multiple rows for each of
the multidetermined fields, updating a single
instance means updating every row.
• Also, deleting a single instance will require
deleting many rows.
• This is because the Cartesian product creates
just about as much redundancy as possible
113
• In short, this design doesn’t violate any of the
foregoing normal forms.
• However, it’s full of awful anomalies.
• Therefore, 4NF was defined, stating that a
design like this is not correct.
114
• The assumptions imply two one-to-many
relationships.
• Deriving a correct design seems pretty
straightforward.
• Have three tables, one each for Salesperson,
Customer and Car.
• Embed the primary key of Salesperson in
Customer and in Car.
115
• Books will sometimes tell you that you are
unlikely to find a mess like this in practice.
• Anyone who makes a design by working from 1NF
through BCNF, rooting out stray dependencies,
will most likely have broken out all entities into
their own tables.
• Even if that hasn’t happened due to
normalization, if they reach this point, they are
likely to realize intuitively that this is a bad
design.
116
• Anyone familiar with database design
principles is unlikely to put three types of
entities together in a single table in the first
place.
• On the other hand, compare with the idea of a
married couple as a single entity.
• If the relationship among 3 entities is always
1-1-1, then putting them into a single table is
at least theoretically possible
117
• On the other hand, people who are unfamiliar
with the rules have no clue about functional
dependencies or multi-valued dependencies
• They sometimes think that they should try
and cram as much information into a single
table as possible.
• If that happens, then a violation of 4NF is
possible.
118
• This is the classic worst-case scenario of the
one-table mindset.
• If you are ever asked to convert a “database”
maintained as an Excel spreadsheet into
relational form, expect to see nightmares like
this.
119
7. Fifth Normal Form
• 5NF Definition:
• Formally (Watson):
• A relation is in fifth normal form if and only if
every join dependency of the relation is a
consequence of candidate keys of the relation
• Informally:
• See the following commentary and
explanation
120
• It is apparent that 5NF is like BCNF in this way:
• It doesn’t depend on the earlier normal forms.
• It is a summative definition of what
characteristic a correct design should have.
• Unfortunately, because its definition is so high
level and theoretical, what it means is not
clear.
• Like the definition of 4NF, it involves a concept
that hasn’t been explained yet.
121
• What is a join dependency?
• Very informally, you can see that the definition
of normal forms now explicitly refers to the
fact that multiple tables are related to each
other and data is reassembled with joins.
• Rather than trying to define this concept in
detail, an example will be pursued.
• After looking at what the example reveals, an
informal description of 5NF will be given.
122
Example
• Consider the Salesperson, Customer, and Car
entities again
• In the 4NF example, the relationship between
Salesperson and Customer and the relationship
between Salesperson and Car were independent
• But suppose that we’re again interested in
capturing some sort of relationship which directly
ties together a salesperson, a customer, and a car
123
• In the sample database used for SQL practice,
this kind of relationship was captured by a star
design.
• The Carsale table was in the middle of the star
• Each of Car, Customer, and Salesperson were
in a one-to-many relationship with it.
124
• You may have realized that there is another
way to relate together all three of the base
tables.
• What if each pairing of Salesperson, Customer,
and Car were related in an m-m relationship?
• This results in a design with a cycle in it.
• This is shown in the ER diagram on the
following overhead.
125
Salesperson
-Car
Car
CarCustomer
Salesperson
CustomerSalesperson
Customer
126
• The design could also be represented in this
way:
• Car(vin, make, model, year)
• Car-Customer(vin f.k., custno f.k.)
• Customer(custno, custname)
• Customer-Salesperson(custno f.k., spno f.k.)
• Salesperson(spno, spname)
• Salesperson-Car(spno f.k., vin f.k., date)
127
• In general, designs with cycles in them are
difficult to understand.
• What you want to determine can be stated in
two different ways:
• What assumptions might make a cyclical
design rather than a star design correct?
• What can go wrong with a cyclical design
which would make it incorrect for a given
situation?
128
• If you traced all of the links in the design with
the cycle, you would find that every car is
connected to every salesperson is connected
to every customer.
• Put in business terms, the cyclical design for
car sales embodies this assumption:
• At one time or another every salesperson has
sold every car to every customer.
129
• In other words, all sales that could have
possibly occurred did occur.
• On the other hand, there is no obvious place
where information relating to a specific sale of
a car, such as salesprice, date, etc., should go
in the design.
130
• This is a way of simply recording all possible
relationships among the entities
• Roughly speaking, what this means is that if
we did a join involving all of the tables, we
would effectively get m-m-m Cartesian
product
• From a practical point of view, this is madness.
131
• At this point the car sale example breaks
down.
• It is unrealistic to think that every car would
be sold by every salesperson and bought by
every customer.
• It is difficult to imagine a scenario where these
new assumptions would hold true.
• We are dealing with assumptions that simply
aren’t realistic for this example.
132
Cycles vs. Star Models
• Although not realistic, the scenario is still
informative.
• The cyclical design is contrary to the star
design in this way:
• In the star design the table in the middle
captures information for that subset of
possible sales that actually occurred.
133
• Finally, this returns to the question of 5NF and
what a correct design would be.
• Suppose you did have a situation where every
possible pair of relationships actually does
exist.
• Which design is better, the one with the star
or the one with the cycle?
134
• Under this scenario, with a star design, the
table in the middle becomes the 3-way
Cartesian product of the primary keys of the
three base tables.
• This harks back to the 4NF example.
• The problem there was a table consisting of
Cartesian products of independent
multivalued dependencies
135
• We now have the converse problem—
everything really is related to everything else.
• If we know that each pairwise relationship is
valid, why are we creating a table in the
middle that forms an unneeded Cartesian
product?
• In a theoretical sense, at least, the design with
the cycle is better.
136
• Are there anomalies in the star design under
these assumptions?
• The anomalies aren’t unmanageable roadblocks,
but there is a practical reason why the star design
is undesirable:
• Suppose that m Salespeople are related to n
Customers and p Cars.
• Suppose you wanted to add one salesperson to
the design with a star under these assumptions.
137
• That would involve adding n x p triplets to the
table in the middle.
• In the cyclical design you would “only” have to
manage n additions to the SalespersonCustomer table and p additions to the
Salesperson-Car table.
• The cost of bringing them together would be
in a join query executed later
138
• Consider the problem overall
• The complete Cartesian product in the table in
the middle of a star design would contain m x
n x p triples
• With a cyclical design you would have three
pairwise tables in the middle with m x n, n x p,
and p x m entries each
• Once m, n, and p are above 3, there are fewer
entries to manage with the cyclical design
139
A More Realistic Example
• The book comes up with a reasonably realistic
example of where the cyclical design might
apply
• Suppose firms have contracts with consultants
• Consultants provide advice based on their skill
sets
• And the firms require advice on skills in
consultants’ skill sets
140
• Add this assumption to the suppositions above:
• If a company has a contract with a consultant,
and the consultant has a particular skill, and the
company requires advice that depends on that
skill, then a consultant under contract, by
definition, can or will or does provide advice
involving the skill.
• In shorthand, the relationships are “universal” (or
promiscuous…)
141
• This is a more general model than saying that a
consultant is only contracted to provide advice on
a given skill set.
• It is a realistic model from the firms’ point of
view.
• They may want to enforce contracts like this with
their consultants.
• Whether it is completely realistic or not, if all of
the assumptions hold true, then the cyclical
model is the technically correct one.
142
• ER diagrams of this example are given on the
following overheads.
• The first diagram is a star design where the
new assumption doesn’t apply
• The second diagram is the cyclical design
where the new assumption does apply
• In the captions of the diagrams, the “rule”
referred to is the assumption that the
relationships among entities are “universal”
143
144
145
Cycles vs. Linear Models
• There is another aspect of this question that
can be considered.
• Would it make a difference if you broke the
cycle by removing one of the tables in the
middle?
• You could still form a join query that included
all of the base tables, but you would definitely
lose information from the model.
146
• Suppose you removed “Firm Requires Skill” for
example.
• Now every company would be indiscriminately
contracting with every consultant for every
skill they had rather than every skill they had
that the company required
• The same kind of argument would apply to
removing any of the tables in the middle
147
• Situations where every entity in every base
table is related to every other entity in every
other base table are rare.
• However, they can occur.
• In those cases the cyclical design is
theoretically better.
148
A Concrete Illustration
• Even though the car example isn’t realistic,
because it’s familiar, it’s possible to develop a
concrete example that illustrates that the
cyclical design does ultimately capture the
Cartesian product of the base tables
• The example also illustrates how information
is lost if you forget to include one of the
tables, making the design a linear one instead
of a cyclical one
149
• These tables are shown on the following
overheads:
• Car
• Customer
• Salesperson
• Car-Customer
• Car-Salesperson
• Customer-Salesperson
• The tables in the middle contain the full m x n
Cartesian product of the base tables they connect
150
Car
carID
carName
1
jkl
2
uio
3
fds
151
Customer
customerID
customerName
1
Chris
2
Chet
3
Chad
152
Salesperson
salespersonID
salespersonName
1
Sam
2
Sid
3
Sue
153
Car-Customer
carID
customerID
1
1
1
2
1
3
2
1
2
2
2
3
3
1
3
2
3
3
154
Car-Salesperson
carID
SalespersonID
1
1
1
2
1
3
2
1
2
2
2
3
3
1
3
2
3
3
155
Customer-Salesperson
customerID
salespersonID
1
1
1
2
1
3
2
1
2
2
2
3
3
1
3
2
3
3
156
• The following overhead shows a screenshot of
the cyclical relationships between the tables.
157
158
• A join query that goes all the way around the
cycle is given on the next overhead.
• The results of the query are shown on the
overhead following the next one.
159
• SELECT *
• FROM Car, Car-Customer, Customer, Customer-Salesperson,
Salesperson, Car-Salesperson
• WHERE Car.carID = Car-Customer.carID
• And Car.carID = Car-Salesperson.carID
• And Customer.customerID = Car-Customer.customerID
• And Customer.customerID = CustomerSalesperson.customerID
• And Salesperson.salespersonID = CarSalesperson.salespersonID
• And Salesperson.salespersonID = CustomerSalesperson.salespersonID;
160
CycleQuery
CarCust
omer
.cust
omer
ID
Cust
omer
.cust
omer
ID
custo
mer
Nam
e
Cust
omer
Sales
pers
on.cu
stom
erID
Cust
omer
Sales
pers
on.Sa
lespe
rsonI
D
Sales
pers
on.Sa
lespe
rsonI
D
sales
pers
onNa
me
CarSales
pers
on.ca
rID
CarSales
pers
on.Sa
lespe
rsonI
D
Car.c
arID
carN
ame
CarCust
omer
.carI
D
1
jkl
1
1
1
Chris
1
1
1
Sam
1
1
2
uio
2
1
1
Chris
1
1
1
Sam
2
1
3
fds
3
1
1
Chris
1
1
1
Sam
3
1
1
jkl
1
2
2
Chet
2
1
1
Sam
1
1
2
uio
2
2
2
Chet
2
1
1
Sam
2
1
3
fds
3
2
2
Chet
2
1
1
Sam
3
1
1
jkl
1
3
3
Chad
3
1
1
Sam
1
1
2
uio
2
3
3
Chad
3
1
1
Sam
2
1
3
fds
3
3
3
Chad
3
1
1
Sam
3
1
1
jkl
1
1
1
Chris
1
2
2
Sid
1
2
2
uio
2
1
1
Chris
1
2
2
Sid
2
2
3
fds
3
1
1
Chris
1
2
2
Sid
3
2
1
jkl
1
2
2
Chet
2
2
2
Sid
1
2
2
uio
2
2
2
Chet
2
2
2
Sid
2
2
3
fds
3
2
2
Chet
2
2
2
Sid
3
2
1
jkl
1
3
3
Chad
3
2
2
Sid
1
2
2
uio
2
3
3
Chad
3
2
2
Sid
2
2
3
fds
3
3
3
Chad
3
2
2
Sid
3
2
1
jkl
1
1
1
Chris
1
3
3
Sue
1
3
2
uio
2
1
1
Chris
1
3
3
Sue
2
3
3
fds
3
1
1
Chris
1
3
3
Sue
3
3
1
jkl
1
2
2
Chet
2
3
3
Sue
1
3
2
uio
2
2
2
Chet
2
3
3
Sue
2
3
3
fds
3
2
2
Chet
2
3
3
Sue
3
3
1
jkl
1
3
3
Chad
3
3
3
Sue
1
3
2
uio
2
3
3
Chad
3
3
3
Sue
2
3
3
fds
3
3
3
Chad
3
3
3
Sue
3
3
161
• The main thing to note about the previous
results is that there are 3 x 3 x 3 = 27 rows
• The join query that goes all the way around
the cycle produces the Cartesian product of
the three base tables.
162
• The previous join query with the last joining
condition removed is shown on the next
overhead.
• The results of the query are shown on the
overhead following the next one.
• They are impossible to read because there are
81 rows.
163
• SELECT *
• FROM Car, Car-Customer, Customer, CustomerSalesperson, Salesperson, Car-Salesperson
• WHERE Car.carID = Car-Customer.carID
• And Car.carID = Car-Salesperson.carID
• And Customer.customerID = Car-Customer.customerID
• And Customer.customerID = CustomerSalesperson.customerID
• And Salesperson.salespersonID = CarSalesperson.salespersonID;
164
CycleQueryMinusOneLink
Car.c
arID
carNa
me
CarCusto
mer.c
arID
CarCusto
mer.c
ustom
erID
Custo
mer.c
ustom
erID
custo
merN
ame
Custo
merSales
perso
n.cust
omerI
D
Custo
merSales
perso
n.Sale
spers
onID
Sales
perso
n.Sale
spers
onID
salesp
erson
Name
CarSales
perso
n.carI
D
CarSales
perso
n.Sale
spers
onID
1
jkl
1
1
1
Chris
1
1
1
Sam
1
1
1
jkl
1
1
1
Chris
1
2
1
Sam
1
1
1
jkl
1
1
1
Chris
1
3
1
Sam
1
1
1
jkl
1
1
1
Chris
1
1
2
Sid
1
2
1
jkl
1
1
1
Chris
1
2
2
Sid
1
2
1
jkl
1
1
1
Chris
1
3
2
Sid
1
2
1
jkl
1
1
1
Chris
1
1
3
Sue
1
3
1
jkl
1
1
1
Chris
1
2
3
Sue
1
3
1
jkl
1
1
1
Chris
1
3
3
Sue
1
3
1
jkl
1
2
2
Chet
2
1
1
Sam
1
1
1
jkl
1
2
2
Chet
2
2
1
Sam
1
1
1
jkl
1
2
2
Chet
2
3
1
Sam
1
1
1
jkl
1
2
2
Chet
2
1
2
Sid
1
2
1
jkl
1
2
2
Chet
2
2
2
Sid
1
2
1
jkl
1
2
2
Chet
2
3
2
Sid
1
2
1
jkl
1
2
2
Chet
2
1
3
Sue
1
3
1
jkl
1
2
2
Chet
2
2
3
Sue
1
3
1
jkl
1
2
2
Chet
2
3
3
Sue
1
3
1
jkl
1
3
3
Chad
3
1
1
Sam
1
1
1
jkl
1
3
3
Chad
3
2
1
Sam
1
1
1
jkl
1
3
3
Chad
3
3
1
Sam
1
1
1
jkl
1
3
3
Chad
3
1
2
Sid
1
2
1
jkl
1
3
3
Chad
3
2
2
Sid
1
2
1
jkl
1
3
3
Chad
3
3
2
Sid
1
2
1
jkl
1
3
3
Chad
3
1
3
Sue
1
3
1
jkl
1
3
3
Chad
3
2
3
Sue
1
3
1
jkl
1
3
3
Chad
3
3
3
Sue
1
3
2
uio
2
1
1
Chris
1
1
1
Sam
2
1
2
uio
2
1
1
Chris
1
2
1
Sam
2
1
2
uio
2
1
1
Chris
1
3
1
Sam
2
1
2
uio
2
1
1
Chris
1
1
2
Sid
2
2
2
uio
2
1
1
Chris
1
2
2
Sid
2
2
2
uio
2
1
1
Chris
1
3
2
Sid
2
2
2
uio
2
1
1
Chris
1
1
3
Sue
2
3
2
uio
2
1
1
Chris
1
2
3
Sue
2
3
2
uio
2
1
1
Chris
1
3
3
Sue
2
3
2
uio
2
2
2
Chet
2
1
1
Sam
2
1
2
uio
2
2
2
Chet
2
2
1
Sam
2
1
2
uio
2
2
2
Chet
2
3
1
Sam
2
1
2
uio
2
2
2
Chet
2
1
2
Sid
2
2
2
uio
2
2
2
Chet
2
2
2
Sid
2
2
2
uio
2
2
2
Chet
2
3
2
Sid
2
2
2
uio
2
2
2
Chet
2
1
3
Sue
2
3
2
uio
2
2
2
Chet
2
2
3
Sue
2
3
2
uio
2
2
2
Chet
2
3
3
Sue
2
3
2
uio
2
3
3
Chad
3
1
1
Sam
2
1
2
uio
2
3
3
Chad
3
2
1
Sam
2
1
2
uio
2
3
3
Chad
3
3
1
Sam
2
1
2
uio
2
3
3
Chad
3
1
2
Sid
2
2
2
uio
2
3
3
Chad
3
2
2
Sid
2
2
2
uio
2
3
3
Chad
3
3
2
Sid
2
2
2
uio
2
3
3
Chad
3
1
3
Sue
2
3
2
uio
2
3
3
Chad
3
2
3
Sue
2
3
2
uio
2
3
3
Chad
3
3
3
Sue
2
3
3
fds
3
1
1
Chris
1
1
1
Sam
3
1
3
fds
3
1
1
Chris
1
2
1
Sam
3
1
3
fds
3
1
1
Chris
1
3
1
Sam
3
1
3
fds
3
1
1
Chris
1
1
2
Sid
3
2
3
fds
3
1
1
Chris
1
2
2
Sid
3
2
3
fds
3
1
1
Chris
1
3
2
Sid
3
2
3
fds
3
1
1
Chris
1
1
3
Sue
3
3
3
fds
3
1
1
Chris
1
2
3
Sue
3
3
3
fds
3
1
1
Chris
1
3
3
Sue
3
3
3
fds
3
2
2
Chet
2
1
1
Sam
3
1
3
fds
3
2
2
Chet
2
2
1
Sam
3
1
3
fds
3
2
2
Chet
2
3
1
Sam
3
1
3
fds
3
2
2
Chet
2
1
2
Sid
3
2
3
fds
3
2
2
Chet
2
2
2
Sid
3
2
3
fds
3
2
2
Chet
2
3
2
Sid
3
2
3
fds
3
2
2
Chet
2
1
3
Sue
3
3
3
fds
3
2
2
Chet
2
2
3
Sue
3
3
3
fds
3
2
2
Chet
2
3
3
Sue
3
3
3
fds
3
3
3
Chad
3
1
1
Sam
3
1
3
fds
3
3
3
Chad
3
2
1
Sam
3
1
3
fds
3
3
3
Chad
3
3
1
Sam
3
1
3
fds
3
3
3
Chad
3
1
2
Sid
3
2
3
fds
3
3
3
Chad
3
2
2
Sid
3
2
3
fds
3
3
3
Chad
3
3
2
Sid
3
2
3
fds
3
3
3
Chad
3
1
3
Sue
3
3
3
fds
3
3
3
Chad
3
2
3
Sue
3
3
3
fds
3
3
3
Chad
3
3
3
Sue
3
3
165
• The query missing one joining condition is the
equivalent of a design where one of the tables
in the middle is missing
• In such a design you’ve lost information
• However, the loss of design information
means that when you run the query you get
more result records
• In other words, what’s lost is connecting
information that would restrict the results
166
Finally, an Informal Definition of 5NF
• Recall the formal definition of 5NF:
• A relation is in fifth normal form if and only if
every join dependency of the relation is a
consequence of candidate keys of the relation
• For the purposes of comparison, also recall
the definition of BCNF:
• A relation is in Boyce-Codd normal form if and
only if every determinant is a candidate key
167
• BCNF seems to be directed towards the
definition of the correct design of a table
internally, without reference to other tables
• 5NF seems to be directed towards the
definition of the correct design of a table in
relation or reference to other tables
• The term “join dependency” hasn’t been
defined, but it will be informally defined now
168
• A general description of what 5NF says is the
following:
• A design is correct if two conditions are met:
• All real relationships between entities are
captured by the design;
• no false relationships between entities are
captured by the design.
169
• Empirically, the informal definition can be
translated into two statement about queries that
involve joining (join dependencies between) 2 or
more tables.
• 1. In a correct design in 5NF, you can write a
query that will successfully pull all “real”
information (based on the tables and their
relationships) out of the db.
• It may be informative to consider some related
negative statements.
170
• There is no information “trapped” in the
database that you cannot retrieve with a
query.
• There is no information that you have failed to
include due to a flaw in the database design.
• You can relate this back to the cycle vs. linear
model observations.
• For example, you haven’t left a (pk-fk, join)
relationship out of the design.
171
• 2. The second part of the informal definition
of 5NF is equally important.
• It is not possible to write a query that would
pull out results which did not hold true
• In other words, you haven’t included a “false”
relationship in the design
• You can relate this back to the cycle vs. star
model observations.
172
• If the star was the correct design, you would
only be able to pull out that specific set of
triplets that were correct.
• If you mistakenly implemented a cycle,
ultimately you would be able to write a query
that in a sense pulled out the full Cartesian
product of the base tables.
• If that is not correct, the model should not
make it possible.
173
General Rules for Normalizing
• Even if a cycle is theoretically the correct
design, for practical and understandability
reasons, you might go with star.
• Consider firms, consultants, and skills again
• Suppose the hiring of consultants constitutes
a contract
• Suppose that contracts have attributes of their
own, like dates and pay rates
174
• At that point a contract may become a base
entity and should probably be implemented as a
base table.
• If the relationships are universal, you’ll just have
to deal with the fact that the table in the middle
of the star is a Cartesian product.
• In any case, it will be necessary to figure out the
assumptions regarding dates and pay
• A cyclical solution is probably not possible, and
even if it were, it woulr probably not be prettier.
175
Normalizing Overall
• The general rule with normal forms is similar
to the rule with ER modeling.
• You take the model to the highest normal
form that is desirable.
• If you haven’t taken it to the highest normal
form possible, you document the decision.
176
• The whole sweep of numbered normal forms
can be considered now.
• At the very least you would expect to work a
design up through BCNF.
• 4NF is considered kind of exceptional because
when modeling from scratch you would never
arrive in that situation.
177
• As noted, you might have to deal with 4NF if
you are working with an existing “database”
designed by a one-table fanatic.
• 5NF is the final extreme.
• Whether you choose to implement 5NF, it is
definitely worthwhile to be conscious of the
implications of cycles in a final design.
• If you choose to avoid them, document that
decision.
178
8. Domain Key Normal Form
• The highest normal form, domain-key normal
form (DKNF), is a theoretical statement of which
attributes belong in tables and the relationships
among the attributes.
• This form is not numbered.
• Like BCNF and 5NF, it is a summative normal
form.
• DKNF encompasses all of the other normal forms.
• A design in DKNF is fully normalized.
179
• DKNF Definition:
• Formally (Watson):
• A relation is in domain-key normal form if and
only if every constraint on the relation is a logical
consequence of the domain constraints and the
key constraints that apply to the relation.
• Informally:
• Unless you’ve steeped yourself in the relational
theory of normalization, the meaning of this
statement is virtually impossible to interpret.
180
• DKNF is based on the idea of domains and
constraints.
• These ideas have come up before, but they
themselves haven’t been formally defined yet.
• Practically speaking, even if understood,
unlike the numbered normal forms, DKNF
does not give you a step-by-step procedure for
creating a design, determining whether or not
it has violations, and fixing them.
181
• Although this normal form is of little practical
use, it raises the ideas of domains and
constraints, which are important.
• They are the basis for correctly capturing
relationships between tables.
• The remainder of this section will consist of
remarks on these topics.
182
Domains and Data Types
• Domains and data types are not the same
thing
• When a table is created, a complete definition
has to tell the data type of each field.
• Some fields may hold numeric values, some
may hold strings of characters, some may hold
dates, etc.
• If a field holds strings of characters, its length,
or maximum length also has to be stated.
183
• So, for example, a person’s last name may be
defined as containing a maximum of 24
characters.
• TEXT(24) or CHAR(24) doesn’t tell you that the
field holds a name.
184
• The social security number came up earlier
illustrating another aspect of this.
• A social security number has 9 digits.
• Although the social security number is called a
number, it is never used numerically.
• There is no need to add, subtract, multiply, or
divide it, and a good design will prevent that.
185
• The wise choice is to define this field as a
character field containing 9 characters where
valid characters in this field are limited to
digits.
• The point of this is that even though the name
of a field tells you something meaningful
about it, the name doesn’t necessarily
accurately reflect the preferred data type for
it.
186
Domains
• Most of the time, the name of a field is
descriptive of the kind of information it can
hold.
• So for a “last name” field in a table containing
information about people, it is informally clear
what this means.
• In general, a name would consist of a
sequence of letters of the alphabet.
187
• Names could come from any language or
culture, translated into the English alphabet.
• Some names do contain numeric information,
usually indicated with Roman numerals, for
example, John Smith I, John Smith II, etc.
• It would not be practical to come up with a
formula that mathematically defined all
possible values.
• Still, the general idea is clear.
188
• In a technical sense, the term domain refers to
the whole set of values that could appear as
valid data in that field.
• However, the domain is a semantic concept.
• A field’s domain is a description of the
meaning of the field and the kind of data it
can contain.
189
• This goes back to the description of an attribute
and the requirements on it.
• An attribute captures one characteristic of the
entity which it belongs to.
• Throughout a table, a field should capture the
same characteristic for the different entities
recorded in each row.
• Different records should not be recording
different kinds of information in the same field.
190
Different Fields, Not on the Same
Domain
• The idea of domains can be further clarified by
giving examples of cases where fields are not
on the same domain.
• A person’s last name field may be defined as
24 characters.
• A city field could also be defined the same
way.
• There may be cases where a person’s name is
the same as the name of a city.
191
• There is a city of Lincoln in England.
• Abraham Lincoln’s ancestors might have came
from that area.
• There is also a city of Lincoln in Nebraska,
which was named after Abraham Lincoln.
• Even though there may be an intersection of
the values in the city and last name fields, you
could argue that conceptually, city name and
person last name are two distinct domains.
192
• On the other hand, you may be maintaining some
sort of database where the underlying concept of
interest is name itself, whether describing a
person or a place.
• In that case, you could argue that the two fields,
city name and person last name, really are on the
same domain.
• From a table point of view, instead of two tables,
PeopleName and PlaceName, you could have one
table, GenericName, with a single name field.
193
• Another example of two fields that are not
obviously on the same domain would be social
security number and zip code.
• A full zip code consists of 5 plus 4, or 9 digits,
like a social security number.
• Both might be defined as character fields
containing 9 characters.
194
• However, social security numbers and zip
codes have nothing in common.
• There are doubtless cases where someone’s
social security number matches some zip code
somewhere in the country, but this is purely
coincidental.
• On the other hand, you may be maintaining
some sort of database where the underlying
concept is random numeric characteristics…
195
Domains and Primary Key to Foreign
Key Relationships
• The relationship between tables has been
explained by the process of embedding the
primary key of one table as a foreign key in
another.
• The foreign key table has to have a field with a
suitable name, that is defined to hold the same
type of data as the primary key field.
• In other words, the primary key to foreign key
relationship requires fields on the same domain.
196
Fields on Common Domains in General
• There can be other relationships between
tables which are the result of domains, but
not the direct result of embedding keys.
• Going back to one of the earlier examples, a
database may distinguish between mothers
and children as different kinds of entities, and
store them in different tables.
197
• Each of these tables may have social security
number fields and last name fields.
• You would not expect a mother and child to
have the same social security number.
• This would be a mistake.
198
• However, in most cases you would expect
mothers and children to have the same last
names.
• The idea is that a social security number is a
social security number, regardless of what
table it appears in.
• The idea of a social security number defines a
domain.
199
• Similarly, if the last name fields in both the
mother and child tables were defined as
containing 24 characters, a last name is a last
name, regardless of what table it appears in.
• Incidentally, a last name would be a last name
even if the fields were not of the same length
200
• Both social security number and last name
define a domains.
• You might expect to see social security
numbers involved in primary key to foreign
key relationships.
• Because names aren’t unique, you don’t
expect that of them.
201
• You can have name fields in two different
tables.
• Even though they’re not a pk to fk pair, they
are still on the same domain.
• In theory, if you wanted to, you could write
join queries on those fields
202
• The point of this discussion is that a domain is a
cross-table concept.
• Any given database may contain many different
fields in its tables, but the database will contain
fewer domains than fields because various fields
are on the same domain.
• In a sense, the idea of a domain is more
fundamental than the idea of a field.
• A field is just a manifestation of a domain.
203
The Definition of DKNY, Again
• This was the formal definition of DKNY:
• A relation is in domain-key normal form if and
only if every constraint on the relation is a logical
consequence of the domain constraints and the
key constraints that apply to the relation.
• What are domain constraints?
• They are constraints on the values that are valid
in a field based on the meaning of the field.
204
• Thinking back to SQL, these are some
constraints that can be concretely identified in
all cases, regardless of the particular field in
question:
• Entity integrity (primary key)
• Relational integrity (foreign key)
• Unique
• Not null
205
• In very general terms, DKNF says that a database
is correctly designed if the fields in the tables, the
constraints on the individual fields, and the
dependencies among the tables are the result of
correctly defining the domains and choosing
which domain each field is on.
• It reaches beyond the question of whether one
table is correctly designed, because the
relationships between tables also depends on
domain choice.
206
• The DKNY definition is so high-level and cosmic
that it boils down to saying this:
• A database is correctly designed if and only if it’s
correctly designed.
• If the design stems from and captures the logical
meaning of the problem, the design is correct.
• If the design is correct, then it must be the case
that it stems from and captures the logical
meaning of the problem…
207
• This is just a random thought.
• It is conceivable that you might be asked to
design a database for someone who was truly
confused.
• Their database needs are based on a scenario
that contains logical contradictions.
• If that were the case, faithfully capturing the
scenario would lead to a bad model.
• The first problem in such a case is to iron out the
inconsistencies in the user’s thinking.
208
9. Nulls and Integrity
• This section is a review of material that was
explained earlier.
• It will not be gone over in class.
• However, it is provided below in its entirety in
case you want to read the overheads yourself.
209
9. Nulls and Integrity
• The term “null” refers to the idea that a
particular field in a particular record may not
have data in it.
• In general, this is permissible.
• Cases often arise in practice where the
information doesn’t exist or isn’t known.
210
• It would be impractical to insist that all fields
always contain data.
• If that restriction were imposed, people would
get around it by putting in bogus values for
information that didn’t exist or wasn’t known.
• However, filling a database with bogus values
is not a very good idea.
211
• When a database management system supports null
values in fields, it’s important to understand what this
does not mean.
• It does not mean that the fields contain the sequence
of characters “null”.
• It also does not mean that the field contains invisible
blanks.
• Blank spaces themselves are a form of character.
• What it means is that there is absolutely nothing in the
field, and the database management system is able to
recognize fields that are in that state.
212
• The term integrity in database management systems refers
to the validity and consistency of data entered into a
database.
• The phrase “entity integrity” is the formal expression of a
requirement that was stated informally earlier.
• Entity integrity puts the following requirement on a
correctly implemented database:
• Every table has a primary key field, and no part of the
primary key field can be null for any record in the table.
• Clearly, if all or part of a key were allowed to be null, that
would defeat the purpose that the primary key field be the
unique identifier for every record in the table.
213
• As seen in the long discussion of normal forms, it is the
primary key to foreign key relationships that support the
interconnection between related entities that have been
separated into different tables by the design process.
• Once this has been done, it is critically important that the
data maintaining the relationships be valid and consistent.
• The phrase “referential integrity” has the following
meaning:
• Every value that appears in a foreign key field also has to
appear as a value in the corresponding primary key field.
• This can also be stated negatively:
• There can be no foreign key value that does not have a
corresponding primary key value.
214
• The meaning and importance of referential
integrity can be most easily explained with a
small example showing a violation of it.
• Consider the tables shown on the following
overhead:
215
mid
1
2
Mother
Name
Lily
Matilda
kid
a
b
c
Child
name
Ned
Ann
June
mid
3
2
216
• Child b, Ann, is shown as having a mother with
mid equal to 3.
• There is no such mother in the Mother table.
This is literally nonsense.
• There is no sense in which this can be correct
and this is what referential integrity forbids.
217
• This example also illustrates two other things,
which are related to “non-existent” values.
• Observe that mother 1, Lily does not have any
matching records in the Child table.
• This does not violate referential integrity.
• It suggests that the Mother table is misnamed,
and should be named the Woman table, but it is
reasonable to think that you might be recording
information about women and children and some
women will not have children.
218
• The other thing visible in the table is that child
c, June, does not have a mother listed.
• In other words, the foreign key field is null.
• This also does not violate referential integrity.
• As with null in any situation, it may mean that
the mother is not known.
219
• Nobody literally doesn’t have a mother, but if
the woman table only records information on
living women, for example, then for an
orphan, the mother “wouldn’t exist”.
• It is unlikely that you would rename the table
“Children and Orphans”—but the idea is that
the null value is allowed and this in some
sense affects the meaning of what kinds of
entities are entered into the table.
220
• Referential integrity leads to one last
consideration.
• The idea behind normalization was to get the
stray dependency out of one table and break it
into two.
• The problem with stray dependencies was
redundancy and anomalies.
• By breaking a design into two tables with a
primary to foreign key pair, you introduce
interrelationship constraints.
221
• Put simply, the question is this: What do you
do with foreign key values if the
corresponding primary key values in another
table are deleted or updated?
• A fully-featured database management system
will enforce referential integrity constraints.
• The default settings for these constraints are
summarized in these two phrases:
• On delete, restrict; on update, cascade.
222
• If these defaults are implemented, this is a
fuller explanation of what they mean in terms
of the concrete mother and child example:
• On delete, restrict:
• No mother record can be deleted if she has
corresponding child records in the other table.
• To allow the deletion would lead to a
referential integrity violation.
223
• On update, cascade:
• If the primary key value of a mother record is
updated, if she has corresponding child
records in the other table, the foreign key
values in those records is automatically
updated to reflect the change.
• This problem arises less frequently because
once a primary key value is assigned to an
entity, it is rarely changed.
224
The End
225