Download Database design 2

Document related concepts

Entity–attribute–value model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Database Design - Normalization
 Normalization
are a set of techniques for
organizing data into tables in order to...
– Eliminate most redundancy
– Prevent incompleteness
 Much
of this presentation is devoted to a
rather long example... will work through it
together
1
Two Steps to Normalization
 1.
Put the data into tabular form (by
removing repeating groups)
 2. Remove duplicated data into separate
tables
2
First - A Simple Example
 Graphic
2-1 - Employee Qualifications
3
Put Data Into Tabular Form
 Problem:
What do we do if an employee
has more than one qualification?
 Graphic
2-2
4
1. Divide Data Into Two Tables
 Employee Table
– Emp#, Name, Dept#, DeptName, DeptLocation
 Qualification Table
– Emp#, Qualification Desc, Qualification Year
 Graphic
2-3
5
2. Remove Duplicated Data
 Notice
that Dept. Number "05" is
"Auditing" and is located at "HO"
 It is repeated for every employee in that
department
– This wastes space
– Makes updating data more complicated
– "Elegance" rule is violated
6
Whose Data is it Anyway?
 The
basic problem is that department name
and location are really data about
departments rather than employees
 It belongs in a separate Department table
 Graphic
2-4
7
A Very Basic Example
 This
example was presented informally
 The rules of normalization have their
foundation in mathematics
– On one hand, we can have confidence in
normalization as a technique
– On the other hand, it's very easy to become lost
in mathematical terminology and proofs
– Remember that data modeling is "design" and
we should be careful about anything that lead
us to one "right" answer.
8
Relational Notation
 The
sample tables we've seen so far take up
a lot of space
 We need a more concise notation
 If we eliminate the sample rows, we are left
with the table names and columns
 Graphic
2-5
9
Relational Notation - cont.
 Text
books usually use data displayed as in
Graphic 2-5
 Designers usually want a little more
information
 Graphic
2-6
10
Exercise
 Normalize
the database shown in the
graphic
 Take about 10 minutes
 Remember two steps...
– Put the information in tabular format
– Eliminate redundant information
 Graphic
2-E1
11
A More Complex Example
 We'll
introduce the rules of normalization as
we proceed
 The rules can be daunting at first, but we'll
look at the problems they solve
12
Hospital Survey Example
 The
form displayed in Graphic 2-7 is one
used in an actual survey of antibiotic drug
usage in Australian hospitals
 Survey used to determine which drugs and
dosages were being used for various
operations, to ensure that patients were
properly prescribed for, and that the public
was not paying for unnecessary drugs
13
One Form for Each Operation
 Each
hospital in the survey was given a
unique hospital number
 All hospital numbers were prefixed with 'H'
 Hospitals fell into three categories:
– 'P' for public
– 'V for private
– 'T' for training
» All training hospitals were public so,
'T' implied 'P'
14
Operations
 All
Operation Numbers were assigned
sequentially be each hospital
 Operation Code was a standard
international code for the named operation
 Procedure Group was a broader
classification
15
Surgeons
 Surgeon
Number was allocated by
individual hospitals to allow surgeons to
remain anonymous
 The prefix 'S' stood for surgeon
16
Drugs
 The
Total Drug Cost was the total cost of
all drug doses for the operation
 The bottom of the form recorded the
individual antibiotic drugs used in the
operation
 The Drug Code consisted of the short name
for the drug and the size of the dose
17
As the Study Continued...
 ...
It was decided to replace the heaps of
forms with a computerized database
 Graphic 2-8 shows the initial database
design using the relational notation
 This was done by one person, who was the
data modular, the physical database
designer, and the programmer
18
Exercise
 Normalize
the database as shown in
Graphic 2-8
 Take about 15 minutes
 Remember two steps...
– Put the information in tabular format
– Eliminate redundant information
19
Determining Columns
 Normalization
relies on certain assumptions
about the way data is represented
– We need to make sure that these are valid
 There
are some problems normalization
does not solve
– It is better to address these at the outset, rather
than carry excess baggage through the whole
normalization process
 The
following steps are necessary...
20
One Fact per Column
 First
we make sure that each column in the
table represents one fact only
– The Drug code column holds both a short name
for the drug and a dosage size
– The Dosage size consists of a numeric size and
a unit of measure
– The Hospital Category really provides two
facts
» Hospital Public or Private?
» Does the hospital provide training?
21
Hidden Data
 Make
sure that we have not lost any data in
the translation to tabular form
 Common problem: We cannot rely on the
rows of the table being stored in any
particular order
– Suppose the original survey forms had been
filed in order of return
– We would need a Return Date or Sequence
column
22
Derivable Data
 We
need to remove any data that can be
derived from other data in the table and
amend the columns accordingly
– Remember our basic objective is nonredundancy
– The Total drug cost was derived by adding
together the Drug Costs multiplied by the
Number of Doses
23
Determining the Key
 Key
- a minimal set of columns that hold a
different combination of values for each
row in the table
 The value of the Key uniquely identifies
one row on the table
– The combination of Hospital Number and
Operation number
– In relational notation, underline the key
column(s)
24
Exercise
 Using
Graphic 2-8, Clean up the design (i.e.
list all the columns that are valid in the new
design)
 Remember...
–
–
–
–
One fact per column
Don't lose hidden data
Remove data that can be derived
Determine the Key
 Take
about 10 minutes
25
After Tidying Up...
 Graphic
2-9
26
Repeating Groups & 1st Normal Form
 Our
first task was to put the data in tabular
format…
– it might seem like we've done this
– but we've actually hidden a problem about the
drug administered data
27
Limiting Number of Occurrences
 The
drug administration data is the major
cause of the tables complexity and
inelegance
– Drug Short Name 2, Drug Short Name 4 etc.
 The
columns needed to accommodate up to
four drugs account for most of the
complexity
28
Why Only Four Drugs?
 Why
not five or six or more?
 Four drugs represented the maximum
arrived at by asking one of the survey teams
"What would be the maximum number of
drugs used in an operation?"
 In fact, this number was frequently
exceeded
– some operations had 10 or more
29
What's the Problem?
 Part
of the problem is that the question was
poorly put… a line on the form was
required for each drug-dose combination,
rather than just for each different drug
 The maximum number of drugs could
increase later, so this model still rates
poorly against the stability criterion
30
Paper Over the Differences
 With
the paper form, a continuation sheet
was simply attached when more drugs were
needed
 We could add more columns to the table
easily, but the application program changes
would be much more difficult
31
What Was Done...
 The
original designer decided to handle
continuations by suffixing the operation
number with "a", "b", or "c" to indicate a
continuation
 This caused program changes and
comprised the original simplicity of the
system
32
Data Reusability & Program
Complexity
 The
main difficulties are with data
reusability and program complexity
 Program can easily answer questions…
– How many operations were performed by
neurosurgeons?
– Which hospital is spending the most money on
drugs?
 But
not …
– How much money was spent on Ampicillin?
33
Another Way
 You
might argue that some queries are
always going to be more difficult than
others
 What would happen if we had designed the
table on the basis of "one row per drug"?
34
Recognizing Repeating Groups
 A Set
of columns repeated a number of
times - a "repeating group" - results in…
– inflexibility
– complexity
– poor data reusability
 Graphic
2-10
– curly brackets indicate a repeating group with
an indefinite number of occurrences
35
Solution
 A general
and flexible solution should not
set any limit on the maximum number of
occurrences
 It should also handle of few or no
occurrences (the drug-free operation)
36
Removing Repeating Groups
 First
step of Normalization:
– Put the data in tabular form by identifying and
eliminating repeating group
 Split
the file into two tables
– basic operation data
– (repeating) drug administration table
37
Split into Two Tables
 Remove
all repeating group columns to a
new table (each occurrence of group is a
row in the new table
 Include the primary key of the original table
in the new table (this makes a foreign key
in the new table)
38
Split into Two Tables - cont.
 Add
a 'Sequence' column if needed
 Name the new table
 Identify and underline the primary key of
the new table
39
Exercise
 Using
Graphic 2-10 remove all repeating
groups
40
Solution
 Graphic
2-11
41
Determining Key of New Table
 Not
always an easy task
 Question:
What is the minimum number of
combination of columns needed to uniquely
identify one row?
42
A Six Column Key
 Hospital
Number - FK
 Operation Number - FK
 Drug Short Name
 Dose
 Unit of Measure
 Method of Administration
43
First Normal Form
 Our
tables are now technically in first
normal form (1NF). What have we
achieved?
– All data of the same kind in now held in the
same place
– The number of different drug dosages that can
be recorded for an operation is effectively
unlimited
– An operation that don't use any drugs is
allowed
44
Problems with First Normal Form
 Look
at Operation table in Graphic 2-11
 Every row for an operation (e.g. Hospital
number 17) will contain the facts that its
name is St. Vincent's and Fred Fleming is
the contact person
 Criterion of non-redundancy is not being
met
45
More Problems
 Change
a fact about a hospital and you will
have to change it for every operation in the
hospital
 If
we delete the last operation for the
hospital, then we effectively delete the
hospital
46
Eliminating Redundancy
 Solve
all these problems by removing the
hospital information to a separate table, in
which each hospital number appears once
only (which becomes the primary key)
 Graphic
2-12
47
Determinants
 For
a given hospital number there could be
only one...
–
–
–
–
hospital name
hospital type
contact person
training status
 The
hospital number is a determinant for
the other columns
48
Formal Procedure
 1.
Identify and determinants (other than
the primary key) and the columns they
determine
 2. Create a separate table for each
determinant and its columns (the
determinant becomes the primary key)
 3. Name the new table
 4. Remove the determined columns from
the original table, leave the determinant
49
Other Determinants
 Hospital
number + Surgeon Number =>
Surgeon Specialty
 Operation code => Operation Name,
Procedure Group
 Drug Short Name => Drug Name,
Manufacturer
 Drug Short Name + Method of
Administration + Size of Dose + Unit of
Measure = Cost of Dose
50
Exercise
 Finish
normalizing Graphic 2-10
51
Third Normal Form
 Graphic
 This
2-13 is the final model
model is in Third Normal Form (3NF)
52
Where is Second Normal Form?
 Our
approach took us directly to Third
Normal Form
 Most texts treat this as a two stage process
– deal first with the determinants that are part of
the table's key (Second Normal Form)
– then with the non-key determinants
53
Is Third Normal Form the End?
 Unfortunately,
no
 Boyce-Codd Normal Form
 Fourth Normal Form
 Fifth Normal Form
 We'll discuss later...
54
Third Normal Form Hint
 Every
non-key column must be a fact about
the key, the whole key, and nothing but the
key
55
Candidate Keys
 Sometime
more than one column or
combination of columns could serve as a
primary key (e.g. Drug Name instead of
Drug Short Name)
 These are called Candidate Keys
 In step 1 we should really say...
– Identify any determinants, other than candidate
keys...
56
Foreign Keys
 When
we removed repeating groups we
carried the primary key of the original table
to cross-reference back
 These columns are called Foreign Keys
57
Self-Referencing Tables
 A Foreign
Key may refer back to the Primary
Key of the same table
– Example: An Employees table might have
employee id as its Primary Key and a Foreign Key
of Manager id which refers to another row in the
same table
 The
convention to represent Foreign Keys is
an asterisk...
– EMPLOYEE (Employee Id, Name, Manager id*
58
Referential Integrity
 If
the Employees table has a Department Id
foreign key referring back to the
Departments table's Primary key, then we
would expect to find a valid Department Id
on the Departments table for every value of
Department id in the Employees table
 If not, our database lacks Referential
Integrity
59
Summary
 Normalization
is a set of techniques for
organizing data into tables to eliminate
certain types of redundancy an
incompleteness
 Normalization relies on correct
identification of determinants and keys
60
Last Slide - Normalization
 Assignment
#9 due next week
61