Download Lecture 5 PowerPoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
(Winter 2017)
Instructor: Craig Duckett
Lecture 05: Thursday, January 17th
Six Design Rules
Normalization Refresher
TEAMWORK
1
Assessment announcements, upcoming due dates, etc, will be posted here on each lecture
slide going forward.
• PHASE 1: DISCOVERY DUE: Tuesday, January 31st, uploaded to Team Web Site
and ZIPPED and uploaded to StudentTracker by Phase 1 Project Manager
Phase 2: Design due Thursday, February 16th
Monday, January 16th
Thursday, January 26th
TBA
TBD
NO SCHOOL Martin Luther King Day
NO CLASSES Non-Instructional Day
NO LECTURE: Instructor on CHEM Hiring Committee
NO LECTURE: Instructor on CHEM Hiring Committee
2
The Team Project
Five Phase Due Dates
One (1) Team Project for a Client (3-to-4 Members on Team) 1000 points Total
•
•
•
•
•
Phase 1: Discovery (200 Points) DUE TUESDAY, January 31st
Phase 2: Design (200 Points) DUE THURSDAY, February 16th
Phase 3: Develop (200 Points) DUE THURSDAY, March 2nd
Phase 4: Distribute (200 Points) DUE THURSDAY, MARCH 9th
Phase 5: Documentation (200 Points) DUE THURSDAY, MARCH 16th (Last Day of Class)
3
Database Design for Mere Mortals: Chaps. 4 & 5 Summary
4
Conceptual Overview
The first phase in the database design process is to define a mission statement and mission
objective. This establishes the purpose of the database and provides a focus for the developer.
The second phase involves analyzing the current database, if one exists. It will typically be a
legacy (one that has been in use for several years) or paper-based (forms, index cards, folders,
etc.) database. It is very important to conduct interviews with users and management to identify
how they interact with the database on a daily basis. With this information, you then compile a
list of fields. This list will be refined as the design is developed.
The third phase is creating the data structures: tables, fields, establishing keys and defining field
specifications.
Tables are the first structures you define in the database. Once subjects are identified, they’re
established as tables, then fields are associated with the appropriate tables. Tables should be
reviewed to be sure they represent only one subject and that no fields are duplicated.
Next, fields are reviewed to be sure there are no multipart or multivalued fields. If so, you modify
those fields so each fields stores a single value. Then a Primary Key is established, making sure it
uniquely identifies each record within the table.
Finally, field specifications are established. Interviews should be conducted with users to help
identify any specific field characteristics that may be important to them.
Conceptual Overview
In the fourth phase, table relationships are established. Interviews are conducted with users
and management to identify relationships, relationship characteristics and establish relationshiplevel integrity. Once relationships have been identified, it is necessary to establish the logical
connection for each relationship. Depending upon the type of relationship, you would use either
a Primary Key or a “linking” or “composite” table to make the connection between a pair of tables
based upon the type of relationship you want to establish.
The fifth phase of the database design process is to define the business rules. How an
organization views and uses its data will determine limitations and requirements that must be
built into the database. Again, this information is gained through interviews with users and
management. Next, validation tables are defined. For example, if certain fields are found to have
a finite range of values due to the way they are used by an organization, validation tables are used
to ensure the consistency and validity of the values stored in those fields.
The sixth phase of the database design process is determining and establishing views.
Interviewing users and management will help identify the different ways data is viewed. One
group may view data from a different perspective than another group. Another group may only
need to view one specific field from a certain table.
The seventh phase is reviewing the final database structure for data integrity. First, each table is
reviewed to ensure that it meets proper design criteria. Then field specifications are reviewed and
checked. Then, you test the validity of each relationship. Finally, business rules are reviewed and
confirmed.
Chapter 5 Summary STARTING THE PROCESS
To begin the design process, you must identify the purpose of the database as well as a list of tasks that can
be performed against the data.
Conducting interviews provides valuable information that affects the design of the database structure.
Having a list of prepared questions is highly recommended. It’s important to ask open-ended questions.
This gives the participant an opportunity to provide complete and objective answers to questions.
The following are suggestions for interview guidelines:
•
•
•
•
•
•
•
•
•
•
•
Set a limit of six people or less for each interview.
Conduct separate interviews for users and managers.
If several groups are interviewed, designate a group leader for each group.
Prior to the interview, inform the participants of what will be discussed and how the interview will be
conducted.
Make sure everyone understands you appreciate their participation and that their responses are
valuable to the overall design.
Conduct interview in well-lit room, separated from distracting noise, large table and comfortable chairs
and have coffee and munchies on hand.
If you’re not good at taking notes, assign the task to a dependable transcriber or get the group’s
permission to use a tape recorder.
Give everyone your equal and undivided attention.
Make sure everyone understands that you’re the official arbitrator.
Keep the pace of the interview moving.
Always maintain control of the interview.
Chapter 5 Summary (CONTINUED)
Defining the Mission Statement
A good mission statement is succinct and to the point. It should be very general an should not describe
specific tasks. Interviewing management and staff will bring a overall understanding of the organization
and general comprehension of why the database is necessary. Here are a few sample questions that
you can use to arrive at your mission statement:
•
•
•
•
•
•
How would you describe the purpose of your organization to a new client?
What would you say is the purpose of your organization?
What is the major function of your organization?
How would you describe what your organization does?
How would you define the single most important reason for the existence of your
organization?
What is the main focus of your organization?
Defining the Mission Objectives
Mission objectives are statements that represent the general tasks performed against the data in the
database. Each statement represents a single task and should not contain unnecessary detail. Mission
objectives are used to help define table structures, field specifications, relationship characteristics and
Views. Information used to define the mission objectives is gathered through interviews with users and
management. General tasks are determined by asking open-ended questions. The interviews should be
very general in nature to get an overall idea of the general tasks the database should support.
Six Important Database Design Rules
9
Rule 1: What is the nature of the application (OLTP or OLAP)?
When you start your database design the first thing to analyze is the nature of the application you are designing for, is
it Transactional or Analytical. You will find many developers by default applying normalization rules without thinking
about the nature of the application and then later getting into performance and customization issues. As said, there
are two kinds of applications: transaction based and analytical based, let’s understand what these types are.
Transactional: In this kind of application, your end user is more interested in CRUD, i.e., creating, reading,
updating, and deleting records. The official name for such a kind of database is OLTP (Online Transactional
Processing)
Analytical: In these kinds of applications your end user is more interested in analysis, reporting, forecasting, etc.
These kinds of databases have a less number of inserts and updates. The main intention here is to fetch and analyze
data as fast as possible. The official name for such a kind of database is OLAP (Online Analytical Processing)
Rule 1: What is the nature of the application (OLTP or OLAP)? CONTINUED
In other words if you think inserts, updates, and deletes are more prominent then go for a normalized table design,
else create a flat denormalized database structure.
Below is a simple diagram which shows how the names and address in the left hand side are a simple normalized
table and by applying a denormalized structure how we have created a flat table structure.
Rule 2: Break your data into logical pieces, make life simpler
This rule is actually the first rule from 1st normal form. One of the signs of violation of this rule is if your queries are
using too many string parsing functions like substring, charindex, etc., then probably this rule needs to be applied.
For instance you can see the below table which has student names; if you ever want to query student names having
“Winkus” and not “Abner”, you can imagine what kind of a query you will end up with.
So the better approach would be to break this field into further logical pieces so that we can write clean and optimal
queries.
Rule 3: Do not get overdosed with Rule 2
Developers are cute creatures. If you tell them this is the way, they keep doing it; well, they overdo it leading to
unwanted consequences. This also applies to Rule 2 which we just talked above. When you think about decomposing,
give a pause and ask yourself, is it needed? As said, the decomposition should be logical.
For instance, you can see the phone number field; it’s rare that you will operate on area codes of phone numbers
separately (until your application demands it). So it would be a wise decision to just leave it as it can lead to more
complications.
Rule 4: Treat duplicate non-uniform data as your biggest enemy
Focus and refactor duplicate data. My personal worry about duplicate data is not that it takes hard disk space, but the
confusion it creates.
For instance, in the below diagram, you can see “5th Standard” and “Fifth standard” means the same. Now you can
say the data has come into your system due to bad data entry or poor validation. If you ever want to derive a report,
they would show them as different entities, which is very confusing from the end user point of view.
Rule 4: Treat duplicate non-uniform data as your biggest enemy CONTINUED
One of the solutions would be to move the data into a different master table altogether and refer them via foreign
keys. You can see in the below figure how we have created a new master table called “Standards” and linked the
same using a simple foreign key.
Rule 5: Watch for data separated by separators (commas, dashes, slashes)
The second rule of 1st normal form says avoid repeating groups. One of the examples of repeating groups is explained
in the below diagram. If you see the syllabus field closely, in one field we have too much data stuffed. These kinds of
fields are termed as “Repeating groups”. If we have to manipulate this data, the query would be complex and also I
doubt about the performance of the queries.
Rule 5: Watch for data separated by separators CONTINUED
These kinds of columns which have data stuffed with separators need special attention and a better approach would
be to move those fields to a different table and link them with keys for better management.
So now let’s apply the second rule of 1st normal form: “Avoid repeating groups”. You can see in the above figure I
have created a separate syllabus table and then made a many-to-many relationship with the subject table. With this
approach the syllabus field in the main table is no more repeating and has data separators.
Rule 6: Watch for partial dependencies
Watch for fields which depend partially on primary
keys. For instance in the table to the right we can
see the primary key is created on roll number and
standard. Now watch the syllabus field closely. The
syllabus field is associated with a standard and not
with a student directly (roll number).
The syllabus is associated with the standard in which
the student is studying and not directly with the
student. So if tomorrow we want to update the
syllabus we have to update it for each student,
which is painstaking and not logical. It makes more
sense to move these fields out and associate them
with the Standard table.
You can see how we have moved the syllabus field
and attached it to the Standards table.
This rule is nothing but the 2nd normal form: “All
keys should depend on the full primary key and not
partially”.
Normalization
https://www.youtube.com/watch?v=wp0N1tYjEWc&feature=youtu.be&hd=1
19
Normalization
Once we've started to plan out our tables, our columns, and relationships, we do something
called Database Normalization. This is a process where you take your database design, and
you apply a set of formal criteria of rules called Normal Forms. These were developed about
45 years ago mainly by Edgar Codd, the father of relational databases. And we step through
them 1, 2, 3, first normal form, second normal form, and third normal form. There are others
but these are the important ones.
Normalization
Normalization should be carried out for every database you design. It's really not that hard,
even though, yes, when you first start reading about database normalization, you'll run into
phrases like:
But you don't have to get into all this language unless you are mathematically inclined. The
entire point of normalization is to make your database easier and more reliable to work with.
You usually will end up creating a few new tables as part of the process. But the end result is
your database will contain a minimum of duplicate or redundant data. It will contain data
that's easy to get to, easier to edit, and maintain, and you can preform operations, even difficult
ones on your database without creating garbage in it, without invalidating the state of it.
If you're a working database administrator or database designer, you can do normalization in
your sleep. It's a core competency of the job. It's important. And as you'll see, we've already
been doing a little of it.
First Normal Form (1NF)
Before we apply the first set of criteria, what's called first normal form, often shortened to
1NF, I'm taking as a given that we already have our columns and our primary keys specified.
First normal form says that each of your columns and each of your tables should contain one
value, just one value, and there should be no repeating groups.
Okay, what does this actually mean?
First Normal Form (1NF)
Well, let's say I begin developing a database for my company and one of my tables is an
Employee table, very simple stuff, EmployeeID, LastName, FirstName, and so on. And we
allocate every employee a computer.
I want to keep track of that, so we'll add a ComputerSerial column to keep track of who has
what. Now, this is actually okay right now. This technically is in first normal form. Here's the
problem.
First Normal Form (1NF)
Let's say I figured out that some of our employees need a Mac and a PC to do the testing. Others
need a desktop and a laptop. So, several people have multiple computers, and I want to keep track of
all of them. There is a couple of ways that I could deal with this. I could just start stuffing extra data
into that one column. We could start putting commas or vertical bars or any other delimiter and put
in multiple values in the one ComputerSerial column.
This is just something you just don't do in Relational Database Design. We're violating first normal
form.
Understand the relational databases will happily deal with hundreds of tables. Each table could have
hundreds of columns and millions of rows. But they do not want columns that have a variable
amount of values. You would find it hard to search directly for a serial number. You'd find it hard to
sort. You'd find it hard to maintain. So, it's not in first normal form if you do this because first
normal form demands that every column, every field contains one and only one value.
First Normal Form (1NF)
So, what we might do then is go back to the original way, and instead start adding new
columns. So, ComputerSerial2, ComputerSerial3, this is what's called a repeating group, and
there should be no repeating groups. The classic sign of a repeating group column is a column
with the same name, and the number tacked onto the end of it just to make it unique, because
usually this is a sign of an inflexible design.
Sure, if we could guarantee that there would only ever be two or three, that's fine. But what
happens when we want to add the tablet and the smart phone? What happens when one
employee manages testing and needs to be associated with six computers? We don't want to
require a change to the database schema just because we buy a new computer. So, what do
we do here?
First Normal Form (1NF)
Well, what we do is the same thing for a lot of these normalization steps. We'll take this data
out of the Employee table, and put it in its own table called Computer.
First Normal Form (1NF)
This then has relationships. We create a one-to-many relationship between Employee, and
this new Computer, or it could be called an asset table or whatever else makes sense. And it
has a foreign key back to the Employee table. I can take any EmployeeID like 551, follow it to
the Computer table, and find his two computers or 553, find his three computers, there are no
repeating values, no repeating groups in either table. And this will get us into first normal
form.
First Normal Form (1NF)
So, it's very common that the solution to a normalization issue is to create a new table.
Sometimes, it's a one-to-many relationship like this, other times it might even require a manyto-many with a linking table.
Second Normal Form (2NF)
Before you attempt to go into second normal form or 2NF, well first, you have to be in first
normal form. You don't pick and choose between them. You go through this one  two 
three. Now whereas first normal form is about the idea of repeating values in a particular
column, second normal form, and third normal form are all about the relationship between
your columns that are your keys, and your other columns that aren't your keys. The second
normal form has the rather puzzling official description that any non-key field should be
dependent on the entire primary key. And that is about as simple as it can get phrased.
Second Normal Form (2NF)
Now, when I say the word field, it usually refers to the idea that the actual value in a particular
column position for a particular row. But what does this actually mean? Well, for most of what
we've done in this course, this actually won't be an issue for us. Second normal form is only
ever a problem when we're using a Composite Primary Key. That is a primary key made of two
or more columns. So, let me show you a table that currently is in first normal form but not in
second normal form. Going back to the idea of a database for a training center, I have an
Events table here that has an ID of a Course, a Date, CourseTitle, Room, Capacity,
AvailableSeats, and so on. Now, what's actually happening here is this table has been defined
to use two columns as the primary key. It's a composite primary key.
Second Normal Form (2NF)
Now, the issue with second normal form is that if you use a composite key, you need to look
closely at the other columns in this table. So, going along to my non-key columns, I have
CourseTitle, SQL Fundamentals, Room 4A, Capacity is 12, there are 4 seats available. A lot of
this information would be unique to this one entry, this one course on this particular date.
That's fine. But second normal form asks that all of my non-key columns, everything that isn't
part of the key, so Course Title, Room, Capacity, Available, they all have to be dependent on
the entire primary key. Now, that is the case for Room and Capacity and Available. These are
unique values based on the fact that we're running this particular Date, this particular Course,
and this particular room with a certain number of seats available. It will always be different.
But CourseTitle, well, I could get that just from half of the key. I could get that just from the
first part of the key. It has no connection to the Date whatsoever. SQL Fundamentals will
always be based on SQL101.
Second Normal Form (2NF)
It doesn't matter if it's being run in March or April or May. Now, this might sound a little bit
ivory tower. But here would be the impact.
What happens if somebody reached into this table, and they changed that Course ID, but they
didn't change the title? Now, we've got a conflict. We might have the wrong title for the wrong
piece of data. That's because my data now isn't in second normal form, and we're trying to fix
that conflict from ever happening. So, how do we fix it?
Second Normal Form (2NF)
Well, once again, we're going to rip out the CourseTitle. We're going to create a separate
Courses table, where we want to again map the ID of the course into its own row. So, we'll
always have one specific title for one specific ID. And then we create a one-to-many
relationship between Events and Course. And removing that from the Event table means that
everything in that table is now based on the entire key, particular course, at a particular date
which may have a different room or different capacity, different number of available seats.
Third Normal Form (3NF)
Now, let's take a look at the third normal form. Well, as plainly as this can be described, it's
that no non-key field, meaning, a column that is not part of the primary key is dependent on
another non-key field. It is in a way similar to second normal form. Second normal form asks
can I figure out any of the values in this row from just part of the composite key? While third
normal form asks can I figure out any of the values in this row from any of the other values in
this row? And I shouldn't be able to do that.
Third Normal Form (3NF)
Let's take a look at an example. I've got this updated version of the Events and Courses table
from the previous example. So, it's in both first normal form, it doesn't have any repeating
values or repeating groups, and it's in second normal form. Meaning, there's no part of this
that's dependent on just on a piece of the key. What I need to do for third normal form is look
at my non-key fields, Room, Capacity, Availability. If I scan the entire row, let's take the first
row, we've got SQL101 course occurring on the 1st of March. There is apparently 4 seats
available. It's in Room 4A with a capacity of 12. Now, this is at a first look at it perfectly
acceptable, because this course could be being scheduled in a different room every time with a
different number of available seats as we start to sell different seats for a particular date.
That's all okay.
Third Normal Form (3NF)
Here's the problem. It's between Room and Capacity. These are both non-key fields. These
columns aren't part of the primary key. But if I look down the column for Room, I see 4A has 12
seats capacity, 4A has 12 seats, 7B has 14 seats. So, if every time we're in Room 4A, we always
have 12 seats or every time we're in 7B, we always have 14 seats. I don't need to repeat that
information. I could figure out capacity from Room and Room alone. I have one non-key field
that is based on another non-key field. So, we don't need these to be stored in the same table.
What we need to do is, you guessed it, split some of this information out into its own table.
Third Normal Form (3NF)
So, we need to pull out Capacity from the Event table, and just keep Room. And that's as long
as Room will always tell us a fixed capacity, we'd create our own table for it, 4A always has 12,
7B always has 14, and so on. Now, we're in third normal form, no non-key field is dependent
on another non-key field.
Now, as you're seeing, it's all about the redundancy of the information. This is what we're
trying to do with normalization.
Third Normal Form (3NF)
Now, another example of third normal form would be something like this, which is very
common.
Let's say we've got an OrderItem table, which is calculating different parts of an invoice. So, it
has a ProductID with a Quantity, a UnitPrice, and a Total.
Third Normal Form (3NF)
Now, you don't have to worry about how this might relate to different tables. All I'm interested
in looking at is this part. We've got Quantity for UnitPrice of $10, Total is $40. Here is the
issue. We can see that Total is based purely as on Quantity x UnitPrice. Now, Quantity and
UnitPrice are both non-key fields. So we're figuring out Total from these other two non-key
fields. We don't need to do this. We don't need to store this in the database. We don't want to
store information in your table that's easily ascertained by adding other non-key fields
together, or in this case multiplying them. One of the main reasons for this is to prevent
any conflicts.
If in this example I have a row that says we have a Quantity of 4 and the UnitPrice of 10, but
the Total says 50, well, where is the problem? There is a problem. How do we do it? How
would we fix it?
Third Normal Form (3NF)
Is the Total wrong or is the Quantity wrong? Your data doesn't make sense anymore. So, we
would remove that Total column form this table. We can figure it out when we need to figure it
out. Now, third normal form will help you figure out these potential problems.
Now just a quick side bar, in cases like this where you might find a total useful in the table,
many database systems offer you the option of defining a computed or calculated column. It's
not actually stored in the database, it is a convenient read-only fiction. Its value is
automatically calculated based on the other columns in the table, and you may find that useful
from time to time.
Denormalization
41
Database Denormalization
So, we should always take our database design through the first, second, and third normal
forms. There are more criteria available. There are fourth, fifth, and sixth normal forms.
There's something called Boyce-Codd normal form. But taking it to third normal form is the
usual expectation in a business environment, and certainly all we need to cover in a course like
this one. Now, you will actually find a lot of tables out there intentionally break normalization
rules and some others seem like they do but they actually don't.
Denormalization
Here's one example. Let's say we've got an Employee table, and I'm storing an Email and a Phone
number.
Technically, this can be described as breaking first normal form. It's a repeating group. But in practice,
you may find it more convenient to just allow an Email and Email2 column or perhaps a HomePhone
and MobilePhone column rather than splitting everything out into multiple tables and having to
follow relationships every single time you read or write this data. This will be referred to as a denormalization decision. You're consciously making the choice that something could be normalized
out into another table. You could follow the official rules. But for convenience and/or for
performance, you're not going to.
Denormalization
Normalizing a table like this, thinking that I've immediately spotted a non-key field
dependency, that would actually be taking it too far and making things more inconvenient.
And the question is you really want to understand your data before you can make all these
choices whether to normalize or de-normalize. And you might de-normalize to make things a
bit more efficient, but do it knowingly instead of accidentally.
Denormalization
These really the three steps that we would go through, first normal form, second normal
form, and third normal form.
First being about having no repeating values and no repeating groups, second normal form, no
values based on just part of say half of a composite key, and third normal form, none of your
non-key values should be based on or determined from another non-key value.
Taking your database design through these three central criteria will vastly improve the quality
of your data.
Denormalization
one example that can seem like a normalization and/or de-normalization issue but really isn't
any table that's full of address information. This situation can be a little deceptive. If I look at a
table like this, and I can see I've got Zip code being stored as the last column here.
Theoretically, I could figure out what the City, and the State are just from the Zip Code, if I
separated them out into their own table. So technically, I have non-key fields, City and State
that are dependent on another non-key field, Zip, that could be figured out from Zip alone.
However, this kind of case is not the full story because while it might be true 99% of the time
that a Zip code maps to a particular City or Town, there are some cases where multiple towns
or cities are allowed in the same zip code, some Zip codes even cross multiple states.
TEAM TIME
47