* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Normalization
Survey
Document related concepts
Transcript
IT 244 Database Management System Topic 3 : Normalization Ref : - A First Course in Database System (Jeffrey D Ullman, Jennifer Widom) + online. Different Technique • Thus far in this course we have discussed Entity Relationship Modeling as a technique for designing a database • There exists another important design technique, called Normalization which we shall now cover . • Before we discuss the details, let us first consider the applicability of each of these techniques Entity Relationship Modeling • Entity Relationship Modeling is appropriate when a database is to be built from scratch • This arises when a completely new information system is being built (e.g. a company realizes that they need to keep track all purchase made from its suppliers in order to improve efficiency) • In the above scenario no relations exist (or can be identified directly), thus we need to model the requirements in the form of an ER model, and then convert the latter to the Relational Model Normalization If you've been working with databases for a while, chances are you've heard the term normalization. Perhaps someone's asked you "Is that database normalized?" or "Is that in BCNF?" All too often, the reply is "Yeah." Normalization is often brushed aside as a luxury that only academics have time for. However, knowing the principles of normalization and applying them to your daily database design tasks really isn't all that complicated and it could drastically improve the performance of your DBMS. Normalization • However, there exists another scenario whereby a company decides to migrate from a file-based information system to a database system • It is in this latter scenario that Normalization is most applicable. • In this scenario it is possible to define relations directly (i.e. without going through the intermediate ER modeling stage) by analyzing the report structures needed by the system Normalization • This analysis results in a set of relations, however, these relations are very often unsatisfactory from the viewpoint of database efficiency (as we shall see shortly) • Normalization is the technique by which we decompose unsatisfactory relations into smaller relations, each of which are "better" in some sense than the original • Is the step by step process of organizing data in a way that eliminates potential problem in updating and accessing data • it's the process of efficiently organizing data in a database. Normalized Student Database Students (Idnumber, Surname, First name, Gender, Sponsor) Dept (DeptID, Dept name) DeptCourses (DeptId, CourseNumber,Course Title, Coordinator) CourseTextbooks (ISBN, CoourseNumber,Title, Author, Edition) TextbookFees (FeeCode, ISBN, Fee, Fee Amount) StudentCourses (Idnumber,CourseNumber,Fees ID) Normalizations • There are two goals of the normalization process: eliminate redundant data (for example, storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in a table). • Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored. Normalizations The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through sixth (sixth normal form or 6NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth and Sixth normal form is very rarely seen and won't be discussed in this course. Normalizations Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to distance from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible effect they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms. Normal Form • Relations can be in: - 1st Normal Form, - 2nd Normal Form, - 3rd Normal Form - 4th Normal Form - or 5th Normal Form - even 6th NF in 2002 Normal Form First normal form (1NF) sets the very basic rules for an organized database: Eliminate duplicative columns from the same table. A table can be said to be 1NF by identifying the non-key attribute is dependent on non-key attribute and identify the column or attribute which produces more than one value. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key). Normal Form Second normal form (2NF) further addresses the concept of removing duplicative data: Remove subsets of data that apply to multiple rows of a table and place them in separate tables. Create relationships between these new tables and their predecessors through the use of foreign keys. • 2NF: A table can set to be in 2NF, if the table has already satisfy the 1NF and eliminating partial functional dependency Normal Form Third normal form (3NF) goes one large step further: Remove columns that are not dependent upon the primary key. : A Table can be said to be 3 NF if the table already satisfy 2NF and eliminate transitive dependency. Finally, fourth normal form (4NF) has one requirement: A relation is in 4NF if it has no multi-valued dependencies and already said to satisfy the 3NF 1st Normal Form – Supplier(supp_no, pro_no, city, supp_status, (pro_no,qty)) • Supplier(supp_no, pro_no, city, supp_status, qty) • Relation Supplier is in 1st Normal Form • The first NF addresses the occurrence of repeating groups in an entity… an entity is said to be in 1NF if it contains no repeating group. • K, A, B, C, D, E, F, R1, R2, R3 • K, A, B, C, D, E, F • Even though Supplier can be represented in a relational database it suffers from the following problems – anomalies 1st Normal Form • Problems with Relation Supplier • Insert : We cannot insert a supplier until that supplier supplies at least one part • Delete : If we delete the only tuple for a particular shipment for a supplier, we also lose supplier information in the process • Modify : the city value for a given supplier appears many times causing update problems • Solution to these problems is to further normalize - into 2nd Normal Form Progressing to the 1NF 1. Place all attributes which appear in a repeating group into a new table; 2. Duplicate in the new table the primary key of he table from which the repeating group was extracted; 3. Designate a primary key for each new table produced. This key will be the primary key of the original table concatenated with one or more attributes in the new table. 2nd Normal Form • A relation is said to be in 2nd Normal Form if it has been placed in first NF and if every data item in the entity is fully functionally dependent (FD’s) on the primary key of the entity. • K1 + K2, A, B, C, D, E, F (F depends only on K2) • K1 + K2, A, B, C, D, E • A non-key attribute is any attribute that does not participate in the primary key Progressing to the 2NF 1. If a data is fully functionally dependent on only a part of the primary key, move that data item and the part of the primary key on which it is fully functionally dependent to new table; 2. If other data items are functionally dependent on the same part of the key, place them in the new table 3. Make the partial primary key value copied from the original table of the primary key of the new table 4. Examine each new table to ensure that each data item is fully functionally dependent on the primary key, that is that they are in the 2NF Functional Dependence • Given a Relation R, an attribute Y of R is functionally dependent on attribute X if and only if each X value is associated with only one Y value • Y is fully functionally dependent on X if it is functionally dependent on X and not functionally dependent on any part of X Functional Dependence status Supp_no City Qty Pro_no Above diagram is called the Functional Dependency Diagram (FDD) 2nd Normal Form • From the definition of 2nd NF we get: SECOND (supp_no, status, city) SP(supp_no, pro_no, qty) • Need to verify that these 2 relations overcome all the problems associated with relation FIRST • However, the SECOND/SP structure still has problems associated with it 2nd Normal Form • Problems • Insert - Cannot express the fact that a particular city has a particular status value until at least one supplier is located in that city • Delete - If we delete the only supplier for a particular city, then we not only lose that supplier information but we also lose city info: • Update - The status value for a given city appears many times, causing the usual update problems • Thus need to normalize further into 3NF Anomalies • Goal of relational schema design is to avoid anomalies and redundancy. – Update anomaly : one occurrence of a fact is changed, but not all occurrences.- should necessary only one changed. – Deletion anomaly : valid fact is lost when a tuple is deleted.- deletion of a record might delete data that is not intended to be deleted – Insertion anomaly: Adding a new record into a relation would require entering another record to another relation 3rd Normal Form • A relation R is in 3rd NF if and only if it is in 2NF, and every non-key attribute is non-transitively dependent on the primary key attribute • En entity is said to be in 3NF if has been placed in 2NF and if all fields which are not part of the primary key are mutually independent , that is there are no transitive dependencies (where a field is dependent not only on the key but also on another field) • K, A, B, C, D, E (E is also dependent on A) • K, A, B, C, D Progressing to 3NF 1. Identify all fields involved in transitive dependencies; 2. Move all items involved in transitive dependencies to a new entity 3. Identify a primary key for the new entity 4. Place the primary key for the new entity as a foreign key on the original entity. There should be no remaining transitive dependencies on the original entity. Boyce-Codd Normal Form • An entity is said to be in BCNF if it is in 3NF and every data item on which some other item is fully functionally dependent is a candidate key. • 3rd NF does not deal satisfactorily when a relation has multiple candidate keys. A candidate key is any key that can act as a primary key. • A relation is in BCNF if and only if every determinant is a candidate key. A determinant is any attribute on which some other attribute is fully functionally dependent on. BCNF Example • Entity X (A, B, C, D, E, F, G, H) • {A B} and {B C} are both Candidate primary key. • Entity X (A + B, D, F, G) • Entity Y (B + C, E, H) BCNF Example • Consider this Relation SSP Sup_no sname Pro_no Qty SI Smith PI 300 Sl Smith P2 200 SI Smith. P3 400 SI Smith P4 200 Update problems with SSP, even though SSP is in 3rd NF, there are two candidate keys for SSP: (sup_no, pro_no) and (sname, pro_no) Here sname is a determinant, but not a candidate key Hence relation SSP is not in BCNF SSP can be decomposed into: SS(sup_no, sname) SP(sname, pro_no, qty) Both SS and SP are in BCNF Can be verified that problems with SSP have now disappeared Higher Normal Form • Higher Normal Forms such as 4th NF and 5th NF do exist, but they are mainly of theoretical interest only • This is because more than 99 % of relations which are in 3rd NF or BCNF are also in 4th NF and 5th NF Choosing Between ER Model and Normalisation • ER modeling is appropriate when the applications being developed is new • In this case normalization should be applied once the ER model is transformed to the Relational Model • On the other hand the pure normalization approach is appropriate when the application being developed already exists Eg. A company wants to migrate its S/W from a file-based system to a database system In this case the database could be designed by applying normalization techniques to the reports produced by the old (file_based) system. Modeling Entities from Relations • It is possible to construct an ER model from the relations produced by the normalization process • The ER model is drawn in three steps Step 1 For each relation, draw a rectangle to represent the equivalent entity type, using the relation name. Step 2 For each pair of entity types, compare the primary keys. If the key of one is fully contained within the other, draw a relationship line between them Step 3 For each pair of entity types, if the key of one is a foreign key of another draw a relationship line between the entity types Example Customer (Cust#, Name, Address) Order (Order#, Date, Total, Cust#) Order Item (Order#, Item#, Quantity) Stock Item (Item#, Name, Unit Price, Qty_on_hand) Statement Item (Cust#, Date, Line#, PayType. PayAmt) Example of Bad Design Drinkers(name, addr, beersLiked, manf, favBeer) name Vili Vili Tolu addr Kalau ??? Manima beersLiked VB Royal VB manf A.B. R.B ??? favBeer Ikale ??? VB Data is redundant, because each of the ???’s can be figured out by using the FD’s name -> addr favBeer and beersLiked -> manf. This Bad Design Also Exhibits Anomalies name Vili Vili Tolu addr Kalau Kalau Manima beersLiked VB Royal VB manf A.B. R.B A.B. favBeer Ikale Ikale VB • Update anomaly: if Vili is transferred to Fanga, will we remember to change each of his tuples? • Deletion anomaly: If nobody likes VB, we lose track of the fact that A.B manufactures VB. • Insertion anomaly: If entering of records under beersLiked require entering of records under manf. Normalization Benefits Make the data model more informative to users Normalized tables, and the relationship between one normalized table and another, mirror real-world concepts and their interrelationships. Avoid bias towards any particular pattern of querying Normalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others. Normalization Benefits Functional dependency: Attribute B has a functional dependency on attribute A (i.e., A → B) if, for each value of attribute A, there is exactly one value of attribute B. If value of A is repeating in tuples then value of B will also repeat. In our example, Employee Address has a functional dependency on Employee ID, because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse need not be true: several employees could live at the same address and therefore one Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to determine the extent to which a design is normalized without understanding what functional dependencies apply to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain.