Download Normalization-Anomalies

Database Normalization is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only. Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an Automated teller machine), while less normalized tables tend to be used in database applications that do not need to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application). Database theory describes a table's degree of normalization in terms of normal forms of successively higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not always the case. Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized. A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data operations. In such a table:    The same information can be expressed on multiple records; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfully—if, that is, the employee's address is updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly. There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses. This phenomenon (fact/ event) is known as an insertion anomaly. There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears. This phenomenon is known as a deletion anomaly. Ideally, a relational database table should be designed in such a way as to exclude the possibility of update, insertion, and deletion anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular design will be vulnerable to such anomalies. It is possible to correct an unnormalized design so as to make it adhere to the demands of the normal forms: this is called normalization. Normalization typically involves decomposing an unnormalized table into two or more tables that, were they to be combined (joined), would convey exactly the same information as the original table. Problems addressed by normalization An update anomaly. Employee 519 is shown as having different addresses on different records. An insertion anomaly. Until the new faculty member is assigned to teach at least one course, his details cannot be recorded. A deletion anomaly. All information about Dr. Giddens is lost when he temporarily ceases to be assigned to any courses. A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data operations. In such a table:    The same information can be expressed on multiple records; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfully—if, that is, the employee's address is updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly. There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses. This phenomenon (fact/ event) is known as an insertion anomaly. There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears. This phenomenon is known as a deletion anomaly. Ideally, a relational database table should be designed in such a way as to exclude the possibility of update, insertion, and deletion anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular design will be vulnerable to such anomalies. It is possible to correct an unnormalized design so as to make it adhere to the demands of the normal forms: this is called normalization. Normalization typically involves decomposing an unnormalized table into two or more tables that, were they to be combined (joined), would convey exactly the same information as the original table. Background to Normalization: Definitions   Functional dependency: Attribute B has a functional dependency on attribute A i.e. A → B if, for each value of attribute A, there is exactly one value of attribute B. In our example, Employee Address has a functional dependency on Employee ID, because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse need not be true: several employees could live at the same address and therefore one Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to determine the extent to which a design is normalized without understanding what functional dependencies apply to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain. For example, an Employer may require certain employees to split their time between two locations, such as New York City and London, and therefore want to allow Employees to have more than one Employee Address. In this case, Employee Address would no longer be functionally dependent on Employee ID. Trivial functional dependency: A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.         Full functional dependency: An attribute is fully functionally dependent on a set of attributes X if it is a) functionally dependent on X, and b) not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on {Employee ID, Skill}, but not a full functional dependency, for it is also dependent on {Employee ID}. Transitive dependency: A transitive dependency is an indirect functional dependency, one in which X→Z only by virtue of X→Y and Y→Z. Multivalued dependency: A multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows: see the Multivalued Dependency article for a rigorous definition. Join dependency: A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T. Superkey: A superkey is an attribute or set of attributes that uniquely identifies rows within a table; in other words, two distinct rows are always guaranteed to have distinct superkeys. {Employee ID, Employee Address, Skill} would be a superkey for the "Employees' Skills" table; {Employee ID, Skill} would also be a superkey. Candidate key: A candidate key is a minimal superkey, that is, a superkey for which we can say that no proper subset of it is also a superkey. {Employee Id, Skill} would be a candidate key for the "Employees' Skills" table. Non-prime attribute: A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table. Primary key: Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible unique keys. A primary key is a candidate key which the database designer has designated for this purpose. * Functional Dependency (also known as Normalization) is a relationship between attributes in the same table. Or, a constraint between two sets of attributes in a relation from a database. It occurs when one attribute in a relation uniquely determines another attribute. History Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form: There is, in fact, a very simple elimination[1] procedure which we shall call normalization. Through decomposition non-simple domains are replaced by "domains whose elements are atomic (non-decomposable) values." —Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks [2] In his paper, Edgar F. Codd used the term "non-simple" domains to describe a heterogeneous data structure, but later researchers would refer to such a structure as an abstract data type. In his biography Edgar F. Codd also cited that the inspiration for his work was his eager assistant Tom Ward who used to challenge Edgar to rounds of database normalization similar to a chess match between master and apprentice. Tom Ward has been often quoted in industry magazines as stating that he has always enjoyed database normalization even more than sudoku. Normal Forms The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to such inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF. The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n. Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, be cause 3NF tables usually need no modification to meet the requirements of these higher normal forms. Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These normal forms have been summarized as requiring that all non-key attributes be dependent on "the key, the whole key and nothing but the key". The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation of many-to-many and one-to-many relationships among attributes. Sixth normal form (6NF) incorporates considerations relevant to temporal databases. First Normal Form Main article: First normal form A table is in first normal form (1NF) if and only if it faithfully represents a relation.[3] Given that database tables embody a relationlike form, the defining characteristic of one in first normal form is that it does not allow duplicate rows or nulls. Simply put, a table with a unique key (which, by definition, prevents duplicate rows) and without any nullable columns is in 1NF. Note that the restriction on nullable columns as a 1NF requirement, as espoused by Chris Date, et. al., is controversial. This particular requirement for 1NF is a direct contradiction to Dr. Codd's vision of the relational database, in which he stated that "null values" must be supported in a fully relational DBMS in order to represent "missing information and inapplicable information in a systematic way, independent of data type."[4] By redefining 1NF to exclude nullable columns in 1NF, no level of normalization can ever be achieved unless all nullable columns are completely eliminated from the entire database. This is in line with Date's and Darwen's vision of the perfect relational database, but can introduce additional complexities in SQL databases to the point of impracticality. [5] One requirement of a relation is that every table contain exactly one value for each attribute. This is sometimes expressed as "no repeating groups"[6]. While that statement itself is axiomatic, experts disagree about what qualifies as a "repeating group", in particular whether a value may be a relation value; thus the precise definition of 1NF is the subject of some controversy. Notwithstanding, this theoretical uncertainty applies to relations, not tables. Table manifestations are intrinsically free of variable repeating groups because they are structurally constrained to the same number of columns in all rows. See the first normal form article for a fuller discussion of the nuances of 1NF. Second Normal Form Main article: Second normal form The criteria for second normal form (2NF) are:  The table must be in 1NF.  None of the non-prime attributes of the table are functionally dependent on a part (proper subset) of a candidate key; in other words, all functional dependencies of non-prime attributes on candidate keys are full functional dependencies. [7] For example, in an "Employees' Skills" table whose attributes are Employee ID, Employee Address, and Skill, the combination of Employee ID and Skill uniquely identifies records within the table. Given that Employee Address depends on only one of those attributes – namely, Employee ID – the table is not in 2NF.  Note that if none of a 1NF table's candidate keys are composite – i.e. every candidate key consists of just one attribute – then we can say immediately that the table is in 2NF.  Third Normal Form Main article: Third normal form The criteria for third normal form (3NF) are:  The table must be in 2NF.  Every non-prime attribute of the table must be non-transitively dependent on every candidate key.[7] A violation of 3NF would mean that at least one non-prime attribute is only indirectly dependent (transitively dependent) on a candidate key. For example, consider a "Departments" table whose attributes are Department ID, Department Name, Manager ID, and Manager Hire Date; and suppose that each manager can manage one or more departments. {Department ID} is a candidate key. Although Manager Hire Date is functionally dependent on the candidate key {Department ID}, this is only because Manager Hire Date depends on Manager ID, which in turn depends on Department ID. This transitive dependency means the table is not in 3NF. Boyce-Codd Normal Form Main article: Boyce-Codd normal form A table is in Boyce-Codd normal form (BCNF) if and only if, for every one of its non-trivial functional dependencies X → Y, X is a superkey—that is, X is either a candidate key or a superset thereof. Fourth Normal Form Main article: Fourth normal form A table is in fourth normal form (4NF) if and only if, for every one of its non-trivial multivalued dependencies X →→ Y, X is a superkey—that is, X is either a candidate key or a superset thereof. Fifth Normal Form Main article: Fifth normal form The criteria for fifth normal form (5NF and also PJ/NF) are:  The table must be in 4NF.  There must be no non-trivial join dependencies that do not follow from the key constraints. A 4NF table is said to be in the 5NF if and only if every join dependency in it is implied by the candidate keys. Domain/key Normal Form Main article: Domain/key normal form Domain/key normal form (or DKNF) requires that a table not be subject to any constraints other than domain constraints and key constraints. Sixth Normal Form A table is in sixth normal form (6NF) if and only if it satisfies no non-trivial join dependencies at all.[10] This obviously means that the fifth normal form is also satisfied. The sixth normal form was only defined when extending the relational model to take into account the temporal dimension. Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal extensions to SQL are not relational. See work by Date, Darwen and Lorentzos[11] for a relational temporal extension, Zimyani[12] for further discussion on Temporal Aggregation in SQL, or TSQL2 for a non-relational approach. Denormalization Main article: Denormalization Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for Online Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate Business Intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema. It has never been proven that this denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the performance. The need for denormalization has waned as computers and RDBMS software have become more powerful. Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial. Non-First Normal Form (NF²) In recognition that denormalization can be deliberate and useful, the non-first normal form is a definition of database designs which do not conform to the first normal form, by allowing "sets and sets of sets to be attribute domains" (Schek 1982). This extension is a (nonoptimal) way of implementing hierarchies in relations. Some theoreticians have dubbed this practitioner developed method, "First Abnormal Form", Codd defined a relational database as using relations, so any table not in 1NF could not be considered to be relational. Consider the following table: Non-First Normal Form Person Favorite Colors Bob blue, red Assume a person has several favorite colors. Obviously, favorite colors consist of a set of colors modeled by the given table. To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms. The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint required is for the operators to be bijective, which is covered by the Partitioned Normal Form (PNF). Database Anomalies Database anomalies are the problems in relations that occur due to redundancy in the relations. These anomalies affect the process of inserting, deleting and modifying data in the relations. Some important data may be lost if a elations is updated that contains database anomalies. It is Jane green, yellow, red important to remove these anomalies in order to perform different processing on the relations without any problem. Types of Anomalies.Different types of database anomalies are as follows: Insertion Anomaly.The insertion anomaly occurs when a new record is inserted in the relation. In this anomaly, the user cannot insert a fact about an entity until he has an additional fact about another entity. Deletion Anomaly.The deletion anomaly occurs when a record is deleted from the relation. In this anomaly, the deletion of facts about an entity automatically deleted the fact of another entity.Modification Anomaly.The modification anomaly occurs when the record is updated in the relation. In this anomaly, the modification in the value of specific attribute requires modification in all records in which that value occurs. BENEFITS OF DENORMALIZED RELATIONAL DATABASE TABLES ABSTRACT Heuristics for denormalizing relational database tables are examined with an objective of improving processing performance for data insertions, deletions and selection. Client-server applications necessitate consideration of denormalized database schemas as a means of achieving good system performance where the client-server interface is graphical (GUI) and the network capacity is limited by the network channel. INTRODUCTION Relational database table design efforts encompass both the conceptual and physical modeling levels of the three-schema architecture. Conceptual diagrams, either entity-relationship or object-oriented, are a precursor to designing relational table structures. CASE tools will generate relational database tables at least to the third normal form (3NF) based on conceptual models, but have not advanced to the point that they produce table structures that guarantee acceptable levels of system processing performance. As firms move away from the mainframe toward cheaper client-server platforms, managers face a different set of issues in the development of information systems. One critical issue is the need to continue to provide a satisfactory level of system performance, usually reflected by system response time, for mission-critical, on-line, transaction processing systems. A fully normalized database schema can fail to provide adequate system response time due to excessive table join operations. It is difficult to find formal guidance in the literature that outlines approaches to denormalizing a database schema, also termed usage analysis. This paper focuses on identifying heuristics or rules of thumb that designers may use when designing a relational schema. Denormalization must balance the need for good system response time with the need to maintain data, while avoiding the various anomalies or problems associated with denormalized table structures. Denormalization goes hand-in-hand with the detailed analysis of critical transactions through view analysis. View analysis must include the specification of primary and secondary access paths for tables that comprise end-user views of the database. Additionally, denormalization should only be attempted under conditions that allow designers to collect detailed performance measurements for comparison with system performance requirements [1]. Further, designers should only denormalize during the physical design phase and never during conceptual modeling. Relational database theory provides guidelines for achieving an idealized representation of data and data relationships. Conversely, client-server systems require physical database deign measures that optimize performance for specific target systems under less than ideal conditions [1]. The final database schema must be adjusted for characteristics of the environment such as hardware, software, and other constraints. We recommend following three general guidelines to denormalization [1]. First, perform a detailed view analysis in order to identify situations where an excessive number of table joins appears to be required to produce a specific end-user view. While no hard rule exists for defining "excessive," any view requiring more than three joins should be considered as a candidate for denormalization. Beyond this, system performance testing by simulating the production environment is necessary to prove the need for denormalization. Second, the designer should attempt to reduce the number of foreign keys in order to reduce index maintenance during insertions and deletions. Reducing foreign keys is closely related to reducing the number of relational tables. Third, the ease of data maintenance provided by normalized table structures must also be provided by the denormalized schema. Thus, a satisfactory approach would not require excessive programming code (triggers) to maintain data integrity and consistency. DENORMALIZING FIRST NORMAL FORM (1NF) TO UNNORMALIZED TABLES There are more published formal guidelines for denormalizing 1NF than for any other normal form. This stems from the fact that repeating fields occur fairly often in business information systems; therefore, designers and relational database theoreticians have been forced to come to grips with the issue. It is helpful to examine an example problem for those new to the concept of normalization and denormalization. Consider a situation where a customer has several telephone numbers that must be tracked. The third normal form (3NF) solution and the denormalized table structure is given below with telephone number (Phone1, Phone2, Phone3, …) as a repeating field: 3NF: CUSTOMER (CustomerId, CustomerName,...) CUST_PHONE (CustomerId, Phone) Denormalized: CUSTOMER (CustomerId, CustomerName, Phone1, Phone2, Phone3, ...) Which approach is superior? The answer is that it depends on the processing and coding steps needed to store and retrieve the data in these tables. In the denormalized solution, code must be written (or a screen designed) to enable any new telephone number to be stored in any of the three Phone fields. Clearly this is not a difficult task and can be easily accomplished with modern CASE tools; still, this denormalized solution is usually only appropriate if one can guarantee that a customer will have a limited finite number of telephone numbers, or if the firm makes a management decision not store more than "X" number of telephone numbers. Both solutions provide good data retrieval of telephone numbers as indicated by the following SQL statements. The denormalized solution requires a simpler, smaller index structure for the CustomerId key field. The normalized solution would require at least two indices for the Cust_Phone table - one on the composite key to ensure uniqueness integrity, and one on the CustomerId field to provide a fast primary access path to records. Select * from Cust_Phone where CustomerId = '3344'; Select * from Customer where CustomerId = '3344'; If the customer name is also required in the query, then the denormalized solution is superior as no table joins are involved. Select * from Cust_Phone CP, Customer C where CP.CustomerId = '3344' and CP.CustomerId = C.CustomerId; The effort required to extract telephone numbers for a specific customer based on the customer name is also more difficult fo r the 3NF solution as a table join is required. Select * from Cust_Phone CP, Customer C where C.CustomerName = 'Tom Thumb' and CP.CustomerId = C.CustomerId; If the telephone number fields represent different types of telephones, for example, a voice line, a dedicated fax line, and a dedicated modem line, then the appropriate table structures are: 3NF: CUSTOMER (CustomerId, CustomerName, ... ) CUST_PHONE (CustomerId, Phone) PHONE (Phone, PhoneType) Denormalized: CUSTOMER (CustomerId, CustomerName, VoicePhone, FaxPhone, ModemPhone, ...) Again the denormalized solution is simpler for data storage and retrieval and, as before, the only factor favoring a 3NF solution is the number of potential telephone numbers that an individual customer may have. The Phone table would be at least 30 to 50 percent the size of the Cust_Phone table. The decision to denormalize is most crucial when the Customer table is large in a client-server environment; for example, one hundred thousand customers, each having two or three telephone numbers. The join of Customer, Cust_phone, and Phone may be prohibitive in terms of processing efficiency. DENORMALIZING TO SECOND NORMAL FORM (2NF) TO 1NF The well-known order entry modeling problem involving Customers, Orders, and Items provides a realistic situation where denormalization is possible without significant data maintenance anomalies. Consider the following 3NF table structures. 3NF: CUSTOMER (CustomerId, CustomerName,...) ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId) ORDERLINE (OrderId, ItemId, QtyOrdered, OrderPrice) ITEM (ItemId, ItemDescription, CurrentPrice) The many-to-many relationship between the Order and Item entities represents a business transaction that occurs repeatedly over time. Further, such transactions, once complete, are essentially "written in stone" since the transaction records would never be deleted. Even if an order is partially or fully deleted at the request of the customer, other tables not shown above will be used to record the deletion of an item or an order as a separate business transaction for archive purposes. The OrderPrice field in the Orderline table represents the storage of data that is time-sensitive. The OrderPrice is stored in the Orderline table because this monetary value may differ from the value in the CurrentPrice field of the Item table since prices may change over time. While the OrderPrice field is functionally determined by the combination of ItemId and OrderDate, storing the OrderDate field in the Orderline table is not necessary, since the OrderPrice is recorded at the time that the transaction takes place. Therefore, while Orderline is not technically in 3NF, most designers would consider the above solution 3NF for all practical purposes. A true 3NF alternative solution would store price data in an ItemPriceHistory table, but such a solution is not central to the discussion of denormalization. In the proposed denormalized 1NF solution shown below (the Customer and Order tables remain unchanged) the Item.ItemDescription field is duplicated in the Orderline table. This solution violates second normal form (2NF) since the ItemDescription field is fully determined by ItemId and is not determined by the full key (OrderId + ItemId). Again, this denormalized solution must be evaluated for the potential effect on data storage and retrieval. Denormalized 1NF: ORDERLINE (OrderId, ItemId, QtyOrdered, OrderPrice, ItemDescription) ITEM (ItemId, ItemDescription, CurrentPrice) Orderline records for a given order are added to the appropriate tables at the time that an order is made. Since the OrderPrice for a given item is retrieved from the CurrentPrice field of the Item table at the time that the sale takes place, the additional processing required to retrieve and store the ItemDescription value in the Orderline table is negligible. The additional expense of storing Orderline.ItemDescription must be weighed against the processing required to produce various enduser views of these data tables. Consider the production of a printed invoice. The 3NF solution requires joining four tables to produce an invoice view of the data. The denormalized 1NF solution requires only the Customer, Order, and Orderline tables. Clearly, the Order table will be accessed by an indexed search based on the OrderId field. Similarly, the retrieval of records from the Orderline and Item tables may also be via indexes; still, the savings in processing time may offset the cost of extra storage. An additional issue concerns the storage of consistent data values for the ItemDescription field in the Orderline and Item tables. Suppose, for example, an item description of a record in the Item table is changed from "Table" to "Table, Mahogany." Is there a need to also update corresponding records in the denormalized Orderline table? Clearly an argument can be made that these kinds of data maintenance transactions are unnecessary since the maintenance of the non-key ItemDescription data field in the Orderline table is not critical to processing the order. DENORMALIZING THIRD NORMAL FORM (3NF) TO 2NF An example for denormalizing from a 3NF to a 2NF solution can be found by extending the above example to include data for salespersons. The relationship between the Salesperson and Order entities is one-to-many (many orders can be processed by a single salesperson, but an order is normally associated with one and only one salesperson). The 3NF solution for the Salesperson and Order tables is given below, along with a denormalized 2NF solution. 3NF: SALESPERSON (SalespersonId, SalespersonName,...) ORDER (OrderId, OrderDate, DeliveryDate, Amount, SalespersonId) Denormalized 2NF: SALESPERSON (SalespersonId, SalespersonName,...) ORDER (OrderId, OrderDate, DeliveryDate, Amount, SalespersonId, SalespersonName) Note that the SalespersonId in the Order table is a foreign key linking the Order and Salesperson tables. Denormalizing the table structures by duplicating the SalespersonName in the Order table results in a solution that is 2NF because the non-key SalespersonName field is determined by the non-key Salesperson field. What is the effect of this denormalized solution? By using view analysis for a typical printed invoice or order form, we may discover that most end-user views require printing of a salesperson's name, not their identification number on order invoices and order forms. Thus, the 3NF solution requires joining the Salesperson and Order two tables, in addition to the Customer and Orderline tables from the denormalized 1NF solution given in the preceding section, in order to meet processing requirements. As before, a comparison of the 3NF solution and the denormalized 2NF solution reveals that the salesperson name could easily be recorded to the denormalized Order table at the time that the order transaction takes place. Once a sales transaction takes place, the probability of changing the salesperson credited with making the sales is very unlikely. One should also question the need to maintain the consistency of data between the Order and Salesperson tables. In this situation, we find that the requirement to support name changes for salespeople is very small, and only occurs, for the most part, when a salesperson changes names due to marriage. Furthermore, the necessity to update the Order table in such a situation is a decision for management to make. An entirely conceivable notion is that such data maintenance activities may be ignored, since the important issue for salesperson's usually revolves around whether or not they get paid their commission, and the denormalized 2NF solution supports payroll activities as well as the production of standard end-user views of the database. DENORMALIZING HIGHER NORMAL FORMS The concept of denormalizing also applies to the higher order normal forms (fourth normal form - 4NF or fifth normal form - 5NF), but occurrences of the application of denormalization in these situations are rare. Recall that denormalization is used to improve processing efficiency, but should not be used where there is the risk of incurring excessive data maintenance problems. Denormalizing from 4NF to a lower normal form would almost always lead to excessive data maintenance problems. By definition, we normalize to 4NF to avoid the problem of having to add multiple records to a single table as a result of a single transaction. Data anomalies associated with 4NF violations only tend to arise when sets of binary relationships between three entities have been incorrectly modeled as a ternary relationship. The resulting 4NF solution, when modeled in the form of an E-R diagram usually results in two binary one-to-many relationships. If denormalization offers the promise of improving performance among the entities that are paired in these binary relationships, then the guidance given earlier under each of the individual 1NF, 2NF, and 3NF sections applies; thus, denormalization with 4NF would not require new heuristics. While denormalization may also be used in 5NF modeling situations, the tables that result from the application of 5NF principles are rarely candidates for denormalization. This is because the number of tables required for data storage have already been minimized. In essence, the 5NF modeling problem is the mirror-image of the 4NF problem. A 5NF anomaly only arises when a database designer has modeled what should be a ternary relationship as a set of two or more binary relationships. SUMMARY This article has described situations where denormalization can lead to improved processing efficiency. The objective is to improve system response time without incurring a prohibitive amount of additional data maintenance requirements. This is especially important for client-server systems. Denormalization requires thorough system testing to prove the effect that denormalized table structures have on processing efficiency. Furthermore, unseen ad hoc data queries may be adversely affected by denormalized table structures. Denormalization must be accomplished in conjunction with a detailed analysis of the tables required to support various end-user views of the database. This analysis must include the identification of primary and secondary access paths to data. Additional consideration may be given table partitioning that goes beyond the issues that surround table normalization. Horizontal table partitioning may improve performance by minimizing the number of rows involved in table joins. Vertical partitioning may improve performance by minimizing the size of rows involved in table joints. A detailed discussion of table partititioning my be found elsewhere [1]. I work at a company that is doing the following:     Taking a product line with some minor attribute difference between products (for example, they are identical except for the color of the product) Creating a record for the product, and filling in the first color. Copying it, filling in the second color. Repeating until all colors have been entered. This creates an amazing amount of duplication, as you can imagine. I grabbed a sample set of data, and 95% of it was redundant. When a coworker and I undertook the task of making the data relational, we hit some serious organizational resistance. It turns out that the decision to denormalize was on purpose (!). The justification was that it's easier to query ("Get me all the products that are red"). Am I missing something? It seems to me that if a denormalized view is useful, it should be generated from an AuthoritativeSource that's normalized. I think I'm going to be in the midst of a data management nightmare in a few months. Agreed. Denormalization should be a last resort if nothing else can be done to speed things up. It's a basic relational database design principle that data should be stored OnceAndOnlyOnce. -- DavidPeterson Or perhaps these are different products, that just happen to be similar in attributes other than colour. Where is the denormalisation then? By normalising, you are saying that conceptually, these colours are kinds of the same product, or subclasses of a product. But perhaps these concepts are not the most useful. (More likely you are dead right, but I thought it was worth saying.) I've seen this done because they didn't want to add an index. Pretty funny until my last contract where they bought an E64K (a mega-buck Cray equivalent unix server) with gigabytes of memory rather than add an index. Indexes must be heap scary to some people. In DataWarehouse building, denormalization seems to be an acceptable technique. For instance, you may have summary tables that contain information that you could get with a query, but not so quickly. The StarSchema and other techniques seem to flout NormalForm regularly. (Or am I wrong? We're implementing a DataWarehouse soon, so I've been reading up, but maybe those with more experience could correct me?) --PhilGroce If your data isn't changing then DenormalizationIsOk, but if your data is changing then you should be very wary about denormalizing - it's so easy for duplicated data to become out of sync. Writing insert, update and delete triggers to update multiple tables is not straightforward and needs careful planning and testing. It may be an issue of short-term versus long-term. Normalization is the better long-term bet because it allows new requirements to be added with the minimal fuss. If you optimize the schema for a given current (short-term) usage, you may indeed speed up that one operation, but open yourself to risk down the road. I have seen a lot of companies with messed-up schemas and it costs them. Revamping it would be a huge investment because it would impact a lot of applications, so they live with bad schemas. FutureDiscounting perhaps was overdone. I think it may be better to find a compromize, such as periodic batch jobs that create read-only tables for special, high-performance needs. That way the live production tables can remain clean. --top You can refactor things having a normalized database with new tables and build a view with the name of the old denormalized table. It is generally accepted that a DatabaseIsRepresenterOfFacts. This has implications on how we interpret a normalized schema vs. a denormalized schema. A fully normalized schema is when we cannot state a fact more than once. A denormalized schema on the contrary allows us to state a fact more than once, with the following consequences:    If we say a proposition twice it doesn't make it twice as true. If we state two different proposition about essentially the same fact they might not be consistent. When the fact has changed in the underlying reality, we might fail to update the corresponding database propositions in all the places where that fact is stated. I agree with the thrust of what you are saying, but I didn't think that this was the usual definition of normalized. Normalisation (1NF, 2NF, 3NF, BCNF, 4NF, 5NF; also 6NF if you use DateAndDarwensTemporalDatabaseApproach?) avoids _particular_ kinds of redundancy and update anomaly, not all kinds -- it doesn't preclude "roll up" running totals in a separate table, for example, because it deals with relations individually. Also, a schema may be "unnormalised" but still prevent a fact from being stated more than once using additional constraints (and such a schema may be preferable to a normalised one, even as a logical schema, because normalisation beyond 3NF (IIRC) may make certain constraints harder to write). I think we need a different term for a schema (relations+constraints) in which it is impossible to be inconsistent. "Consistent" itself is rather overused, perhaps, so maybe "consistency-constrained"? Therefore a denormalized schema is never justified. The main justification for a [de- surely?] normalized schema is very often justified by the fact that it helps performance, but this is hardly a good justification. {If the boss/owner wants a faster system, you either give it to them or lose your job to somebody who can. Purists are not treated very well in the work world, for good or bad. Denormalization for performance takes advantage of the fact that any given cell generally has a read-to-write ratio of roughly 10:1 or more in practice. As long as this lopsided relationship exists, it appears that denormalization, or at least "join-and-copy", is an economical way to gain performance. It is roughly analogous to caching on a larger scale: if you read the same item over and over, it makes sense to keep a local copy instead of keep fetching from the "master copy" on each read.} {{*First*, (even with current SQL DBMSs) there are better ways to achieve this than corrupting the logical data model: materialised views (indexed views, for SQL Server), for example. *Second*, this is the worst kind of premature optimisation -- not only are you spending the effort when it might not actually help (you've made the data bigger = slower than it should be in cases when you don't need the "joined" columns), in the process you are corrupting perhaps the most important public interface in the system, on which there will be many many dependencies (what happens when a new, more performance-critical app using the same data wants an incompatible "denormalisation"?). *Third*, in many cases, any performance benefits are lost in the additional validation needed on update to enforce consistency ("cache coherency", if you want to think of it as a cache) between the duplicates -- although queries may be 90% of the load, updates have a disproportionate effect (due to the "strength" difference between X and S locks, and because the "denormalisation" often means that the consistency checks are expensive as they need to read several rows).}} Denormalization biases the schema towards one application. So the denormalized schema suits better the chosen application and makes the life harder for other ones. This might be justified, e.g. a DataWarehouse needs denormalized schema. Why do you think that a DataWarehouse needs a denormalized schema ? A denormalized schema is automatically a redundant schema. A redundant schema can serve no logical purpose when compared to the equivalent non-redundant schema. What it serves is a physical purpose: to be able to answer certain queries faster, for example by grouping related information closer together on the disk, precomputing the results of frequently asked queries, or partial results needed for the same purpose and storing them on the disk. However, this all comes at the expense of a greater risk of breaking logical correctness because physical concerns surface to the logical level. The traditional approach of denormalizing data warehouses was favored by poor implementation of the relational model in commercial SQL databases. The relational model mandates a sharp separation between the logical level and the physical level, which it rarely is the case in current DBMSes, although they are slowly making progress. For example almost always there will be a 1-1 relationship between logical structures (tables) and physical structures (table extents, type of row stored in physical pages). The right solution to solving datawarehouse kind of problems is to be able to have controlled redundancy and more flexibility in storage at the physical level. For example, pieces of information from different table that are accessed together can be stored in the same physical disk page , even the same piece of information can be stored several time in various pages together with other related information. However, this can be done automatically by the DBMS, while the user will still see only one logical structure (the initial table), so the user can only insert/update/delete the fact once. This is not possible in a denormalized schema. The piece of the database kernel that controls the redundancy at the physical level can be thoroughly tested and doesn't run the risk of corrupting data like all the ad-hoc user programs that are needed to maintain a redundant schema. Recently various SQL databases have been making progresses to this kind of flexibility in the physical layout through the use of clustered tables (grouping information from several tables in the same physical location to avoid join penalty), table partitioning (breaking a table apart in several physical locations either by data range , or by groups of columns), materialized views (storing and maintaining precomputed results for important queries). -CostinCozianu Agreed. DataWarehouse approaches like StarSchemas are doomed to be obsoleted (WithinFiveYears?? Perhaps even today, for many DW applications) by improvements to logical-physical separation in main-stream RDBMS products. The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1  Fabian Pascal  DM Review Magazine, June 2002 "A traditional normalized structure cannot and will not outperform a denormalized star schema from a decision support system (DSS) perspective. These schemas are designed for speed, not your typical record style transaction ... [DSS] requests put way to [sic] much stress on a model that was originally developed to support the handling of single record transactions." – Practitioner with 20 years of experience "Third normal form seems to be regarded by many as the point where your database will be most efficient ... If your database is overnormalized [sic] you run the risk of excessive table joins. So you denormalize and break theoretical rules for real world performance gains." – http://www.sqlteam.com/Forums/ "Denormalization can be described as a process for reducing the degree of normalization with the aim of improving query processing performance...we present a practical view of denormalization, and provide fundamental guidelines for denormalization ... denormalization can enhance query performance when it is deployed with a complete understanding of application requirements." – Sanders & Shin, State University NY Buffalo Those familiar with my work know I have deplored the lack of foundational knowledge and sound practices in the database trade for years. One of the most egregiously abused aspects is normalization, as reflected in the opening quotes. The prevailing argument is as follows: 1. 2. 3. 4. The higher the level of normalization, the greater the number of tables in the database; The greater the number of tables in the database, the more joins are necessary for data manipulation; Joins slow performance; Denormalization reduces the number of tables and, therefore, the reliance on joins, which speeds up performance. The problem is that points 2 and 3 are not necessarily true, in which case point 4 does not hold. What is more, even if the claims were all true, denormalization would still be highly questionable because performance gains, if any, can only be had at the expense of integrity. The purpose of this article is to explain just why the notion of "denormalization for performance" is a misconception and to expose its costly dangers, of which practitioners are blissfully unaware. If the integrity cost of denormalization is taken into consideration, it will cancel out performance gains, if any. The Key, the Whole Key and Nothing But the Key Informally, a fully normalized relational database consists of R-tables. Each represents just one entity type. In such tables, all non-key columns represent attributes of one entity type. The attributes of an entity type are represented by columns, and the entities of that type are represented by rows. Because a table represents only one entity type and the entities of that type are identified by the key, non-key values or sets thereof are dependent on (associated with) key values. Stated another way, non-key columns are "directly about the key, the whole key and nothing but the key." By definition, key columns do not have duplicate values; therefore, non-key columns dependent on the key won't have duplicate values either. That is why there is no redundancy due to such dependencies in fully normalized tables. Practically, what this means is that in a fully normalized database, it is sufficient for the database designer to declare key constraints to the RDBMS, and the integrity of the database insofar as column dependencies are concerned will be guaranteed.1 Denormalizing a database means, again informally, "bundling" multiple entity types in one table. In a table that represents multiple entity types, some non-key columns will not be directly about the key, the whole key and nothing but the key. The following kinds of dependency are possible in such tables, each of which is associated with a normal form lower than full normalization: 1. 2. 3. 4. Dependency on part of a (composite) key; table not in 2NF. Indirect dependency on the key; dependency on another non-key column that is dependent on the key; table not in 3NF. Intra- key multivalued dependency; table not in 4NF. Intra-key join- dependency; table not in 5NF.2 Non-key columns and columns that are part of a (composite) key are not guaranteed to have unique values. If they have duplicates in some rows in the table, other columns in the table dependent on such columns (rather than directly on the key, the whole key and nothing but the key) will also have duplicate values in those same rows, as highlighted in table A in Figure 1, because of the association with those non-key values. This causes redundancy in denormalized tables. For such tables, key constraints are no longer sufficient to guarantee integrity. The DBMS must also control the redundancy, so that when a table such as A is updated, all stored instances of the same data are updated. If this is not done, the database can be corrupted and become inconsistent. This complication is not necessary for fully normalized tables (such as B1 and B2 in Figure 1) because they don't have redundancy due to under- normalization. Figure 1: Denormalized versus Normalized Tables Thus, from an integrity perspective, there are two design options: 1. 2. Fully normalize the database and maximize the simplicity of integrity enforcement; Denormalize the database and complicate enormously (prohibitively, as we shall see shortly) integrity enforcement. Obviously the first choice is the better option. Why, then, the prevailing insistence on the second choice? The argument for denormalization is, of course, based on performance considerations; but does it make sense? The Logical-Physical Confusion Suppose I ask you to retrieve two sets of books from a library: one set of five books and one of ten. If I ask you which set will take you longer to retrieve, would you know? When I ask practitioners this question, many realize that the time required for set retrieval depends not on the number of books per se, but on the characteristics of the library: how the books are physically stored and retrieved. It is entirely possible for 10 books to take less time if, for example, they are all stored together while the five are scattered across the library; or if the 10 are on the first floor and the five on the sixth; or if the 10 are indexed in a catalog, but the five are not. (If any of the books were being used or borrowed by others, retrieval could take a very long time indeed.) By definition, normalization increases the number of logical tables – books in the analogy – in the database. If the number of books says nothing about library retrieval performance, how can denormalization – decreasing the number of logical tables – say anything about database performance? As in the library case, performance is determined entirely at the physical database level (storage and access methods, hardware, physical design, DBMS implementation details, degree of concurrent access, etc.). What, then, does "denormalization for performance" mean? When I ask this question, I generally get answers along the following line: "Well, it's all fine in theory, but we live in the real world. The reality is that I cannot get acceptable performance unless I denormalize my databases." Let us assume, for the sake of argument, that this is always so (even though it's not). The only logical conclusion possible is not that normalization slows performance and denormalization speeds it up. To the extent that performance with fully normalized databases is slow and it improves with denormalization, this can only be due to the physical implementations of the database and DBMS. Those implementations are simply not good enough for correct logical database design, forcing you to introduce redundancy in the database and, as we shall see, trade integrity for performance. Okay, you say, but I must work with existing DBMS software anyway so what difference does your argument make? First of all, it is very important to realize that it is not normalization, but physical implementation that determines performance (more on this in Part 2). Secondly, the only reason you may sometimes achieve better performance when you denormalize your database is because you ignore the integrity implications. If and when those are taken into account, you will lose the performance gain, if any, from denormalization and may even end up with a net performance loss relative to the fully normalized alternative. I will demonstrate this next month in Part 2. Stay tuned. References: 1. 2. There may be redundancies in the database for reasons other than such column dependencies (e.g. derived columns, crosstable duplicates) which must be separately resolved by means other than full normalization (see Chapters 5 and 8 in Practical Issues In Database Management, Addison Wesley, 2000). All R-tables are by definition in first normal form (1NF), or normalized – they don't have multivalued columns (known in older databases as repeating groups). Normalized databases can be further normalized to the higher normal forms 2NF – 5NF. If there are no composite keys in the database, a normalized database is automatically also in 2NF and a database in 3NF is automatically also in 5NF, or fully normalized. In other words, you need to worry about 4NF and 5NF only if composite keys are present (see Chapter 5 in Practical Issues In Database Management, Addison Wesley, 2000). Definitions of the Normal Forms 1st Normal Form (1NF) A table (relation) is in 1NF if: 1. There are no duplicated rows in the table. 2. Each cell is single-valued (no repeating groups or arrays). 3. Entries in a column (field) are of the same kind. *The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be made up of more than one column, even possibly, of all the colomns). 2nd Normal Form (2NF) A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all of the key. Since a partial dependency occurs when a non-key attribute is dependent on only a part of the composite key, the definition of 2NF is sometimes phrased as, “A table is in 2NF if it is in 1NF and if it has no partial dependencies.” 3rd Normal Form (3NF) A table is in 3NF if it is in 2NF and if it has no transitive dependencies. Boyce-Codd Normal Form (BCNF) A table is in BCNF if it is in 3NF and if every determinant is a candidate key. 4th Normal Form (4NF) A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.] 5th Normal Form (5NF) A table is in 5NF, also called “Projection-join Normal Form” (PJNF), if it is in 4NF and if every join dependency in the table is a consequence of the candidate keys of the table. Domain-Key Normal Form (DKNF) A table is in DKNF if every constraint on the table is a logical consequence of the definition of keys and domains. Database Anomalies Insertion Anomaly It is a failure to place information about a new database entry into all the places in the database where information about the new entry needs to be stored. In a properly normalized database, information about a new entry needs to be inserted into only one place in the database, in an inadequately normalized database, information about a new entry may need to be inserted into more than one place, and human fallibility being what it is, some of the needed additional insertions may be missed. Deletion Anomaly It is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the database, in an inadequately normalized database, information about that old entry may need to be deleted from more than one place. Update Anomaly An update of a database involves modifications that may be additions, deletions, or both. Thus “update anomalies” can be either of the kinds discussed above. All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly normalized database are much less susceptible to corruption than are un-normalized databases. Rules of Normalization I To normalize databases, there are certain rules to keep in mind. These pages will illustrate the basics of normalization in a simplified way, followed by some examples. Database Normalization Rule 1: Eliminate Repeating Groups. Make a separate table for each set of related attributes, and give each table a primary key. Unnormalized Data Items for Puppies  puppy number  puppy name  kennel code  kennel name  kennel location  trick ID  trick name  trick where learned  skill level In the original list of data, each puppy description is followed by a list of tricks the puppy has learned. Some might know 10 tricks, some might not know any. To answer the question “Can Fifi roll over?” we need first to find Fifi’s puppy record, then scan the list of tricks associated with the record. This is awkward, inefficient, and extremely untidy. Moving the tricks into a separate table helps considerably. Separating the repeating groups of tricks from the puppy information results in first normal form. The puppy number in the trick table matches the primary key in the puppy table, providing a foreign key for relating the two tables with a join operation. Now we can answer our question with a direct retrieval look to see if Fifi’s puppy number and the trick ID for “roll over” appear together in the trick table. First Normal Form: Puppy Table puppy number puppy name kennel name kennel location — primary key Trick Table puppy number trick ID trick name trick where learned skill level Rules of Normalization II Database Normalization Rule 2: Eliminate Redundant Data, if an attribute depends on only part of a multi-valued key, remove it to a separate table. The trick name (e.g. “roll over”) appears redundantly for every puppy that knows it. Just trick ID whould do. TRICK TABLE Puppy Number Trick ID Trick Name Where Learned Skill Level 52 27 ”roll over” 16 9 53 16 “Nose Stand” 9 9 54 27 ”roll over” 9 5 *Note that trick name depends on only a part (the trick ID) of the multi-valued, i.e. composite key. In the trick table, the primary key is made up of the puppy number and trick ID. This makes sense for the “where learned” and “skill level” attributes, since they will be different for every puppy-trick combination. But the trick name depends only on the trick ID. The same name will appear redundantly every time its associated ID appears in the trick table. Second Normal Form Puppy Table puppy number puppy name kennel code kennel name kennel location Tricks Table tricks ID tricks name Puppy-Tricks puppy number trick ID trick where learned skill level Suppose you want to reclassify a trick, i.e. to give it a different trick ID. The change has to be made for every puppy that knows the trick. If you miss some of the changes, you will have several puppies with the same trick under different IDs, this is an update anomaly. Rules of Normalization III Database normalization Rule 3: Eliminate columns not dependent on key. If attributes do not contribute to a description of the key, remove them to a separate table. Puppy Table puppy number puppy name kennel code kennel name The puppy table satisfies the first normal form, since in contains no repeating groups. It satisfies the second normal form, since it does not have a multivalued key. But the key is puppy number, and the kennel name and the kennel location describe only a kennel, not a puppy. To achieve the third normal form, they must be moved into a separate table. Since they describe a kennel, kennel code becomes the key of the new “kennels” table. Third Normal Form Puppies puppy number puppy name kennel code Kennel kennel code kennel name kennel location Tricks trick ID trick name Puppy Tricks puppy number trick ID trick where learned skill level The motivation for this is the same as for the second normal form. We want to avoid update and delete anomalies. For example suppose no puppies from the Puppy Farm were currently stored in the database. With the previous design, there would be no record of its existence. Boyce-Codd Normal Form The previous normalization forms are considered elementary, and should be applied on tables during our design process. This normalization form however, and the following forms, are done in special tables. A table is considered in BCNF (Boyce-Codd Normal Form) if it’s already in 3NF AND doesn’t contain any nontrivial functional dependencies. That is it doesn’t contain any field (other than the primary key) that can determine the value of another field. Let’s take the following table: Student Subject Teacher Smith Math Dr. White Smith English Dr. Brown Jones Math Dr. White Jones English Dr. Brown Doe Math Dr. Green By taking into consideration the following conditions:  For each subject, every student is educated by one teacher.  Every teacher teaches one subject only.  Each subject can be taught by more than one teacher. It’s clear we have the following functional dependency: Teacher -> Subject And the left side of this dependency is not the primary key. So, to convert the table from 3NF to BCNF, we do these steps:  Determine in the table, a key other than the primary key. That can be left side to the functional dependency.  Delete the key in the right side of our functional dependency in the main table.  Make a table for this dependency, with it’s key being the left side of the dependency, as the following: Student Teacher Smith Dr. White Smith Dr. Brown Jones Dr. White Jones Dr. Brown Doe Dr. Green And Teacher Subject Dr. White Math Dr. Brown English Dr. Green Math Rules of Normalization IV Database Normalization Rule 4: Isolate independent multiple relationships. No table may contain two or more 1:n (one-tomany) or n:m (many-to-many) relationships that are not directly related. This applies only to designs that include one-to-many and many-to-many relationships. An example of a one-to-many relationship is that one kennel can hold many puppies. An example of a many-to-many relationship is that a puppy can know many tricks and several puppies can know the same tricks. Puppy Tricks and Costumes puppy number trick ID trick where learned skill level costume suppose we want to add a new attribute to the puppy-trick table, ”Costume”, this way we can look for puppies that can both “set-up-and-beg” and wear a Groucho Marx mask, for example. The forth normal form dictates against this (i.e. against using the puppy-tricks table not against begging while wearing a Groucho mask). The two attributes do not share a meaningful relationship. A puppy may be able to wear a wet suit. This does not mean it can simultaneously sit up and beg. How will you represent this if you store both attributes in the same table? Fourth Normal Form Puppies puppy number puppy name kennel code Kennels kennel code kennel name kennel location Tricks trick ID trick name Puppy-Tricks puppy number trick ID trick where learned skill level Costumes costume number costume name Puppy-Custumes puppy number costume number Examples Question 1. Choose a key and write the dependencies for the following Grades: relation: GRADES(Student_ID, Course#, Semester#, Grade) Answer: Key is : Student_ID, Course#, Semester#, Dependency is: Student_ID, Course#, Semester# -> Grade Question 2. Choose a key and write the dependencies for the LINE_ITEMS relation: LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty) Answer: Key can be: PO_Number, ItemNum Dependencies are: PO_Number, ItemNum -> PartNum, Description, Price, Qty PartNum -> Description, Price Question 3. What normal form is the above LINE_ITEMS relation in? Answer: First off, LINE_ITEMS could not be in BCNF because: not all determinants are keys. next: it could not be in 3NF because there is a transitive dependency: PO_Number, ItemNum -> PartNum and PartNum -> Description Therefore, it must be in 2NF, we can check this is true because: the key of PO_Number, ItemNum determines all of the non-key attributes however, PO_Number by itself and ItemNum by itself can not determine any other attributes. Example II Question 4: What normal form is the following relation in? STORE_ITEM( SKU, PromotionID, Vendor, Style, Price ) SKU, PromotionID -> Vendor, Style, Price SKU -> Vendor, Style Answer: STORE_ITEM is in 1NF (non-key attribute (vendor) is dependent on only part of the key. Question 5: Normalize the above (Q4) relation into the next higher normal form. Answer: STORE_ITEM (SKU, PromotionID, Price) VENDOR ITEM (SKU, Vendor, Style) Question 6: Choose a key and write the dependencies for the following SOFTWARE relation (assume all of the vendor’s products have the same warranty). SOFTWARE (SoftwareVendor, Product, Release, SystemReq, Price, Warranty) SoftwareVendor, Product, Release -> SystemReq, Price, Warranty Answer: key is: SoftwareVendor, Product, Release SoftwareVendor, Product, Release -> SystemReq, Price, Warranty SoftwareVendor -> Warranty .:. SOFTWARE is in 1NF Example III Question 7: Normalize the above Software relation into 4NF. Answer: SOFTWARE (SoftwareVendor, Product, Release, SystemReq, Price) WARRANTY (SoftwareVendor, Warranty) Question 8: What normal form is the following relation in? only H,I can act as the key. STUFF (H, I, J, K, L, M, N, O) H, I -> J, K, L J -> M K -> N L -> O Answer: 2NF (Transitive dependencies exist) Question 9: What normal form the following relation in? STUFF2 (D, O, N, T, C, R, Y) D, O -> N, T, C, R, Y C, R -> D D -> N Answer: 1NF (Partial Key Dependency exist) Example IV Invoice relation Is this relation in 1NF? 2NF? 3NF? Convert the relation to 3NF. Inv# date custID Name Part# Desc Price #Used 14 12/63 14 12/63 15 1/64 42 42 44 Lee Lee Pat A38 A40 A38 Nut Saw Nut 0.32 4.50 0.32 10 2 20 Ext Price 3.20 9.00 6.40 Tax rate 0.10 0.10 0.10 Tax Total 1.22 13.42 1.22 13.42 064 7.04 Table not in 1NF because - it contains derived values EXT PRICE(=Price X # used) 3.2 = 0.32 X 10 - Tax (=sum of Ext price of same Inv# X Tax rate) 1.22 = (3.2 + 9.00) X 0.10 - Total (=sum of Ext price + Tax) 13.42 = (3.20 + 9.00) + 1.22 To get 1NF, identify PK and remove derived attributes Inv# date 14 12/63 14 12/63 15 1/64 custID 42 42 44 Name Lee Lee Pat Part# A38 A40 A38 Desc Nut Saw Nut Price 0.32 4.50 32 #Used 10 2 20 Tax rate 0.10 0.10 0.10 To get 2NF - Remove partial dependencies - Partial FDs with key attributes. - Inv# -> Date, CustID, Name, Tax Rate - Part# -> Desc, Price Remove Partial FDs |–K1-||———————–D1———————————||—K2—||——-D2———| Inv# date custID Name Tax rate Part# Desc Price #Used 14 12/63 42 Lee 0.10 A38 Nut 0.32 10 14 12/63 42 Lee 0.10 A40 Saw 4.50 2 15 1/64 44 Pat 0.10 A38 Nut 32 20 = Inv# date custID Name Tax rate 14 12/63 42 Lee 0.10 14 12/63 42 Lee 0.10 15 1/64 44 Pat 0.10 Inv# Part# #Used 14 A38 10 14 A40 2 15 A38 20 Part# A38 A40 A38 Desc Nut Saw Nut Price 0.32 4.50 32 Remove transitive FD Inv#(PK) -> CustID -> Name Inv# date custID 14 12/63 42 15 1/64 44 = Inv# date custID 14 12/63 42 15 1/64 44 + custID Name 42 Lee 44 Pat All relations in 3NF Inv# Part# 14 A38 14 A40 15 A38 Part# A38 A40 date 12/63 1/64 custID 42 42 Name Lee Pat Tax rate 0.10 0.10 Tax rate 0.10 0.10 #Used 10 2 20 Desc Nut Saw Inv# 14 15 Name Lee Pat custID 42 44 Price 0.32 4.50 Tax rate 0.10 0.10 Example V In this page you will see a basic database normalization example, transforming BCNF table into a 4NF one(s). The next table is in the BCNF form, convert it to the 4th normal form. Employee skill language Jones electrical French Jones electrical German Jones mechanical French Jones mechanical German Smith plumbing Spanish The above table does not comply with the 4th normal form, because it has repetitions like this: Jones X A Jones Y B So this data may be already in the table, which means that it’s repeated. Jones B X Jones A Y To transform this into the 4th normal form (4NF) we must separate the original table into two tables like this: employee Jones Jones Smith skill electrical mechanical plumbing employee Jones Jones Smith language French German Spanish And Source : DBNormalization.com

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Normalization-Anomalies