Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UNIT 1 INTRODUCTION TO DBMS 1. File system organisation 1. Computer file contains information arranged in an electronic format. It also facilitates easy storage, retrieval, and manipulation of data. 2. They are stored in the form bits and bytes. It has a name and the computer would recognize a file based on this name. 3. A programmer working with this file can give instructions to the computer to open the file, read from it, write to it, modify its contents, close it, and so on. 4. A program passes control to another in a sequence. This is called batch processing, where no or minimum human interaction is required. 5. In many situations the program needs to be conversational. These days, the computer performs both the searching and the answering operations in an automated manner. 6. A search that can take place at any time is called as an online query. 7. When an instantaneous answer is expected, it is called online processing or real-time processing. Example: Airlines reservations 8. Data can be classified into two types: Master data -does not change with time. Transaction data -can change from time to time. 9. Example: Library Management 10. There is a library of books and a librarian to maintain it. The librarian has created one card per book, which contains details such as book number, title, author, price and date of purchase. 11. For this, the librarian has used the conceptual record layout as shown in the figure 1.1 Fig1.1Record layout for the Book file 12. A card is similar to a record and, in technical terms, the entire pile of cards is similar to a file. 13. A field used to identify a record is called as a record key, or just a key. 14. A record key can be of two types: Primary key: Identifies a record uniquely based on a field value. Example:Book number. Secondary key: May or may not identify a record uniquely, but can identify one or more records based on a field value. Example: Author. 1.1 Sequential organisation A file is called as a sequential file when it contains records arranged in sequential fashion. The records are added as and when they are available. That is a new record is always added to the end of the current file. Advantages of sequential organisation Simplicity:The sequential organisation of records is quite simple ant it just need to create new record for the new books. CS6302 DATABASE MANAGEMENT SYSTEM Page 1 Less overhead: There is no need to keep any key or any other extra information on the books file. The file is enough. Disadvantage Difficulties in searching: Searching can be a very slow process. It starts from the beginning and continue till the end, or until the desired record is found, whichever is earlier. This is both time consuming and cumbersome. Lack of support for queries: To even find out whether something is available in the file or not, the entire file has to be read. Problem with record deletion: It is not simple to delete records. The space freed by the deletion of the record cannot be reclaimed. 1.2 pointers A pointer in a record is a special field, whose value is the address/reference of another record in the same file. The special field forms a chain of records. A chain of records is a logical sequence of records created by the use of pointer fields. These chains were called one-way chains. Problems with One-way Chains: They can suffer from the drawback of lost/damaged references. They are unidirectional by nature. Two-way Chains 1. In two-way chains another logical chain is created. 2. Add another chain in the reverse direction, such that there are two pointer fields. The new pointer fields is called as the back-pointer. 3. The back-pointer points to the previous record in the chain. 4. A broken chain does not cause a major problem in two-way chains. This is for the simple reason that another chain exists in the opposite direction. 5. Two-way chains do not suffer from the drawback of lost/damaged references. Disadvantages: 1. More effort goes into their maintenance. For every new record that is being added, there is a need to make the entry in both forward and backward fields. 2. If a record is lost both the previous and next pointers need to be adjusted so that they point to the correct record. 1.3 Indexed Organisation 1. One of the fields in the file is the primary key. This field identifies a record uniquely. In every record, the primary key field should occupy the same position. 2. In order to create and maintain index files, a computer creates a data file and an index file. The data file contains the actual contents (data) of the record, whereas the index file contains the index entries. 3. The way files are organized in computers is as follows: The data file is sorted in the order of the primary key field values. The index file contains two fields: the key value and the pointer to the data area. One record in the index file thus consists of a key value and a pointer to the corresponding data record. The key value is generally the largest primary key value in a given range of records. The pointer points to the first entry within that range of data records. 4. This is illustrated in figure. In the first index entry, the index value is C, which is the highest primary key value in the first data block. CS6302 DATABASE MANAGEMENT SYSTEM Page 2 5. The pointer from this index entry points to the start of this range (i.e. A). The address (i.e. memory location) of this on the disk is assumed to be 0, as shown. 6. The second index entry contains F as the highest primary key value for that range of records, and a pointer to D, which is the start of the range, and so on. 7. The address of this on the disk is assumed to be 100, as shown. Fig1.2 Indexed file organization 8. This arrangement works fine. There are two problems, as follows: To insert new index values between any two existing values. The number of index values becomes too high. Solution : Inserting a new index entry would necessitate a split in the index, and appropriate adjustments in the address values. To solve the second problem create a multi-level index (index of indexes). In this type of index, the very first line does not point to the data items as before. Instead, it points to another lower-level index. Depending on the need, this lower-level index may point to yet another lower-level index, and so on. Only the final level of index points to the actual data items. 1.4 Direct organisation The idea is quite simple.All records in direct file are of the same size. Every record has an associated record number.The record number serves the same purpose as a primary key in an index file. Direct files can be classified in to two main types .They are: Hashed files Non-hashed files Non - hashed files Here, records are placed in its appropriate slot based on its record number. Th drawback of the non-hashed file approach is the creation of too many empty slots. Hashed file In hashed file the record number itself becomes an equivalent of the primary key. The term hash indicates splitting or chopping of key in to pieces. The are three primary hashing techniques they are:divison method,mid-square method and folding method. CS6302 DATABASE MANAGEMENT SYSTEM Page 3 2.Purpose of Database System 9. Database systems arose in response to early methods of computerized management of commercial data. 10. As an example consider part of a university organization that among other data, keeps information about all instructors, students, departments and course offerings. 11. One way to keep the information on a computer is to store it in operating system files. 12. To allow users to manipulate the information, the system has a number of application programs that manipulate the files, including programs to: Add new students, instructors, and courses. Register students for courses. Assign grades to students, compute grade point averages (GPA) and generate transcripts. 13. New application programs are added to the system as the need arises. 2.1 File Processing System This system is supported by a conventional operating system. The system stores permanent records in various files. It needs different application programs to extract records from, and add records to, the appropriate files. Before database management systems (DBMSs) were introduced, organizations usually stored information in such systems. 2.1.1 Drawbacks of using file systems to store data Data redundancy and inconsistency Different programmers create files and application program. The files created, have different structures and the programs may be written in several programming language. The same information may be duplicated in several files. Difficulty in accessing data Need to write a new program to carry out each new task. Data isolation Data are scattered in various file. The files may be stored in different format Writing new application program to retrieve appropriate data is difficult. Integrity problems Integrity constraints (e.g., account balance > 0) become “buried” in program code rather than being stated explicitly. Hard to add new constraints or change existing ones. Atomicity of updates Failures may leave database in an inconsistent state with partial updates carried out. Example: Transfer of funds from one account to another should either complete or not happen at all. Concurrent access anomalies Concurrent access needed for improved performance. CS6302 DATABASE MANAGEMENT SYSTEM Page 4 Uncontrolled concurrent accesses can lead to inconsistencies. Example: Two people reading a balance (say 100) and updating it by withdrawing money (say 50 each) at the same time. Security problems Not every user of the database system should be able to access all the data. Example: In a university, payroll personnel need to see only the financial information .They do not see information about academic records. Since application programs are added to file processing system in an adhoc manner, enforcing such security constraint is difficult. Database systems offer solutions to all the above problems. 3. Database System Terminologies Database: A collection of related data. Data: Known facts that can be recorded and have an implicit meaning. Mini-world: Some part of the real world about which data is stored in a database. For example, student grades and transcripts at a university. Database Management System (DBMS): A software package/ system to facilitate the creation and maintenance of a computerized database. Database System: The DBMS software together with the data itself. Sometimes, the applications are also included. 4. Database Characteristics The main characteristics of the database approach are the following: 1. Self-describing nature of a database system 2. Insulation between programs and data, and data abstraction 3. Support of multiple views of the data 4. Sharing of data and multiuser transaction processing 1.Self-describing nature of a database system A DBMS catalog stores the description of a particular database (e.g. data structures, types, and constraints) The description is called meta-data. This allows the DBMS software to work with different database applications. 2.Insulation between programs and data, and data abstraction The structure of data files is stored in the DBMS catalog separately from the access programs. This property is called program-data independence. Allows changing data structures and storage organization without having to change the DBMS access programs. CS6302 DATABASE MANAGEMENT SYSTEM Page 5 Data Abstraction: A data model is used to hide storage details and present the users with a conceptual view of the database. Programs refer to the data model constructs rather than data storage details. 3. Support of multiple views of the data Each user may see a different view of the database, which describes only the data of interest to that user. 4. Sharing of data and multiuser transaction processing Allowing a set of concurrent users to retrieve from and to update the database. Concurrency control within the DBMS guarantees that each transaction is correctly executed or aborted. Recovery subsystem ensures each completed transaction has its effect permanently recorded in the database. OLTP (Online Transaction Processing) is a major part of database applications. This allows hundreds of concurrent transactions to execute per second. 5. Data Models Data abstraction: Suppression of details of data organization and Storage. Highlighting the essential features for an improved understanding of data. Data model: Collection of concepts that describe the structure of a database. Provides means to achieve data abstraction. Basic operations Specify retrievals and updates on the database Dynamic aspect or behavior of a database application Allows the database designer to specify a set of valid operations allowed on database objects. CS6302 DATABASE MANAGEMENT SYSTEM Page 6 Categories of Data Models High-level or conceptual data models Close to the way many users perceive data. Conceptual data models use concepts such as entities, attributes, and relationships. Entity-Represents a real-world object or concept. Attribute-Represents some property of interest that further describes an entity. Relationship among two or more entities represents an association among the entities. Low-level or physical data models Describe the details of how data is stored on computer storage media. Representational data models Easily understood by end users. Also similar to how data organized in computer storage. Relational data model Used most frequently in traditional commercial DBMSs. Object data model New family of higher-level implementation data models that are closer to conceptual data models. Physical data models Describe how data is stored as files in the computer. Access path- Structure that makes the search for particular database records efficient. Index- Example of an access path that allows direct access to data using an index term or keyword. 6. DBMS Components A DBMS is a complex software system. Figure illustrates, in a simplified form, the typical DBMS components. Fig6.1Component modules of a DBMS and their interactions CS6302 DATABASE MANAGEMENT SYSTEM Page 7 The top part of the figure refers to the various users of the database environment and their interfaces. The lower part shows the internals of the DBMS responsible for storage of data and processing of transactions. Let us consider the top part of Figure: 1. It shows interfaces for the DBA staff, casual users who work with interactive interfaces to formulate queries, application programmers who program using some host languages, parametric users who do data entry work by supplying parameters to predefined transactions. 2. The DDL compiler: processes schema definitions, specified in the DDL, and stores descriptions of the schemas (meta-data) in the DBMS catalog. 3. Casual users and persons with occasional need for information from the database interact using some form of interface called as as interactive query interface. 4. Query compiler: handles high-level queries that are entered interactively. 5. The query optimizer is concerned with the rearrangement and possible reordering of operations, elimination of redundancies, and use of correct algorithms and indexes during execution. 6. Application programmers write programs in host languages such as Java, C, or C++that are submitted to a precompiler. 7. The precompiler extracts DML commandsfrom an application program written in a host programming language. 8. DML compiler: compiles the DML commands into objectcode for database access. 9. The rest of the program is sent to the host language compiler. 10. The object codes for the DML commands and the rest of the program are linked, forming a canned transaction whose executable code includes calls to the runtime database processor. Now, Let us consider the lower part of figure 1. Run-time database processor: handles database access at run time. It receives retrieval and update operations and carries them out on the database. 2. It also works with the stored data manager, which controls access to DBMS information that is stored on disk through interaction with operating system. 3. Concurrency control and backup and recovery systems are integrated into the working of the runtime database processor for purposes of transaction management. Database System Utilities There are some functions that are not provided through the normal DBMS components rather they are provided through additional programs called utilities. Some of these are: 1. Loading or import utility: used to load or import existing data files into the database. 2. Backup utility: used to create backup copies of the database, usually by dumping the entire database onto tape. 3. File reorganization utility: is used to reorganize a database file into a different file organization to improve performance. 4. Performance monitoring utility: is used to monitor database usage and provides statistics to the DBA. 7. Relational Algebra 1. A set of operators (unary and binary) that take relation instances as arguments and return new relations. 2. Gives a procedural method of specifying a retrieval query. CS6302 DATABASE MANAGEMENT SYSTEM Page 8 3. 4. 5. 6. Forms the core component of a relational query engine. SQL queries are internally translated into Relational Algebra expressions. Provides a framework for query optimization. A sequence of relational algebra operations forms a relational algebra expression 7.1 Unary Relational Operations: SELECT ,PROJECT and RENAME The Select operation ( denoted by σ ( sigma))can be used to select those tuples of a relation that satisfy a given condition. Notation: σ : select operator ( read as sigma) R: relation name Examples of select expressions Obtain information about a professor with name “giridhar” σ name= “giridhar”(professor) Obtain information about professors who joined the university between 1980 and 1985 σ startYear≥1980 ^ startYear < 1985(professor) To select the tuples for all employees who either work in department 4 and make over $25,000 per year, or work in department 5 and make over $30,000, the following SELECT operation is given: σ(Dno=4 AND Salary>25000) OR (Dno=5 AND Salary>30000)(EMPLOYEE) The result is shown in Figure Fig7.1.Results of select operation TheBoolean conditions AND, OR, and NOT have their normal interpretation, as follows: (cond1 AND cond2) is TRUE if both (cond1) and (cond2) are TRUE; otherwise,it is FALSE. (cond1 OR cond2) is TRUE if either (cond1) or (cond2) or both are TRUE;otherwise, it is FALSE. (NOT cond) is TRUE if cond is FALSE; otherwise, it is FALSE. The project operation(denoted by π(pie)) can be used to keep only the required attributes of a relation instance and throw away others. Notation: Π:project operator(read as pie) R: relation name Examples of project expressions To list each employee’s first and last name and salary, the PROJECT operation is used as follows: πLname, Fname, Salary(EMPLOYEE) The result is shown in figure CS6302 DATABASE MANAGEMENT SYSTEM Page 9 Fig7.2.Results of project operation The Rename operator is denoted by ρ (rho). It is used to rename the attributes of a relation or the relation name or both. The general RENAME operation ρ can be expressed by any of the following forms: ρS (B1, B2, …, Bn )(R) changes both: the relation name to S, and the column (attribute) names to B1, B1, …..Bn ρS(R) changes: the relation name only to S ρ(B1, B2, …, Bn )(R) changes: the column (attribute) names only to B1, B1, …..Bn Example of Rename operation To rename the attributes in a relation, simply list the new attribute names in parentheses, as in the following example: TEMP ← σ DNO = 4 (EMPLOYEE) R (FN, LN, SAL)← π FNAME, LNAME, SALARY (TEMP) These two operations are illustrated in Figure Fig7.3.Results of Rename operation 7.2 Relational Algebra Operations from Set Theory Union Operation 1. Binary operation, denoted by . 2. The result of R S is a relation that includes all tuples that are either in R or in S or in both R and S. 3. Duplicate tuples are eliminated. 4. The two operand relations R and S must be “typecompatible” (or UNION compatible): R and S must have same number of attributes. Each pair of corresponding attributes must be type compatible ( have same domains). Intersection operation 1. INTERSECTION is denoted by ∩. 2. The result of the operation R ∩ S, is a relation that includes all tuples that are in both R and S. CS6302 DATABASE MANAGEMENT SYSTEM Page 10 3. The attribute names in the result will be the same as the attribute names in R. 4. The two operand relations R and S must be “type compatible”. Set Difference 1. SET DIFFERENCE (also called MINUS or EXCEPT) is denoted by – 2. The result of R – S, is a relation that includes all tuples that are in R but not in S 3. The attribute names in the result will be the same as the attribute names in R 4. The two operand relations R and S must be“type compatible” Example of union, intersection and set difference operations Cartesian (Or Cross) Product Operation 1. This operation is used to combine tuples from two relations in a combinatorial fashion. 2. Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm). 3. Result is a relation Q with degree n + m attributes: Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order. 4. The resulting relation state has one tuple for each combination of tuples—one from R and one from S. 5. Hence, if R has nR tuples and S has nS tuples, then R x S will have nR * nS tuples. 6. The two operands do NOT have to be "type compatible”. 7. Example: FEMALE_EMPS ← σ SEX=’F’(EMPLOYEE) EMPNAMES ← π FNAME, LNAME, SSN (FEMALE_EMPS) EMP_DEPENDENTS ← EMPNAMES x DEPENDENT 8. EMP_DEPENDENTS will contain every combination of EMPNAMES and DEPENDENT. 9. The operations are illustrated in the figure CS6302 DATABASE MANAGEMENT SYSTEM Page 11 Fig7.4 . The Cartesian Product (Cross Product) operation 8. Relational DBMS (RDBMS) It is a database management system where the data are organized as tables of data values and all the operations on the data work on these tables. 8.1 codd’s rule Dr. Edgar F. Codd proposed a set of 12 rules that were intended to define the important characteristics and capabilities of any relational system [Codd 1986]. The rules are listed below: Rule Rule Name Description Rule 1 Information rule All information is represented logically by values in tables Rule 2 Guaranteed Access Every data value is logically accessible by a combination of table name, Rule primary key value and column name. Rule 3 Missing Information Null values are systematically supported independent of data type. rule Rule 4 System catalogue The logical description of the database is represented and may be interrogated Rule by authorized users. Rule 5 Comprehensive A high level relational language that support all of the following: data language Rule definition, view definition, data manipulation, integrity constraints, authorization, transaction boundaries. CS6302 DATABASE MANAGEMENT SYSTEM Page 12 Rule 6 Rule 7 View update rule Set level Update Rule Rule 8 Physical data independence rule Physical data independence rule Rule 9 Rule 10 Integrity independence rule Distribution independence rule Non-subversion rule Rule11 Rule 12 The system should able to perform all theoretically possible updates on view. The ability to treat whole table as single object applies to insertion, modification and deletion, as well as retrieval of data. User operations and application program should be independent of any changes in physical storage. User operations and application program should be independent of any changes in Logical structure of base table provided they involve no loss of information. Entity and referential integrity constraints should be defined in the high level relational language, not by application programs. User operations and application program should be independent of location of data when it is distributed over multiple computers. If a low-level procedural language is supported, it must not able to subvert integrity or security constraints expressed in the high-level relational language 9. Entity-Relationship model Entity-Relationship (ER) model- Popular high-level conceptual data model. ER diagrams -Diagrammatic notation associated with the ER model. Entity- Thing in real world with independent existence. Attributes-Particular properties that describe entity. For example, an EMPLOYEE entity may be described by the attributes employee’s name, age, address, salary, and job. Several types of attributes occur in the ER model: simple, composite, single valued, multi valued, stored, and derived. Simple or atomic attributes: Attributes that are not divisible. Composite attributes: It can be divided into smaller subparts, which represent more basic attributes with independent meanings. Composite attributes can form a hierarchy. Example: Address attribute of the EMPLOYEE entity can be subdivided into Street_address, City, State, and Zip. Fig9.1. A hierarchy of composite attributes Single-Valued Attributes: Attributes that have a single value for a particular entity. For example, Age of a person. Multivalued Attributes: An attribute can have a set of values for the same entity. A multivalued attribute may have lower and upper bounds to constrain the number of values allowed for each individual entity. Stored versus Derived Attributes: Two (or more) attribute values are related. Example: Age and Birth_date attributes of a person. CS6302 DATABASE MANAGEMENT SYSTEM Page 13 For a particular person entity, the value of Age can be determined from the current (today’s) date and the value of that person’s Birth_date. The Age attribute is called a derived attribute and is said to be derivable from the Birth_date attribute, which is called a stored attribute. Entity type: Collection (or set) of entities that have the same attributes. Fig 9.2 Two entity types,EMPLOYEE andCOMPANY, and some member entities ofeach Key or uniqueness constraint: Attributes whose values are distinct for each individual entity in entity set Key attribute: Uniqueness property must hold for every entity set of the entity type. Value sets (or domain of values):Specifies set of values that may be assigned to that attribute for each individual entity. Relationship: attribute of one entity type refers to another entity type. Represent references as relationships not attributes. Relationship Types, Sets, and Instances: Relationship type R among n entity types E1, E2, ..., En:Defines a set of associations among entities from these entity types. Relationship instances ri: Each ri associates n individual entities (e1,e2, ..., en)and each entity ej in ri is a member of entity set Ej. Relationship Degree Degree of a relationship type:1. Number of participating entity types 2. A relationship type of degree two is called binary, and one of degree three is calledternary. Relationships as attributes:Think of a binary relationship type in terms of attributes. Fig9.3. Some instances in the WORKS_FOR relationship set, which represents a relationship type WORKS_FOR between EMPLOYEE and DEPARTMENT CS6302 DATABASE MANAGEMENT SYSTEM Page 14 Role names :Role name signifies the role that a participating entity plays in each relationship instance. Recursive relationships: Same entity type participates more than oncein a relationship type in different roles. Cardinality ratio for a binary relationship: Specifies maximum number of relationship instances that entity can participate in. Participation constraint: Specifies whether existence of entity depends on its being related to another entity. Types: total and partial. Attributes of Relationship Types Attributes of 1:1 or 1:N relationship types:can be migrated to one entity type. For a 1:N relationship type:Relationship attribute can be migrated only to entity type on N-side of relationship. For M:N relationship types :1.Some attributes may be determined by combination of participating entities2. be specified as relationship attributes. Weak Entity Types Do not have key attributes of their own. Identified by being related to specific entities another entity type. Regular entity types that do have a key attribute are called strong entity types. Identifying relationship of the weak entity type: The relationship type that relates a weak entity type to its owner. Summary of the notation for ER diagram: CS6302 DATABASE MANAGEMENT SYSTEM Page 15 Fig 9.4 ER Design for the COMPANY Database 10.Functional dependencies 1. The whole database is described by a single universal relation schema R = { A1, A2, ..., An }. a. Definition: 2. A functional dependency, denoted by X → Y, between two sets of attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation state r of R. 3. The constraint is that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y]. 4. The values of the Y component of a tuple in r depend on, or are determined by, the values of the X component. 5. The values of the X component of a tuple uniquely (or functionally) determine the values of the Y component. 6. There is a functional dependency (FD or f.d) from X to Y, or that Y is functionally dependent on X. 7. X functionally determines Y in a relation schema R if, and only if, whenever two tuples of r(R) agree on their X-value, they must necessarily agree on their Y value. Note the following: If a constraint on R states that there cannot be more than one tuple with a given X-value in any relation instance r(R) That is, X is a candidate key of R—this implies that X → Y for any subset of attributes Y of R. If X→Y in R, this does not say whether or not Y→X in R. 8. A functional dependency is a property of the semantics or meaning of the attributes. 9. Whenever the semantics of two sets of attributes in R indicate that a functional dependency should hold, specify the dependency as a constraint. 10. Relation extensions r(R) that satisfy the functional dependency constraints are called legal relation states (or legal extensions) of R. CS6302 DATABASE MANAGEMENT SYSTEM Page 16 Fig10.1 . Relation schemas EMP_PROJ. 11. Consider the relation schema EMP_PROJ in Figure10.1; from the semantics of the attributes and the relation, the following functional dependencies should hold: Ssn→Ename Pnumber →{Pname, Plocation} {Ssn, Pnumber}→Hours 12. These functional dependencies specify that The value of an employee’s Social Security number (Ssn) uniquely determines the employee name (Ename), The value of a project’s number (Pnumber) uniquely determines the project name (Pname) and location (Plocation), Acombination of Ssn and Pnumber values uniquely determines the number of hours the employee currently works on the project per week (Hours). Alternatively, Ename is functionally determined by (or functionally dependent on) Ssn. 10.1Normal Forms Based on Primary Keys 10.1.1 Normalization of Relations: The normalization process, as first proposed by Codd (1972a), takes a relation schema through a series of tests to certify whether it satisfies a certain normal form. 10.1.2Normalization of data: 1. It can be considered a process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of (1) minimizing redundancy and (2) minimizing the insertion, deletion, and update anomalies. 2. Unsatisfactory relation schemas that do not meet certain conditions—the normal form tests are decomposed into smaller relation schemas that meet the tests and hence possess the desirable properties. 3. Definition: The normal form of a relation refers to the highest normal form condition that it meets, and hence indicates the degree to which it has been normalized. 4. Normalization must confirm the existence of additional properties: 5. The nonadditive join or lossless join property, which guarantees that the spurious tuple generation problem does not occur with respect to the relation schemas created after decomposition. 6. The dependency preservation property, which ensures that each functional dependency is represented in some individual relation resulting after decomposition. 10.1.3 Denormalization: It is the process of storing the join of higher normal form relations as a base relation, which is in a lower normal form. 10.1.4 Definitions of Keys and Attributes Participating in Keys 1. A key K is a superkey with the additional property that removal of any attribute from K will cause K not to be a superkey any more. 2. If a relation schema has more than one key, each is called a candidate key. CS6302 DATABASE MANAGEMENT SYSTEM Page 17 3. One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary keys. 4. An attribute of relation schema R is called a prime attribute of R if it is a member of some candidate key of R. 5. An attribute is called nonprime if it is not a prime attribute, that is, if it is not a member of any candidate key. 10.2 First Normal form It states that the domain of an attribute must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. It disallows having a set of values, a tuple of values, or a combination of both as an attribute value for a single tuple. Fig10.2. A relation schema that is not in 1NF Fig10.3 Sample state of relation DEPARTMENT Fig10.4 . 1NF version of the same relation with redundancy Fig10.2is not in 1NF because Dlocations is not an atomic attribute. There are three main techniques to achieve first normal form: First technique: 1. Remove the attribute Dlocations and place it in a separate relation DEPT_LOCATIONS, along with the primary key Dnumber. 2. The primary key of this relation is the combination {Dnumber, Dlocation}. 3. A distinct tuple in DEPT_LOCATIONS exists for each location of a department. 4. This decomposes the non-1NF relation into two 1NF relations. Second Technique: 1. Expand the key so that there will be a separate tuple, in the original DEPARTMENT relation for each location of a DEPARTMENT. 2. The primary key becomes the combination {Dnumber, Dlocation}. 3. Disadvantage: introducing redundancy in the relation. Third technique: 1. If a maximum number of values is known for the attribute—for example, if it is known that at most three locations can exist for a department—replace the Dlocations attribute by three atomic attributes: Dlocation1, Dlocation2, and Dlocation3. 2. Disadvantage: Introducing NULL values if most departments have fewer than three locations. The first solution is considered best because it does not suffer from redundancy and it is completely general, having no limit placed on a maximum number of values. CS6302 DATABASE MANAGEMENT SYSTEM Page 18 10.3Second Normal Form 1. It is based on the concept of full functional dependency. 2. A functional dependency X → Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more. 3. A functional dependency X→Y is a partial dependency if some attribute A € X can be removed from X and the dependency still holds. 4. In the following figure, {Ssn, Pnumber} → Hours is a full dependency (neither Ssn → Hours nor Pnumber→Hours holds). 5. However, the dependency {Ssn, Pnumber}→Ename is partial because Ssn→Ename holds. Fig10.5 Relation schema EMP_PROJ 6. The EMP_PROJ relation is in 1NF but is not in 2NF. 7. The functional dependencies FD2 and FD3 make Ename, Pname, and Plocation partially dependent on the primary key {Ssn, Pnumber} of EMP_PROJ. 8. If a relation schema is not in 2NF, it can be second normalized or 2NF normalized into a number of 2NF relations. 9. In that 2NF Relation , nonprime attributes are associated only with the part of the primary key on which they are fully functionally dependent. 10. The functional dependencies FD1, FD2, and FD3 lead to the decomposition of EMP_PROJ into the three relation schemas EP1, EP2, and EP3 shown in figure, each of which is in 2NF. Fig10.6 . Normalizing EMP_PROJ into 2NF relations 10.4 Third Normal Form 1. It is based on the concept of transitive dependency. 2. A functional dependency X→Y in a relation schema R is a transitive dependency if there exists a set of attributes Z in R that is neither a candidate key nor a subset of any key of R, and both X→Z and Z→Y hold. 3. The dependency Ssn→Dmgr_ssn is transitive through Dnumber in EMP_DEPT in figure, because both the dependencies Ssn → Dnumber and Dnumber → Dmgr_ssn hold and Dnumber is neither a key itself nor a subset of the key of EMP_DEPT. Fig10.7 . Relation schema EMP_DEPT Definition: A relation schema R is in 3NF if it satisfies 2NF and no nonprime attribute of R is transitively dependent on the primary key. The relation schema EMP_DEPT is in 2NF but not in 3NF because of the transitive dependency. EMP_DEPT is normalized by decomposing it into the two 3NF relation schemas ED1 and ED2. CS6302 DATABASE MANAGEMENT SYSTEM Page 19 Fig10.8. Normalizing EMP_DEPT into 3NF relations 10.5 Boyce Codd Normal Form Definition: A relation schema R is in BCNF if whenever a nontrivial functional dependency X→A holds in R, then X is a superkey of R. 1. Example: Consider a relation TEACH with the following dependencies: FD1: {Student, Course} → Instructor FD2: Instructor → Course 2. {Student, Course} is a candidate key for this relation and that the dependencies shown follow the pattern in figure, with Student as A, Course as B, and Instructor as C. Fig10.9. A schematic relation with FDs; it is in 3NF, but not in BCNF 3. Hence this relation is in 3NF but not BCNF. 4. Decomposition of this relation schema into two schemas is not straightforward because it may be decomposed into one of the three following possible pairs: {Student, Instructor} and {Student, Course} {Course, Instructor} and {Course, Student} {Instructor, Course} and {Instructor, Student} 5. All three decompositions lose the functional dependency FD1. The desirable decomposition of those just shown is 3 because it will not generate spurious tuples after a join. 6. A relation not in BCNF should be decomposed so as to meet this property. Nonadditive decomposition is a must during normalization. 10.6 Formal definition of Multivalued dependencies(MVD): The MVD x →→ Y is said to hold for R(X,Y,Z) if, whenever t1 and t2 are two rows in R that have the same values for attributes X and therefore t1[x]=t2[x] then R also contains t3 and t4,such that t3 [X] = t4 [X] = t1 [X] = t2 [X] t3 [Y] = t1 [Y] and t4[Y] = t2 [Y] t3 [Z] = t2 [Z] and t4 [Z] = t1[Z] 10.6.1Fourth Normal Form A relation schema R is in 4NF with respect to a set of dependencies F if, for every nontrivial multivalued dependency X →→ Y in F+, X is a superkey for R. Consider the EMP relation in figure. EMP is not in 4NF because in the nontrivial MVDs Ename→→ Pname and Ename →→ Dname, and Ename is not a superkey of EMP. Fig10.10. The EMP relation with two MVDs: Ename →→ Pname and Ename →→ Dname CS6302 DATABASE MANAGEMENT SYSTEM Page 20 Decompose EMP into EMP_PROJECTS and EMP_DEPENDENTS, shown in figure. Both EMP_PROJECTS and EMP_DEPENDENTS are in 4NF, because the MVDs Ename →→ Pname in EMP_PROJECTS and Ename →→ Dname in EMP_DEPENDENTS are trivial MVDs. No other nontrivial MVDs hold in either EMP_PROJECTS or EMP_DEPENDENTS. No FDs hold in these relation schemas either. Fig10.11. Decomposing the EMP relation into two 4NF relations EMP_PROJECTS and EMP_DEPENDENTS 10.7 Join Dependencies Let a relation R have subset of its attribute A,B,C ,..Then R satisfies the Join dependency (JD) written as *(A,B,C) if and only if every possible legal value of R is equal to the join of its projection A,B,C… 10.7.1Definition of 5NF: A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies of the form *(R1, R2, ..., Rn), where each Ri is a subset of the set of attributes of R and R = R1⋃ R2⋃...⋃Rn, at least one of the following holds. *(R1, R2, ..., Rn) is a trivial join-dependency (i.e., one of Ri is R) Every Ri is a super key for R. Example: Department Subject Student Comp. Sc. CP1000 John Smith Mathematics MA1000 John Smith Comp. Sc. CP2000 Arun Kumar Comp. Sc. CP3000 Reena Rani Physics PH1000 Raymond Chew Chemistry CH2000 Albert Garcia 1. The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are taken by a variety of students. 2. No student takes all the subjects and no subject has all students enrolled in it and therefore all three fields are needed to represent the information. 3. The above relation does not show MVDs since the attributes subject and student are not independent; they are related to each other and the pairings have significant information in them. 4. The relation can therefore not be decomposed in two relations (dept, subject), and (dept, student) Without losing some important information. The relation can however be decomposed in the following three relations (dept, subject), and (dept, student) (subject, student) Now it can be shown that this decomposition is lossless. CS6302 DATABASE MANAGEMENT SYSTEM Page 21