Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 12: Database Design SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/is202/f03/ IS 202 – FALL 2003 2003.10.02 - SLIDE 1 Lecture Overview • Review – Databases and Database Design – Database Life Cycle – ER Diagrams • Database Design • Normalization • Discussion Questions IS 202 – FALL 2003 2003.10.02 - SLIDE 2 Lecture Overview • Review – Databases and Database Design – Database Life Cycle – ER Diagrams • Database Design • Normalization • Discussion Questions IS 202 – FALL 2003 2003.10.02 - SLIDE 3 Models (1) Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 4 Database System Life Cycle Physical Creation 2 Conversion 3 Design 1 Growth, Change, & Maintenance 6 Integration 4 Operations 5 IS 202 – FALL 2003 2003.10.02 - SLIDE 5 Another View of the Life Cycle Integration 4 Operations 5 Design Physical 1 Creation Conversion Growth, 2 Change 3 6 IS 202 – FALL 2003 2003.10.02 - SLIDE 6 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 7 Entity • An Entity is an object in the real world (or even imaginary worlds) about which we want or need to maintain information – Persons (e.g.: customers in a business, employees, authors) – Things (e.g.: purchase orders, meetings, parts, companies) Employee IS 202 – FALL 2003 2003.10.02 - SLIDE 8 Attributes • Attributes are the significant properties or characteristics of an entity that help identify it and provide the information needed to interact with it or use it (This is the Metadata for the entities) Birthdate First Middle Last IS 202 – FALL 2003 Age Name Employee SSN Projects 2003.10.02 - SLIDE 9 Relationships • Relationships are the associations between entities • They can involve one or more entities and belong to particular relationship types – One to One – One to Many – Many to Many IS 202 – FALL 2003 2003.10.02 - SLIDE 10 Relationships Student Attends Class Project Supplier IS 202 – FALL 2003 Supplies project parts Part 2003.10.02 - SLIDE 11 Types of Relationships • Concerned only with cardinality of relationship Employee Employee Employee 1 Assigned n Assigned 1 1 m Assigned n Truck Project Project Chen ER notation IS 202 – FALL 2003 2003.10.02 - SLIDE 12 More Complex Relationships Manager 1/1/1 Employee 1/n/n Evaluation n/n/1 Project SSN Date Project Employee 4(2-10) Assigned 1 Manages Employee Is Managed By Project 1 Manages n IS 202 – FALL 2003 2003.10.02 - SLIDE 13 Weak Entities • Owe existence entirely to another entity Part# Invoice # Order Invoice# Contains Quantity Order-line Rep# IS 202 – FALL 2003 2003.10.02 - SLIDE 14 Supertype and Subtype Entities Employee Sales-rep Is one of Manages Clerk Sold Other Invoice IS 202 – FALL 2003 2003.10.02 - SLIDE 15 Many to Many Relationships SSN Proj# Proj# Hours Project Assignment Is Assigned Project Assigned Employee IS 202 – FALL 2003 SSN 2003.10.02 - SLIDE 16 Lecture Overview • Review – Databases and Database Design – Database Life Cycle – ER Diagrams • Database Design • Normalization • Discussion Questions IS 202 – FALL 2003 2003.10.02 - SLIDE 17 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 18 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 19 Requirements Analysis • Conceptual Requirements – Systems Analysis Process • Examine all of the information sources used in existing applications • Identify the characteristics of each data element – – – – Numeric Text Date/time Etc. • Examine the tasks carried out using the information • Examine results or reports created using the information IS 202 – FALL 2003 2003.10.02 - SLIDE 20 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 21 Conceptual Design • Conceptual Model – Merge the collective needs of all applications – Determine what Entities are being used • Some object about which information is to maintained – What are the Attributes of those entities? • Properties or characteristics of the entity • What attributes uniquely identify the entity – What are the Relationships between entities • How the entities interact with each other? IS 202 – FALL 2003 2003.10.02 - SLIDE 22 Developing a Conceptual Model • Overall view of the database that integrates all the needed information discovered during the requirements analysis • Elements of the Conceptual Model are represented by diagrams, Entity-Relationship or ER Diagrams, that show the meanings and relationships of those elements independent of any particular database systems or implementation details • Can also be represented using other modeling tools (such as UML) IS 202 – FALL 2003 2003.10.02 - SLIDE 23 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 24 Logical Design • Logical Model – How is each entity and relationship represented in the Data Model of the DBMS • • • • Hierarchic? Network? Relational? Object-Oriented? IS 202 – FALL 2003 2003.10.02 - SLIDE 25 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 26 Physical Design • Internal Model – Choices of index file structure – Choices of data storage formats – Choices of disk layout IS 202 – FALL 2003 2003.10.02 - SLIDE 27 Database Design Process Application 1 External Model Application 2 Application 3 Application 4 External Model External Model External Model Application 1 Conceptual requirements Application 2 Conceptual requirements Application 3 Conceptual requirements Conceptual Model Logical Model Internal Model Application 4 Conceptual requirements IS 202 – FALL 2003 2003.10.02 - SLIDE 28 Database Application Design • External Model – User views of the integrated database – Making the old (or updated) applications work with the new database design IS 202 – FALL 2003 2003.10.02 - SLIDE 29 Terms and Concepts • Key – An attribute or set of attributes used to identify or locate records in a file • Primary Key – An attribute or set of attributes that uniquely identifies each record in a file • Candidate Key – An attribute or set of attributes that might be used as a primary key IS 202 – FALL 2003 2003.10.02 - SLIDE 30 Lecture Overview • Review – Databases and Database Design – Database Life Cycle – ER Diagrams • Database Design • Normalization • Discussion Questions IS 202 – FALL 2003 2003.10.02 - SLIDE 31 Normalization • Normalization theory is based on the observation that relations with certain properties are more effective in inserting, updating and deleting data than other sets of relations containing the same data • Normalization is a multi-step process beginning with an “unnormalized” relation IS 202 – FALL 2003 2003.10.02 - SLIDE 32 Normal Forms • • • • • • First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) Boyce-Codd Normal Form (BCNF) Fourth Normal Form (4NF) Fifth Normal Form (5NF) IS 202 – FALL 2003 2003.10.02 - SLIDE 33 Normalization No transitive dependency between nonkey attributes All determinants are candidate keys - Single multivalued dependency IS 202 – FALL 2003 BoyceCodd and Higher Functional dependency of nonkey attributes on the primary key - Atomic values only Full Functional dependency of nonkey attributes on the primary key 2003.10.02 - SLIDE 34 Unnormalized Relations • First step in normalization is to convert the data into a two-dimensional table • In unnormalized relations data can repeat within a column • (The following is a highly contrived example that actually bears only a slight resemblance to the current implementation of the Phone/Photo project database) IS 202 – FALL 2003 2003.10.02 - SLIDE 35 Unnormalized Relations Person # People # Picture date Person Name 145 1111 311 Oct 1, 2003; Nov 12, 2003 John White Objects Object_Feat Student Northside Beth Little Shopping Book bag Blue Paul Kosher Student South Hall Beth Little Reading none Charles Field; Oakland; Charles Eating; Burrito; Oakland Field Shopping none 6845 243 IS 202 – FALL 2003 Activity Hal Kane Oct 5, 2003 Dec 15, 2003 Ann Hood 4876 145 People Charles Brown Student 5123 145 2345 189 Location San Beth Little Francisco, Michael Shopping; Berkeley Diamond Eating Charles 202 South Field Hall; Patricia Reading; Oakland Gold Drinking Sather David Gate Rosen Singing Sep 25, 2003; Oct 10, 2003 Sep 27, 2003 Nov 5, 2003 Oct 10, 2003 243 1234 467 Person Type Mary Jones Student Auditor Student Book bag; Blue Pasta none Textbook; None; Teacup Chinese none none none vegetarian; none 2003.10.02 - SLIDE 36 First Normal Form • To move to First Normal Form a relation must contain only atomic values at each row and column – No repeating groups – A column or set of columns is called a Candidate Key when its values can uniquely identify the row in the relation IS 202 – FALL 2003 2003.10.02 - SLIDE 37 First Normal Form Person # People # Picture DatePerson Name Person Type 1111 1111 1234 1234 2345 4876 5123 6845 6845 145 Oct 1, 2003 Nov 12, 311 2003 Sep 25, 243 2003 Oct 10, 467 2003 Sep 27, 189 2003 145 Nov 5, 2003 Oct 10, 145 2003 John White Student John White Student Mary Jones Auditor San Francisco Berkeley 202 South Hall People Beth Little Michael Diamond Activity Objects Object_feat Shopping Book bag Blue Eating Pasta none Charles Field Reading Textbook none Patricia Gold Drinking Teacup Mary Jones Auditor Charles Brown Student Hal Kane Student Oakland Sather Gate David Rosen Northside Beth Little Singing none none Shopping Book bag Blue Paul Kosher Student South Hall Beth Little Reading Student Oakland Charles Field Eating Student Oakland Charles Field Shopping none 243 Oct 5, 2003 Ann Hood Dec 15, 243 2003 Ann Hood IS 202 – FALL 2003 Location none Burrito Chinese none Vegetaria n none 2003.10.02 - SLIDE 38 1NF Storage Anomalies • Insertion: A new person has not yet taken a picture -- hence no Picture # -- Since Picture # is part of the key we can’t insert • Insertion: If People is are known and likely to be photographed, but haven’t been yet -- there is be no way to include that person in the database • Update: If a Person changes status (e.g. Mary Jones becomes a Student) we have to change multiple rows in the database • Deletion (type 1): Deleting a Person record may also delete all info about People in the pictures • Deletion (type 2): When there are functional dependencies (like Object and Object_features) changing one item eliminates other information IS 202 – FALL 2003 2003.10.02 - SLIDE 39 Second Normal Form • A relation is said to be in Second Normal Form when every nonkey attribute is fully functionally dependent on the primary key – That is, every nonkey attribute needs the full primary key for unique identification IS 202 – FALL 2003 2003.10.02 - SLIDE 40 Second Normal Form Person Table Person # Person Name Person Type 1111 John White Student 1234 Mary Jones Charles 2345 Brown Auditor 4876 Hal Kane Student Student 5123 Paul Kosher Student 6845 Ann Hood IS 202 – FALL 2003 Student 2003.10.02 - SLIDE 41 Second Normal Form People Table People # 145 189 243 311 467 IS 202 – FALL 2003 People Beth Little David Rosen Charles Field Michael Diamond Patricia Gold 2003.10.02 - SLIDE 42 Second Normal Form Person # People # Picture Date Location Activity San Francisco Shopping 1111 145 01-Oct-03 1111 311 12-Nov-03 1234 243 25-Sep-03 Berkeley Eating 202 South Hall Reading 1234 467 10-Oct-03 Oakland Drinking 2345 189 4876 Objects Object_Feat Book bag Blue Pasta none Textbook none Teacup Chinese 27-Sep-03 Sather Gate Singing none none 145 05-Nov-03 Book bag Blue 5123 145 10-Oct-03 South Hall Reading none none 6845 243 05-Oct-03 Oakland Eating Burrito vegetarian 6845 243 15-Dec-03 Oakland Shopping none none Northside Shopping Picture Table IS 202 – FALL 2003 2003.10.02 - SLIDE 43 1NF Storage Anomalies Removed • Insertion: Can now enter new Persons who haven’t yet taken pictures • Insertion: Can now enter People who haven’t been photographed • Deletion (type 1): If Charles Brown withdraws his photos the corresponding tuples from Person and Picture tables can be deleted without losing information on David Rosen • Update: If John White takes a third picture, and has changed status (e.g., graduate), we only need to change the Person table in one place IS 202 – FALL 2003 2003.10.02 - SLIDE 44 2NF Storage Anomalies • Insertion: Cannot enter the fact that a particular object has a particular feature unless it is associated with a particular picture • Deletion: If John White describes some other object that Beth Little has while shopping, we lose the fact that the bookbag is blue • Update: If the features of an object change change we have to update multiple occurrences of object features IS 202 – FALL 2003 2003.10.02 - SLIDE 45 Third Normal Form • A relation is said to be in Third Normal Form if there are no transitive functional dependencies between nonkey attributes – When one nonkey attribute can be determined with one or more nonkey attributes there is said to be a transitive functional dependency • The Obect_Feature column in the Picture table is determined by the Object – Object_Feature is transitively functionally dependent on Object so Picture is not 3NF IS 202 – FALL 2003 2003.10.02 - SLIDE 46 Third Normal Form Person # People # Picture Date Location Activity Objects 1111 145 01-Oct-03 San Francisco Shopping Book bag 1111 311 12-Nov-03 Berkeley Eating Pasta 1234 243 25-Sep-03 202 South Hall Reading Textbook 1234 467 10-Oct-03 Oakland Drinking Teacup 2345 189 27-Sep-03 Sather Gate Singing none 4876 145 05-Nov-03 Northside Shopping Book bag 5123 145 10-Oct-03 South Hall Reading none 6845 243 05-Oct-03 Oakland Eating Burrito 6845 243 15-Dec-03 Oakland Shopping none Picture Table IS 202 – FALL 2003 2003.10.02 - SLIDE 47 Third Normal Form Object Table Objects IS 202 – FALL 2003 Object_Feat Book bag Blue Pasta none Textbook none Teacup Chinese Burrito Vegetarian 2003.10.02 - SLIDE 48 2NF Storage Anomalies Removed • Insertion: We can now enter the fact that an object has a particular feature • Deletion: If John White describes some other object that Beth Little has while shopping, we don’t lose the fact that the bookbag is blue • Update: The features for each object appear only once IS 202 – FALL 2003 2003.10.02 - SLIDE 49 Boyce-Codd Normal Form • Most 3NF relations are also BCNF relations • A 3NF relation is NOT in BCNF if: – Candidate keys in the relation are composite keys (they are not single attributes) – There is more than one candidate key in the relation, and – The keys are not disjoint, that is, some attributes in the keys are common IS 202 – FALL 2003 2003.10.02 - SLIDE 50 Most 3NF Relations Are Also BCNF – Is This One? Person # Person Name Person Type 1111 John White Student 1234 Mary Jones Charles 2345 Brown Auditor 4876 Hal Kane Student Student 5123 Paul Kosher Student 6845 Ann Hood IS 202 – FALL 2003 Student 2003.10.02 - SLIDE 51 BCNF Relations Person # Person Name IS 202 – FALL 2003 Person # Person Type 1111 John White 1111 Student 1234 Mary Jones Charles 2345 Brown 1234 Auditor 4876 Hal Kane 4876 Student 5123 Paul Kosher 5123 Student 6845 Ann Hood 6845 Student 2345 Student 2003.10.02 - SLIDE 52 Additional Issues • Why separate Person and People? – They are really all People/Persons in different roles • Shouldn’t a picture have a unique ID regardless of Who is in it? • Can’t we have multiple people in the same picture, multiple objects, etc.? • Can’t objects have multiple characteristics? IS 202 – FALL 2003 2003.10.02 - SLIDE 53 BCNF Relations Picture # loc # Picture # Obj # loc # 1 1 Picture # Person # Picture Date 1 2 3 4 1111 1111 1234 1234 2345 12-Nov-03 25-Sep-03 27-Sep-03 1 2 2 3 3 Obj # Objects 4 4 4 4 1 Book bag 5 5 6 1 2 Pasta 6 6 5 Sather Gate 8 5 3 Textbook 7 7 6 Northside 4 Teacup 8 4 7 South Hall 5 Burrito 9 4 2 Berkeley 3 3 Picture # People # 1 5 1 San Francisco 1 2 2 01-Oct-03 10-Oct-03 2 145 311 3 202 South Hall 4 Oakland Act # 3 7 8 9 IS 202 – FALL 2003 4876 5123 6845 6845 05-Nov-03 10-Oct-03 05-Oct-03 15-Dec-03 Activity Picture # Act # 1 Shopping 2 Eating 3 Reading 4 4 4 Drinking 5 Singing 1 1 2 2 6 Location 243 3 3 4 467 5 189 5 5 6 145 6 1 7 145 7 3 8 243 9 243 8 2 9 1 2003.10.02 - SLIDE 54 BCNF Added Capabilities • Can now have a picture with no (identified) people in it • Can have multiple objects, activities, and people associated with each picture IS 202 – FALL 2003 2003.10.02 - SLIDE 55 Fourth Normal Form • Any relation is in Fourth Normal Form if it is BCNF and any multivalued dependencies are trivial • Eliminate non-trivial multivalued dependencies by projecting into simpler tables IS 202 – FALL 2003 2003.10.02 - SLIDE 56 Fifth Normal Form • A relation is in 5NF if every join dependency in the relation is implied by the keys of the relation • Implies that relations that have been decomposed in previous NF can be recombined via natural joins to recreate the original relation IS 202 – FALL 2003 2003.10.02 - SLIDE 57 Fifth Normal Form Relations Picture # loc # Picture # Obj # Picture # Person # Picture Date 1 2 3 4 1111 1111 1234 1234 01-Oct-03 12-Nov-03 25-Sep-03 10-Oct-03 loc # Obj # Objects 1 Book bag 2 2 2 Pasta 3 3 3 Textbook 4 4 4 4 Teacup 5 5 4 Oakland 6 1 5 Burrito 6 6 5 Sather Gate 8 5 7 7 6 Northside 1 1 2 2 3 3 4 1 San Francisco 2 Berkeley 3 202 South Hall 8 4 7 South Hall 9 4 Picture # People # 1 Location 1 1 145 Picture # Act # 5 6 7 8 9 IS 202 – FALL 2003 2345 4876 5123 6845 6845 27-Sep-03 05-Nov-03 10-Oct-03 05-Oct-03 15-Dec-03 2 311 3 243 2 2 4 467 5 189 6 145 1 1 People Table Act # Activity 1 Shopping 3 3 2 Eating 4 4 3 Reading 5 5 4 Drinking 6 1 5 Singing 7 145 8 243 8 2 9 243 9 1 7 3 2003.10.02 - SLIDE 58 Normalizing to Death • Normalization splits database information across multiple tables • To retrieve complete information from a normalized database, the JOIN operation must be used • JOIN tends to be expensive in terms of processing time, and very large joins are very expensive IS 202 – FALL 2003 2003.10.02 - SLIDE 59 Lecture Overview • Review – Databases and Database Design – Database Life Cycle – ER Diagrams • Database Design • Normalization • Discussion Questions IS 202 – FALL 2003 2003.10.02 - SLIDE 60 Questions: Brooke Maury • Discussion Questions on Hoffer & McFadden: • If the goal of the relational database model is to encode a ‘conceptual’ design into a logical design, is it possible that improved technology and the development of new modeling techniques will supplant the RDBMS? Specifically, what impact will XML and the development of document engineering have on organizing information in multiple normalized tables? • Conversely, what does the relational model have that would be lost if a conceptual design was encoded in another model? IS 202 – FALL 2003 2003.10.02 - SLIDE 61 Questions: Brooke Maury • The drive to develop the RDBM was in part motivated by a need to minimize the space required and improve the performance of database systems by removing redundancies. What impact will very inexpensive data storage and computing power have on the relational database model and the third normal form especially? IS 202 – FALL 2003 2003.10.02 - SLIDE 62 Questions: Shane Ahern • Discussion Questions for "Logical Database Design and the Relational Model" • Is the normalization process described really necessary? When I design a database schema, I find that by thinking of tables in terms of they entities they represent (employees, sales, events), I avoid most of the problems of normalization that the process seeks to address (i.e. salesperson and region in Sales table, salesperson is clearly a distinct entity from sales). If the formal process described in the article is not followed, are there potential pitfalls that might lead to problems with your database schema? IS 202 – FALL 2003 2003.10.02 - SLIDE 63 Questions: Shane Ahern • The article points out that "the relational model does not yet directly support supertype/subtype relationships." Once the tables in a relational database have been decomposed to third normal form, the database is efficient from systems point-of-view, but the tables no longer represent a representation of the data that is intuitive to humans. The object-oriented model more accurately mirrors the way we think about the concepts that we wish to store in databases. So perhaps object-oriented database systems are worth considering. What about XML databases? IS 202 – FALL 2003 2003.10.02 - SLIDE 64 Questions: Arthur Law • The three models that we have been presented with, Entity Relationship Model, NIAM Model, and Object Oriented Model all enforce a specific thought process in the organization and relationship between items in a database. With all of our recent discussion of computers understanding natural language are these methods now out of date with how we should be organizing information? Should we use artificial intelligence or learning algorithms to statistically determine the relationship between entities or is there still value in using these models? IS 202 – FALL 2003 2003.10.02 - SLIDE 65 Questions: Arthur Law • Each model is approximately one decade apart in development and a quick Google search shows that companies are using databases with one of the three models. However, as new models arise there doesn't seem too much interest in migrating from one data model to another. Which makes sense given that an organization using a given model probably finds that it works. Now with the proliferation of XML, we see more information being shared between organizations, so are we fated for an expensive and lengthy translation process between databases? Or should all DB administrators be responsible for upgrading to the latest model? IS 202 – FALL 2003 2003.10.02 - SLIDE 66 Lecture Overview • Review – Databases and Database Design – Database Life Cycle – ER Diagrams • Database Design • Normalization • Discussion Questions • Next Time/Readings IS 202 – FALL 2003 2003.10.02 - SLIDE 67 Next Time • Guest Lecture – Bob Glushko on XML and “Document Engineering” • Readings on Class website • No assigned discussion questions (but bring your questions on the readings) IS 202 – FALL 2003 2003.10.02 - SLIDE 68