Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 An Introduction to Relational Databases. Part I 1. Entities, Tables, and Primary Keys 2. One-to-Many Relationships and Foreign Keys 3. One-to-One and Many-to-Many Relationships Chapter 0 in the book gives some preliminary background information. Some of this information is not particularly important, like the discussions of the network and hierarchical database models and the discussion of the software engineering process, for example. If I do not think the topic is important, it is not covered in these notes and you’re not responsible for knowing anything about it or answering questions about it. The book also brings up other topics which are important. I explain some of these topics differently than the book does. For some I go into greater detail. For others, such as the normal forms, I rely less on formal definitions and more on verbal descriptions. For yet others, such as finding and correcting normal form violations, I have my own notation, which is explained in these notes. For the topics in Chapter 0 which I cover in these notes, the notes are supposed to be a supplement, not a substitute. You should be familiar with the book’s explanations. You should also be familiar with my notation because some of the assignment problems rely on it. The bottom line is that you need to be able to answer the assignment questions, whether your source of information is the book or these notes. 1. Entities a. Definition of an Entity The term “entity” refers to any individual item that can have information stored about it in a database. This may be a person, a thing, or some sort of abstraction that doesn’t have a physical existence. b. Describing Entities with Attributes In order to have information about it stored in a database, it has to be possible to describe it using values which may be words or phrases or numeric quantities. c. Storing Information about Entities in Tables Information about entities is stored in rectangular tables. A table may contain information about more than one entity of the same kind. Suppose you decide that persons are entities that you want to store information about. The information about each individual entity is contained in a single row in the table. The values which describe that entity are contained in the columns of that row. This is an illustration of the general idea: 2 SSN 123-45-6789 … name Bob dob 1/1/01 d. Table Schema Notation This is a notation that can be used to specify the name of the table and its fields without providing sample data. Sometimes people are tempted to give tables plural names, like people. It turns out to be less confusing to talk about tables and their contents if the names of the tables are singular. Here is the table specification using this naming convention and the new notation: Person(SSN, name, dob) e. Parallel Sets of Terminology for Relational Tables Table, Row, Column File, Record, Field Relation, Tuple, Attribute There are three parallel sets of terminology when referring to the structure of data mentioned above. One set has already been used above. Each row in the table contains information about one entity. Each column contains a value describing a particular characteristic of the entity. The characteristic described by a given column is the same for all of the entities in the table. An older set of terminology, which is still used, refers to files, records, and fields instead of tables, rows, and columns. Sometimes the term flat file is used. This means that each record in the file has the same number of fields. A more theoretical set of terminology refers to relations, tuples, and attributes, respectively. The terms relation and tuple may be somewhat obscure, but the term attribute is very descriptive. Each column in the table contains a value which describes an attribute of the entity in question. In these notes all of these sets of terminology may be used at one time or another, and they may be mixed up, referring to attributes of records, or rows of files, for example. You may also use the terminology indiscriminately when answering questions. It’s simply important that you know what the different terms mean. f. Defining Types of Entities by their Attributes; Distinguishing them from other Entities The designer of a database has to determine what entities will have information stored about them. Entities are conceptual in nature. For any given situation, there is not necessarily just one correct set of entities which will describe the situation. Suppose you want to record information about mothers and children, for example. You could regard both mothers and children as instances of persons, and decide that the underlying entity is a person. You could also decide that mothers and children are distinct entities in your view of the world. This second approach is dictated if the attributes you store for mothers and children are different. 3 g. Incorrect Design 1: Mixing Types of Entities in Tables Suppose that you have decided to record different attributes for mothers and children and that they are conceptually different entities. It may be convenient or customary to show information about mothers and their children as outlined below, but this is not a correct table design: motherA child1 child2 motherB child1 … … … … … … … This is not allowed under the relational model. The theory of relational databases is similar to set theory in math. The rows in the tables are like elements in a set. It is also true that the order of the elements of a set does not have any meaning. The corresponding idea in tables is that no meaning can be implied by the order of rows. It may be convenient for users to see the rows in some particular order, and it will be possible to display information in the desired way. But intrinsically the order is immaterial. If two kinds of entities are different, then they belong in different tables, even if they have the same number of attributes. Each table has to contain information about one kind of entity only. One table can’t contain records for two different kinds of entities where the records are distinguished by where they are relative to other records in the table. h. Incorrect Design 2: Repeating Fields or Multi-valued Fields It may also be convenient to show information as shown below, but this is also incorrect: motherA motherB … child1 child1 child1 … … This is not allowed under the relational model. Different mothers may have different numbers of children. This means that the number of attributes in a row for a mother could vary. You may think that it would be possible to set a maximum number of children per mother and use that to set a fixed number of columns per row. However, for any number you choose there are two undesirable results: You may encounter a mother who has more children than the maximum; and for all mothers who have less than the maximum you have lots of wasted space for information about children. This example emphasizes the idea of a flat file. The rows can’t be jagged. They all have to contain the same set of values. 4 i. No Duplicate Records; Primary Keys Relations are like sets in another way. Duplicate elements are not allowed in a set, and duplicate rows are not allowed in tables. This makes perfect sense. What purpose would it serve to store the information about a given entity more than one time? Another way of saying that there can be no duplicate records is that all of the records in a file are unique. In other words, when taking the values in all fields into account, there are no records in the table which contain exactly the same set of values. It is also customary to have a single field which uniquely identifies each record. That is to say, there are no duplicate values for that field in the whole table. When recording information about people, their social security number is a good example of this. No two people are supposed to have the same social security number. If you know their number, you have positive identification and you can look them up and find out other information about them. It is customary, but not required, to have the unique identifier be the first field in the table. This field is called the primary key of the table. This picture, given previously, illustrates the general idea: SSN 123-45-6789 … name Bob dob 1/1/01 The notation for giving the name of the table and its fields can be expanded to show which field is the primary key. One alternative is to underline the primary key field: Person(SSN, name, dob) Another alternative is to explicitly mark the primary key field by following it with the abbreviation p.k.: Person(SSN p.k., name, dob) 2. One-to-Many Relationships and Foreign Keys a. All Tables in a Database are Related A database typically consists of more than one related table. Virtually no real database consists of only one table. The first step in designing a relational database is determining the entities. As seen above, there may be entities, like mothers and children, which obviously have some sort of relationship but which are not stored together in the same table. The relationships between entities may not always be so clear, but every table in a database has to be related to at least one other table in the database in some way. Informally, the first step in database design is determining the entities and attributes involved. The second step is determining the relationships among the entities. 5 b. The Three Kinds of Relationships There are three kinds of relationships that can exist between entities: one-to-one (1-1), one-to-many (1-m), and many-to-many (m-n). Each of these kinds of relationships can be captured in a relational database design. c. The One-to-Many Relationship; ER Notation It turns out that the 1-m relationship is the most basic one. The mother-child relationship is of this type, and it can be illustrated using another kind of notation, which is known as an entity-relationship (ER) diagram. In this diagram the line between the rectangles representing the tables shows the kind of relationship. One end of the line is forked. This is known as a crow’s foot, and it is this end of the relationship which is “many”: Mother Child It is also possible to include field names using this notation. In this example mid stands for mother id and kid stands for child id: Mother Child mid p.k. name … kid p.k. name … d. Foreign Keys Capturing the relationship between two tables depends on the use of what is called a foreign key. A foreign key is a field in one table which happens to be the primary key field of another table. Foreign key can be abbreviated f.k. The way to capture a 1-m relationship is to embed the primary key of the “one” table as a foreign key in the “many” table. Continuing to use ER notation, the example above can be expanded to show the 1-m, or primary key to foreign key relationship. The mid, the primary key of the mother table, is embedded as a foreign key (with the same name) in the child table: 6 Mother Child mid p.k. name … kid p.k. name mid f.k. … In some books the list of fields in the Child table wouldn’t explicitly show the mid field. They are relying on the notation to indicate that the p.k. of the Mother table would be a f.k. in the Child table. It is clearer to show all of the fields in each table explicitly. Another way of illustrating this relationship is as follows: Mother mid Child kid mid In this representation the arrow graphically shows the primary key field being embedded as a foreign key field. In some books you might see the arrow shown going in the opposite direction. When done in that way it represents the fact that values in the “many” table refer back to values in the “one” table. e. A Concrete Example The presentation so far has basically been about notation and ideas. The underlying idea becomes clear if a concrete example is shown. Suppose that the Mother and Child tables consist of these simple designs. Notice the use of the abbreviation f.k. to identify the foreign key in the design: Mother(mid, name) Child(kid, name, mid f.k.) Suppose that the tables contain this information: 7 Mother mid name 1 Lily 2 Matilda Child kid name a Ned b Ann c June mid 2 1 2 What the data in the tables show is that Matilda is the mother of Ned and June, and Lily is the mother of Ann. Identifiers do not have to be of any particular type, either numeric or non-numeric. In this example mid was numeric and kid was non-numeric so that there would be no chance of confusion between the two. 3. One-to-One and Many-to-Many Relationships a. Background The correct approach to implementing a 1-m relationship was given in the previous section. This section will show how to implement 1-1 and m-n relationships. The idea behind the examples will be biological mating patterns. The 1-m pattern will be repeated using cows as an example. In a herd social structure, one bull will acquire a harem of cows. Geese will be used to illustrate the 1-1 structure. Canada geese, for example, typically mate for life. They will only find a new partner if their old partner dies. Finally, chimpanzees will be used to illustrate the m-n structure. Throughout their lives, chimpanzees may change partners and have children with different partners. Their social structure is not based on what humans call a nuclear family. b. The One-to-Many Relationship Again There is just the one way of implementing a 1-m relationship. The primary key of the “one” table is embedded as a foreign key in the “many” table. The tables below illustrate this idea again. Bull bid name 1 Ferdinand 2 Durham Cow cid name bid a Elsie 2 b Bossy 1 8 c Daisy 2 c. The One-to-One Relationship as a Single Table With the 1-1 model, there is a choice to be made. Suppose that the match-up of two entities is truly permanent. In this case, it would be possible to reanalyze the situation and determine that the match-up itself was a base entity, and make single table for that. For example: pair id a b c Goose Pairs goose name gander name Gaye Gary Gabriela Gus Gladys Gil d. The One-to-One Relationship as Two Tables Most of the time it is impractical to assume that the match-up is a base entity. The design alternative is to put the two kinds of entities into two different tables and link the tables with a primary key, foreign key pair. This is one alternative for handling the situation: Goose goose id goose name a Gaye b Gabriela c Gladys gander id 1 2 3 Gander gander id gander name 1 Gary 2 Gus 3 Gil There are several things happening with this example that bear some explanation. This technique for capturing a 1-1 relationship allows for data to be included that would make it 1-m. Every database design has underlying assumptions. The assumption here is that the relationship is 1-1. It is up to the user to make sure that no data is entered into the tables which would imply a 1-m relationship. There is a choice in modeling a 1-1 relationship in this way: Which primary key should be embedded in the other table as a foreign key? In this example, the primary key of the Gander table is embedded as a foreign key in the Goose table. It would also be possible to embed the primary key of the Goose table as a foreign key in the Gander table. It doesn’t make a difference in this example, but in real life, if there is any chance that in the future the relationship could become 1-m, it is important to embed the primary key of what would become the “one” 9 table as a foreign key in what would become the “many” table. Another concern which will be explained further in the future is minimizing the number of nulls values in tables. e. An Incorrect Design for the One-to-One Relationship Beginning database designers sometimes make this mistake: They think that if one embedded primary key is a good thing, then two embedded primary keys is a better thing. Going back to the notation for embedding which uses arrows, this illustrates what they try to do, which is wrong: Gander Ganderid, p.k. Goose Gooseid, f.k. Gooseid, p.k. Ganderid, f.k. The problem with this design is that it’s redundant. There is only one relationship between the tables, but it is captured twice. First of all, this is wasteful. The second flaw in this arrangement is that it makes it opens the possibility of mistakenly storing conflicting data. What if Gus’s primary key is the foreign key value in the record for Gabriela, but Gabriela’s primary key value is not the foreign key value in the record for Gus? Another concern which will be explained further in the future is the inability to enter data into tables if there is referential integrity between them. f. The Many-to-Many Relationship with a "Table in the Middle" If two kinds of entities are in a m-n relationship, in addition to the tables for the two kinds of entities, a third table is needed. The m-n relationship is captured by two 1m relationships. Each of the base entity tables is in a 1-m relationship with the “table in the middle”. Using the ER modeling notation, this is how it works: 10 Entity1 Table in the middle Entity2 Here is a simple example using data: Female chimp fcid name 1 Carol 2 Alice 3 Sue 4 Jill 5 Ann 6 Barb Pairing fcid mcid 1 1 1 2 2 1 2 2 5 3 6 3 3 4 Male chimp mcid name 1 Bob 2 Ted 3 John 4 Lou g. Cardinalities, or the Number of Entities on each End of a Relationship This example raises several points. Notice that some of the chimps do not participate in a pairing at all, namely female 4. Some of the chimps participate in a 1-1 relationship, namely female 3 and male 4. Some of the chimps are in a 1-m relationship, namely male 3 and females 5 and 6. Finally, males 1 and 2 and females 1 and 2 exhibit an m-n relationship. When referring to relationships like m-n or 1-m, unless other 11 assumptions are explicitly stated, it’s customary to assume that the relationship may involve 0 or more match-ups on the “many” side of the relationship. It is also customary to assume that the relationship may involve 0 or 1 match-ups on the "one" side of the relationship. h. The Primary Key of the Table in the Middle The second point raised by this example has to do with the primary key of the table in the middle. As given, the pair of embedded foreign keys in the table in the middle is unique for each row. It is not necessary to invent a new primary key for the table. When more than one field together make a unique identifier and this is used as the primary key, this is known as a concatenated, or compound key field. Using the notation for table design given earlier, this m-n design can be represented as follows. Notice that underlining is used to show that something is part of the primary key, and at the same time the abbreviation f.k. is used to show that something is a foreign key: Female chimp(fcid, name) Pairing(fcid f.k., mcid f.k.) Male chimp(mcid, name) The third point raised by this example has to do with adding a separate primary key field to the table. This may be handy, and there is nothing wrong with it. For example, the table in the middle might now take this form: Pairing pid fcid mcid 1 1 1 2 1 2 3 2 1 4 2 2 5 5 3 6 6 3 7 3 4 The design could then look like this: Pairing(pid, fcid f.k., mcid f.k.) i. Recording Historical Data This leads to the fourth, and conceptually most interesting point. In relationships of this kind, the different pairings may occur at different times, and this may be important information to record. There may be different pairings of the same partners at different times. It is also possible that the times of pairings of different partners could overlap. This kind of information can be captured by adding a time or date field to the table in the middle, or possibly a beginning time or date and ending time or date field. For example: 12 fcid 1 1 1 ... mcid 1 2 1 Pairing beginning date 1/1/01 2/2/02 3/3/03 end date 1/6/01 2/7/02 3/8/03 As shown above, with time or date fields included, it is still reasonable to assume that each row would be unique and that no separate primary key field would be needed. Every field in the table is part of the key, and the design would look like this: Pairing(fcid f.k., mcid f.k., beginning date, end date) On the other hand, it might still be useful to have a simple primary key, in which the design would look like this: Pairing(pid, fcid f.k., mcid f.k., beginning date, end date) It is always possible that the table in the middle would also have other fields containing useful information concerning the match-up. For example there might be a field which contains yes or no values depending on whether the pairing resulted in offspring. In that case, the table design would expand to this: Pairing(pid, fcid f.k., mcid f.k., beginning date, end date, offspring) j. An Incorrect Design for the Many-to-Many Relationship The last remark on implementation choices for m-n relationships is the following: Beginning designers sometimes make the same mistake here as with 1-m relationships. They try to capture the relationship without the table in the middle and they embed the primary key of each of the base tables as a foreign key in the other. Not only does this suffer from the same general problems as when applied to 1-m relationships; it is a deadly mistake when trying to represent m-n relationships. It is simply impossible to record all of the relationships when each primary key has to be matched with all corresponding records in the other table.