Download cs109sqlnotes1a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational algebra wikipedia , lookup

Ingres (database) wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Join (SQL) wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
1
An Introduction to Relational Databases. Part I
1. Entities, Tables, and Primary Keys
2. One-to-Many Relationships and Foreign Keys
3. One-to-One and Many-to-Many Relationships
Chapter 0 in the book gives some preliminary background information. Some of
this information is not particularly important, like the discussions of the network and
hierarchical database models and the discussion of the software engineering process, for
example. If I do not think the topic is important, it is not covered in these notes and
you’re not responsible for knowing anything about it or answering questions about it.
The book also brings up other topics which are important. I explain some of these topics
differently than the book does. For some I go into greater detail. For others, such as the
normal forms, I rely less on formal definitions and more on verbal descriptions. For yet
others, such as finding and correcting normal form violations, I have my own notation,
which is explained in these notes. For the topics in Chapter 0 which I cover in these
notes, the notes are supposed to be a supplement, not a substitute. You should be familiar
with the book’s explanations. You should also be familiar with my notation because
some of the assignment problems rely on it. The bottom line is that you need to be able
to answer the assignment questions, whether your source of information is the book or
these notes.
1. Entities
a. Definition of an Entity
The term “entity” refers to any individual item that can have information stored
about it in a database. This may be a person, a thing, or some sort of abstraction that
doesn’t have a physical existence.
b. Describing Entities with Attributes
In order to have information about it stored in a database, it has to be possible to
describe it using values which may be words or phrases or numeric quantities.
c. Storing Information about Entities in Tables
Information about entities is stored in rectangular tables. A table may contain
information about more than one entity of the same kind. Suppose you decide that
persons are entities that you want to store information about. The information about each
individual entity is contained in a single row in the table. The values which describe that
entity are contained in the columns of that row. This is an illustration of the general idea:
2
SSN
123-45-6789
…
name
Bob
dob
1/1/01
d. Table Schema Notation
This is a notation that can be used to specify the name of the table and its fields
without providing sample data. Sometimes people are tempted to give tables plural
names, like people. It turns out to be less confusing to talk about tables and their contents
if the names of the tables are singular. Here is the table specification using this naming
convention and the new notation:
Person(SSN, name, dob)
e. Parallel Sets of Terminology for Relational Tables
Table, Row, Column
File, Record, Field
Relation, Tuple, Attribute
There are three parallel sets of terminology when referring to the structure of data
mentioned above. One set has already been used above. Each row in the table contains
information about one entity. Each column contains a value describing a particular
characteristic of the entity. The characteristic described by a given column is the same
for all of the entities in the table. An older set of terminology, which is still used, refers
to files, records, and fields instead of tables, rows, and columns. Sometimes the term flat
file is used. This means that each record in the file has the same number of fields. A
more theoretical set of terminology refers to relations, tuples, and attributes, respectively.
The terms relation and tuple may be somewhat obscure, but the term attribute is very
descriptive. Each column in the table contains a value which describes an attribute of the
entity in question. In these notes all of these sets of terminology may be used at one time
or another, and they may be mixed up, referring to attributes of records, or rows of files,
for example. You may also use the terminology indiscriminately when answering
questions. It’s simply important that you know what the different terms mean.
f. Defining Types of Entities by their Attributes; Distinguishing them from other
Entities
The designer of a database has to determine what entities will have information
stored about them. Entities are conceptual in nature. For any given situation, there is not
necessarily just one correct set of entities which will describe the situation. Suppose you
want to record information about mothers and children, for example. You could regard
both mothers and children as instances of persons, and decide that the underlying entity is
a person. You could also decide that mothers and children are distinct entities in your
view of the world. This second approach is dictated if the attributes you store for
mothers and children are different.
3
g. Incorrect Design 1: Mixing Types of Entities in Tables
Suppose that you have decided to record different attributes for mothers and
children and that they are conceptually different entities. It may be convenient or
customary to show information about mothers and their children as outlined below, but
this is not a correct table design:
motherA
child1
child2
motherB
child1
…
…
…
…
…
…
…
This is not allowed under the relational model. The theory of relational databases
is similar to set theory in math. The rows in the tables are like elements in a set. It is also
true that the order of the elements of a set does not have any meaning. The
corresponding idea in tables is that no meaning can be implied by the order of rows. It
may be convenient for users to see the rows in some particular order, and it will be
possible to display information in the desired way. But intrinsically the order is
immaterial. If two kinds of entities are different, then they belong in different tables,
even if they have the same number of attributes. Each table has to contain information
about one kind of entity only. One table can’t contain records for two different kinds of
entities where the records are distinguished by where they are relative to other records in
the table.
h. Incorrect Design 2: Repeating Fields or Multi-valued Fields
It may also be convenient to show information as shown below, but this is also
incorrect:
motherA
motherB
…
child1
child1
child1
…
…
This is not allowed under the relational model. Different mothers may have
different numbers of children. This means that the number of attributes in a row for a
mother could vary. You may think that it would be possible to set a maximum number of
children per mother and use that to set a fixed number of columns per row. However, for
any number you choose there are two undesirable results: You may encounter a mother
who has more children than the maximum; and for all mothers who have less than the
maximum you have lots of wasted space for information about children. This example
emphasizes the idea of a flat file. The rows can’t be jagged. They all have to contain the
same set of values.
4
i. No Duplicate Records; Primary Keys
Relations are like sets in another way. Duplicate elements are not allowed in a
set, and duplicate rows are not allowed in tables. This makes perfect sense. What
purpose would it serve to store the information about a given entity more than one time?
Another way of saying that there can be no duplicate records is that all of the records in a
file are unique. In other words, when taking the values in all fields into account, there are
no records in the table which contain exactly the same set of values.
It is also customary to have a single field which uniquely identifies each record.
That is to say, there are no duplicate values for that field in the whole table. When
recording information about people, their social security number is a good example of
this. No two people are supposed to have the same social security number. If you know
their number, you have positive identification and you can look them up and find out
other information about them. It is customary, but not required, to have the unique
identifier be the first field in the table. This field is called the primary key of the table.
This picture, given previously, illustrates the general idea:
SSN
123-45-6789
…
name
Bob
dob
1/1/01
The notation for giving the name of the table and its fields can be expanded to
show which field is the primary key. One alternative is to underline the primary key
field:
Person(SSN, name, dob)
Another alternative is to explicitly mark the primary key field by following it with
the abbreviation p.k.:
Person(SSN p.k., name, dob)
2. One-to-Many Relationships and Foreign Keys
a. All Tables in a Database are Related
A database typically consists of more than one related table. Virtually no real
database consists of only one table. The first step in designing a relational database is
determining the entities. As seen above, there may be entities, like mothers and children,
which obviously have some sort of relationship but which are not stored together in the
same table. The relationships between entities may not always be so clear, but every
table in a database has to be related to at least one other table in the database in some
way. Informally, the first step in database design is determining the entities and attributes
involved. The second step is determining the relationships among the entities.
5
b. The Three Kinds of Relationships
There are three kinds of relationships that can exist between entities: one-to-one
(1-1), one-to-many (1-m), and many-to-many (m-n). Each of these kinds of relationships
can be captured in a relational database design.
c. The One-to-Many Relationship; ER Notation
It turns out that the 1-m relationship is the most basic one. The mother-child relationship
is of this type, and it can be illustrated using another kind of notation, which is known as
an entity-relationship (ER) diagram. In this diagram the line between the rectangles
representing the tables shows the kind of relationship. One end of the line is forked.
This is known as a crow’s foot, and it is this end of the relationship which is “many”:
Mother
Child
It is also possible to include field names using this notation. In this example mid
stands for mother id and kid stands for child id:
Mother
Child
mid p.k.
name
…
kid p.k.
name
…
d. Foreign Keys
Capturing the relationship between two tables depends on the use of what is
called a foreign key. A foreign key is a field in one table which happens to be the
primary key field of another table. Foreign key can be abbreviated f.k. The way to
capture a 1-m relationship is to embed the primary key of the “one” table as a foreign key
in the “many” table. Continuing to use ER notation, the example above can be expanded
to show the 1-m, or primary key to foreign key relationship. The mid, the primary key of
the mother table, is embedded as a foreign key (with the same name) in the child table:
6
Mother
Child
mid p.k.
name
…
kid p.k.
name
mid f.k.
…
In some books the list of fields in the Child table wouldn’t explicitly show the
mid field. They are relying on the notation to indicate that the p.k. of the Mother table
would be a f.k. in the Child table. It is clearer to show all of the fields in each table
explicitly.
Another way of illustrating this relationship is as follows:
Mother
mid
Child
kid
mid
In this representation the arrow graphically shows the primary key field being
embedded as a foreign key field. In some books you might see the arrow shown going in
the opposite direction. When done in that way it represents the fact that values in the
“many” table refer back to values in the “one” table.
e. A Concrete Example
The presentation so far has basically been about notation and ideas. The
underlying idea becomes clear if a concrete example is shown. Suppose that the Mother
and Child tables consist of these simple designs. Notice the use of the abbreviation f.k. to
identify the foreign key in the design:
Mother(mid, name)
Child(kid, name, mid f.k.)
Suppose that the tables contain this information:
7
Mother
mid name
1
Lily
2
Matilda
Child
kid name
a
Ned
b
Ann
c
June
mid
2
1
2
What the data in the tables show is that Matilda is the mother of Ned and June,
and Lily is the mother of Ann. Identifiers do not have to be of any particular type, either
numeric or non-numeric. In this example mid was numeric and kid was non-numeric so
that there would be no chance of confusion between the two.
3. One-to-One and Many-to-Many Relationships
a. Background
The correct approach to implementing a 1-m relationship was given in the
previous section. This section will show how to implement 1-1 and m-n relationships.
The idea behind the examples will be biological mating patterns. The 1-m pattern will be
repeated using cows as an example. In a herd social structure, one bull will acquire a
harem of cows. Geese will be used to illustrate the 1-1 structure. Canada geese, for
example, typically mate for life. They will only find a new partner if their old partner
dies. Finally, chimpanzees will be used to illustrate the m-n structure. Throughout their
lives, chimpanzees may change partners and have children with different partners. Their
social structure is not based on what humans call a nuclear family.
b. The One-to-Many Relationship Again
There is just the one way of implementing a 1-m relationship. The primary key of
the “one” table is embedded as a foreign key in the “many” table. The tables below
illustrate this idea again.
Bull
bid name
1
Ferdinand
2
Durham
Cow
cid name bid
a
Elsie 2
b
Bossy 1
8
c
Daisy
2
c. The One-to-One Relationship as a Single Table
With the 1-1 model, there is a choice to be made. Suppose that the match-up of
two entities is truly permanent. In this case, it would be possible to reanalyze the
situation and determine that the match-up itself was a base entity, and make single table
for that. For example:
pair id
a
b
c
Goose Pairs
goose name gander name
Gaye
Gary
Gabriela
Gus
Gladys
Gil
d. The One-to-One Relationship as Two Tables
Most of the time it is impractical to assume that the match-up is a base entity.
The design alternative is to put the two kinds of entities into two different tables and link
the tables with a primary key, foreign key pair. This is one alternative for handling the
situation:
Goose
goose id goose name
a
Gaye
b
Gabriela
c
Gladys
gander id
1
2
3
Gander
gander id gander name
1
Gary
2
Gus
3
Gil
There are several things happening with this example that bear some explanation.
This technique for capturing a 1-1 relationship allows for data to be included that would
make it 1-m. Every database design has underlying assumptions. The assumption here is
that the relationship is 1-1. It is up to the user to make sure that no data is entered into
the tables which would imply a 1-m relationship. There is a choice in modeling a 1-1
relationship in this way: Which primary key should be embedded in the other table as a
foreign key? In this example, the primary key of the Gander table is embedded as a
foreign key in the Goose table. It would also be possible to embed the primary key of the
Goose table as a foreign key in the Gander table. It doesn’t make a difference in this
example, but in real life, if there is any chance that in the future the relationship could
become 1-m, it is important to embed the primary key of what would become the “one”
9
table as a foreign key in what would become the “many” table. Another concern which
will be explained further in the future is minimizing the number of nulls values in tables.
e. An Incorrect Design for the One-to-One Relationship
Beginning database designers sometimes make this mistake: They think that if
one embedded primary key is a good thing, then two embedded primary keys is a better
thing. Going back to the notation for embedding which uses arrows, this illustrates what
they try to do, which is wrong:
Gander
Ganderid, p.k.
Goose
Gooseid, f.k.
Gooseid, p.k.
Ganderid, f.k.
The problem with this design is that it’s redundant. There is only one relationship
between the tables, but it is captured twice. First of all, this is wasteful. The second flaw
in this arrangement is that it makes it opens the possibility of mistakenly storing
conflicting data. What if Gus’s primary key is the foreign key value in the record for
Gabriela, but Gabriela’s primary key value is not the foreign key value in the record for
Gus? Another concern which will be explained further in the future is the inability to
enter data into tables if there is referential integrity between them.
f. The Many-to-Many Relationship with a "Table in the Middle"
If two kinds of entities are in a m-n relationship, in addition to the tables for the
two kinds of entities, a third table is needed. The m-n relationship is captured by two 1m relationships. Each of the base entity tables is in a 1-m relationship with the “table in
the middle”. Using the ER modeling notation, this is how it works:
10
Entity1
Table in the middle
Entity2
Here is a simple example using data:
Female chimp
fcid name
1
Carol
2
Alice
3
Sue
4
Jill
5
Ann
6
Barb
Pairing
fcid mcid
1
1
1
2
2
1
2
2
5
3
6
3
3
4
Male chimp
mcid name
1
Bob
2
Ted
3
John
4
Lou
g. Cardinalities, or the Number of Entities on each End of a Relationship
This example raises several points. Notice that some of the chimps do not
participate in a pairing at all, namely female 4. Some of the chimps participate in a 1-1
relationship, namely female 3 and male 4. Some of the chimps are in a 1-m relationship,
namely male 3 and females 5 and 6. Finally, males 1 and 2 and females 1 and 2 exhibit
an m-n relationship. When referring to relationships like m-n or 1-m, unless other
11
assumptions are explicitly stated, it’s customary to assume that the relationship may
involve 0 or more match-ups on the “many” side of the relationship. It is also customary
to assume that the relationship may involve 0 or 1 match-ups on the "one" side of the
relationship.
h. The Primary Key of the Table in the Middle
The second point raised by this example has to do with the primary key of the
table in the middle. As given, the pair of embedded foreign keys in the table in the
middle is unique for each row. It is not necessary to invent a new primary key for the
table. When more than one field together make a unique identifier and this is used as the
primary key, this is known as a concatenated, or compound key field. Using the notation
for table design given earlier, this m-n design can be represented as follows. Notice that
underlining is used to show that something is part of the primary key, and at the same
time the abbreviation f.k. is used to show that something is a foreign key:
Female chimp(fcid, name)
Pairing(fcid f.k., mcid f.k.)
Male chimp(mcid, name)
The third point raised by this example has to do with adding a separate primary
key field to the table. This may be handy, and there is nothing wrong with it. For
example, the table in the middle might now take this form:
Pairing
pid fcid mcid
1
1
1
2
1
2
3
2
1
4
2
2
5
5
3
6
6
3
7
3
4
The design could then look like this:
Pairing(pid, fcid f.k., mcid f.k.)
i. Recording Historical Data
This leads to the fourth, and conceptually most interesting point. In relationships
of this kind, the different pairings may occur at different times, and this may be important
information to record. There may be different pairings of the same partners at different
times. It is also possible that the times of pairings of different partners could overlap.
This kind of information can be captured by adding a time or date field to the table in the
middle, or possibly a beginning time or date and ending time or date field. For example:
12
fcid
1
1
1
...
mcid
1
2
1
Pairing
beginning date
1/1/01
2/2/02
3/3/03
end date
1/6/01
2/7/02
3/8/03
As shown above, with time or date fields included, it is still reasonable to assume
that each row would be unique and that no separate primary key field would be needed.
Every field in the table is part of the key, and the design would look like this:
Pairing(fcid f.k., mcid f.k., beginning date, end date)
On the other hand, it might still be useful to have a simple primary key, in which
the design would look like this:
Pairing(pid, fcid f.k., mcid f.k., beginning date, end date)
It is always possible that the table in the middle would also have other fields
containing useful information concerning the match-up. For example there might be a
field which contains yes or no values depending on whether the pairing resulted in
offspring. In that case, the table design would expand to this:
Pairing(pid, fcid f.k., mcid f.k., beginning date, end date, offspring)
j. An Incorrect Design for the Many-to-Many Relationship
The last remark on implementation choices for m-n relationships is the following:
Beginning designers sometimes make the same mistake here as with 1-m relationships.
They try to capture the relationship without the table in the middle and they embed the
primary key of each of the base tables as a foreign key in the other. Not only does this
suffer from the same general problems as when applied to 1-m relationships; it is a
deadly mistake when trying to represent m-n relationships. It is simply impossible to
record all of the relationships when each primary key has to be matched with all
corresponding records in the other table.