Download Chapter 9 - Joining tables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Chapter 9 - Joining tables
This is not too exciting (so far). Tables pretty much look like big fast programmable spreadsheets with rows, columns, and
commands. The power comes when we have more than one table and we can exploit the relationships between the tables.
9.1 Basic data modeling
The real power of a relational database is when we make multiple tables and make links between those tables. The act of
deciding how to break up your application data into multiple tables and establishing the relationships between the two tables
is called data modeling or database design. It is an art form of its own with particular skills and experience. Its goal is to
avoid really bad mistakes and design clean and easily understood databases. Others, called database administrators, may
performance tune things later. Database design starts with a picture : the design document that shows the tables and their
relationships, called a data model.
Data modeling is a relatively sophisticated skill and we will only introduce the most basic concepts of relational data modeling
in this section. For more detail on data modeling you can start with:
http://en.wikipedia.org/wiki/Relational_model
Here is an example of a data model :
01_data_model
Let’s say for our Music application, instead of just storing tracks’ titles, we wanted to keep lists of
the albums containing the tracks ;
the artists authoring the albums ;
the musical genre of any given track.
so we could find tracks according to various searching criteria. We could build a KISS (keep it stupid simple) Music data
model having one table with each piece of data as one column in that table. It would look like that :
02_music_0
That solution does not respect basic rules : for example “AC/DC” would occur several times in the table. This is not a
good design, because
1. all those redundant strings use too much disk space in the database ;
2. if we need to correct an error in the artist’s name, we would have to make the same update in several lines of the
table ;
3. the artist “AC/DC” is a thing in the real world and should exist only once in the database
This duplication of string data violates the best practices for database normalization which basically states that we should
never put the same string data in the database more than once. If we need the data more than once, we create a numeric key
for the data and reference the actual data using this key.
In practical terms, a string takes up a lot more space than an integer on the disk and in the memory of our computer and takes
more processor time to compare and sort. If we only have a few hundred entries the storage and processor time hardly
matters. But if we have a thousand album in our database and a possibility of 10 thousand tracks, it is important to be able to
scan data as quickly as possible.
To build a better data model, one would start by drawing a picture of the data objects for the application and then figuring out
how to represent the objects and their relationships. A few tips might be welcomed :
Basic Rule: Don’t put the same string data in twice - use a relationship instead.
When there is one thing in the “real world” there should be one copy of that thing in the database.
For each piece of information, namely
1. Len (of a track)
2. Album (containing a track)
3. Genre (a track belongs to)
4. Artist (authoring the album)
5. Track (in an album)
6. Rating (of a track)
7. Count (of a track)
We should ask if this data is an object or an attribute of another object. In this case, one could argue that
2 to 5 are objects : they exist by themselves ;
1, 6 and 7 are attributes : they exist only as the characteristic of the object Track .
Once we define objects we need to define the relationships between objects :
We want to keep track of which band is the “creator” of each music track… What album does this song “belong to” ?
Which album is this song related to ? So, a track belongs to an album (it is contained in the album).
It also belongs to a musical genre (it has been classified in that genre).
An album belongs to an artist (it has been recorded by that artist).
We then record those decisions in a diagram representing the data model, where rectangular boxes represent objects and
their attributes, while arrows represent relationships between objects :
03_music_dm1
9.2 Database Normalization
How do we then transform that design into a database ? There is tons of database theory - way too much to understand
without excessive predicate calculus. We will not study that theory but learn to apply a few simple rules :
Do not replicate data and let each object have its own table ;
Instead reference data, point at data from one table to another to materialize relationships between objects ;
Use integers for keys and for references ;
Add a special key column to each table which we want to make references to. By convention many programmers call this
column id ;
More information :
http://en.wikipedia.org/wiki/Database_normalization
For example, we could use an integer column artist_id in a table Album to reference rows identified by a column id in a
table Artist .
04_intRefPattern
Those integers are called Keys and help us find our way around the many tables in a database…
We have to select INTEGER PRIMARY KEY as the type of our id column, thereby indicating that we would like SQLite to
manage this column and assign automatically a unique numeric key to each row we insert. We also add the keyword UNIQUE
to indicate that we will not allow SQLite to insert two rows with the same value for id .
When we add UNIQUE clauses to our tables, we are communicating a set of rules that we are asking the database to enforce
when we attempt to insert records. We are creating these rules as a convenience in our programs. The rules both keep us
from making mistakes and make it simpler to write code using the database.
In essence, in creating this Album table, we are modeling a “relationship” where one album “has been authored” by some
artist and representing it with a number indicating that (a) the artist and the album are connected and (b) the direction of the
relationship.
9.4 Three kinds of keys
Now that we have started building a data model putting our data into multiple linked tables, and linking the rows in those
tables using keys, we need to look at some terminology around keys. There are generally three kinds of keys used in a
database model.
A logical key is a key that the “real world” might use to look up a row. In our example data model, the title field in
table Track is a logical key. It is the name of a musical piece and we indeed could often look up a track’s row several
times using the title field. You will often find that it makes sense to add a UNIQUE constraint to a logical key. Since the
logical key is how we look up a row from the outside world, it makes little sense to allow multiple rows with the same
value in the table.
A primary key is usually a number that is assigned automatically by the database. It generally has no meaning outside
the program and is only used to link rows from different tables together. When we want to look up a row in a table, usually
searching for the row using the primary key is the fastest way to find a row. Since primary keys are integer numbers, they
take up very little storage and can be compared or sorted very quickly. In our data model, the id fields are examples of
primary keys.
A foreign key is usually a number that points to the primary key of an associated row in a different table. An example of a
foreign key in our data model is the artist_id field.
We are using a naming convention of always calling the primary key field name
id
and appending the suffix _id to any field name that is a foreign key.
Primary Key Rules
A few best practices should be mentioned :
Never use your logical key as the primary key ;
Logical keys can and do change albeit slowly ;
Relationships that are based on matching string fields are far less efficient than integers performance-wise.
For example :
05_PKRules
in the table User , we have the attributes : id , login , password , name , email , created_at , modified_at , login_at .
id
, login and email are logical keys because we could use either one to look up a user.
We should you use id as primary key because login , and email could change in the future.
Foreign Keys
​A foreign key is when a table has a column that contains a key which points to the primary key of another table.
When all primary keys are integers, then all foreign keys are integers - this is good - very good
If you use strings as foreign keys - you show yourself to be an uncultured swine
Example :
06_FKRules
In the table Site , the foreign key user_id references or “points to” the primary key id of table User .
From design to tables
In order to create a relational database from a data design (or data model), we need to define tables and keys (a relational
schema). For example, from this model :
07_design2tables1
we will define the following tables and keys :
08_design2tables2
The complete relational schema for the Music database is :
09_design2tables3
9.5 Defining the database in DB Browser for SQLite
We can now use DB Browser for SQLite to create the Music database by executing
the following SQL instructions :
CREATE TABLE `Artist` (
`id`
INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`name` TEXT
);
CREATE TABLE `Genre` (
`id`
INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`name` TEXT
);
CREATE TABLE `Album` (
`id`
INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`artist_id` INTEGER,
`title` TEXT
);
CREATE TABLE `Track` (
`id`
INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
`album_id` INTEGER,
`genre_id` INTEGER,
`len`
INTEGER,
`rating` INTEGER,
`title` TEXT,
`count` INTEGER
);
and these to add test data to the database :
Artists :
INSERT INTO Artist (name) VALUES ('Led Zepplin')
INSERT INTO Artist (name) VALUES ('AC/DC')
Genres :
INSERT INTO Genre (name) VALUES ('Rock')
INSERT INTO Genre (name) VALUES ('Metal')
Albums :
INSERT INTO Album (title, artist_id) VALUES ('Who Made Who', 2)
INSERT INTO Album (title, artist_id) VALUES ('IV', 1)
Tracks :
INSERT INTO Track (title, rating, len, count,
VALUES ('Black Dog', 5, 297, 0, 2, 1)
INSERT INTO Track (title, rating, len, count,
VALUES ('Stairway', 5, 482, 0, 2, 1)
INSERT INTO Track (title, rating, len, count,
VALUES ('About to Rock', 5, 313, 0, 1, 2)
INSERT INTO Track (title, rating, len, count,
VALUES ('Who Made Who', 5, 207, 0, 1, 2)
album_id, genre_id)
album_id, genre_id)
album_id, genre_id)
album_id, genre_id)
We have built relationships into our data :
10_music_relationships
9.6 Using JOIN to retrieve data
Now that we have followed the rules of database normalization and have data separated into four tables, linked together using
primary and foreign keys, we need to be able to build a SELECT that re-assembles the data across the tables.
By removing the replicated data and replacing it with references to a single copy of each bit of data we build a “web” of
information that the relational database can read through very quickly - even for very large amounts of data. Often when you
want some data it comes from a number of tables linked by these foreign keys.
SQL uses the JOIN clause to re-connect these tables. In the JOIN clause you specify the fields that are used to re-connect
the rows between the tables.
The following is an example of a SELECT with a JOIN clause :
SELECT *
FROM Album JOIN Artist
ON Album.artist_id = Artist.id
WHERE Artist.id = 1
The JOIN clause indicates that the fields we are selecting cross both the Artist and Album tables. The ON clause
indicates how the two tables are to be joined. Take the rows from Album and append the row from Artist where the field
artist_id
in Album is the same the id value in the Artist table.
The result of the JOIN is to create extra-long “meta-rows” which have both the fields from Artist and the matching fields
from Album . Where there is more
than one match between the id field from Artist and the artist_id from Album , then JOIN creates a meta-row for
each of the matching pairs of rows, duplicating data as needed.
More information : http://en.wikipedia.org/wiki/Join_(SQL)
The JOIN operation links across several tables as part of a select operation.
You must tell the JOIN how to use the keys that make the connection between the tables using an ON clause.
Example : a list of album’s titles and the names of the artists performing the tracks :
SELECT Album.title,Artist.name -- What we want to see
FROM Album JOIN Artist
-- The tables which hold the data
ON Album.artist_id = Artist.id -- How the tables are linked
11_music_join1
It can get complex, if we need to join more than two tables :
SELECT Track.title, Artist.name,
-- What we want to see
Album.title, Genre.name
FROM Track
JOIN Album JOIN Artist JOIN Genre -- The tables which hold the data
ON Track.genre_id = Genre.id
-- How the tables are linked
AND Track.album_id = Album.id
AND Album.artist_id = Artist.id
12_music_join2
9.7 Complexity Enables Speed
Complexity makes speed possible and allows you to get very fast results as the data size grows.
By normalizing the data and linking it with integer keys, the overall amount of data which the relational database must
scan is far lower than if the data were simply flattened out.
It might seem like a tradeoff - spend some time designing your database so it continues to be fast when your application
is a success.
9.8 Additional SQL Topics
… useful but that will not be studied in this course :
Indexes improve access performance for things like string fields.
Constraints on data - (cannot be NULL, etc..).
Transactions - allow SQL operations to be grouped and done as a unit.
9.8 Summary
This chapter has covered a lot of ground to give you an overview of the basics of using a database in Python. It is more
complicated to write the code to use a database to store data than Python dictionaries or flat files so there is little reason to
use a database unless your application truly needs the capabilities of a database. The situations where a database can be
quite useful are: (1) when your application needs to make small many random updates within a large data set, (2) when your
data is so large it cannot fit in a text files and you need to look up information repeatedly, or (3) you have a long-running
process that you want to be able to stop and restart and retain the data from one run to the next.
You can build a simple database with a single table to suit many application needs, but most problems will require several
tables and links/relationships between rows in different tables. When you start making links between tables, it is important to
do some thoughtful design and follow the rules of database normalization to make the best use of the database’s capabilities.
Since the primary motivation for using a database is that you have a large amount of data to deal with, it is important to model
your data efficiently so your programs run as fast as possible.
9.9 Glossary
constraint : When we tell the database to enforce a rule on a field or a row in a table. A common constraint is to insist that
there can be no duplicate values in a particular field (i.e. all the values must be unique).
foreign key : A numeric key that points to the primary key of a row in another table. Foreign keys establish relationships
between rows stored in different tables.
logical key : A key that the “outside world” uses to look up a particular row. For example in a table of user accounts, a
person’s e-mail address might be a good candidate as the logical key for the user’s data.
normalization : Designing a data model so that no data is replicated. We store each item of data at one place in the database
and reference it elsewhere using a foreign key.
primary key : A numeric key assigned to each row that is used to refer to one row in a table from another table. Often the
database is configured to automatically assign primary keys as rows are inserted.