Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 9 - Joining tables This is not too exciting (so far). Tables pretty much look like big fast programmable spreadsheets with rows, columns, and commands. The power comes when we have more than one table and we can exploit the relationships between the tables. 9.1 Basic data modeling The real power of a relational database is when we make multiple tables and make links between those tables. The act of deciding how to break up your application data into multiple tables and establishing the relationships between the two tables is called data modeling or database design. It is an art form of its own with particular skills and experience. Its goal is to avoid really bad mistakes and design clean and easily understood databases. Others, called database administrators, may performance tune things later. Database design starts with a picture : the design document that shows the tables and their relationships, called a data model. Data modeling is a relatively sophisticated skill and we will only introduce the most basic concepts of relational data modeling in this section. For more detail on data modeling you can start with: http://en.wikipedia.org/wiki/Relational_model Here is an example of a data model : 01_data_model Let’s say for our Music application, instead of just storing tracks’ titles, we wanted to keep lists of the albums containing the tracks ; the artists authoring the albums ; the musical genre of any given track. so we could find tracks according to various searching criteria. We could build a KISS (keep it stupid simple) Music data model having one table with each piece of data as one column in that table. It would look like that : 02_music_0 That solution does not respect basic rules : for example “AC/DC” would occur several times in the table. This is not a good design, because 1. all those redundant strings use too much disk space in the database ; 2. if we need to correct an error in the artist’s name, we would have to make the same update in several lines of the table ; 3. the artist “AC/DC” is a thing in the real world and should exist only once in the database This duplication of string data violates the best practices for database normalization which basically states that we should never put the same string data in the database more than once. If we need the data more than once, we create a numeric key for the data and reference the actual data using this key. In practical terms, a string takes up a lot more space than an integer on the disk and in the memory of our computer and takes more processor time to compare and sort. If we only have a few hundred entries the storage and processor time hardly matters. But if we have a thousand album in our database and a possibility of 10 thousand tracks, it is important to be able to scan data as quickly as possible. To build a better data model, one would start by drawing a picture of the data objects for the application and then figuring out how to represent the objects and their relationships. A few tips might be welcomed : Basic Rule: Don’t put the same string data in twice - use a relationship instead. When there is one thing in the “real world” there should be one copy of that thing in the database. For each piece of information, namely 1. Len (of a track) 2. Album (containing a track) 3. Genre (a track belongs to) 4. Artist (authoring the album) 5. Track (in an album) 6. Rating (of a track) 7. Count (of a track) We should ask if this data is an object or an attribute of another object. In this case, one could argue that 2 to 5 are objects : they exist by themselves ; 1, 6 and 7 are attributes : they exist only as the characteristic of the object Track . Once we define objects we need to define the relationships between objects : We want to keep track of which band is the “creator” of each music track… What album does this song “belong to” ? Which album is this song related to ? So, a track belongs to an album (it is contained in the album). It also belongs to a musical genre (it has been classified in that genre). An album belongs to an artist (it has been recorded by that artist). We then record those decisions in a diagram representing the data model, where rectangular boxes represent objects and their attributes, while arrows represent relationships between objects : 03_music_dm1 9.2 Database Normalization How do we then transform that design into a database ? There is tons of database theory - way too much to understand without excessive predicate calculus. We will not study that theory but learn to apply a few simple rules : Do not replicate data and let each object have its own table ; Instead reference data, point at data from one table to another to materialize relationships between objects ; Use integers for keys and for references ; Add a special key column to each table which we want to make references to. By convention many programmers call this column id ; More information : http://en.wikipedia.org/wiki/Database_normalization For example, we could use an integer column artist_id in a table Album to reference rows identified by a column id in a table Artist . 04_intRefPattern Those integers are called Keys and help us find our way around the many tables in a database… We have to select INTEGER PRIMARY KEY as the type of our id column, thereby indicating that we would like SQLite to manage this column and assign automatically a unique numeric key to each row we insert. We also add the keyword UNIQUE to indicate that we will not allow SQLite to insert two rows with the same value for id . When we add UNIQUE clauses to our tables, we are communicating a set of rules that we are asking the database to enforce when we attempt to insert records. We are creating these rules as a convenience in our programs. The rules both keep us from making mistakes and make it simpler to write code using the database. In essence, in creating this Album table, we are modeling a “relationship” where one album “has been authored” by some artist and representing it with a number indicating that (a) the artist and the album are connected and (b) the direction of the relationship. 9.4 Three kinds of keys Now that we have started building a data model putting our data into multiple linked tables, and linking the rows in those tables using keys, we need to look at some terminology around keys. There are generally three kinds of keys used in a database model. A logical key is a key that the “real world” might use to look up a row. In our example data model, the title field in table Track is a logical key. It is the name of a musical piece and we indeed could often look up a track’s row several times using the title field. You will often find that it makes sense to add a UNIQUE constraint to a logical key. Since the logical key is how we look up a row from the outside world, it makes little sense to allow multiple rows with the same value in the table. A primary key is usually a number that is assigned automatically by the database. It generally has no meaning outside the program and is only used to link rows from different tables together. When we want to look up a row in a table, usually searching for the row using the primary key is the fastest way to find a row. Since primary keys are integer numbers, they take up very little storage and can be compared or sorted very quickly. In our data model, the id fields are examples of primary keys. A foreign key is usually a number that points to the primary key of an associated row in a different table. An example of a foreign key in our data model is the artist_id field. We are using a naming convention of always calling the primary key field name id and appending the suffix _id to any field name that is a foreign key. Primary Key Rules A few best practices should be mentioned : Never use your logical key as the primary key ; Logical keys can and do change albeit slowly ; Relationships that are based on matching string fields are far less efficient than integers performance-wise. For example : 05_PKRules in the table User , we have the attributes : id , login , password , name , email , created_at , modified_at , login_at . id , login and email are logical keys because we could use either one to look up a user. We should you use id as primary key because login , and email could change in the future. Foreign Keys A foreign key is when a table has a column that contains a key which points to the primary key of another table. When all primary keys are integers, then all foreign keys are integers - this is good - very good If you use strings as foreign keys - you show yourself to be an uncultured swine Example : 06_FKRules In the table Site , the foreign key user_id references or “points to” the primary key id of table User . From design to tables In order to create a relational database from a data design (or data model), we need to define tables and keys (a relational schema). For example, from this model : 07_design2tables1 we will define the following tables and keys : 08_design2tables2 The complete relational schema for the Music database is : 09_design2tables3 9.5 Defining the database in DB Browser for SQLite We can now use DB Browser for SQLite to create the Music database by executing the following SQL instructions : CREATE TABLE `Artist` ( `id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, `name` TEXT ); CREATE TABLE `Genre` ( `id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, `name` TEXT ); CREATE TABLE `Album` ( `id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, `artist_id` INTEGER, `title` TEXT ); CREATE TABLE `Track` ( `id` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, `album_id` INTEGER, `genre_id` INTEGER, `len` INTEGER, `rating` INTEGER, `title` TEXT, `count` INTEGER ); and these to add test data to the database : Artists : INSERT INTO Artist (name) VALUES ('Led Zepplin') INSERT INTO Artist (name) VALUES ('AC/DC') Genres : INSERT INTO Genre (name) VALUES ('Rock') INSERT INTO Genre (name) VALUES ('Metal') Albums : INSERT INTO Album (title, artist_id) VALUES ('Who Made Who', 2) INSERT INTO Album (title, artist_id) VALUES ('IV', 1) Tracks : INSERT INTO Track (title, rating, len, count, VALUES ('Black Dog', 5, 297, 0, 2, 1) INSERT INTO Track (title, rating, len, count, VALUES ('Stairway', 5, 482, 0, 2, 1) INSERT INTO Track (title, rating, len, count, VALUES ('About to Rock', 5, 313, 0, 1, 2) INSERT INTO Track (title, rating, len, count, VALUES ('Who Made Who', 5, 207, 0, 1, 2) album_id, genre_id) album_id, genre_id) album_id, genre_id) album_id, genre_id) We have built relationships into our data : 10_music_relationships 9.6 Using JOIN to retrieve data Now that we have followed the rules of database normalization and have data separated into four tables, linked together using primary and foreign keys, we need to be able to build a SELECT that re-assembles the data across the tables. By removing the replicated data and replacing it with references to a single copy of each bit of data we build a “web” of information that the relational database can read through very quickly - even for very large amounts of data. Often when you want some data it comes from a number of tables linked by these foreign keys. SQL uses the JOIN clause to re-connect these tables. In the JOIN clause you specify the fields that are used to re-connect the rows between the tables. The following is an example of a SELECT with a JOIN clause : SELECT * FROM Album JOIN Artist ON Album.artist_id = Artist.id WHERE Artist.id = 1 The JOIN clause indicates that the fields we are selecting cross both the Artist and Album tables. The ON clause indicates how the two tables are to be joined. Take the rows from Album and append the row from Artist where the field artist_id in Album is the same the id value in the Artist table. The result of the JOIN is to create extra-long “meta-rows” which have both the fields from Artist and the matching fields from Album . Where there is more than one match between the id field from Artist and the artist_id from Album , then JOIN creates a meta-row for each of the matching pairs of rows, duplicating data as needed. More information : http://en.wikipedia.org/wiki/Join_(SQL) The JOIN operation links across several tables as part of a select operation. You must tell the JOIN how to use the keys that make the connection between the tables using an ON clause. Example : a list of album’s titles and the names of the artists performing the tracks : SELECT Album.title,Artist.name -- What we want to see FROM Album JOIN Artist -- The tables which hold the data ON Album.artist_id = Artist.id -- How the tables are linked 11_music_join1 It can get complex, if we need to join more than two tables : SELECT Track.title, Artist.name, -- What we want to see Album.title, Genre.name FROM Track JOIN Album JOIN Artist JOIN Genre -- The tables which hold the data ON Track.genre_id = Genre.id -- How the tables are linked AND Track.album_id = Album.id AND Album.artist_id = Artist.id 12_music_join2 9.7 Complexity Enables Speed Complexity makes speed possible and allows you to get very fast results as the data size grows. By normalizing the data and linking it with integer keys, the overall amount of data which the relational database must scan is far lower than if the data were simply flattened out. It might seem like a tradeoff - spend some time designing your database so it continues to be fast when your application is a success. 9.8 Additional SQL Topics … useful but that will not be studied in this course : Indexes improve access performance for things like string fields. Constraints on data - (cannot be NULL, etc..). Transactions - allow SQL operations to be grouped and done as a unit. 9.8 Summary This chapter has covered a lot of ground to give you an overview of the basics of using a database in Python. It is more complicated to write the code to use a database to store data than Python dictionaries or flat files so there is little reason to use a database unless your application truly needs the capabilities of a database. The situations where a database can be quite useful are: (1) when your application needs to make small many random updates within a large data set, (2) when your data is so large it cannot fit in a text files and you need to look up information repeatedly, or (3) you have a long-running process that you want to be able to stop and restart and retain the data from one run to the next. You can build a simple database with a single table to suit many application needs, but most problems will require several tables and links/relationships between rows in different tables. When you start making links between tables, it is important to do some thoughtful design and follow the rules of database normalization to make the best use of the database’s capabilities. Since the primary motivation for using a database is that you have a large amount of data to deal with, it is important to model your data efficiently so your programs run as fast as possible. 9.9 Glossary constraint : When we tell the database to enforce a rule on a field or a row in a table. A common constraint is to insist that there can be no duplicate values in a particular field (i.e. all the values must be unique). foreign key : A numeric key that points to the primary key of a row in another table. Foreign keys establish relationships between rows stored in different tables. logical key : A key that the “outside world” uses to look up a particular row. For example in a table of user accounts, a person’s e-mail address might be a good candidate as the logical key for the user’s data. normalization : Designing a data model so that no data is replicated. We store each item of data at one place in the database and reference it elsewhere using a foreign key. primary key : A numeric key assigned to each row that is used to refer to one row in a table from another table. Often the database is configured to automatically assign primary keys as rows are inserted.