* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Models of Databases and Database Design
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational algebra wikipedia , lookup
Ingres (database) wikipedia , lookup
Clusterpoint wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Models of Databases and Database Design File Based approach Shortcomings 1. Data redundancy – Data is duplicated in many files. e.g. Some of the data is so redundant such that it may never be accessed thereby taking up space. 2. Data inconsistency – Some tables or files may be skipped during an update transaction eg If someone’s marital status changes, the salaries department may change but the operations department might not record the change. This may lead to inconsistent within the system. 3. Difficult to share – File based systems do not allow for multiple access. 4. Data dependence – The physical structure of data is defined in the code hence any changes to the structure will require a lot of changes to the rest of the code. 5. Incompatibility – Data stored as files in programs written in different programming languages is difficult to share across different programming languages. 6. Difficult to make queries or generate reports – Complex code is require to make queries and generate reports Database Approach An approach whereby a pool of related data is shared Database : A collection of related data. Data: Known facts that can be recorded and have implicit meaning. History of Database Systems First-generation - Hierarchical and Network Second generation - Relational Third generation - Object Relational - Object-Oriented Database Management Software (DBMS): A collection of software to support the storage, retrieval and modification of large volumes of data. Support is also provided for multiple users along with administration tools E.g. Oracle, SQL Server, and Access Functions of a DBMS Data Storage, Retrieval and Updating Data Dictionary This describes the structure and content of the DB. E.g. The names of tables, names of fields, characteristics of fields, relationships between entities, what the user is allowed to do. Transaction support Feedback to the user. Did a transaction fail? Why? Concurrency Control Make sure that users aren’t accessing or changing data that is being changed by another user at the same time. Must be done in a controlled manner to avoid mistakes Recovery Services Database can be recovered to some past correct state in the event of failure. This can done using a system log which contains information an all the previous transactions so that they can be reversed The system log may also indicate the cause of failure Backups can also be made regularly Authorisation services Who can access what data and how. Support for data communication Many DBs are accessed remotely and the DBMS must support this Integrity services Enforcing constraints to ensure that data remains correct. An example would be the data type. Other constraints are derived from the semantics of the data Providing multiple interfaces Query language interface for casual users, programming language interfaces for programmers, menu driven interfaces for beginners Database advantages Reduction of redundancy Consistency More available information Sharing of data Data integrity Security Enforcement of standards Economy of scale Management of conflicts Improved accessibility Increased productivity Improved maintenance Increased concurrency Improved backup and recovery Database disadvantages Complexity Size Software Cost Hardware cost Conversion Performance Vulnerability to system failure DBMS Interfaces Stand-alone query language interfaces. Programmer interfaces for embedding DML in programming languages: - Pre-compiler Approach - Procedure (Subroutine) Call Approach User-friendly interfaces: - Menu-based, popular for browsing on the web - Forms-based, designed for naïve users - Graphics-based (Point and Click, Drag and Drop etc.) - Natural language: requests in written English - Combinations of the above Others: - Speech as Input (?) and Output - Web Browser as an interface - Parametric interfaces (e.g., bank tellers) using function keys. - Interfaces for the DBA: Creating accounts, granting authorizations Setting system parameters Changing schemas or access path Classification of DBMSs Based on the data model used: - Traditional: Relational, Network, Hierarchical. - Emerging: Object-oriented, Object-relational. Other classifications: - Single-user (typically used with micro- computers) vs. multi-user (most DBMSs). - Centralized (uses a single computer with one database) vs. distributed (uses multiple computers, multiple databases) Distributed Database Systems have now come to be known as client server based database systems because they do not support a totally distributed environment, but rather a set of database servers supporting a set of clients. Variations of Distributed Environments: Homogeneous DDBMS Heterogeneous DDBMS Federated or Multidatabase Systems ANSI-SPARC Objectives of Three-Level Architecture All users should be able to access same data A user’s view is immune to changes made in other views Users should not need to know physical database storage details DBA should be able to change database storage structures without affecting the users’ views Internal structure of database should be unaffected by changes to physical aspects of storage DBA should be able to change conceptual structure of database without affecting all users The Ansi -Sparc and data Independence Logical Data Independence - Refers to immunity of external schemas to changes in conceptual schema. - Conceptual schema changes (e.g. addition/removal of entities). - Should not require changes to external schema or rewrites of application programs. Physical Data Independence _ Refers to immunity of conceptual schema to changes in the internal schema. _ Internal schema changes (e.g. using different file organizations, storage structures/devices). _ Should not require change to conceptual or external schemas. Schemas versus Instances Database Schema: The description of a database. Includes descriptions of the database structure and the constraints that should hold on the database. Schema Diagram: A diagrammatic display of (some aspects of) a database schema. Schema Construct: A component of the schema or an object within the schema, e.g., STUDENT, COURSE. Database Instance: The actual data stored in a database at a particular moment in time. Also called database state (or occurrence). Database Schema Vs. Database State Database State: Refers to the content of a database at a moment in time. Initial Database State: Refers to the database when it is loaded Valid State: A state that satisfies the structure and constraints of the database. Distinction The database schema changes very infrequently. The database state changes every time the database is updated. Schema is also called intension, whereas state is called extension. Normalization Normalization is a design technique that is widely used as a guide in designing relational databases. Normalization is essentially a two step process that puts data into tabular form by removing repeating groups and then removes duplicated data from the relational tables. Normalization theory is based on the concepts of normal forms. A relational table is said to be a particular normal form if it satisfied a certain set of constraints. There are currently many normal forms that have been defined. In this section, we will cover the first three normal forms that were defined by E. F. Codd. 1NF – First Normal Form 2NF – Second Normal Form 3NF – Third Normal Form BCNF – Boyce Code Normal Form 4NF – Fourth Normal Form 5NF – Fifth Normal Form DKNF – Domain-key Normal Form Introducing the normal forms Initially only three forms of normalisation (1NF, 2NF and 3NF) were put forward by E. F. Codd in 1972. The Boyce-Codd normal form was later introduced by R. Boyce and E. F. Codd in 1974. The later forms were mainly the work of R. Fagin in the period from 1977 through to 1981. The process of normalizing a database is so well known that there are formal rules governing how a normalized database should be structured. There are seven of these rules, known as normal forms, in all, but the first four are adequate most of the time: First Normal Form (1NF)—This rule has several requirements, including that there are no multivalued items or repeating groups; that each field is atomic, meaning each field must contain the smallest data element possible; and that the table contains a key. Second Normal Form (2NF)—The table must be normalized to 1NF. All fields must refer to (or describe) the primary key value. If the primary key is based on more than one field, each nonkey field must depend on the complex key, not just one field within the key. Nonkey fields that don't support the primary key should be moved to another table. Third Normal Form (3NF)—Te table must meet 1NF and 2NF requirements. All fields must be mutually independent. Any field that describes a nonkey field must be moved to another table. Boyce-Codd Normal Form (BCNF)—There must be no possibility of a nonkey dependent field occurring. This rule is really a subrule of 3NF and supposedly catches dependencies that might otherwise sneak through the process. It's rather abstract and can be difficult to apply at first. Functional Dependencies The concept of functional dependencies is the basis for the first three normal forms. A column, Y, of the relational table R is said to be functionally dependent upon column X of R if and only if each value of X in R is associated with precisely one value of Y at any given time. X and Y may be composite. Saying that column Y is functionally dependent upon X is the same as saying the values of column X identify the values of column Y. If column X is a primary key, then all columns in the relational table R must be functionally dependent upon X. A short-hand notation for describing a functional dependency is: R.x —>; R.y which can be read as in the relational table named R, column X functionally determines (identifies) column Y. Full functional dependence applies to tables with composite keys. Column Y in relational table R is fully functional on X of R if it is functionally dependent on X and not functionally dependent upon any subset of X. Full functional dependence means that when a primary key is composite, made of two or more columns, then the other columns must be identified by the entire key and not just some of the columns that make up the key. Case study Let’s consider an example where we may want to commit to a database the details of packing notes raised by a supplier. Un-normalized data The first important fact to realise is that there are fields which appear only once on the packing note (those in the header group) and there are fields that repeat for every separate item listed on the packing note (those in the invoice body group). If we were to try to make one record for each packing note, this would result as below Un-normalized NoteNo Packer Name Address ItemNo Quantity PartNo Description 300 JW Bloggs Perth 1 200 1234 Nuts 2 200 2234 Bolts 3 200 3334 Washers Here we can clearly identify repeating groups. But fields must be ‘atomic’ in the sense that there can only be one value in any field (no multi-valued attributes). In theory we could extend the number of columns and introduce the following fields: Item1 : Quantity Item1 : Partnumber Item1 : Description Item2 : Quantity Item2 : Partnumber Item2 : Description Item3 : Quantity Item3 : Partnumber Item3 : Description etc However, this has many problems associated with it. First, we do not know in advance the number of separate items on any particular packing note. This would result in having to cater for the maximum possible number of items that could be expected. The vast majority of entries would likely be much less than this maximum making the database unnecessarily large. It would also make queries much less efficient as we would have to search for the required data in multiple columns. 1NF – first normal form A better approach would be to repeat the common data to ensure that this resulted in storage only of atomic values, as shown below NoteNo 300 300 300 Packer JW JW JW CoName Bloggs Bloggs Bloggs CoAddress Perth Perth Perth ItemNo 1 2 3 Quantity 200 200 200 PartNo 1234 2234 3334 Description Nuts Bolts Washers Although satisfying the issue of atomic values, clearly a great deal of redundancy has been introduced. Nevertheless this table satisfies the rules of 1NF. An alternative approach at this stage would be to split the table into two parts. Not surprisingly, going either route should end up with much the same solution although arguably the second approach is perhaps a quicker way of getting to 2NF. For the moment we will proceed with the single table. 2NF – second normal form To be in 2NF we must remove any part-key dependencies. Here we quickly run into a problem. NoteNo cannot by itself be used as a key as this is not unique. We must therefore consider the use of a composite key (i.e. one containing more than one column). By looking at the table data, it should be apparent that a key would have to utilise both the NoteNo as well as ItemNo. A different NoteNo would be used for a different dispatch but is not unique for every line. An ItemNo would be unique within the context of a single packing note but not necessarily between different dispatch notes – conceivably you may ship the same item to two different customers. Let’s have a look at the implications of using a composite key consisting of NoteNo and ItemNo. This would work as a primary key but would fail on the dependency issue as we have part-key dependencies. The reason for this is that fields such as Packer, CoName and CoAddress are dependent only on NoteNo whereas fields Quanity, PartNo and Description would be dependent on the composite key NoteNo and ItemNo. To solve this problem we have to split the table (as we suggested when we looked at 1NF) as below PackingNote NoteNo Packer CoName 300 JW Bloggs 300 JW Bloggs 300 JW Bloggs CoAddress Perth Perth Perth PackingNoteItem NoteNo ItemNo Quantity PartNo 300 1 200 1234 300 2 200 2234 300 3 200 3334 Description Nuts Bolts Washers We can consider 2NF to consist of two tables: PackingNote (NoteNo, Packer, CoName, CoAddress); PackingNoteItem (NoteNo, ItemNo, Qty, PartNo, Desc). It is interesting to note that this is the same solution we would have achieved for 1NF (as well as 2NF) had we simply split tables from the outset. Although splitting tables at the start does not always make the transition from 1NF to 2NF so straightforward, it does generally reduce the amount of work to be carried out at this stage. 3NF – third normal form In third normal form, we must ensure that no columns are dependent on other non-key attributes, often termed transitive dependencies. Here we have to look at both tables. In the ‘PackingNote’ table, a dependency between non-key values does exist in that Customer’s address is dependent on the Company name and not related to the ‘NoteNo’. A similar situation exists in the ‘PackingNoteItem’ table where Description is dependent on the ‘PartNo’. Therefore the 3NF would consist of four tables: PackingNote (NoteNo, Packer, Company); CustomerDetail (CoName, CoAddress); PackingNoteItem (NoteNo, ItemNo, Qty, PartNo); Part (PartNo, Description). Properties of Relational Tables Relational tables have six properties: 1. 2. 3. 4. 5. 6. Values are atomic. Column values are of the same kind. Each row is unique. The sequence of columns is insignificant. The sequence of rows is insignificant. Each column must have a unique name. Values Are Atomic This property implies that columns in a relational table are not repeating group or arrays. Such tables are referred to as being in the "first normal form" (1NF). The atomic value property of relational tables is important because it is one of the cornerstones of the relational model. The key benefit of the one value property is that it simplifies data manipulation logic. Column Values Are of the Same Kind In relational terms this means that all values in a column come from the same domain. A domain is a set of values which a column may have. For example, a Monthly_Salary column contains only specific monthly salaries. It never contains other information such as comments, status flags, or even weekly salary. This property simplifies data access because developers and users can be certain of the type of data contained in a given column. It also simplifies data validation. Because all values are from the same domain, the domain can be defined and enforced with the Data Definition Language (DDL) of the database software. Each Row is Unique This property ensures that no two rows in a relational table are identical; there is at least one column, or set of columns, the values of which uniquely identify each row in the table. Such columns are called primary keys and are discussed in more detail in Relationships and Keys. This property guarantees that every row in a relational table is meaningful and that a specific row can be identified by specifying the primary key value. The Sequence of Columns is Insignificant This property states that the ordering of the columns in the relational table has no meaning. Columns can be retrieved in any order and in various sequences. The benefit of this property is that it enables many users to share the same table without concern of how the table is organized. It also permits the physical structure of the database to change without affecting the relational tables. The Sequence of Rows is Insignificant This property is analogous the one above but applies to rows instead of columns. The main benefit is that the rows of a relational table can be retrieved in different order and sequences. Adding information to a relational table is simplified and does not affect existing queries. Each Column Has a Unique Name Because the sequence of columns is insignificant, columns must be referenced by name and not by position. In general, a column name need not be unique within an entire database but only within the table to which it belongs. Relationships and Keys A relationship is an association between two or more tables. Relationships are expressed in the data values of the primary and foreign keys. A primary key is a column or columns in a table whose values uniquely identify each row in a table. A foreign key is a column or columns whose values are the same as the primary key of another table. You can think of a foreign key as a copy of primary key from another relational table. The relationship is made between two relational tables by matching the values of the foreign key in one table with the values of the primary key in another. Keys are fundamental to the concept of relational databases because they enable tables in the database to be related with each other. Navigation around a relational database depends on the ability of the primary key to unambiguously identify specific rows of a table. Navigating between tables requires that the foreign key is able to correctly and consistently reference the values of the primary keys of a related table. For example, the figure below shows how the keys in the relational tables are used to navigate from AUTHOR to TITLE to PUBLISHER. AUTHOR_TITLE is an all key table used to link AUTHOR and TITLE. This relational table is required because AUTHOR and TITLE have a many-to-many relationship. Data Integrity Data integrity means, in part, that you can correctly and consistently navigate and manipulate the tables in the database. There are two basic rules to ensure data integrity; entity integrity and referential integrity. The entity integrity rule states that the value of the primary key can never be a null value (a null value is one that has no value and is not the same as a blank). Because a primary key is used to identify a unique row in a relational table, its value must always be specified and should never be unknown. The integrity rule requires that insert, update, and delete operations maintain the uniqueness and existence of all primary keys. The referential integrity rule states that if a relational table has a foreign key, then every value of the foreign key must either be null or match the values in the relational table in which that foreign key is a primary key. Domain Integrity Domain integrity requires that a set of data values fall within a specific range (domain) in order to be valid In other words, domain integrity defines the permissible entries for a given column by restricting the data type, format, or range of possible values A domain in database terminology refers to a set of permissible values for a column (it should not be confused with an Internet or DNS 'domain' or a Windows NT 'domain') Examples of domain integrity: correct data type; values that fall within the range supported by the system; null status; permitted size values Example: Domain integrity might be used to ensure an entry in the 'age' field is an integer and must be between the values of 0 and 120 Domain integrity is sometimes referred to as 'attribute' integrity Domain Integrity can be enforced with a DEFAULT constraint, FOREIGN KEY, CHECK constraint, data types, and, less frequently with SQL Server 7, rules or defaults Data types limit fields to broad categories (e.g., integers) A default is a definition of a value that can be inserted into a column; a rule is a definition of acceptable values that can be inserted into a column Rules and defaults are similar to constraints but are not ANSI standard; their continued use is not encouraged Referential Integrity Referential integrity is concerned with keeping the relationships between tables synchronized Referential integrity is typically enforced with a Primary Key (PK) and Foreign Key (FK) combination An Foreign Key (FK) is a column or combination of columns in one table (referred to as the 'child table') that takes its values from the PK in another table (referred to as the 'parent table') Example: If you want to relate the 'orders' table to the 'customers' table, you could add a Customer ID column to the 'orders' table, declare this column as a FK, and then reference it to the PK (Customer ID) in the 'customers' table Once this relationship is established, it is possible to 'relate' or 'tie' each order to a particular customer Note that while PK-FK combinations represent logical relationships among data, they do not necessarily limit the possible access paths through the data In order for referential integrity to be maintained, the FK in the 'child' table can only accept values that exists in the PK of 'parent' table Example: the 'Customer ID' column that is declared as a FK in the 'orders' table must not contain a value that does not exist in the PK (Customer ID) in the 'customers' table; if it did, you would not be able to relate that order to a valid customer - a significant data integrity violation The primary objective of referential integrity is to prevent 'orphans;' i.e., records in the child table that cannot be related to a record in the parent table Enforcing referential integrity means the relationship between the tables must be preserved when records are added (INSERT), changed (UPDATE), or deleted (DELETE) Example: You cannot change a Customer ID in the 'customers' table if that change would produce an 'orphan;' IOW if that change would leave records in the 'orders' table that did not reference to a valid Customer ID Although referential Integrity is often implemented with a PK-FK combination, database developers can also use triggers or stored procedures as well There are three fundamental approaches to implementing referential integrity: 1) restrict (disallow the data modification); 2) cascade (extend the data modification to related tables); or 3) nullify (set the values of matching FKs to NULL) Threats to Referential Integrity The UPDATE Threat to Referential Integrity o o o o o o o o o o o o o o o o UPDATEs can produce orphans when either the PK of the parent is changed or the FK of child is changed Example: Changing a Customer ID value in 'customer' table may result in orphan records in the 'orders' table; likewise, changing a Customer ID in the 'orders' table to a value that does not exist in the 'customers' table will produce orphan records In order to preserve referential integrity, the offending UPDATE can be disallowed; this happens automatically when a FK references a PK Alternatively, the UPDATE can be 'cascaded' from the parent table to the child table A third option for dealing the UPDATE threat is to set the FK values to NULL when the PK is changed; this is generally not a good solution The INSERT Threat to Referential Integrity The INSERT threat only applies to data modifications to the child table The INSERT threat involves adding records to the child table with no associated record in the parent table; again, the result is orphaned records Example: A record is inserted into the 'orders' (child) table without a FK or with a FK that does not match a value in the PK column of the 'customers' (parent) table There are two ways to preserve referential integrity in the case of an INSERT: The INSERT can be disallowed; this is what happens automatically when a FK references a PK Alternatively, the FK can be set to null (but, as with the UPDATE threat, this option is generally not a good idea) Note that unlike UPDATEs and DELETEs, INSERTs cannot be cascaded The DELETE Threat to Referential Integrity The DELETE threat applies only to data modifications to the parent table The DELETE threat involves deleting records in the parent table when there are matching records in the child table; as always, the result in orphaned records Example: Deleting a record in the customers table when the customer has open orders; the entries in the orders table then become orphans because they cannot be related to a customer Like UPDATEs, there are 3 ways to preserve referential integrity with a DELETE o o o the offending DELETE can be disallowed; this happens automatically when a FK references a PK Alternatively, the DELETE can be 'cascaded' from the parent table to the child table The third (and bad) option for dealing the DELETE threat is to set the FK values to NULL when the PK is changed Relational Data Manipulation Relational tables are sets. The rows of the tables can be considered as elements of the set. Operations that can be performed on sets can be done on relational tables. The eight relational operations are: Union The union operation of two relational tables is formed by appending rows from one table to those of a second table to produce a third. Duplicate rows are eliminated. The notation for the union of Tables A and B is A UNION B. The relational tables used in the union operation must be union compatible. Tables that are union compatible must have the same number of columns and corresponding columns must come from the same domain. Figure1 shows the union of A and B. Note that the duplicate row [1, A, 2] has been removed. Figure1: A UNION B Difference The difference of two relational tables is a third that contains those rows that occur in the first table but not in the second. The Difference operation requires that the tables be union compatible. As with arithmetic, the order of subtraction matters. That is, A - B is not the same as B - A. Figure2 shows the different results. Figure 2: The Difference Operator Intersection The intersection of two relational tables is a third table that contains common rows. Both tables must be union compatible. The notation for the intersection of A and B is A [intersection] B = C or A INTERSECT B. Figure3 shows the single row [1, A, 2] appears in both A and B. Figure3: Intersection Product The product of two relational tables, also called the Cartesian Product, is the concatenation of every row in one table with every row in the second. The product of table A (having m rows) and table B (having n rows) is the table C (having m x n rows). The product is denoted as A X B or A TIMES B. Figure 4: Product The product operation is by itself not very useful. However, it is often used as an intermediate process in a Join. Projection The project operator retrieves a subset of columns from a table, removing duplicate rows from the result. Selection The select operator, sometimes called restrict to prevent confusion with the SQL SELECT command, retrieves subsets of rows from a relational table based on a value(s) in a column or columns. Join A join operation combines the product, selection, and, possibly, projection. The join operator horizontally combines (concatenates) data from one row of a table with rows from another or the same table when certain criteria are met. The criteria involve a relationship among the columns in the join relational table. If the join criterion is based on equality of column value, the result is called an equijoin. A natural join is an equijoin with redundant columns removed. Figure 5 illustrates a join operation. Tables D and E are joined based on the equality of k in both tables. The first result is an equijoin. Note that there are two columns named k; the second result is a natural join with the redundant column removed. Figure 5: Join Joins can also be done on criteria other than equality. Division The division operator results in columns values in one table for which there are other matching column values corresponding to every row in another table. Figure 6: Division