Download Models of Databases and Database Design

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational algebra wikipedia , lookup

Database wikipedia , lookup

Ingres (database) wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Join (SQL) wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Models of Databases and Database Design
File Based approach
Shortcomings
1. Data redundancy – Data is duplicated in many files.
e.g.
Some of the data is so redundant such that it may never be accessed
thereby taking up space.
2. Data inconsistency – Some tables or files may be skipped during an
update transaction
eg
If someone’s marital status changes, the salaries department may change
but the operations department might not record the change. This may lead
to inconsistent within the system.
3. Difficult to share – File based systems do not allow for multiple access.
4. Data dependence – The physical structure of data is defined in the code
hence any changes to the structure will require a lot of changes to the rest
of the code.
5. Incompatibility – Data stored as files in programs written in different
programming languages is difficult to share across different programming
languages.
6. Difficult to make queries or generate reports – Complex code is require
to make queries and generate reports
Database Approach
An approach whereby a pool of related data is shared
Database : A collection of related data.
Data: Known facts that can be recorded and have implicit meaning.
History of Database Systems
First-generation
- Hierarchical and Network
Second generation
- Relational
Third generation
- Object Relational
- Object-Oriented
Database Management Software (DBMS): A collection of software to support
the storage, retrieval and modification of large volumes of data. Support is also
provided for multiple users along with administration tools
E.g. Oracle, SQL Server, and Access
Functions of a DBMS
Data Storage, Retrieval and Updating
Data Dictionary
This describes the structure and content of the DB. E.g. The names of tables, names of
fields, characteristics of fields, relationships between entities, what the user is allowed to
do.
Transaction support
Feedback to the user. Did a transaction fail? Why?
Concurrency Control
Make sure that users aren’t accessing or changing data that is being changed by
another user at the same time. Must be done in a controlled manner to avoid mistakes
Recovery Services
Database can be recovered to some past correct state in the event of failure.
This can done using a system log which contains information an all the previous
transactions so that they can be reversed
The system log may also indicate the cause of failure
Backups can also be made regularly
Authorisation services
Who can access what data and how.
Support for data communication
Many DBs are accessed remotely and the DBMS must support this
Integrity services
Enforcing constraints to ensure that data remains correct. An example would
be the data type. Other constraints are derived from the semantics of the data
Providing multiple interfaces
Query language interface for casual users, programming language interfaces
for programmers, menu driven interfaces for beginners
Database advantages
Reduction of redundancy
Consistency
More available information
Sharing of data
Data integrity
Security
Enforcement of standards
Economy of scale
Management of conflicts
Improved accessibility
Increased productivity
Improved maintenance
Increased concurrency
Improved backup and recovery
Database disadvantages
Complexity
Size
Software Cost
Hardware cost
Conversion
Performance
Vulnerability to system failure
DBMS Interfaces

Stand-alone query language interfaces.

Programmer interfaces for embedding DML in programming languages:
- Pre-compiler Approach
- Procedure (Subroutine) Call Approach

User-friendly interfaces:
-
Menu-based, popular for browsing on the web
-
Forms-based, designed for naïve users
-
Graphics-based (Point and Click, Drag and Drop etc.)
-
Natural language: requests in written English
-
Combinations of the above

Others:
- Speech as Input (?) and Output
- Web Browser as an interface
-
Parametric interfaces (e.g., bank tellers) using function keys.
-
Interfaces for the DBA:
 Creating accounts, granting authorizations
 Setting system parameters
 Changing schemas or access path
Classification of DBMSs
Based on the data model used:
-
Traditional: Relational, Network, Hierarchical.
-
Emerging: Object-oriented, Object-relational.
Other classifications:
-
Single-user (typically used with micro- computers) vs. multi-user (most DBMSs).
-
Centralized (uses a single computer with one database) vs. distributed (uses
multiple computers, multiple databases)
Distributed Database Systems have now come to be known as client server based database
systems because they do not support a totally distributed environment, but rather a set of
database servers supporting a set of clients.
Variations of Distributed Environments:
Homogeneous DDBMS
Heterogeneous DDBMS
Federated or Multidatabase Systems
ANSI-SPARC
Objectives of Three-Level Architecture
All users should be able to access same data
A user’s view is immune to changes made in other views
Users should not need to know physical database storage details
DBA should be able to change database storage structures without affecting the
users’ views
Internal structure of database should be unaffected by changes to physical
aspects of storage
DBA should be able to change conceptual structure of database without affecting
all users
The Ansi -Sparc and data Independence
Logical Data Independence
- Refers to immunity of external schemas to changes in conceptual schema.
- Conceptual schema changes (e.g. addition/removal of entities).
- Should not require changes to external schema or rewrites of application
programs.
Physical Data Independence
_ Refers to immunity of conceptual schema to changes in the internal schema.
_ Internal schema changes (e.g. using different file organizations, storage
structures/devices).
_ Should not require change to conceptual or external schemas.
Schemas versus Instances
Database Schema: The description of a database. Includes descriptions
of the database structure and the constraints that should hold on the
database.
Schema Diagram: A diagrammatic display of (some aspects of) a
database schema.
Schema Construct: A component of the schema or an object within the
schema, e.g., STUDENT, COURSE.
Database Instance: The actual data stored in a database at a particular
moment in time. Also called database state (or occurrence).
Database Schema Vs. Database State
Database State: Refers to the content of a database at a moment in time.
Initial Database State: Refers to the database when it is loaded
Valid State: A state that satisfies the structure and constraints of the database.
Distinction

The database schema changes very infrequently. The database state changes
every time the database is updated.

Schema is also called intension, whereas state is called extension.
Normalization
Normalization is a design technique that is widely used as a guide in designing
relational databases. Normalization is essentially a two step process that puts
data into tabular form by removing repeating groups and then removes
duplicated data from the relational tables.
Normalization theory is based on the concepts of normal forms. A relational
table is said to be a particular normal form if it satisfied a certain set of
constraints. There are currently many normal forms that have been defined. In
this section, we will cover the first three normal forms that were defined by E. F.
Codd.
1NF – First Normal Form
2NF – Second Normal Form
3NF – Third Normal Form
BCNF – Boyce Code Normal Form
4NF – Fourth Normal Form
5NF – Fifth Normal Form
DKNF – Domain-key Normal Form
Introducing the normal forms
Initially only three forms of normalisation (1NF, 2NF and 3NF) were put
forward by E. F. Codd in 1972. The Boyce-Codd normal form was later
introduced by R. Boyce and E. F. Codd in 1974. The later forms were mainly the
work of R. Fagin in the period from 1977 through to 1981.
The process of normalizing a database is so well known that there are formal
rules governing how a normalized database should be structured. There are
seven of these rules, known as normal forms, in all, but the first four are
adequate most of the time:




First Normal Form (1NF)—This rule has several requirements, including
that there are no multivalued items or repeating groups; that each field is
atomic, meaning each field must contain the smallest data element
possible; and that the table contains a key.
Second Normal Form (2NF)—The table must be normalized to 1NF. All
fields must refer to (or describe) the primary key value. If the primary key
is based on more than one field, each nonkey field must depend on the
complex key, not just one field within the key. Nonkey fields that don't
support the primary key should be moved to another table.
Third Normal Form (3NF)—Te table must meet 1NF and 2NF
requirements. All fields must be mutually independent. Any field that
describes a nonkey field must be moved to another table.
Boyce-Codd Normal Form (BCNF)—There must be no possibility of a
nonkey dependent field occurring. This rule is really a subrule of 3NF and
supposedly catches dependencies that might otherwise sneak through the
process. It's rather abstract and can be difficult to apply at first.
Functional Dependencies
The concept of functional dependencies is the basis for the first three normal
forms. A column, Y, of the relational table R is said to be functionally dependent
upon column X of R if and only if each value of X in R is associated with
precisely one value of Y at any given time. X and Y may be composite. Saying
that column Y is functionally dependent upon X is the same as saying the values
of column X identify the values of column Y. If column X is a primary key, then
all columns in the relational table R must be functionally dependent upon X.
A short-hand notation for describing a functional dependency is:
R.x —>; R.y
which can be read as in the relational table named R, column X functionally
determines (identifies) column Y.
Full functional dependence applies to tables with composite keys. Column Y in
relational table R is fully functional on X of R if it is functionally dependent on X
and not functionally dependent upon any subset of X. Full functional
dependence means that when a primary key is composite, made of two or more
columns, then the other columns must be identified by the entire key and not just
some of the columns that make up the key.
Case study
Let’s consider an example where we may want to commit to a database the
details of packing notes raised by a supplier.
Un-normalized data
The first important fact to realise is that there are fields which appear only once on the
packing note (those in the header group) and there are fields that repeat for every separate
item listed on the packing note (those in the invoice body group). If we were to try to
make one record for each packing note, this would result as below
Un-normalized
NoteNo Packer Name Address ItemNo Quantity PartNo Description
300
JW
Bloggs
Perth
1
200
1234
Nuts
2
200
2234
Bolts
3
200
3334
Washers
Here we can clearly identify repeating groups. But fields must be ‘atomic’ in the
sense that there can only be one value in any field (no multi-valued attributes).
In theory we could extend the number of columns and introduce the following
fields:
Item1 : Quantity
Item1 : Partnumber
Item1 : Description
Item2 : Quantity
Item2 : Partnumber
Item2 : Description
Item3 : Quantity
Item3 : Partnumber
Item3 : Description
etc
However, this has many problems associated with it. First, we do not know in
advance the number of separate items on any particular packing note. This
would result in having to cater for the maximum possible number of items that
could be expected. The vast majority of entries would likely be much less than
this maximum making the database unnecessarily large. It would also make
queries much less efficient as we would have to search for the required data in
multiple columns.
1NF – first normal form
A better approach would be to repeat the common data to ensure that this
resulted in storage only of atomic values, as shown below
NoteNo
300
300
300
Packer
JW
JW
JW
CoName
Bloggs
Bloggs
Bloggs
CoAddress
Perth
Perth
Perth
ItemNo
1
2
3
Quantity
200
200
200
PartNo
1234
2234
3334
Description
Nuts
Bolts
Washers
Although satisfying the issue of atomic values, clearly a great deal of redundancy
has been introduced. Nevertheless this table satisfies the rules of 1NF.
An alternative approach at this stage would be to split the table into two parts.
Not surprisingly, going either route should end up with much the same solution
although arguably the second approach is perhaps a quicker way of getting to
2NF. For the moment we will proceed with the single table.
2NF – second normal form
To be in 2NF we must remove any part-key dependencies.
Here we quickly run into a problem. NoteNo cannot by itself be used as a key as
this is not unique. We must therefore consider the use of a composite key (i.e.
one containing more than one column). By looking at the table data, it should be
apparent that a key would have to utilise both the NoteNo as well as ItemNo. A
different NoteNo would be used for a different dispatch but is not unique for
every line. An ItemNo would be unique within the context of a single packing
note but not necessarily between different dispatch notes – conceivably you may
ship the same item to two different customers.
Let’s have a look at the implications of using a composite key consisting of NoteNo and
ItemNo. This would work as a primary key but would fail on the dependency
issue as we have part-key dependencies. The reason for this is that fields such as
Packer, CoName and CoAddress are dependent only on NoteNo whereas fields
Quanity, PartNo and Description would be dependent on the composite key
NoteNo and ItemNo. To solve this problem we have to split the table (as we
suggested when we looked at 1NF) as below
PackingNote
NoteNo Packer CoName
300
JW
Bloggs
300
JW
Bloggs
300
JW
Bloggs
CoAddress
Perth
Perth
Perth
PackingNoteItem
NoteNo ItemNo Quantity PartNo
300
1
200
1234
300
2
200
2234
300
3
200
3334
Description
Nuts
Bolts
Washers
We can consider 2NF to consist of two tables:
PackingNote (NoteNo, Packer, CoName, CoAddress);
PackingNoteItem (NoteNo, ItemNo, Qty, PartNo, Desc).
It is interesting to note that this is the same solution we would have achieved for
1NF (as well as 2NF) had we simply split tables from the outset. Although
splitting tables at the start does not always make the transition from 1NF to 2NF
so straightforward, it does generally reduce the amount of work to be carried out
at this stage.
3NF – third normal form
In third normal form, we must ensure that no columns are dependent on other
non-key attributes, often termed transitive dependencies. Here we have to look
at both tables. In the ‘PackingNote’ table, a dependency between non-key values
does exist in that Customer’s address is dependent on the Company name and
not related to the ‘NoteNo’. A similar situation exists in the ‘PackingNoteItem’
table where Description is dependent on the ‘PartNo’.
Therefore the 3NF would consist of four tables:
PackingNote (NoteNo, Packer, Company);
CustomerDetail (CoName, CoAddress);
PackingNoteItem (NoteNo, ItemNo, Qty, PartNo);
Part (PartNo, Description).
Properties of Relational Tables
Relational tables have six properties:
1.
2.
3.
4.
5.
6.
Values are atomic.
Column values are of the same kind.
Each row is unique.
The sequence of columns is insignificant.
The sequence of rows is insignificant.
Each column must have a unique name.
Values Are Atomic
This property implies that columns in a relational table are not repeating group or arrays. Such tables are
referred to as being in the "first normal form" (1NF). The atomic value property of relational tables is
important because it is one of the cornerstones of the relational model.
The key benefit of the one value property is that it simplifies data manipulation logic.
Column Values Are of the Same Kind
In relational terms this means that all values in a column come from the same domain. A domain is a set of
values which a column may have. For example, a Monthly_Salary column contains only specific monthly
salaries. It never contains other information such as comments, status flags, or even weekly salary.
This property simplifies data access because developers and users can be certain of the type of data
contained in a given column. It also simplifies data validation. Because all values are from the same domain,
the domain can be defined and enforced with the Data Definition Language (DDL) of the database software.
Each Row is Unique
This property ensures that no two rows in a relational table are identical; there is at least one column, or set
of columns, the values of which uniquely identify each row in the table. Such columns are called primary
keys and are discussed in more detail in Relationships and Keys.
This property guarantees that every row in a relational table is meaningful and that a specific row can be
identified by specifying the primary key value.
The Sequence of Columns is Insignificant
This property states that the ordering of the columns in the relational table has no meaning. Columns can be
retrieved in any order and in various sequences. The benefit of this property is that it enables many users to
share the same table without concern of how the table is organized. It also permits the physical structure of
the database to change without affecting the relational tables.
The Sequence of Rows is Insignificant
This property is analogous the one above but applies to rows instead of columns. The main benefit is that
the rows of a relational table can be retrieved in different order and sequences. Adding information to a
relational table is simplified and does not affect existing queries.
Each Column Has a Unique Name
Because the sequence of columns is insignificant, columns must be referenced by name and not by position.
In general, a column name need not be unique within an entire database but only within the table to which
it belongs.
Relationships and Keys
A relationship is an association between two or more tables. Relationships are expressed in the data values of
the primary and foreign keys.
A primary key is a column or columns in a table whose values uniquely identify each row in a table. A
foreign key is a column or columns whose values are the same as the primary key of another table. You can
think of a foreign key as a copy of primary key from another relational table. The relationship is made
between two relational tables by matching the values of the foreign key in one table with the values of the
primary key in another.
Keys are fundamental to the concept of relational databases because they enable tables in the database to be
related with each other. Navigation around a relational database depends on the ability of the primary key
to unambiguously identify specific rows of a table. Navigating between tables requires that the foreign key
is able to correctly and consistently reference the values of the primary keys of a related table. For example,
the figure below shows how the keys in the relational tables are used to navigate from AUTHOR to TITLE
to PUBLISHER. AUTHOR_TITLE is an all key table used to link AUTHOR and TITLE. This relational table
is required because AUTHOR and TITLE have a many-to-many relationship.
Data Integrity
Data integrity means, in part, that you can correctly and consistently navigate and manipulate the tables in
the database. There are two basic rules to ensure data integrity; entity integrity and referential integrity.
The entity integrity rule states that the value of the primary key can never be a null value (a null value is one
that has no value and is not the same as a blank). Because a primary key is used to identify a unique row in
a relational table, its value must always be specified and should never be unknown. The integrity rule
requires that insert, update, and delete operations maintain the uniqueness and existence of all primary
keys.
The referential integrity rule states that if a relational table has a foreign key, then every value of the foreign
key must either be null or match the values in the relational table in which that foreign key is a primary key.
Domain Integrity
Domain integrity requires that a set of data values fall within a specific range
(domain) in order to be valid
In other words, domain integrity defines the permissible entries for a given
column by restricting the data type, format, or range of possible values
A domain in database terminology refers to a set of permissible values for a
column (it should not be confused with an Internet or DNS 'domain' or a
Windows NT 'domain')
Examples of domain integrity: correct data type; values that fall within the range
supported by the system; null status; permitted size values
Example: Domain integrity might be used to ensure an entry in the 'age' field is
an integer and must be between the values of 0 and 120
Domain integrity is sometimes referred to as 'attribute' integrity




Domain Integrity can be enforced with a DEFAULT constraint, FOREIGN
KEY, CHECK constraint, data types, and, less frequently with SQL Server
7, rules or defaults
Data types limit fields to broad categories (e.g., integers)
A default is a definition of a value that can be inserted into a column; a
rule is a definition of acceptable values that can be inserted into a column
Rules and defaults are similar to constraints but are not ANSI standard;
their continued use is not encouraged
Referential Integrity













Referential integrity is concerned with keeping the relationships between
tables synchronized
Referential integrity is typically enforced with a Primary Key (PK) and
Foreign Key (FK) combination
An Foreign Key (FK) is a column or combination of columns in one table
(referred to as the 'child table') that takes its values from the PK in another
table (referred to as the 'parent table')
Example: If you want to relate the 'orders' table to the 'customers' table,
you could add a Customer ID column to the 'orders' table, declare this
column as a FK, and then reference it to the PK (Customer ID) in the
'customers' table
Once this relationship is established, it is possible to 'relate' or 'tie' each
order to a particular customer
Note that while PK-FK combinations represent logical relationships
among data, they do not necessarily limit the possible access paths
through the data
In order for referential integrity to be maintained, the FK in the 'child'
table can only accept values that exists in the PK of 'parent' table
Example: the 'Customer ID' column that is declared as a FK in the 'orders'
table must not contain a value that does not exist in the PK (Customer ID)
in the 'customers' table; if it did, you would not be able to relate that order
to a valid customer - a significant data integrity violation
The primary objective of referential integrity is to prevent 'orphans;' i.e.,
records in the child table that cannot be related to a record in the parent
table
Enforcing referential integrity means the relationship between the tables
must be preserved when records are added (INSERT), changed
(UPDATE), or deleted (DELETE)
Example: You cannot change a Customer ID in the 'customers' table if that
change would produce an 'orphan;' IOW if that change would leave
records in the 'orders' table that did not reference to a valid Customer ID
Although referential Integrity is often implemented with a PK-FK
combination, database developers can also use triggers or stored
procedures as well
There are three fundamental approaches to implementing referential
integrity: 1) restrict (disallow the data modification); 2) cascade (extend
the data modification to related tables); or 3) nullify (set the values of
matching FKs to NULL)
Threats to Referential Integrity

The UPDATE Threat to Referential Integrity
o
o
o
o
o

o
o
o
o
o
o
o

o
o
o
o
UPDATEs can produce orphans when either the PK of the parent is
changed or the FK of child is changed
Example: Changing a Customer ID value in 'customer' table may
result in orphan records in the 'orders' table; likewise, changing a
Customer ID in the 'orders' table to a value that does not exist in the
'customers' table will produce orphan records
In order to preserve referential integrity, the offending UPDATE
can be disallowed; this happens automatically when a FK references a PK
Alternatively, the UPDATE can be 'cascaded' from the parent table
to the child table
A third option for dealing the UPDATE threat is to set the FK
values to NULL when the PK is changed; this is generally not a good
solution
The INSERT Threat to Referential Integrity
The INSERT threat only applies to data modifications to the child
table
The INSERT threat involves adding records to the child table with
no associated record in the parent table; again, the result is orphaned
records
Example: A record is inserted into the 'orders' (child) table without
a FK or with a FK that does not match a value in the PK column of the
'customers' (parent) table
There are two ways to preserve referential integrity in the case of
an INSERT:
The INSERT can be disallowed; this is what happens automatically
when a FK references a PK
Alternatively, the FK can be set to null (but, as with the UPDATE
threat, this option is generally not a good idea)
Note that unlike UPDATEs and DELETEs, INSERTs cannot be
cascaded
The DELETE Threat to Referential Integrity
The DELETE threat applies only to data modifications to the parent
table
The DELETE threat involves deleting records in the parent table
when there are matching records in the child table; as always, the result in
orphaned records
Example: Deleting a record in the customers table when the
customer has open orders; the entries in the orders table then become
orphans because they cannot be related to a customer
Like UPDATEs, there are 3 ways to preserve referential integrity
with a DELETE
o
o
o
the offending DELETE can be disallowed; this happens
automatically when a FK references a PK
Alternatively, the DELETE can be 'cascaded' from the parent table
to the child table
The third (and bad) option for dealing the DELETE threat is to set
the FK values to NULL when the PK is changed
Relational Data Manipulation
Relational tables are sets. The rows of the tables can be considered as elements of the set. Operations that
can be performed on sets can be done on relational tables. The eight relational operations are:
Union
The union operation of two relational tables is formed by appending rows from one table to those of a
second table to produce a third. Duplicate rows are eliminated. The notation for the union of Tables A and B
is A UNION B.
The relational tables used in the union operation must be union compatible. Tables that are union
compatible must have the same number of columns and corresponding columns must come from the same
domain. Figure1 shows the union of A and B.
Note that the duplicate row [1, A, 2] has been removed.
Figure1: A UNION B
Difference
The difference of two relational tables is a third that contains those rows that occur in the first table but not in
the second. The Difference operation requires that the tables be union compatible. As with arithmetic, the
order of subtraction matters. That is, A - B is not the same as B - A. Figure2 shows the different results.
Figure 2: The Difference Operator
Intersection
The intersection of two relational tables is a third table that contains common rows. Both tables must be
union compatible. The notation for the intersection of A and B is A [intersection] B = C or A INTERSECT B.
Figure3 shows the single row [1, A, 2] appears in both A and B.
Figure3: Intersection
Product
The product of two relational tables, also called the Cartesian Product, is the concatenation of every row in
one table with every row in the second. The product of table A (having m rows) and table B (having n rows)
is the table C (having m x n rows). The product is denoted as A X B or A TIMES B.
Figure 4: Product
The product operation is by itself not very useful. However, it is often used as an intermediate process in a
Join.
Projection
The project operator retrieves a subset of columns from a table, removing duplicate rows from the result.
Selection
The select operator, sometimes called restrict to prevent confusion with the SQL SELECT command,
retrieves subsets of rows from a relational table based on a value(s) in a column or columns.
Join
A join operation combines the product, selection, and, possibly, projection. The join operator horizontally
combines (concatenates) data from one row of a table with rows from another or the same table when
certain criteria are met. The criteria involve a relationship among the columns in the join relational table. If
the join criterion is based on equality of column value, the result is called an equijoin. A natural join is an
equijoin with redundant columns removed.
Figure 5 illustrates a join operation. Tables D and E are joined based on the equality of k in both tables. The
first result is an equijoin. Note that there are two columns named k; the second result is a natural join with
the redundant column removed.
Figure 5: Join
Joins can also be done on criteria other than equality.
Division
The division operator results in columns values in one table for which there are other matching column
values corresponding to every row in another table.
Figure 6: Division