Download Module 1 – Background - Dalhousie University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Data model wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Data vault modeling wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
CSCI 3140
Module 1 – Background
(Based on Chapters 1 – 13 of
Database Systems by Connolly and Begg)
Theodore Chiasson
Dalhousie University
Definitions
• Database
- A collection of related data
• Database Management System (DBMS)
– Software that manages and controls access to the database
File-based systems
• Early attempt to computerize manual filing systems
• Limitations of file-based systems include:
– Separation and isolation of data
• Application programs often access many files
– Duplication of data
• Data needed in several departments will be entered into each
department’s system
– Data dependence
• Application programs need to be updated if file formats are changed
– Incompatibility of files
• Different application development environments produce different
file formats
– Fixed queries/proliferation of application programs
• Application programmer needs to intervene for ad hoc queries
Two underlying problems with
file-based systems:
1)
The definition of data is embedded within the application programs rather
than being stored separately and independently
2)
There is no control over access and manipulation of data beyond that
imposed by the application programs
Definitions (revisited)
• Database
- A shared collection of logically related data, and a description of
this data, designed to meet the information needs of an
organization
• Database Management System (DBMS)
– A software systems that allows user to define, create, maintain,
and control access to the database.
Important database terms
• System catalog, data dictionary, or meta-data
– The ‘data about the data’
– The part of the database that makes it self-describing
• Entity
– A distinct object (person, place, thing, concept or event) that is
represented in the database
• Attribute
– A property describing some aspect of an entity
• Relationship
– An association between entities
Facilities provided by a DBMS:
• Data Definition Language (DDL)
– Used to specify the data types and structures and the constraints on the
data to be stored in the database
• Data Manipulation Language (DML)
– Used to insert, update, delete, and retrieve data from the database
• Structured Query Language (SQL)
– De facto standard language for relational DBMSs.
• Controlled access to the database
–
–
–
–
–
Security system
Integrity system
Concurrency control system
Recovery control system
User-accessible catalog
ANSI-SPARC three-level architecture
• External level
– The users’ view of the database
– Describes the part of the database that is relevant to each user
– External schemas, or subschemas
• Conceptual level
– The community view of the database
– Describes what data is stored in the database and the relationships
among the data
– Conceptual schema
• Internal level
– The physical representation of the database on the computer
– Describes how the data is stored in the database
– Internal schema
ANSI-SPARC three-level architecture
External
level
Conceptual
level
Internal
level
Physical data
organization
User 1
User 2
User n
View 1
View 2
View n
Conceptual
Schema
Internal
Schema
The Relational Model
• All data is logically structured in relations (tables)
• Each relation has a name and is made up of attributes (columns) of data
• Each tuple (row) contains one value per attribute
Relation: a table with columns and rows (also called a file)
Attribute: a named column of a relation (also called a field)
Tuple: a row of a relation (also called a record)
Domain: the set of allowable values for one or more attributes
Degree: the number of attributes in a relation (unary, binary, ternary, n-ary)
Cardinality: the number of tuples in a relation
Relational database: a collection of normalized relations with distinct relation
names
Mathematical review
•
•
•
•
Let A and B be sets, where A = {2,4} and B = {1,3,5).
The CARTESIAN PRODUCT of these two sets, written A X B, is the set of
all ordered pairs such that the first element is an element of A and the
second element is an element of B, which gives us the set
{(2,1),(2,3),(2,5),(4,1),(4,3),(4,5)}
Any subset of this set is termed a relation
For example,
R = {(2,1),(4,1)}
is a relation. In this case, R can be specified as all those ordered pairs with
the second element equal to 1, or
R = {(x,y) | x E A, y E B, and y = 1}
We could also define a relation S as
S = {(x,y) | x E A, y E B, and x = 2y}
which, given set A and B above, would yield
S = {(2,1)}
Database Relations
•
Relation Schema
– a named relation defined by a set of attribute and domain name pairs
•
Let A1, A2, … An be attributes with domains D1, D2, ... Dn. Then the set
{A1:D1, A2: D2, … An: Dn} is a relation schema. Thus, a relation R is a set of
n-tuples:
(A1:d1, A2,d2, … An:dn) such that d1E D1, d2E D2, … dnE Dn
Tuples are normally written without the column names, yielding (d1,d2, … dn)
A relation instance is a set of tuples, such as
{(6050, Theo, 317)}
{(ID:6050, NAME: Theo, Office: 317)}
•
Relational database schema
– a set of relation schemas, each with a distinct name
– Also called a relational schema
Relational keys
• Superkey
– An attribute, or set of attributes, that uniquely identifies a tuple within a
relation
• Candidate key
– A superkey such that no proper subset is a superkey within the relation
• Primary key
– The candidate key that is selected to uniquely identify tuples with the
relation
• Foreign key
– An attribute, or set of attributes, within one relation that matches the
candidate key of some (possibly the same) relation
Relational Integrity
• Domain constraints
– Restrict the allowable set of values for the attributes of a relation
• Null
– Represents a value for an attribute that is currently unknown or is not
applicable for this tuple
• Entity integrity
– In a base relation, no attribute of a primary key may be null
• Referential integrity
– If a foreign key exists in a relation, either the foreign key value must
match a candidate key value of some tuple in its home relation or the
foreign key value must be wholly null
• Enterprise constraints
– Additional rules specified by the users or database administrators of a
database
Views
• Base Relation
– A named relation corresponding to an entity in the conceptual schema,
whose tuples are physically stored in the database
• View
– The dynamic results of one or more relational operations operating on
the base relations to produce another relation.
– A view is a virtual relation that does not necessarily exist in the
database but can be produced upon request by a particular user, at the
time of the request
– Not all views are updatable.
Relation Algebra
• A theoretical language with operations that work on one or more
relations to define another relation without changing the original
relation(s)
• Five fundamental operations:
–
–
–
–
–
Selection
Projection
Cartesian product
Union
Set difference
• Three additional operations (can be derived from the five)
– Join
– Intersection
– Division
Unary operations
• Selection operation
– Input is a relation and a predicate. The predicate is constructed from
boolean expressions involving attributes of the relation using logical
operators ^ (AND), v (OR), and ~ (NOT). Output is a new relation that
consists of only those tuples in the original relation for which the
predicate evaluates to TRUE.
– Example: σsalary > 100000(Staff)
– Result: A new relation with the same attributes as the Staff relation, and
with tuples corresponding to staff with a salary greater than 100,000.
– The degree of the new relation is the same as the degree of the original
relation.
• Projection operation
– Input is a relation and a set of attributes of the relation. Output is a new
relation with only those attributes listed in the input set. The degree of
the new relation is the number of attributes in the input set, and the
cardinality of the new relation is the same as the cardinality of the
original relation.
– Π(staffNo, salary)(Staff)
Set Operations
• Union (R U S)
– Input is two union-compatible relations R and S, output is a new relation
containing all of the tuples of R and S
– Union-compatibility is defined as having the same number of attributes
with each pair of corresponding attributes having the same domain
• Set Difference (R – S)
– Yields a new relation consisting of tuples that are in relation R that are
not in relation S. R and S must be union-compatible.
• Intersection (R
S)
– Yields a new relation consisting of tuples that are in both relation R and
relation S. Equivalent to R – (R – S)
• Cartesian Product (R x S)
– Yields a new relation that is the concatenation of every tuple of R with
every tuple of S
Join Operations
• Theta join ( R
FS)
– Yields a relation with all tuples of the Cartesian Product of R and S that
satisfy the predicate F
• Equijoin ( R
FS)
– A Theta join in which the only predicate is equal (=)
• Natural join ( R
S)
– Equijoin over all common attributes of R and S. One occurrence of each
common attribute is eliminated from the result.
• Outer join ( R
S)
– Tuples from R which do not have matching values in the common
attributes of S are also included in the join, with missing values in the
second relation set to null.
• Semijoin ( R
FS)
– Contains the tuples of R that participate in a join of R with S
Division operation (R – S)
• The division operation defines a relation over the attributes C that
consists of the set of tuples from R that match the combination of
every tuple is S, where C is the set of attributes that are in R but are
not in S.
Relational Calculus
• Based on a branch of symbolic logic called predicate calculus
– A predicate is a truth-valued function with arguments
– A proposition is a predicate with values filled in for the arguments
– If P is a predicate, the set of all x such that P is true for x is expressed
as
{x | P(x)}
Tuple Relational Calculus
• Tuple relational calculus
– Find tuples for which a predicate is true
– Tuple variables `range over’ a named relation
– Specify a range of a tuple variabe S as the Staff relation using the
notation
Staff(S)
– To express the query `find all tuples of S such that F(S) is true’, we write
{S | F(S)}
F is a formula, or well-formed formula (wff)
– Quantifiers tell how many instances the predicate applies to
• Universal quantifier `for all’
• Existential quantifier `there exists’
– Variables qualified with a quantifier are bound variables, those not
qualified are free variables
– All free variables are to the left of the vertical bar in a wff
General form of an expression in Tuple Relational Calculus
{ S1.a1, S2.a2, …, Sn.an | F(S1, S2, …, Sm)} m ≥ n
Where S1, S2, …, Sm are tuple variables, each ai is an attribute of the
relation over which Si ranges, and F is a formula
A wff is made up of atoms, where atoms can be:
- R(Si) where Si is a tuple variable and R is a relation
- Si.a1 θ Sj.a2 where Si and Sj are tuple variables, a1 is an attribute of
the relation over which Si ranges, a2 is an attributes of the relation
over which Sj ranges, and θ is a comparison operator (<,≤,>,≥,=,≠);
the attributes a1 and a2 must have domains whose members can be
compared by theta
- Si.a1 θ c where where Si is a tuple variable, a1 is an attribute of the
relation over which Si ranges, c is a constant from the domain of
attribute a1, and θ is one of the comparison operators.
Domain Relational Calculus
•
Domain relational calculus
– Variables take their values from the domains of attributes rather than the tuples
of relations
– Expressions have the general form:
{d1, d2, …, dn | F(d1, d2, …, dm)} m >= n
Where d1, d2, …, dn, …, dm represent domain variables and F(d1, d2, …, dm)
represents a formula composed of atoms, where each atom has one of the
following forms:
- R(d1, d2, …, dn) where R is a relation of degree n and each di is a domain
variable
- di θ dj ,where di and dj are domain variables and θ is a comparison operator
- di θ c,where di is a domain variable, c is a constant from the domain of di, and θ
is a comparison operator
We recursively build up formulae from atoms where
- an atom is a formula,
- the conjunction or disjunction of two formulae is a formula, the negation of a
formula is a formula
- Existential and universal quantifiers can be applied
Entity-Relationship Modeling
• Entity type
– A group of objects with the same properties, which are identified by the
enterprise as having an independent existence
• Entity occurance
– A uniquely identifiable object of an entity type
• Each entity type in an E-R diagram is represented as a rectangle
labeled with the name of the entity
Staff
PropertyForRent
Entity-Relationship Modeling
• Relationship type
– A set of meaningful associations among entity types
• Relationship occurance
– A uniquely identifiable association, which includes one occurrence from
each participating entity type
• Each relationship type in an E-R diagram is represented as a line
connecting the associated entity types, labeled with the name of the
relationship
Staff
Has
Branch
Entity-Relationship Modeling
• Degree of a relationship type
– The number of participating entity types in a relationship
• Binary: A branch has staff
Has
Staff
Branch
• Ternary: Staff registers a client at a branch
Staff
Registers
Branch
Client
• Quaternary: A solicitor arranges a bid for a buyer through a bank
Solicitor
Buyer
Arranges
Bid
Bank
Entity-Relationship Modeling
• Recursive relationship
– A relationship type where the same entity type participates more than
once in different roles
• Example: Staff (supervisor) supervises staff (Supervisee)
Supervises
Supervisor
Supervisee
Staff
• Example of entities associated through two distinct relationships
– Manager manages branch office
– Branch office has member of staff
Manager
Manages
Branch
Staff
Member of staff
Branch
Has
Branch
Entity-Relationship Modeling
•
Attribute
– A property of an entity or relationship type
•
Attribute domain
– The set of allowable values for one or more attributes
•
Simple Attribute
– An attribute composed of a single component with an independent existence
•
Composite Attribute
– An attribute composed of multiple components, each with an independent
existence
• Examples
– Address (could be broken into street, city, postal code)
– Name (could be broken into FirstName, MiddleInitial, LastName)
•
Single-valued Attribute
– An attribute that holds a single value for each occurrence of an entity type
•
Multi-valued Attribute
– An attribute that holds multiple values for each occurance of an entity type
• Example: Each branch may have several phone numbers
•
Derived Attribute
– An attribute that represents a value that is derivable from the value of a related
attribute or set of attributes, not necessarily in the same entity type
Entity-Relationship Modeling
• Candidate Key
– The minimal set of attributes that uniquely identifies each occurance of
an entity type
• Primary Key
– The candidate key that is selected to uniquely identify each occurrence
of an entity type
• Composite Key
– A candidate key that consists of two or more attributes
Entity-Relationship Modeling
• Diagrammatic representation of attributes in ER diagram
Manages
Staff
Area for
attribute
list
staffNo {PK}
name
position
salary
/totalStaff
Derived
attribute
Branch
Has
branchNo {PK}
address
street
city
postalCode
telNo[1..3]
Multi-valued
attribute
Primary key
Composite
attribute
Entity-Relationship Modeling
• Strong entity type
– An entity type that is not existence-dependent on some other entity type
• Weak entity type
– An entity type that is existence-dependent on some other entity type
• Attributes on relationships
– Attributes can be attached to relationships as well as entities
– Diagrammatic representation:
Advertises
Newspaper
PropertyForRent
propertyNo
newspaperName
dateAdvert
cost
Entity-Relationship Modeling
• Multiplicity
– The number of possible occurrences of an entity type that may relate to
a single occurrence of an associated entity type through a particular
relationship
• One-to-one (1:1)
• One-to-many (1:*)
• Many-to-many (*:*)
– Cardinality
• Describes the maximum number of possible relationship occurrences for an
entity participating in a given relationship
– Participation
• Determines whether all or only some entity occurrences participate in a
relationship
Entity-Relationship Modeling
• Specialization/Generalization
– Superclass
• An entity type that includes one or more distinct subgroupings of its
occurrences
– Subclass
• A distinct subgrouping of occurrences of an entity type
– (name of subclass) IS-A (name of superclass)
• Manager IS-A Staff
• Secretary IS-A Staff
– Specialization
• The process of maximizing the differences between members of an entity by
identifying their distinguishing characteristics
– Generalization
• The process of minimizing the differences between entites by identifying
their common characteristics
Entity-Relationship Modeling
• Diagrammatic representation of specialization/generalization
Has
Branch
branchNo {PK}
address
street
city
postalCode
1..1
1..*
Staff
staffNo {PK}
name
position
salary
1..1
{Optional, And}
Manages
1..1
Manager
mgrStartDate
bonus
SalesPersonnel
salesArea
carAllowance
Secretary
typingSpeed
Normalization
• Unnormalized form
– A table that contains one or more repeating groups
Artist
Album
Supertramp
Crime of the Century
Even in the quietest moments
Breakfast in America
Sting
Dream of the Blue Turtles
Nothing like the Sun
Normalization
• First normal form (1NF)
– A relation in which the intersection of each row and column contains one
and only one value
Artist
Album
Supertramp
Crime of the Century
Supertramp
Even in the quietest moments
Supertramp
Breakfast in America
Sting
Dream of the Blue Turtles
Sting
Nothing like the Sun
Normalization
• Functional dependency
– Describes the relationship between attributes in a relation
– If A and B are attributes of a relation R, B is functionally dependent on A
(denoted A -> B) if each value of A is associated with exactly one value
of B.
• (A and B may each consist of one or more attributes)
• Determinant
– The attribute or set of attributes on the left-hand-side of the arrow of a
functional dependency
• Full functional dependency
– Indicates that if A and B are attributes in a relation, B is fully functionally
dependent on A if B is functionally dependent on A, but not on any
proper subset of A
• Second normal form (2NF)
– A relation that is in first normal form and every non-primary-key attribute
is fully functionally dependent on every candidate key
Normalization
• Transitive dependency
– A condition where A, B, and C are attributes of a relation such that if
A->B and B->C , then C is transitively dependent on A via B
(provided that A is not fully functionally dependent on B or C)
• Third normal form (3NF)
– A relation that is in first and second normal form, and in which no
non-primary-key attribute is transitively dependent on any candidate
key
Normalization
• Boyce-Codd normal form (BCFN)
– A relation that is in BCNF if and only if every determinant is a
candidate key
• Example:
ClientInterview(clientNo, interviewDate, interviewTime, staffNo, roomNo)
clientNo, interviewDate -> interviewTime, staffNo, roomNo
staffNo, interviewDate, interviewTime -> clientNo
roomNo, interviewDate, interviewTime -> staffNo, clientNo
staffNo, interviewDate -> roomNo
(Primary key)
(Candidate key)
(Candidate key)
Split into two relations that are in BCNF:
Interview(clientNo, interviewDate, interviewTime, staffNo)
StaffRoom(staffNo, interviewDate, roomNo)
Normalization
• Multivalued dependency
– Represents a dependency between attributes in a relation, such that for
each value of A there is a set of values for B and a set of values for C.
However, the set of values for B and C are independent of each other.
• Trivial multivalued dependency
– If A->> B is a multivalued dependency in relation R, it is considered
trivial if either
(a) B is a subset of A
or
(b) A U B = R
• Fourth normal form (4NF)
– A relation that is in BCNF and contains no nontrivial multivalued
dependencies
Example:
BranchStaffOwner(branchNo, sName, oName)
BCNF
branchNo ->> sName
branchNo ->> oName
4NF
4NF
Normalization
• Lossless-join dependency
– A property of decomposition, which ensures that no spurious tuples are
generated when relations are reunited through a natural join operation
• Fifth normal form (5NF)
– A relation has no join dependencies
Example:
Chrysler, Dodge, Jeep dealership carries cars from all three
manufacturers. If the dealership decides to offer trucks for sale, trucks
from all manufacturers that make trucks must be offerred.
Manufacturer(manufacturerNo, vehicleType)
Dealer(dealerNo, dealerName)
VehicleTypes(vehicleTypeNo, vehicleType, vehicleTypeDescription)
VehicleTypesOffered(manufacturerNo, dealerNo, vehicleTypeNo)
Normalization
•
Unnormalized form
– A table that contains one or more repeating groups
•
First normal form (1NF)
– A relation in which the intersection of each row and column contains one and
only one value
•
Second normal form (2NF)
– A relation that is in first normal form and every non-primary-key attribute is fully
functionally dependent on every candidate key
•
Third normal form (3NF)
– A relation that is in first and second normal form, and in which no non-primarykey attribute is transitively dependent any candidate key
•
Boyce-Codd normal form (BCNF)
– A relation is in BCNF, if and only if, every determinant is a candidate key
•
Fourth normal form (4NF)
– A relation is in Boyce-Codd normal form and contains no nontrivial multi-valued
dependencies
•
Fifth normal form (5NF)
– A relation has no join dependencies