Download Data Modeling Overview

Document related concepts

Relational algebra wikipedia , lookup

Clusterpoint wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Modeling Overview
By: Dave Wentzel
What we will accomplish
 Review
of DBMS
 Issues related to DBMS
 Entity Relationship Modeling
– Process flow
– Model types
– Component definition
 Selecting
entities and attributes
 Defining relationships
What we will accomplish
 Defining
Cardinality
 Selecting Primary Keys
 Review of recursive relationships, weak
entities, and ternary relationships
 Participation constraints
 Erwin Notation
 NULL issues
 The Physical Model
What we will accomplish
 Generalization
/ Specialization
 Transaction processing
 Normalization Rules
 History issues
What is data?
 Data
– Raw facts. Can be described, observed, and
measured.
 Information
– Data organized in a form that is useful for
decision making. The meaning behind the data.
– New thing not previously observed that is
created based on the data.
 Knowledge
– Information that is used for decision making.
What is a Database?
 Collection
Database
of interrelated data
 Data which can be visualized in a table
format
 Contains relationships between data
 Can be of any size and varying complexity
 Can be maintained manually or by
computer
Data Base Management System (DBMS)
 Collection
of programs (software) that
allows users to create and maintain a
database
 Supports data:
– Definition - specification of data types,
structures, and constraints
– Construction - storing of the data itself
– Manipulation - updating & querying of the data
 Defines
itself. Contains a catalog which
describes its data.
Components of a DBMS
 Catalog
– Maintains information about the data in the
database
– Considered data about data (metadata)
 Databases
– Collection of related tables
 Tables
– Rows and columns containing data
Issues in DBMS
 Data
independence
 Query optimization
– Improve efficiency
– Faster responses
 Transaction
management
– Sequence of operations that are treated as a unit
– Once 1st step is completed, 2nd step must also
be completed otherwise 1st step is aborted
(ROLLBACK mechanism)
 Example:
Transferring Bank Funds
Issues in DBMS continued
 Transaction
management
– Concurrency
– Recovery
 Controlled
redundancy
– Goal of database design is to minimize
redundancy (duplicate data)
 Integrity
constraints
– Includes business rules and data rules
Issues in DBMS continued
 Security
and privacy
– Protect against unauthorized access
 Data
/ database administration
– Involves managing people, data, performance,
security, etc.
Entity Relationship Modeling
Employee
Person
Account
Transaction
Data Model
 Tool
for describing data, its relationships,
semantics, and integrity constraints
 Provides for data abstraction
 Hides details of data storage
Why use an ER Model?
 Easy
to use for modeling DB design
 Succinct representation of database layout
 Good communication tool among project
team members
 Most case tools support ER modeling
 Implementation independent
Categories of Data Models
 Logical
model
– Conceptual data model
– High level model
– Closest view user has of the data
 Physical
model
– Low level model
– Defines how data is stored
Steps in Database Design
Mini World
Functional Requirements
Requirements
Collection and
Analysis
Functional Analysis
Database
Requirements
Logical Model
High Level Transaction Requirement
DBMS Independent
Data Model Mapping
DBMS Specific
API
Application Program
Design
Physical Design
Transaction
Implementation
Internal Schema
Application Programs
ER Modeling composed of
 Entity
(table)
 Attribute (field)
 Relationship
– Binary Relationships
– Cardinality of relationships
What is an entity?
 Conceptual
definition
– Distinguishable object that exists
 Operational
definition
– Business object that has properties we are
interested in storing
 Physical
definition
– Set of related data forming a table composed of
attributes (fields)
Entities
 Primary THINGS
of a business about which
users need to record data
 Objects about which the business is
interested in tracking information
 When an ER Diagram is translated into a
relational model, the entities become the
tables.
Selecting Entities
 Nouns
are candidate entities
 Possible classes of entities:
– People who carry out some function (
employees, students, customers)
– Places (cities, offices, routes)
– Things which are tangible physical objects
(equipment, products, buildings)
– Organizations (teams, suppliers, departments)
Selecting Entities Continued
 Events
which occur at a given date/time or have
steps (employee promotions, project phases,
account payments)
 Concepts which are intangible ideas used to keep
track of business activities (projects, accounts,
complaints)
Questions to ask...
 What
things do we need to keep data about?
 What things are essential to the
organization?
 What things do we talk about in the
organization?
 What questions do we have that reports can
help answer?
 What information should the reports
contain?
Naming entities
 Use
a SINGULAR noun
 Meaningful but intuitive
 Avoid names which may be misinterpreted
within the problem domain
 Follow organizational / industry trends
 Do not try to rename entities within an
organization
 Avoid abused names such as Task, Form,
Operation, Schedule...
Is it an entity to worry about?
 Decide
if an entity is relevant to your
problem domain by determining if it has
attributes you need to track
 If it does not have attributes you need to
track, it is NOT a valid entity for your
problem
Is it really an entity?
 Can
you define attributes for it? An
attribute is a piece of information that we
are interested in tracking about an entity. It
is a property of an entity.
 In general, if two objects differ by one
attribute, they are separate entities.
 Does it participate in a relationship? Two
entities that are related somehow interact
with one another.
Attributes
 Properties
of an object (entity)
 Each attribute has a data type (char, int,
datetime)
 Each attribute in an RDBMS (relational
database management system) has only one
value at a time (atomic)
Categories of Attributes
 Descriptive
– Property of the entity that helps describe the
entity
 Identifying
(key attributes)
– Property of the entity that helps uniquely
identify the entity
– Normally short
– If one does not exist it MUST be created
– If creating a key, use a numeric/integer data
type
Types of Attributes
 Atomic
– Indivisible value
– Most desired state
 Composite
– Can be divided into smaller parts
– Need to convert into atomic
Types of Attributes Continued
 Multi-valued
– Multiple instances of an attribute
– Normally create another entity
 Derived
– Can be determined by the value of another
attribute or attributes
– In most cases, do NOT store derived attributes
Naming Attributes
 Use
a noun, adjective, or adverb
 Name should be unique database wide
 Use attribute names consistently
 Use singular names
 Define a naming convention for the
organization
Rules for Entity Analysis
 Every
noun is a candidate for an entity
 Every entity should be relevant to the problem
 If an object has only one property of importance,
then it should be considered an attribute of
another entity
 If an object has only one data instance (1 row)
then do not model as an entity
 If an object needs a unique identifier then model
it as an entity
Relationships
 Way
entities interact with one another
 An association between two or more
entities
 Depicts business interactions between
entities
 They DO NOT represent business flow
Relationships Continued
 Number
of entities associated through a
relationship defines its degree (unary,
binary, ternary, n-ary)
 Cardinality defines the maximum number
of entities that can participate in the
relationship
How to Identify a Relationship
 Ask
what is the action or verb used to
describe how one entity interacts with
another
 Three types of relations to consider:
– Existence (Employee HAS Children)
– Functional (Professor TEACHES Course)
– Event (Customer PLACES Order)
 Ignore
verbs not important to the
organization
More on Relationships
 Relationships
and cardinality constraints
represent business rules
 When naming a relationship use and active
verb in the present tense
 Relationships are read bi-directionally
Example notes:
 Together
the customer and account tables form a
schema - structure / layout of a logical database
design
 Note the attributes. Order DOES NOT
MATTER but convention puts primary key first.
 No duplicates for attributes.
 No duplicate tuples (rows)
 Relationship - same attribute name ( or different
attribute name with same meaning, in 2 tables.
Cardinality Constraints
 Express
the MAXIMUM number of entities
that can be associated with another entity
via a relationship
 Also known as mapping constraints
 Types:
– 1:1 (one to one)
– 1:N (one to many)
– N:M (many to many)
The Key to It All
Identifiers...
 Attribute(s)
which uniquely identify a
record
 An entity may have multiple identifiers
 Every entity MUST have at least one
 Can be made up of more then one attribute
Candidate vs. Primary Keys
 Both
are identifiers
 Candidate keys are all the identifiers from
which you can choose which uniquely
identify the record
 Primary key is the one candidate key which
is selected to always uniquely identify the
record
Selecting the Primary Key
 In
general we create a primary key
however...
 Choose the attribute most widely used in
the query
 Select the shorter data type
 If one does not exist, must create one
 Select a MINIMUM key if using compound
attributes (not recommended)
Key Requirements and Preferences
 Known
at all times
 Can NOT be null
 Should not be changed
 Shorter is better
 Numeric / integer is better
 Avoid keys containing letters O, I, Z, S - can be
confused with numbers
 If key includes time, it should be in 24hr format
 Avoid carrying meaning
With this all said...
 It
is difficult to come up with a primary key
based on real attributes which will not
change over time (phone numbers, SSN,
addresses, driver’s license numbers…)
 In most cases it is best to create the primary
key
 In SQL Server can use the identity column
which creates a sequential number
Primary Keys and Relationships
 In
a 1:1 relationship, the primary key of
either one of the entities must migrate to the
other entity
 In a 1:N, the primary key of the 1 side must
migrate to the entity on the N side
 In a M:N, the keys of both entities are used
to identify a new entity which resolves the
M:N into two 1:N relationships
Foreign Key
 When
a key migrates to another entity it is
called a Foreign Key
 A foreign key CAN BE null if it is not part
of an entity’s primary key
 If the FK value is NOT null, then that value
MUST exist in the table in which it is the
primary key. This is called Referential
Integrity (RI)
Recursive Relationships
 An
entity having a relationship with itself
 Same entity participates more than once in
a relationship type in different roles
 Same cardinality examples exist in
recursive relationships
Weak Entity Type
 Entity
that does not have a key attribute of
its own
 Identified by its relationship with another
entity
 Created for multi-valued attributes and time
dependent attributes
 Weak entity has EXISTENCE dependence
on the parent. Only exists if the owner
entity exists.
Primary Keys of Weak Entities
 Can
use the primary key of the owner entity
along with a qualifier such as sequence
number or date/time
 Can create a surrogate key but make sure
you migrate the key of the parent
Ternary Relationship
 Relationship
between 3 entities
 Differs from 3 binary relationships
 States that all three entities occur at the
same time
 Must be converted to binary relationships
Creating Binary Relationships
from a Ternary Relationship
Participation Constraints
 Specifies
whether the existence of an entity
depends on its being related to another
entity via a relationship
 Notes the minimum cardinality
 Total participation (mandatory)
 Partial participation (optional)
Identifying Participation Constraints
 Can
entity A exist without entity B?
– If no, A has total participation in the
relationship
– If yes, entity A has partial participation in the
relationship
Identifying Relationships In Erwin
 An
identifying relationship is a relationship
between two tables in which an instance of
a child table is identified through its
association with a parent table, which
means the child table is dependent on the
parent table for its identity, and cannot exist
without it. In an identifying relationship,
one instance of the parent table is related to
multiple instances of the child.
Non-Identifying Relationship In Erwin
 A non-identifying
relationship is a relationship
between two tables in which an instance of the
child table is not identified through its
association with a parent table, which means the
child table is not dependent on the parent table
for its identity, and can exist without it. In a
non-identifying relationship, one instance of the
parent table is related to multiple instances of
the child.
Optional Non-Identifying
 In
an optional non-identifying relationship,
the columns that are migrated into the nonkey area of the child table are not required
in the child table. This means that nulls are
allowed in the foreign key. ERwin draws an
optional non-identifying relationship
differently depending on the notation for
your diagram
Mandatory Non-Identifying
 In
a mandatory non-identifying
relationship, the columns that are migrated
into the non-key area of the child table are
required in the child table. This means that
the foreign key cannot be null.
Erwin Notation
Cardinality
Description
One to 0,
1, or M
Identifying
Non-Identifying
Nulls
No Nulls
To Null or Not to Null….
 NULL means
no value
 Two types of null values
– Unknown
– None (does not exist or not applicable)
Null Examples
Employee
e#
1
2
3
4
name
Bob
Jack
Mary
Kelly
salary
10,000
20,000
30,000
NULL
Questions:
• How many people make more than 15K?
• What is the average salary?
• Is Mary married?
spouse
Mary
Kate
NULL
John
Problems with NULL
 Null
values are ambiguous
 More programming is required to deal with
NULL values
 Try to use UNKNOWN or NONE if
applicable
Getting Physical…
Getting Physical…
 Converting
the logical data model into the
physical data model
Things to do when converting
 Identify
data type
– Is it a string (character field) or a number?
– Use of varchar() or char()?
– Dates are dates not strings
 Identify
data length
– Consider growth over time and maximum size
requirements
 Identify
value constraints (valid ranges,
values, etc.)
Things to do when converting
 Follow
proper naming conventions
 Determine indexes
 Consider combining 1:1 relationship
entities
 Roll-up generalization / specialization
hierarchies
 Add organizational attributes if any
Indexes
 Index
is a physical access structure
 Makes queries more efficient
 Things to consider when creating
– Create an index for each PK
– Create an index for each FK
– Create an index for each AK which will be
used in queries
– Try to minimize number of indexes (update
overhead)
Specialization / Generalization
Specialization / Generalization
 Inheritance
/ Abstraction
 Subclasses / Superclasses
Specialization / Generalization
 Two
processes resulting in the same model
 Specialization is top-down approach. Can a
high level entity be broken down?
 Generalization is bottom-up approach. Can
entities be combined at a higher level?
Example
Notes on Generalization/Specialization
 Key
of subclass is always key of superclass
 Subclasses can participate in their own
relationships
 Participation in a subclass can either be inclusive
or exclusive
 Exclusive subclasses should be defined by a type
 Multiple inheritance not allowed in most
modeling tools
 When converting to physical could combine into
one entity
Database Operations
 CRUD
–
–
–
–
Create (Insert)
Read
Update (Modify)
Delete
 Transactions
can not violate any integrity
constraints
 Several may be grouped into a transaction
 May propagate to maintain integrity constraints
If update violations occur
 Cancel
the operation (Restrict)
 Perform additional updates / deletes so the
violation is corrected (Cascade)
 Execute a user specified operation to
correct (Trigger)
 Perform the operation but inform the user
Normalization - What’s normal...
Normalization
 Process
to design a highly desirable
relational schema using functional
dependencies
 Guidelines for relational database design
which
–
–
–
–
Minimize redundancy
Avoid potential inconsistency
Help predict data behavior problems
Avoid update anomalies
Update Anomalies
 Insert
extra values
 Add redundant records
 Delete records not intended
 Change a fact more then once, possibly in
multiple tables
 Miss changing a fact which is repeated
multiple times
Normal Forms
# of Tables
Joins
 First
Normal Form
 Second Normal Form
 Third Normal Form
 Boyce-Codd Normal Form
 Fourth Normal Form
 Fifth Normal Form
First Normal Form
 A relation
is in 1NF if it contains only scalar
(atomic) values
–
–
–
–
One value for an attribute
No repeating groups
No composite attributes
No multi-valued attributes
 To
convert to 1NF
– Create 1 table for each repeating group by adding the
PK of the original table
– Remove the repeating group from the original table
Example of Non-1NF w/ Conversion
Non-1NF
Dname
Dnumber DM GRSSN
Research
5 333445555
Administration
4 987654321
Headquarters
1 888665555
Dlocations
{Bellaire, Sugarland, Houston}
Stafford, Voorhees
Houston
1NF (note redundancy)
Dname
Dnumber DM GRSSN
Research
5 333445555
Research
5 333445555
Research
5 333445555
Administration
4 987654321
Administration
4 987654321
Headquarters
1 888665555
Dlocations
Bellaire
Sugarland
Houston
Stafford
Voorhees
Houston
Example of Non-1NF
EmployeeProject - NON-1NF
SSN
Ename
123456789 Smith, John
Pnumber Hours
1
2
666885555 Narayan, Ramesh
3
453223344 English, Joyce
1
2
32.5
7.5
40
20
20
Conversion
SSN
Ename
123456789 Smith, John
666885555 Narayan, Ramesh
453223344 English, Joyce
SSN
Pnumber Hours
123456789
1
32.5
123456789
2
7.5
666885555
3
40
453223344
1
20
453223344
2
20
Second Normal Form
 All
attributes in the relation have a
functional dependency on the complete PK
 Each non-key attribute is uniquely defined
by all components of the primary key
Example of Non-2NF w/ Conversion
EmployeeProject
SSN
Pnumber Hours
Ename
Pname
Plocation
FD1
FD2
FD3
Conversion to 2NF
EP1
SSN
Pnumber Hours
EP2
SSN
Ename
EP3
Pnumber Pname
Plocation
Third Normal Form
 Every
non-key attribute (does not
participate in the primary key) is mutually
independent
 Irreducibly dependent on the primary key
Example of Non-3NF w/ Conversion
Ex a mple
Lots
Pro p e rtyID#
Co untyNa me
Lo t#
Are a
Price
Lo ts1
Pro p e rtyID#
Co untyNa me
Lo t#
Are a
Price
Lo ts2
Co untyNa me
T a xRa te
Lo t#
Are a
2 NF
3 NF
Lo ts1A
Pro p e rtyID#
Co untyNa me
Lo ts1B
Are a
Price
T a xRa te
Maintaining History
 Maintaining
History can serve one of two
purposes:
– Tracking changes in the entity over time
– Tracking record history in order to maintain inactive
records over time and maintain RI
 Tracking
changes in an entity over time is very
difficult and requires significant storage
 Tracking inactive records is our standard here
and provides value to the end user
Examples of History…