Download ppt - Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
Business
Systems Intelligence:
1. Data Warehousing 2
2
of
25
44
Entity Relationship Data Model (ERD)
ER Approach works by dividing the data into many
discreet entities
Each entity becomes a Table in the physical
schema
Why it has been so successful
– Coupled with the concept of Normalization it drives
all the redundancy out of the database
– Change (or add or delete) the data at just one point
– Can build very fast access methods (index)
– Results in efficient transactional processing
Where is the catch?
3
of
25
44
ERD: Where Is The Catch?
Lets have a look at a typical ER data model first
Some Observations
– A Symmetric Model
• All the tables look the same
• Which table is more important ? Which is the largest?
• Which tables contain numerical measurements of the
business?
• Which table contain nearly static descriptive attributes?
– Very hard to visualize and keep it in head
– A large number of possible connections to any two
(or more) tables
4
of
25
44
A Typical OLTP Oriented ER Data Model
5
of
25
44
ERD: Catch Continues
ERD and Normalization result in large number of
tables
– Hard to be understood by the users (db
programmers)
– Hard to navigate by DBMS software in an optimum
way
Real value of ERD is in using tables individually or
in pairs
Too complex for queries that span multiple tables
with a large number of records
6
of
25
44
How to Simplify a Data Model
Two general methods
– De-Normalization
– Dimensional Modeling
De-Normalization
– Reverses the effect of Normalization
– Reintroduce redundancy while reducing the number
of tables
– Popular approaches: Pre-Join de-normalization,
Column Replication or Movement and Aggregation
7
of
25
44
Data Warehouse Design
The database component of a data warehouse is
described using a technique called dimensionality
modelling
Logical design technique that aims to present the
data in a standard, intuitive form that allows for
high-performance access
Uses the concepts of ER modelling with some
important restrictions
8
of
25
44
Data Warehouse Design (cont…)
Every dimensional model (DM) is composed of
– One table with a composite primary key, called the
fact table
– A set of smaller tables called dimension tables
Each dimension table has a simple (noncomposite) primary key that corresponds exactly to
one of the components of the composite key in the
fact table.
9
of
25
44
Data Warehouse Design
Forms star-like structure, which is called a star
schema or star join
–
–
Star schema is a logical structure that has a fact
table containing factual data in the centre,
surrounded by dimension tables containing
reference data, can be denormalised.
Facts are generated by events that occurred in the
past, and are unlikely to change, regardless of how
they are analysed
10
of
25
44
Data Warehouse Design (cont…)
Dimension
Table
Dimension
Table
Fact Table
Dimension
Table
Dimension
Table
Star Schema
11
of
25
44
Data Warehouse Design (cont…)
Bulk of data in data warehouse is in fact tables,
which can be extremely large.
Important to treat fact data as read-only reference
data that will not change over time.
Most useful fact tables contain one or more
numerical measures, or ‘facts’ that occur for each
record and are numeric and additive.
12
of
25
44
Data Warehouse Design (cont…)
Dimension tables usually contain descriptive
textual information.
Dimension attributes are used as the constraints in
data warehouse queries.
Star schemas can be used to speed up query
performance by denormalizing reference
information into a single dimension table.
13
of
25
44
Dimensional Modeling
Models the data around two basic concepts: Facts
& Dimensions.
Facts
– Facts are numeric measurements (values) that
represent a specific business aspect or activity.
– Facts can be computed or derived at run-time
(metrics).
– Examples : Unit Cost, Sale Amount, Quantity Sold
14
of
25
44
Dimensional Modeling (cont…)
Dimensions
– Dimensions are qualifying characteristics that provide
additional perspectives to a given fact.
– Examples: Date (Day, Month, Qtr, Year), Product
(Type, Category)
15
of
25
44
Dimensional Modeling (cont...)
Every dimensional model (DM) is composed of one
(or more) fact tables, and a set of smaller
dimension tables.
Look on Fact table through one (or more)
dimensions.
– What is the sale amount in Consumer Product
category, for elderly customers in the second quarter
of 2004?
Forms ‘star-like’ structure, which is called a star
schema or star join.
16
of
25
44
Example Dimensional Model
17
of
25
44
Time Dimension Exercise
Every Data Warehouse will need Time information
– I.e. a Time Dimension
Compose a generic Time Dimension Table
– E.g. what are the different attributes you can use to
describe 11th February, 2008.
18
of
25
44
Time Dimension Exercise
create table time_dimension (
date_key
full_date
day_of_week
day_num_in_month
day_num_overall
day_name
day_abbrev
weekday_flag
week_num_in_year
week_num_overall
week_begin_date
week_begin_date_key
month
month_num_overall
month_name
month_abbrev
quarter
year
yearmo
fiscal_month
fiscal_quarter
fiscal_year
last_day_in_month_flag
same_weekday_year_ago_date
primary key (date_key));
Number not null,
Date,
Number,
Number,
Number,
Varchar2(9),
Varchar2(3),
Varchar2(1),
Number,
Number,
Date,
Number,
Number,
Number,
Varchar2(9),
Varchar2(3),
Number,
Number,
Number,
Number,
Number,
Number,
Varchar2(1),
Date,
19
of
25
44
Data Model Design for Data Warehouses
Nine-Step Methodology includes following steps:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Choosing the subject
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing pre-calculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query mode
20
of
25
44
Step 1: Choosing The Subject
The subject (or function) refers to the subject
matter of a particular data mart
A business process is a major operational process
in an organization
– Typically supported by a legacy system (database) or
an OLTP
– Examples: Orders, Invoices, Inventory etc.
21
of
25
44
Step 1: Choosing The Subject (cont…)
First data mart built should be the one that is most
likely to be
– Delivered on time
– Within budget
– To answer the most commercially important business
questions
22
of
25
44
Step 2: Choosing The Grain
Grain is the fundamental, atomic level of data to be
represented.
Decide what a record of the fact table is to
represent.
– Grain is also termed as unit of analyses.
– Typical grains
• Individual Transactions
• Daily aggregates (snapshots)
• Monthly aggregates
23
of
25
44
Step 2: Choosing The Grain (cont…)
Identify dimensions of the fact table. The grain
decision for the fact table also determines the grain
of each dimension table.
Also include time as a core dimension, which is
always present in star schemas.
Sometimes grain varies for different facts within
same business process. How?
24
of
25
44
Step 3: Identifying & Conforming The Dimensions
Dimensions set the context for asking questions
about the facts in the fact table.
If any dimension occurs in two data marts, they
must be exactly the same dimension, or one must
be a mathematical subset of the other.
A dimension used in more than one data mart is
referred to as being conformed.
25
of
25
44
Step 3: Identifying & Conforming The Dimensions
Choose the dimensions that apply to each fact in
the fact table.
– Typical dimensions: time, product, customer etc.
– Need to identify the descriptive attributes that explain
each dimension
– Need to determine hierarchies within each
dimension?
26
of
25
44
Steps 4 & 5: Choosing The Facts
The grain of the fact table determines which facts can
be used in the data mart.
Facts should be numeric and additive.
– Example: Quantity Sold, Amount etc.
Unusable facts include:
– Non-numeric facts
– Non-additive facts
– Fact at different granularity from other facts in table
Storing Pre-Calculations in the Fact Table
– Once the facts have been selected each should be reexamined to determine whether there are opportunities to
use pre-calculations.
27
of
25
44
Step 6: Rounding Out The Dimension Tables
Text descriptions are added to the dimension
tables.
Text descriptions should be as intuitive and
understandable to the users as possible.
Usefulness of a data mart is determined by the
scope and nature of the attributes of the dimension
tables.
(See exercise on Time Dimension)
28
of
25
44
Step 7: Choosing The Duration Of The Database
Duration measures how far back in time the fact
table goes.
Very large fact tables raise at least two very
significant data warehouse design issues.
– Often difficult to source increasing old data.
– It is mandatory that the old versions of the important
dimensions be used, not the most current versions.
Known as the ‘Slowly Changing Dimension’ problem.
29
of
25
44
Step 8: Tracking Slowly Changing Dimensions
The slowly changing dimension problem means
that the proper description of the old dimension
data must be used with old fact data.
Often, a generalized key must be assigned to
important dimensions in order to distinguish
multiple snapshots of dimensions over a period of
time.
30
of
25
44
Step 8: Tracking Slowly Changing Dimensions (cont…)
Three basic types of slowly changing dimensions:
– Where a changed dimension attribute is overwritten.
– Where a changed dimension attribute causes a new
dimension record to be created.
– Where a changed dimension attribute causes an
alternate attribute to be created so that both the old
and new values of the attribute are simultaneously
accessible in the same dimension record.
31
of
25
44
Step 9: Deciding The Query Priorities And The Query
Modes
Most critical physical design issues affecting the
end-user’s perception includes:
– Physical sort order of the fact table on disk
– Presence of pre-stored summaries or aggregations.
Additional physical design issues include
administration, backup, indexing performance, and
security.
32
of
25
44
Dimensional Modelling
Dimensional Modelling is a logical design
technique that seeks to present the data in a
standard framework that is intuitive and allows for
high-performance access
33
of
25
44
Dimensional Modelling (cont…)
Fact table
– Consists of a multi-part primary key and usually
contains numeric data
– Numeric data is aggregated based on the multi-part
primary key
– Additive is crucial because the DW applications
typically retrieve data based on more than one set of
facts
Dimension table
– Contains descriptive information
– Are the entry point for queries
34
of
25
44
Dimensional Modelling
Dimensional Model:
SELECT description, SUM(quoted_price), SUM(quantity),
SUM(unit_price) , SUM(total_comm)
FROM
order_fact
of,
part_dimension pd
WHERE of.part_nr = pd.part_nr
GROUP BY description;
ER-Model:
SELECT description, SUM(quoted_price), SUM(quantity),
SUM(unit_price), SUM(total_comm)
FROM order o,
order_detail od,
part p,
customer c,
slsrep s
WHERE o.order_nr = od.order_nr
AND
p.part_nr = od.part_nr
AND
o.customer_nr = c.customer_nr
AND
s.slsrep_nr = c.slsrep_nr
GROUP BY description;
Notice that the dimensional model only joins two tables, while the ER model joins
all five in the ER Diagram. This is very typical of highly normalized ER models.
Imagine a typical normalized database with 100s of tables
35
of
25
44
Dimensional Modelling
36
of
25
44
Simple DW Example
Supermarket (Chain Store)
– Business Area: Sales
– Grain: Individual Purchases
– Dimensions:
•
•
•
•
•
Time
Product
Store
Customer
Employee
– Facts
• Total Sales
• Number of items
• Total Cost Value
37
of
25
44
Data Warehouse Design (cont…)
Customer
Time
Individual
Purchases
Employee
Products
Store
38
of
25
44
More Detailed Exercise
A more detailed exercise – See handout
39
of
25
44
Exam Question Example
You are working on a data warehousing project for the examinations department at the
Dublin Institute of Technology (DIT). The examinations department looks after all exam and
continuous asessment results for all of the students within the Institute. The purpose the
data warehousing project is to allow new reporting capabilities so that examinations
department staff can examine grade patterns for particular courses; monitor average
grades for modules and patterns in grades for courses given by particular staff members;
and to help with student retention efforts.
Currently in the examinations departments transactional systems information about each
student is indexed by a student number, and includes name, address, date of birth, etc.
Similarly, information stored about the instructors working within the Institute is indexed by
staff ID and includes name, address, department, etc. The information that needs to be
stored about each course includes the course title, course code, and the weighting given to
the continuous assessment element and exam element of a particular course. At the
Institute’s progression boards each year the continuous assessment and exam results for
each student, for each course, are entered into the Institute’s transactional grade storage
system and this informaion should be trasnferred across to the data warehouse.
Design a star schema for the above scenario. (20 marks)
Discuss how the star schema supports the reporting requirements outlined in the above
scenario. (15 marks)
40
of
25
44
Different Types Of Dimensional Model
The star-schema can be extended in two ways:
– Snow flake model
– Multi-star model (also know as fact constellation)
41
of
25
52
Example Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
location
location_key
street
city
state_or_province
country
42
of
25
52
Example Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
location
location_key
street
city_key
city
city_key
city
state_or_province
country
43
of
25
52
Example Fact Constellation
time
item
time_key
day
day_of_the_week
month
quarter
year
item_key
item_name
brand
type
supplier_type
branch
branch_key
branch_name
branch_type
Sales Fact Table
time_key
item_key
item_key
shipper_key
branch_key
from_location
location_key
to_location
units_sold
dollars_sold
avg_sales
Measures
Shipping Fact Table
time_key
location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
44
of
25
44
Summary
Over the last two lectures we have introduced them
idea of data warehouses
Data warehouses evolved to address the issues of
using transactional databases to answer new kinds
of questions
IBM have a very detailed RedBook on
dimensional modelling that is well
worth looking at:
http://www.redbooks.ibm.com/redbooks/pdfs/sg247138.pdf