Download Data Warehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Warehouse and
Business Intelligence
Dr. Minder Chen
[email protected]
Fall 2008
Online Resources
• Additional resources:
– Teradata Student Network.
» The Premier Learning Resource for Data Warehousing,
DSS/BI, and Database. The URL is
http://www.teradatastudentnetwork.com
» PSW: smartdecisions
© Minder Chen, 2004-2008
Data Warehouse - 2
BI
Business Intelligence (BI) is the process of gathering meaningful
information to answer questions and identify significant trends or
patterns, giving key stakeholders the ability to make better
business decisions.
“The key in business is to
know something that
nobody else knows.”
-- Aristotle Onassis
PHOTO: HULTON-DEUTSCH COLL
“To understand is to perceive patterns.”
— Sir Isaiah Berlin
"The manager asks how and when,
the leader asks what and why."
— “On Becoming a Leader” by Warren Bennis
© Minder Chen, 2004-2008
Data Warehouse - 3
BI Questions
• What happened?
–
What were our total sales this month?
• What’s happening?
–
Are our sales going up or down, trend analysis
• Why?
–
Why have sales gone down?
• What will happen?
–
Forecasting & “What If” Analysis
• What do I want to happen?
–
Planning & Targets
Source: Bill Baker, Microsoft
© Minder Chen, 2004-2008
Data Warehouse - 4
Business Intelligence
Increasing potential
to support
business decisions (MIS)
Making
Decisions
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data
Analyst
Data Mining
Information Discovery
Data Exploration
OLAP, MDA,
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
(Paper, Files, Information Providers, Database Systems, OLTP)
© Minder Chen, 2004-2008
DBA
Data Warehouse - 5
Where is Business Intelligence applied?
Operational Efficiency
•
•
•
•
•
•
•
•
ERP Reporting
KPI Tracking
Product Profitability
Risk Management
Balanced Scorecard
Activity Based Costing
Global Sourcing
Logistics
© Minder Chen, 2004-2008
Customer Interaction
•
•
•
•
•
•
•
Sales Analysis
Sales Forecasting
Segmentation
Cross-selling
CRM Analytics
Campaign Planning
Customer Profitability
Data Warehouse - 6
© Minder Chen, 2004-2008
Data Warehouse - 7
Inmon's Definition of Data Warehouse – Data View
• A warehouse is a
– subject-oriented,
– integrated,
– time-variant and
– non-volatile
collection of data in support of
management's decision making process.
Source: http://www.intranetjournal.com/features/datawarehousing.html
– Bill Inmon in 1990
© Minder Chen, 2004-2008
Data Warehouse - 8
Inmon's Definition Explain
• Subject-oriented: They are organized around major
subjects such as customer, supplier, product, and
sales. Data warehouses focus on modeling and
analysis to support planning and management
decisions v.s. operations and transaction processing.
• Integrated: Data warehouses involve an integration of
sources such as relational databases, flat files, and online transaction records. Processes such as data
cleansing and data scrubbing achieve data
consistency in naming conventions, encoding
structures, and attribute measures.
• Time-variant: Data contained in the warehouse provide
information from an historical perspective.
• Nonvolatile: Data contained in the warehouse are
physically separate from data present in the
operational environment.
© Minder Chen, 2004-2008
Data Warehouse - 9
Kimball's Definition – Process View
• A data warehouse is a system that
extracts, cleans, conforms, and delivers
source data into a dimensional data store
and then supports and implements
querying and analysis for the purpose of
decision making.
» Ralph Kimball
© Minder Chen, 2004-2008
Data Warehouse - 10
© Minder Chen, 2004-2008
Data Warehouse - 11
The Data Warehouse Process
Data Marts
and cubes
Source
Systems
Clients
Data
Warehouse
1
Design the
Data Warehouse
© Minder Chen, 2004-2008
2
Populate
Data Warehouse
Query Tools
Reporting
Analysis
Data Mining
3
Create
OLAP Cubes
4
Query
Data
Data Warehouse - 12
Key Concepts in BI Development Lifecycle
© Minder Chen, 2004-2008
Data Warehouse - 13
Business Valuation Models for BI
© Minder Chen, 2004-2008
Data Warehouse - 14
Performance Dashboards for Information Delivery
© Minder Chen, 2004-2008
Data Warehouse - 15
Scorecards for Information Delivery
© Minder Chen, 2004-2008
Data Warehouse - 16
OLTP Normalized Design
Warehouse
Ordering
Process
Chain
Retailer
Store
Retailer
Payments
Retailer
Returns
Product
POS
Process
Retail
Promo
Brand
GL
Account
Retail
Cust
Cash
Register
© Minder Chen, 2004-2008
Clerk
Data Warehouse - 17
OLTP Versus Business Intelligence: Who asks what?
OLTP Questions
• When did that order ship?
• How many units are in
inventory?
• Does this customer have
unpaid bills?
• Are any of customer X’s line
items on backorder?
© Minder Chen, 2004-2008
Analysis Questions
• What factors affect order
processing time?
• How did each product line (or
product) contribute to profit last
quarter?
• Which products have the lowest
Gross Margin?
• What is the value of items on
backorder, and is it trending up
or down
over time?
Data Warehouse - 18
OLTP vs. OLAP
Source: http://www.rainmakerworks.com/pdfdocs/OLTP_vs_OLAP.pdf#search=%22OLTP%20vs.%20OLAP%22
© Minder Chen, 2004-2008
Data Warehouse - 19
Dimensional Design Process
Business
Requirements
• Select the business process to model
• Declare the grain of the business process/data
in the fact table
• Choose the dimensions that apply to each fact
table row
• Identify the numeric facts that will populate
each fact table row
Data
Realities
© Minder Chen, 2004-2008
Data Warehouse - 20
Select a business process to model
• Not business departments or business
functions
• Cross-functional business processes
• Business events
• Examples:
–
–
–
–
–
–
Raw materials purchasing
Order fulfillment process
Shipments
Invoicing
Inventory
General ledger
© Minder Chen, 2004-2008
Data Warehouse - 21
Requirements
© Minder Chen, 2004-2008
Data Warehouse - 22
Identifying Measures and Dimensions
Performance Measures
for KPI
Measures
Performance Drivers
Dimensions
The attribute varies
continuously:
The attribute is perceived as
a constant or discrete value:
•Balance
•Unit Sold
•Cost
•Sales
•Description
•Location
•Color
•Size
© Minder Chen, 2004-2008
Data Warehouse - 23
A Dimensional Model for a Grocery Store Sales
© Minder Chen, 2004-2008
Data Warehouse - 24
Product Dimension
• SKU: Stock Keeping Unit
• Hierarchy:
–
Department  Category  Subcategory  Brand  Product
© Minder Chen, 2004-2008
Data Warehouse - 25
Creating Dimensional Model
• Identify fact tables
• Translate business measures into fact tables
• Analyze source system information for additional
measures
• Identify base and derived measures
• Document additivity of measures
• Identify dimension tables
• Link fact tables to the dimension tables
• Create views for users
© Minder Chen, 2004-2008
Data Warehouse - 26
Transaction Level Order Item Fact Table
© Minder Chen, 2004-2008
Data Warehouse - 27
Inside a Dimension Table
• Dimension table key: Uniquely identify each row. Use
surrogate key (integer).
• Table is wide: A table may have many attributes
(columns).
• Textual attributes. Descriptive attributes in string
format. No numerical values for calculation.
• Attributes not directly related: E.g., product color and
product package size. No transitive dependency.
• Not normalized (star schemar).
• Drilling down and rolling up along a dimension.
• One or more hierarchy within a dimension.
• Fewer number of records.
© Minder Chen, 2004-2008
Data Warehouse - 28
Fact Tables
Fact tables have the following characteristics:
• Contain numeric measures (metric) of the
business
• May contain summarized (aggregated) data
• May contain date-stamped data
• Are typically additive
• Have key value that is typically a concatenated
key composed of the primary keys of the
dimensions
• Joined to dimension tables through foreign
keys that reference primary keys in the
dimension tables
© Minder Chen, 2004-2008
Data Warehouse - 29
Facts Table
Measurements of business events.
DateID
ProductID
Dimensions
CustomerID
Units
Dollars
Measures
The Fact Table contains keys and units of
measure
© Minder Chen, 2004-2008
Data Warehouse - 30
Snowflake Schema
Brands
Products
Channels
Dates
Sales
Promotions
© Minder Chen, 2004-2008
Customers
Data Warehouse - 31
Hierarchy
© Minder Chen, 2004-2008
Data Warehouse - 32
OLAP Solutions
•
•
•
•
•
Data Warehouse/Data Mart
Dimensions
Measures
Cubes
Europe
Asia
Cells
© Minder Chen, 2004-2008
US
Gadgets
130
135
140
142
Gizmos
205
390
350
475
Thingies
175
230
190
250
Widgets
310
340
410
450
Q1
Q2
Q3
Q4
Data Warehouse - 33
Operations in Multidimensional Data Model
• Aggregation (roll-up)
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total
sales by city and year  total sales by region and
by year
• Selection (slice) defines a subcube
– e.g., sales where city = Palo Alto and date = 1/15/96
• Navigation to detailed data (drill-down)
– e.g., (sales - expense) by city, top 3% of cities by
average income
• Visualization Operations (e.g., Pivot)
© Minder Chen, 2004-2008
Data Warehouse - 34
A Visual Operation: Pivot (Rotate)
Juice
10
Cola
47
Milk
30
Cream 12
Product
3/1 3/2 3/3 3/4
Date
© Minder Chen, 2004-2008
Data Warehouse - 35
Date Dimension of the Retail Sales Model
© Minder Chen, 2004-2008
Data Warehouse - 36
Store Dimension
• It is not uncommon to represent multiple
hierarchies in a dimension table. Ideally, the
attribute names and values should be unique
across the multiple hierarchies.
© Minder Chen, 2004-2008
Data Warehouse - 37
Multidimensional Query Techniques
Why?
What?
Slicing
Product
Time
Geography
Why?
Dicing
Why?
Drilling
down
© Minder Chen, 2004-2008
Data Warehouse - 38
ETL
ETL = Extract, Transform, Load
• Moving data from production systems to DW
• Checking data integrity
• Assigning surrogate key values
• Collecting data from disparate systems
• Reorganizing data
© Minder Chen, 2004-2008
Data Warehouse - 39
Pivot Table in Excel
© Minder Chen, 2004-2008
Data Warehouse - 40
Data Quality Issues
•
•
•
•
•
•
•
•
•
No common time basis
Different calculation algorithms
Different levels of extraction
Different levels of granularity
Different data field names
Different data field meanings
Missing information
No data correction rules
No drill-down capability
© Minder Chen, 2004-2008
Data Warehouse - 41
Building The Warehouse
Transforming Data
© Minder Chen, 2004-2008
Data Warehouse - 42
The Anomalies Nightmare
CUST #
NAME
ADDRESS
90328574
Digital Equipment
187 N. PARK St. Salem NH 01458
OEM
90328575
DEC
187 N. Pk. St. Salem NH 01458
OEM
90238475
Digital
187 N. Park St Salem NH 01458
$#%
90233479
Digital Corp
187 N. Park Ave. Salem NH 01458
Comp
90233489
Digital Consulting
15 Main Street Andover MA 02341
Consult
90234889
Digital Info Service
PO Box 9 Boston MA 02210
Mail List
90345672
Digital Integration
Park Blvd. Boston MA 04106
SYS INT
No Unique Key
Anomalies
No Standardization
TYPE
Spelling
Noise in
Blank Fields
How does one correctly identify and consolidate
anomalies from millions of records?
© Minder Chen, 2004-2008
Data Warehouse - 43
OLAP and Data Mining Address Different Types of Questions
While reporting and OLAP are informative about past facts, only data mining can help
you predict the future of your business.
OLAP
Data Mining
What was the response rate to our mailing?
What is the profile of people who are likely
to respond to future mailings?
How many units of our new product did we Which existing customers are likely to buy
sell to our existing customers?
our next new product?
Who were my 10 best customers last year?
Which 10 customers offer me the greatest
profit potential?
Which customers didn't renew their policies Which customers are likely to switch to the
last month?
competition in the next six months?
Which customers defaulted on their loans?
Is this customer likely to be a good credit
risk?
What were sales by region last quarter?
What are expected sales by region next
year?
What percentage of the parts we produced
yesterday are defective?
What can I do to improve throughput and
reduce scrap?
Source: http://www.dmreview.com/editorial/dmreview/print_action.cfm?articleId=2367
© Minder Chen, 2004-2008
Data Warehouse - 44
Use of Data Mining
•
•
•
•
•
Customer profiling
Market segmentation
Buying pattern affinities
Database marketing
Credit scoring and risk analysis
© Minder Chen, 2004-2008
Data Warehouse - 45
Associates
Which items are purchased in a retail store at the
same time?
© Minder Chen, 2004-2008
Data Warehouse - 46
Sequential Patterns
What is the likelihood that a customer will
buy a product next month, if he buys a related
item today?
© Minder Chen, 2004-2008
Data Warehouse - 47
Classifications
Determine customers’ buying patterns
and then find other customers with
similar attributes that may be targeted for
a marketing campaign.
© Minder Chen, 2004-2008
Data Warehouse - 48
Modeling
Use factors, such as location, number of
bedrooms, and square footage, to
Determine the market value of a property
© Minder Chen, 2004-2008
Data Warehouse - 49