Download The Snowflake Schema

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia, lookup

Extensible Storage Engine wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

SQL wikipedia, lookup

Clusterpoint wikipedia, lookup

Relational model wikipedia, lookup

Database model wikipedia, lookup

Transcript
OLAP
Products, Challenges, and Related Technologies
Agenda

Basic Introduction and Overview

Thoughts on Models and Schemas

Product Challenges

RDBMS Focus on Warehousing

Parting Comments
Some Terms

Data Warehouse



Datamarts



Smaller repositories created from the warehouse
Also small warehouses or summary tables
DSS


The primary repository for report
Typically a secondary data source created with extracts from
primary data sources
Reporting -- accomplished by any type of reporting tool or
application
Business Intelligence

Same as above
Some More Terms

OLAP


Query Tools


Generates SQL for user based on semantic model, metadata
Datamining Tools


Requires user to write SQL, may be on-line or batch
OLAP Tools


Any type of “live” reporting, as opposed to batch
Stand alone applications focused on high end analytics
Difference between ROLAP and MOLAP

Amount of work done in the relational database. And that’s it...
The DSS Value Proposition

Banks and Financial Services





Telecommunications




25% churn rates; turn over entire customer base in 4 years
Typical Telco loses $100+ million/yr to churn
$300,000+/day in losses
Database Marketing


4-7% churn rates
Fraud costs billions worldwide
1% default rate = $500 million for $50 billion assets
Cross-selling products is worth tens of millions
More targeted mail campaigns can save hundreds of millions
Retail

Inventory Management
The Basic Idea

Warehousing



Basic Reporting




Collect the data
Who purchased mutual funds in the last 3 years?
Analyze data
What is the income distribution of mutual fund buyers?
Who are my most profitable customers?
Advanced Reporting



Predict
What do customers buy in combination?
Who will buy a mutual fund in the next 6 months and why?
The DSS System Life Cycle


Customers “grow” into high end DSS
Most customers struggle to build the
warehouse. Once the warehouse is
in place they progress fairly
rapidly up the “DSS chain”
Closed Loop DSS
Data Mining
OLAP
“Actionable DSS”
Simple Reporting
Warehouse

Raw Data
(OLTP, external
data)
Extracts, Load,
Transformation
Customers opt to build
Datamarts because of
the “difficulty” of
building a proper
warehouse
So How Are Customers Doing?

Customer’s succeed when they know what they
want to know.



Victoria’s Secret Inventory Management
Wal*Mart Pharmacy
Many customers “fail / struggle” at first.





Poor source information
Distracted by technology
 Database benchmarks?
Internal politics
Unrealistic scope/timeframes
Attempt to implement inappropriate technology
 Database gateways?
Types of Reporting / OLAP Strategies

MD OLAP


Datamarts


Extract data into a smaller relational repository and perform
analysis on datamart, using some SQL based tool
Structured ROLAP


Extract data into a cube or MD database and perform analysis
on extracted data
Build a schema tailored to a ROLAP tool and perform analysis
on the structured schema using SQL based tool
Flexible ROLAP


Perform analysis directly on the warehouse
Require an intelligent SQL engine which is used for more than
simple extractions
Different Approaches to DSS
MOLAP/
HOLAP
ROLAP
MD API
SQL
MDDB
SQL
SQL
Structured
Schema
Warehouse
Datamart
Raw Data
(OLTP, external
data)
Models & Schemas

Dimensional Modeling?




Star -- Structured


A MDDB stored relationally
Snowflake -- Normalized


Process of putting a semantic object layer over the physical
schema
Semantic model typically includes dimensions; attributes or
level; facts, metrics or measures; and possibly other objects
Wide degree of variance in products on how closely the physical
structure must resemble the logical presentation layer.
Terrible “new” name for an old concept
The Real World?

TPC-D as a good example
The Original Star Schema
Lookup_Product
product_key
item_name
class_name
department_name
division_name
level
Lookup_Geography
geo_key
store_name
market_name
region_name
level
Fact_Sales
product_key
geo_key
time_key
reg_sls_unit
reg_sls_dollar
cle_sls_unit
cle_sls_dollar
pml_sls_unit
pml_sls_dollar
pln_sls_unit
pln_sls_dollar
Lookup_Time
time_key
day
month_name
year
level
“A relational cube”
The Snowflake Schema
Lookup_year
year
Lookup_division
division_id
division_name
Lookup_month
month_id
month_name
year
Lookup dept
department_id
department_name
division_id
Lookup_class
class_id
class_name
department_id
Lookup_region
region_id
region_name
Lookup_market
market_id
market_name
region_id
Lookup_day
day
month_id
Lookup_item
item_id
item_name
class_id
Lookup_store
store_id
store_name
market_id
Fact_Sales
item_id
store_id
day
reg_sls_unit
reg_sls_dollar
cle_sls_unit
cle_sls_dollar
pml_sls_unit
pml_sls_dollar
pln_sls_unit
pln_sls_dollar
The Star / Snowflake Schema?
Lookup_year
year
Lookup_division
division_id
division_name
Lookup dept
department_id
department_name
division_id
Lookup_class
class_id
class_name
department_id
Lookup_region
region_id
region_name
Lookup_market
market_id
market_name
region_id
Lookup_month
month_id
month_name
year
Lookup_item
item_id
item_name
class_id
class_name
department_id
department_name
division_id
division_name
Lookup_store
store_id
store_name
market_id
market_name
region_id
region_name
Lookup_day
day
month_id
month_name
year
Fact_Sales
item_id
store_id
day
reg_sls_unit
reg_sls_dollar
cle_sls_unit
cle_sls_dollar
pml_sls_unit
pml_sls_dollar
pln_sls_unit
pln_sls_dollar
A Real Schema: TPC-D for example
Dimensional Model for TPC-D
Supp Region
Cust Region
Region Key
Name
Comment
Cust Nation
Customer
Cust Key
Name
Address
Nation Key
Phone
Acct Bal
Mkt Segment
Comment
Part
Supp Nation
Nation Key
Name
Region Key
Comment
Supplier
Part Key
Name
MFGR
Brand
Type
Size
Container
Retail Price
Comment
Line Item
Supp Key
Name
Address
Nation Key
Phone
Acct Bal
Comment
Part Supp
Part Key
Supp Key
Avail Qty
Supply Cost
Comment
Order Key
Part Key
Supp Key
Line Number
Quantity
Extend Price
Discount
Tax
Return Flag
Line Status
Ship Date
Commit Date
Receipt Date
Ship Instruct
Ship Mode
Comment
Nation Key
Name
Region Key
Comment
Order
Order Key
Cust Key
Order Status
Total Price
Order Date
Order Priority
Clerk
Ship Priority
Comment
Region Key
Name
Comment
Order Time
Time Key
Alpha
Year
Month
Week
Day
Ship Time
Commit Time
Receipt Time
Time Key
Alpha
Year
Month
Week
Day
Time Key
Alpha
Year
Month
Week
Day
Time Key
Alpha
Year
Month
Week
Day
Logical Business Model for the Order Dimension
Cust Region
Cust Region
Orders
Cust Nation
Customer
Cust Key
Name
Address
Nation Key
Phone
Acct Bal
Mkt Segment
Comment
Line Item
Order Key
Part Key
Supp Key
Line Number
Quantity
Extend Price
Discount
Tax
Return Flag
Line Status
Ship Date
Commit Date
Receipt Date
Ship Instruct
Ship Mode
Comment
Nation Key
Name
Region Key
Comment
Region Key
Name
Comment
Cust Nation
Mkt Segment
Customer
Clerk
Order Time
Order
Order Key
Cust Key
Order Status
Total Price
Order Date
Order Priority
Clerk
Ship Priority
Comment
Time Key
Alpha
Year
Month
Week
Day
Order Date
Order Priority
Order Status
Ship Priority
Order
Receipt Date
Part Key
Commit Date
Supp Key
Ship Date
Ship Mode
Return Flag
Ship Instruct
Line Status
Line Item
Modeling Conclusions




Seem complicated?
Modeling data is fairly simple if the data and the
capabilities/requirement of the tools are well
understood.
Not all tools are created equally so often many
data transformations must occur to achieve
desired results.
Real world data is rarely “cube” or “star” like.

Caveat...
Industry Benchmarks - Comparison

There exist interesting differences between the
two DSS benchmarks: APB - 1 and TPC -D

APB - 1 (built by the OLAP council)





Is a basic budgeting application
Contains no many to many relationships
Contains “clean” dimensions
Is very “star” and “cube” like
TPC - D (built by the RDBMS community)





Is a basic order entry system
Contains facts with different dimensional keys
Is relatively normalized
Contains cross dimensional attributes relationships
Contains “table-less” dimensions
The Tough Problems

Handling Large Volumes

Working with Complex / Varied Data Structures

Performing Advanced Calculations -- Efficiently
Also called “Depth - Breadth - Reporting Range”
State of Technology

Vendors - Database (good)


Vendors - OLAP (bad)







Database engines add more scalability and flexibility
Continue to focus on making simple problems simpler
Basing solutions on too many assumptions
Working to confuse market -- ROLAP, HOLAP, MOLAP
Working with inherently limited architectures
Not utilizing underling RDBMS capabilities
Working within fixed database schemas
Net Result (still much room for improvement)



Vendors failing to solve customers true needs
The market is pushing back to datamarts
Market is living with simpler reporting -- which may not be bad
Large Systems?

>> 50-100+ Gigabytes of Raw Data

Customers




OLAP Vendors




Want central data warehouse
Find large systems difficult to build and maintain
Have data in a variety of structures (table formats)
Advocate storing subsets of data in different structure
Build proprietary MDDB
Push for datamarts
ROLAP / RDBMS


Push for less data restructuring
Design for less data movement
RDBMS Vendors
Fortunately, with advances in RDBMS technology, ROLAP is
increasingly recognized as the best approach

Key Enhancements - System





Better support for large systems
Partitioning (a.k.a. segmentation, AKA fragmentation)
Hash and bitmap joins, hash and bitmap index technology
Parallel and clustered processing
Key Enhancements -- Function




Temporary table support
Derived table support
Outer Joins
“OLAP” functions
OLAP Demands On An RDMBS

Ability to efficiently perform



Joins and
Aggregations
Row restrictions -- filters

Ability to generate Counts and Sums

Ability to perform iterative calculations and filters


Temporary Tables
Derived Tables or Table Expressions
The above gets you 80 percent of the way there, except for
Ranks, Cumulative Sums, Moving Sums
Example Analysis
DSS Question
“Show me customer revenue & customers’ percent
contribution (customer rev / total rev), only for
those customers who contributed more than 1% to
total revenue”


Popular OLAP Approach



Fetch revenue data for each Customer into OLAP Server
Calculate percent to total revenue for each Customer
Restrict result set to those Customers whose Contribution
is greater than 1%
Example Analysis (2)
Pure ROLAP Approach
select Customer, Sum(Revenue) as REV
into Temp1
from Customer_Fact
group by Customer
select Sum(REV) as TOT_REV
into Temp2 from Temp1
select Temp1.Customer, Temp1.REV/Temp2.TOT_REV as CONT
from Temp1, Temp2
where Temp1.REV/Temp2.TOT_REV >= .01
SQL Extensions


Temporary Table -- Declared Local Tables (ANSI ‘92)
Derived Tables -- Selects in FROM clause (ANSI ‘92)
Example Analysis (3)
ROLAP using Table Expressions
select
Temp1.Customer,
Temp1.REV/Temp2.TOT_REV as CONT
from
(select Customer, Sum(Revenue) as REV
from Customer_Fact
group by Customer) as Temp1,
(select Sum(Revenue) as TOT_REV
from Custom_Fact) as Temp2
where Temp1.REV/Temp2.TOT_REV >= .01

Either Implementation is known as Multi-pass SQL
So What About the Other 20%?

How do you calculate Ranking, Moving Sums,
and Cumulative Sums?

Currently OLAP tools must do this on their own.

RDBMS vendors begin to add support for this.


Teradata and Red Brick have commercial implementations.
Proposal put forth by Oracle and IBM for ANSI SQL ‘99.
(just approved)
Aggregate Navigation

Aggregate Navigation involves two parts



Materialized Views




Materialized View support
Query rewrite capabilities
A “Summary Table” defined as a view
Additional properties telling the database how to update the
view
An advanced type of index
Query Rewrite

The ability for the optimizer to redirect a query to a “higher”
materialized view based on group by and where clause
evaluation
Query Rewrite Example
Materialized View
Select
from
Region,
Sls_unit
Aggregate_Sales
Aggregate_Sales
region_id
sls_unit
sls_dollar
Query Rewrite
Select
from
group
Region,
Sum(Sls_unit)
Base_Sales
by Region
Base Table
Base_Sales
store_id
sls_unit
sls_dollar
OLAP & RDBMS

How does all this affect OLAP tools?



As RDBMS vendors add more functionality -- OLAP tools must
become smarter in terms of generating SQL
DOLAP does not replace OLAP tools, the tools must work
together more intelligently
It lessens the appeal of MD OLAP solutions
Database Technologies
Who are the “leaders”?

By market share...


International Data Corp -- $9.7 billion market
 40.4% Oracle with $3.93 billion
 17.8% UDB with $1.73 billion
 5.7% Informix
 5.1% Microsoft
 4.4% Sybase
Dataquest Inc
 32.3% UDB
 29.3% Oracle
 10.2% Microsoft
 4.4% Informix
 3.5% Sybase
Database Technologies (cont.)
Who are the “leaders”?

By benchmarks...


See TPC-D results
Another Tangent -- Database Gateways?

Cohera, ISG Navigator, IBM’s Data Joiner
OLAP Technologies
Who are the “leaders”?

By Market Share





“The OLAP Report” - $2+ billion
34% Hyperion Solutions Inc. (merged with Arbor)
17% Oracle Express (from 21% slipping)
9.6% Cognos (slipping slightly)
6.4% MicroStrategy (up from 4.5% and rising)
Parting Comments

Customers need...



Sophisticated problems






More flexible OLAP tools
RDBMS optimized for DSS
Limitations of Multidimensional Model?
Large volumes
Schema support
Management of the environment
True ROLAP calculations -- minimize data movement
 HOLAP is not necessarily ROLAP
DSS is becoming mission critical

Systems need to ensure success and availability.
The End
Some further detail on schemas
A Sample Star Lookup Table
GEO_KEY
Star #1
Lookup_Geography
geo_key
geo_name
store_id
market_id
region_id
level



GEO_NAME
STORE_ID
MARKET_ID
REGION_ID
LEVEL
1001
1002
Bo sto n
Greenw ic h
101
102
20
20
1
1
1
1
1003
1004
1005
1006
1007
Pro vid enc e
Ba ltim o re
Phila d elp hia
Cha rlo tte
Durha m
103
104
105
106
107
20
10
10
30
30
1
1
1
2
2
1
1
1
1
1
1008
1009
1010
1011
Greenville
Atla nta
Fa yetteville
Mid -Atla ntic
108
109
110
30
40
40
10
2
2
2
1
1
1
1
2
1012
1013
1014
1015
1016
New Eng la nd
Ca ro lina s
Deep So uth
No rthea st
So uth
20
30
40
1
2
2
1
2
2
2
2
3
3
Star Schema lookup tables hold all of the elements within a dimension in one physical
lookup table.
Each dimensional lookup table will have a single column primary key, that is unique within
the dimension, regardless of the attribute.
Each dimension lookup table will include a ‘level’ field which indicates the attribute level.
Original Star Fact Tables
Fact_Sales
product_key
geo_key
time_key
reg_sls_unit
reg_sls_dollar
cle_sls_unit
cle_sls_dollar
pml_sls_unit
pml_sls_dollar
pln_sls_unit
pln_sls_dollar

Atomic data only



Two types of Star Schemas
 Atomic data only
 Consolidated
Base tables contain only one level of data (per table).
No ‘in-table’ aggregation.
Consolidated


Base tables contain base table data as well as aggregate data for
every possible level of aggregation
‘In-table’ aggregation = storing aggregate data in the same table as
atomic level data, for example, storing store, market, and region level
information within the same fact table.
A Sample Snowflake Lookup
STORE_ID
Lookup_store
store_id
store_name
market_id
region_id
Lookup_market
market_id
market_name
region_id
Lookup_region
region_id
region_name
MARKET_ID
REGION_ID
1
2


REGION_NAM
E
Northeast
South
10
20
30
40
STORE_NAME
MARKET_ID
REGION_ID
101
Boston
20
1
102
103
Greenwich
Providence
20
20
1
1
104
105
Baltimore
Philadelphia
10
10
1
1
106
107
Charlotte
Durham
30
30
2
2
108
Greenville
30
2
109
110
Atlanta
Fayetteville
40
40
2
2
MARKET_NAME REGION_ID
Mid-Atlantic
New England
Carolinas
Deep South
1
1
2
2
The snowflake design typically has one physical lookup table per attribute, with each
attribute identified by a unique key and having its own description column.
Attributes are related to each other by including foreign key columns in attribute
lookup tables, as region_id is stored in the Lookup_Market table.