Download Data Warehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MTAT.03.183
MTAT
03 183
Data Mining
Week 7: Online Analytical Processing
and Data Warehouses
Marlon Dumas
marlon.dumas ät ut . ee
Acknowledgment
• Thi
This slide
lid d
deck
k iis a ““mashup”
h ” off th
the ffollowing
ll i
publicly available slide decks:
– http://www.postech.ac.kr/~swhwang/grass/DataCube.
//
/
/
/
C
ppt
– http://classweb.gmu.edu/kersch/inft864/Readings/Sho
htt // l
b
d /k
h/i ft864/R di
/Sh
shani/DataCube/CubeNotesKerschberg.ppt
– http://ohr.gsfc.nasa.gov/wfstatistics/Data_Cube_Traini
http://ohr gsfc nasa gov/wfstatistics/Data Cube Traini
ng.ppt
– http://www.cs.uiuc.edu/homes/hanj/bk2/03.ppt
http://www cs uiuc edu/homes/hanj/bk2/03 ppt
2
Outline
• The “data cube” abstraction
• Multidimensional data models
• Data warehouses
3
Typical Data Analysis Process
•
•
•
•
•
Formulate a query to extract relevant information
gg g
data from the database
Extract aggregated
Visualize the result to look for patterns.
Analyze the result and formulate new queries
queries.
Online Analytical Processing (OLAP) is about
supporting
ti such
h processes
• OLAP characteristics: No updates, lots of
aggregation, need to visualize and to interact
gg g
• Let’s first talk about aggregation…
4
Relational Aggregation Operators
• SQL has several aggregate operators:
– SUM(), MIN(), MAX(), COUNT(), AVG()
• The basic idea is:
– Combine all values in a column
into a single scalar value
• Syntax
17
AVG()
2
?
…
...
– SELECT AVG(Temp)
AVG(T
)
FROM Weather;
5
13
5
IDSLab.
The Relational GROUP BY Operator
• GROUP BY allows aggregates
gg g
over table sub-groups
g p
– SELECT
Time, Altitude, AVG(Temp)
FROM
Weather
GROUP BY Time, Altitude;
Time
Latitude
Longitude
Altitude
(m)
Temp
07/9/5:1500
…
…
20
24
Time
Altitude
(m)
AVG(Temp)
07/9/5:1500
…
…
20
22
07/9/5:1500
20
23
07/9/5:1500
…
…
100
17
07/9/5:1500
100
17
07/9/9:1500
…
…
50
19
07/9/9:1500
50
20
07/9/9:1500
…
…
50
21
6
IDSLab.
Limitations of the GROUP BY
• Group-by
p y is one-dimensional: one g
group
pp
per
combination of the selected attribute values
 Does not give sub-totals
sub totals
Model
Year
Color
Sales
Ch
Chevy
1994
Bl k
Black
50
Chevy
1995
Black
85
Chevy
1994
White
40
Chevy
1995
White
115
1. Calculate total sales per year
1
2. Compute total sales per year and per color
3. Calculate sales per year, per color and per model
7
Grouping with Sub-Totals (Pivot table)
• Sales by Model by Year by Color
• Note that sub-totals by color are missing, if added it
becomes a cross-tabulation
cross tab lation
8
Grouping with sub-totals
sub totals (cross
(cross-tab)
tab)
9
Grouping with Sub-Totals
Sub Totals (Relational version)
Sub-totals byy color are still missing…
g
10
IDSLab.
SQL Query
11
Adding the colors…
colors
12
CUBE and Roll Up Operators
Aggregate
Group By
(with total)
Sum
By Color
RED
WHITE
BLUE
Cross Tab
Chevy
y Ford By
y Co
Color
o
Sum
RED
WHITE
BLUE
The Data Cube and
The Sub-Space Aggregates
Byy Make
Sum
By Year
By Make
By Make & Year
RED
WHITE
BLUE
By Color & Year
Sum
By Make & Color
B Color
By
C l
13
The Cube
• An Example of 3D Data Cube
By Year
By Make
Sum
IDSLab.
By Color
14
14
Cube: Each Attribute is a Dimension
• N-dimensional
N dimensional Aggregate (sum()
(sum(), max()
max(),...))
– Fits relational model exactly:
• a1, a2, ...., aN, f()
• Super-aggregate over N-1 Dimensional sub-cubes
•
•
•
•
ALL, a2, ...., aN , f()
a3 , ALL, a3, ...., aN , f()
...
a1, a2, ...., ALL,
ALL f()
– This is the N-1 Dimensional cross-tab.
• Super-aggregate
Super aggregate over N-2
N 2 Dimensional sub
sub-cubes
cubes
• ALL, ALL, a3, ...., aN , f()
• ...
• a1, a2 ,...., ALL, ALL, f()
15
The Data Cube Concept
W
B
1994
1995
MAKE
1994
1995
B
W
Ford
Black
F
C
1994
Chevy
1994
1995
White
COLOR
1995
F
YEAR
C
F
C
B
W
16
Sub-cube Derivation
<M,Y,C>
<M,Y,*>
<M,*,C>
<M,*,*>
M, ,
<*,Y,*>
,Y,
<*,Y,C>
<*,*,C>
<*,*,*>
• Dimension collapse, * denotes ALL
17
CUBE Operator
Possible syntax
• Proposed syntax example:
– SELECT
Model, Make, Year, SUM(Sales)
FROM
Sales
WHERE
Model IN {“Chevy”, “Ford”}
AND
Year BETWEEN 1990 AND 1994
GROUP BY CUBE Model, Make, Year
HAVING
SUM(Sales) > 0;
– Note: GROUP BY operator repeats aggregate list
• in select list
• in group by list
18
IDSLab.
18
Rollup Operator
• ROLLUP Operator: special case of CUBE Operator
Return “Sales Roll Up by Store by Quarter” in 1994.:
SELECT
Store, quarter, SUM(Sales)
FROM
Sales
WHERE
nation=“Korea” AND Year=1994
GROUP BY ROLLUP Store, Quarter(Date) AS
quarter;
19
IDSLab.
Cube Operator Example
SALES
Model Year Color
Chevy 1990 red
Chevy 1990 white
Chevy 1990 blue
Chevy 1991 red
Chevy 1991 white
Chevy 1991 blue
Chevy 1992 red
Chevy 1992 white
Chevy 1992 blue
Ford 1990 red
Ford 1990 white
Ford 1990 blue
Ford 1991 red
Ford 1991 white
Ford 1991 blue
Ford 1992 red
F d 1992 white
Ford
hit
Ford 1992 blue
Sales
5
87
62
54
95
49
31
54
71
64
62
63
52
9
55
27
62
39
CUBE
DATA CUBE
Model Year Color
ALL ALL ALL
chevy ALL ALL
ford ALL ALL
ALL 1990 ALL
ALL 1991 ALL
ALL 1992 ALL
ALL ALL red
ALL ALL white
ALL ALL blue
chevy 1990 ALL
chevy 1991 ALL
chevy
h
1992 ALL
ford 1990 ALL
ford 1991 ALL
ford 1992 ALL
chevy ALL red
chevy ALL white
chevy ALL blue
ford ALL red
ford ALL white
ford ALL blue
ALL 1990 red
ALL 1990 white
ALL 1990 blue
ALL 1991 red
ALL 1991 white
ALL 1991 blue
ALL 1992 red
ALL 1992 white
ALL 1992 blue
Sales
942
510
432
343
314
285
165
273
339
154
199
157
189
116
128
91
236
183
144
133
156
69
149
125
107
104
104
59
116
110
20
Summary
• Problems with GROUP BY
– GROUP BY cannot directly construct
• Pivot tables / roll-up reports
• Cross-Tabs
Cross Tabs
• CUBE Operator
– Generalizes GROUP BY and Roll
Roll-Up
Up and Cross
Cross-Tabs!!
Tabs!!
21
IDSLab.
21
Now let’s
let s have a look at one…
one
• NASA Workforce cubes
http://nasapeople.nasa.gov/workforce
• Btell demo reports
– http://www.btell.de
– Follow the “demo” link and start a demo, the go to
reports
22
Multidimensional Data
• S
Sales
l volume
l
as a ffunction
ti off product,
d t month,
th
and region
Dimensions: Product,
Product Location,
Location Time
Hierarchical summarization paths
P
Produc
t
I d t
Industry
R
Region
i
J. Han: Data Mining: Concepts and Techniques
Y
Year
Category Country Quarter
Product
City
Office
Month Week
Day
23
Multidimensional Data Modeling
• Key concepts
– Dimensions: entities along which data can be
aggregated (e.g. Car model, time, color)
– Measures: scalar values to be aggregated (e.g.
number of sales)
• Two types of tables
– Dimension tables
– Fact table: stores measures and references to
dimensions
J. Han: Data Mining: Concepts and Techniques
24
Star Schema
time
time_key
day
day_of_the_week
y_ _ _
month
quarter
year
item
Sales
Sa
es Fact
act Table
ab e
time_key
item key
item_key
item_key
item name
item_name
brand
type
supplier
pp _type
yp
branch_key
branch
branch_key
branch_name
branch type
branch_type
location
l ti
location_key
k
units_sold
dollars_sold
avg_sales
Measure
J. Han: Data Mining:sConcepts and Techniques
location_key
street
city
state_or_province
country
25
Snowflake Schema
time
time_key
day
day_of_the_week
y_ _ _
month
quarter
year
it
item
Sales Fact Table
time_key
item
e _key
ey
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier
key
supplier_type
branch_key
location
branch
location key
location_key
branch_key
branch_name
branch type
branch_type
units_sold
d ll
dollars_sold
ld
avg_sales
Measure
M
s
J. Han: Data Mining: Concepts and Techniques
location_key
street
city_key
city
city_key
city
state or province
state_or_province
country
26
OLTP vs. OLAP
• OLTP – Online Transaction Processing
– Traditional database technology
– Many small transactions
(point queries: UPDATE or INSERT)
– Avoid redundancy, normalize schemas
– Access to consistent, up-to-date database
• OLTP Examples:
p
– Flight reservation
– Banking
g and financial transactions
– Order Management, Procurement, ...
• Extremely fast response times
times...
Carsten Binnig, ETH Zürich
27
OLTP vs. OLAP
• OLAP – Online Analytical Processing
– Big aggerate queries, no Updates
– Redundancy a necessity (Materialized Views, specialpurpose indexes, de-normalized schemas)
– Periodic refresh of data (daily or weekly)
• OLAP Examples
– Decision support
pp ((sales p
per employee)
p y )
– Marketing (purchases per customer)
– Biomedical databases
• Goal: Response Time of seconds / few minutes
Carsten Binnig, ETH Zürich
28
OLTP vs. OLAP (Water and Oil)
• Lock Conflicts: OLAP blocks OLTP
• Database design:
– OLTP normalized, OLAP de-normalized
• Tuning, Optimization
– OLTP: inter-query parallelism, heuristic optimization
– OLAP: intra-query
q yp
parallelism,, full-fledged
g optimization
p
• Freshness of Data:
– OLTP: serializability
– OLAP: reproducibility
• Integrity:
– OLTP: ACID
– OLAP: Sampling,
p g, Confidence Intervals
Carsten Binnig, ETH Zürich
29
Solution: Data Warehouse
• S
Special
i lS
Sandbox
db ffor OLAP
• Data input using OLTP systems
• Data Warehouse aggregates and replicates data
((special
p
schema))
• New Data is periodically uploaded to Warehouse
Carsten Binnig, ETH Zürich
30
DW Architecture
OLTP
OLAP
OLTP Applications
GUI, Spreadsheets
Extract
Transform
Load
(ETL)
DB1
DB2
D t W
Data
Warehouse
h
DB3
Carsten Binnig, ETH Zürich
31
DW Products and Tools
• Oracle 11g, IBM DB2, Microsoft SQL Server, ...
– All provide OLAP extensions
• SAP Business Information Warehouse
– ERP vendors
• MicroStrategy, Cognos (now IBM)
– Specialized vendors
– Kind of Web-based EXCEL
• Niche Players (e.g., Btell)
– Vertical application domain
32
Reference (highly recommended)
• Jim Gray et al. “Data Cube: A Relational
Aggregation Operator Generalizing Group-By,
Cross-Tab, and Sub-Totals”. Data Mining and
Knowledge Discovery 1(1), 1997.
• http://citeseer.ist.psu.edu/old/392672.html
• Data Warehousing chapter of Jianwei Han’s
Han s
textbook (chapter 3)
33
Homework
• Exercises 1 and 4 at:
– http://www.systems.ethz.ch/education/courses/fs09/d
ata-warehousing/ex2.pdf
• Multidimensional data modeling exercise in
course Wiki pages
34
Related documents