Download Data from a DW Data Warehouse— Subject

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Transcript
19.09.15
Data from a DW
Data Warehouse—
Subject-Oriented
n 
Organized around major subjects, such as customer,
product, sales
n 
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
n 
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
1
19.09.15
Data Warehouse—
Integrated
n 
Constructed by integrating multiple, heterogeneous
data sources
n 
n 
relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are
applied.
n 
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
•  E.g., Hotel price: currency, tax, breakfast covered, etc.
n 
When data is moved to the warehouse, it is converted.
Data Warehouse—
Nonvolatile
n 
A physically separate store of data transformed
from the operational environment
n 
Operational update of data does not occur in the
data warehouse environment
n 
Does not require transaction processing, recovery,
and concurrency control mechanisms
n 
Requires only two operations in data accessing:
•  initial loading of data and access of data
2
19.09.15
What is OLAP?
n  The
term OLAP („online analytical
processing“) was coined in a white paper
written for Arbor Software Corp. in 1993
n  Interactive
process of creating, managing,
analyzing and reporting on data
n  Analyzing large quantities of data in realtime
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
3
19.09.15
Conceptual Modeling of Data
Warehouses
n 
Modeling data warehouses: dimensions & measures
instead of relational model
n 
Subject, facilitates on-line data analysis oriented
n 
Most popular model is the multidimensional model
n 
Most common modeling paradigm:
n 
n 
Star schema
Data warehouse contains a large central table (fact table)
•  Contains the data without redundancy
n 
A set of dimension tables (each for each dimension)
n  For
two dimensions
n  Spreadsheet
(Excel) with spreadsheet
formulas calculations
n  For
more than two dimensions
n  We
will require several spreadsheet tables
n  -> Data explosion
4
19.09.15
n  For
two dimensions
n  Spreadsheet
(Excel) with spreadsheet
formulas calculations
n  For
more than two dimensions
n  We
will require several spreadsheet tables
n  -> Data explosion
n  We
will look for one “Excel” table with
several dimensions
n  How
do we represent an Excel table in a
Computer?
n  Multidimensional
model
n  For
Excel, two dimensions (pointers to the
data) and the data itself
5
19.09.15
time
Example of Star Schema
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
branch_key
branch_name
branch_type
location_key
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Measures
Snowflake schema
n  Snowflake
schema: A refinement of star
schema where some dimensional hierarchy
is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
6
19.09.15
time
Example of Snowflake
Schema
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
Fact constellations
n 
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
constellation
7
19.09.15
Example of Fact
Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
Measures
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
OLAP
n  Data
is perceived and manipulated as
though it were stored in a „multidimensional array“
n  Ideas
are explained in terms of
conventional SQL-styled tables
8
19.09.15
Data aggregation
n  Data
aggregation (agregação) in many
different ways
n  The
number of possible groupings quickly
becomes large
n  The
user has to consider all groupings
n  Analytical processing problem
Queries for
supplier-and-parts database
1) 
2) 
3) 
4) 
Get the total shipment quantity
Get total shipment quantities by supplier
Get total shipment quantities by part
Get the shipment by supplier and part
9
19.09.15
n 
SP
S#
P#
QTY
S1
P1
300
S1
P2
200
S2
P1
300
S2
P2
400
S3
P2
200
S4
P2
200
1. SELECT SUM(QTY) AS TOTQTY
FROM SP
GROUP BY () ;
TOTQTY
1600
10
19.09.15
2. SELECT S#,
SUM(QTY) AS TOTQTY
FROM SP
GROUP BY (S#) ;
S#
TOTQTY
S1
500
S2
700
S3
200
S4
200
3. SELECT P#,
SUM(QTY) AS TOTQTY
FROM SP
GROUP BY (P#) ;
P#
TOTQTY
P1
600
P2
1000
11
19.09.15
4. SELECT S#, P#,
SUM(QTY) AS TOTQTY
FROM SP
GROUP BY (S#,P#) ,
S#
P#
S1
P1
TOTQTY
300
S1
P2
200
S2
P1
300
S2
P2
400
S3
P2
200
S4
P2
200
Drawbacks
n  Formulation
so many similar but distinct
queries is tedious
n  Executing the queries is expensive
n  Make life easier,
n  more
n  Single
efficient computation
query
n  GROUPING
SETS, ROLLUP, CUBE options
n  Added to SQL standard 1999
12
19.09.15
GROUPING SETS
n  Execute
several queries simultaneously
SELECT S#, P#, SUM (QTY) AS TOTQTY
FROM SP
GROUP BY GROUPING SETS ( (S#), (P#) ) ;
Single results table
Not a relation !!
null è missing information
SELECT CASE GROUPING ( S# )
WHEN 1 THEN ‘??‘
ELSE S#
AS S#,
CASE GROUPING ( P# )
WHEN 1 THEN ‘!!‘
ELSE P#
AS P#,
SUM ( QTY ) AS TOTQTY
FROM SP
GROUP BY GROUPING SETS ( ( S# ),
S#
P#
S1
null
TOTQTY
500
S2
null
700
S3
null
200
S4
null
200
null P1
600
null P2
1000
S#
P#
TOTQTY
S1
!!
500
S2
!!
700
S3
!!
200
S4
!!
200
??
P1
600
??
P2
1000
( P# ) );
13
19.09.15
ROLLUP
SELECT S#,P#, SUM ( QTY ) AS TOTQTY
FROM SP
GROUP BY ROLLUP (S#, P#) ;
S#
P#
TOTQTY
S1
P1
300
S1
P2
200
S2
P1
300
S2
P2
400
S3
P2
200
S4
P2
200
S1
null
500
S2
null
700
S3
null
200
S4
null
200
null null
1600
GROUP BY GROUPING SETS ( ( S#, P# ), ( S# ) , ( ) )
ROLLUP
n 
n 
The quantities have been „roll up“ (estender)
for each supplier
Rolled up „along supplier dimension“
GROUP BY ROLLUP (A,B,...,Z)
(A,B,...,Z)
(A,B,...)
(A,B)
(A)
()
GROUP BY ROLLUP (A,B) is not symmetric in A and B !
14
19.09.15
CUBE
SELECT S#, P#, SUM ( QTY ) AS TOTQTY
FROM SP
GROUP BY CUBE ( S#, P#) ;
S#
P#
TOTQTY
S1
P1
300
S1
P2
200
S2
P1
300
S2
P2
400
S3
P2
200
S4
P2
200
S1
null
500
S2
null
700
S3
null
200
S4
null
200
null P1
600
null P1
1000
null null
1600
GROUP BY GROUPING SETS ( (S#, P#), ( S# ), ( P# ), ( ) )
Cross Tabulations
n  Display
query results as cross tabulations
n  More
readable way
n  Formatted as a simple array
n  Example: two dimensions (supplier and
parts)
P1
P2
Total
S1
300
200
500
S2
300
400
700
S3
0
200
200
S4
0
200
200
600
1000
1600
15
19.09.15
CUBE
n 
Confusing term CUBE (?)
n 
Derived from the fact that in multidimensional
terminology,data values are stored in cells of a
multidimensional array or a hypercube
•  The actual physical storage my differ
n 
In our example
•  cube has just two dimensions (supplier, part)
•  The two dimensions are unequal (no square rectangle..)
n 
Means „group“ by all possible subsets of the set
{A, B, ..., Z }
CUBE
n 
Means „group“ by all possible subsets of the set
{A, B, ..., Z }
n 
M={A, B, ..., Z },
|M|=N
n 
Power Set (Algebra)
P(M):={U | U ⊆M},
|P(M)|=2N
n 
..proof by induction
n 
n 
Subset represent different grade of
summarization
Data Mining: such a subset is called a Cuboid
16
19.09.15
n 
n 
For a cube with n dimensions, there are total 2n
cuboids
A cube operator was first proposed by Gray et.
All 1997:
n 
n 
Data Cube: A Relational Aggregation Operator Generalizing Group-By, CrossTab, and Sub-Totals; J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D.
Reichart, M. Venkatrao, F. Pellow, H. Pirahesh: Data Mining and Knowledge
Discovery 1(1), 1997, 29-53.
http://research.microsoft.com/~Gray/
n 
The total number of data cuboids is 23=8
n 
n 
n 
n 
n 
{(city,item,year),
(city,item), (city,year),
(city),(item),(year),
()}
(), the dimensions are not grouped
These group-by’s form a lattice of cuboids for the
data cube
n  The basic cuboid contains all three dimensions
n 
17
19.09.15
n 
Hasse-Diagram: Helmut Hasse 1898 - 1979 did fundamental work in algebra and
number theory
()
(city)
(city, item)
(item)
(city, year)
(year)
(item, year)
(city, item, year)
Cuboid (Data Mining Definition)
n 
Names in data warehousing literature:
n 
The n-D cuboid, which holds the lowest level of
summarization, is called a base cuboid
n 
.. {{A},{B},..}
The top most 0-D cuboid, which holds the highest-level
of summarization, is called the apex cuboid
n 
.. {∅}
The lattice of cuboids forms a data cube
18
19.09.15
Cube: A Lattice of
Cuboids ....(Power Set)
all
time
0-D(apex) cuboid
item
time,location
time,item
location
item,location
time,supplier
supplier
location,supplier
item,supplier
time,location,supplier
time,item,location
time,item,supplier
1-D cuboids
2-D cuboids
3-D cuboids
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
24=16
Hierarchies
n 
Independent variables are often related in
hierarchies (taxonomy)
n 
n 
Temporal hierarchy
n 
n 
Determine ways in which dependent data can be
aggregated
Seconds, minutes, hours, days, weeks, months,
years
Same data can be aggregated in many different
ways
n 
Same independent variable can belong to different
hierarchies
19
19.09.15
Hierarchy - Location
all
all
Europe
region
Germany
country
city
Frankfurt
North_America
Canada
Vancouver ...
...
...
Mexico
Toronto
M. Wind
Storage space may explode...
n 
n 
...
Spain
L. Chan
office
n 
...
...
If there are no hierarchies the total number for ndimensional cube is 2n
But....
n 
Many dimensions may have hierarchies, for example
time
•  day < week < month < quarter < year
n 
For a n-dimensional data cube, where Li is the
number of all levels (for time Ltime=5), the total
number of cuboids that can be generated is
n
T = ∏ (Li + 1)
i=1
€
20
19.09.15
View of Warehouses and
Hierarchies
Specification of hierarchies
n 
Schema hierarchy
day < {month < quarter; week}
< year
n 
Set_grouping hierarchy
{1..10} < inexpensive
Multidimensional Data
n  Sales
volume as a function of product,
month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Product
Industry Region
Year
Category Country Quarter
Product
Month
City
Office
Month Week
Day
21
19.09.15
Drill up and down
n 
Drill up:
n 
n 
going from a lower level of aggregation to a higher
Drill down:
n 
means the opposite
n 
Difference between drill up and roll up
•  Roll up: creating the desired groupings or aggregations
•  Drill up: accessing the aggregations
n 
Example for drill down:
•  Given the total shipment quantity, get the total quantities for
each individual supplier
Typical OLAP Operations
n 
n 
n 
n 
n 
Roll up (drill-up): summarize data
n  by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
n  from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
n  reorient the cube, visualization, 3D to series of 2D planes
Other operations
n  drill across: involving (across) more than one fact table
n  drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
22
19.09.15
Fig. 3.10 Typical OLAP
Operations
Discovery-Driven Data Cubes
SelExp
InExp
23
19.09.15
Measures of Data Cube: Three Categories (Depending on the aggregate functions)
n 
Distributive: if the result derived by applying the function to
n aggregate values is the same as that derived by
applying the function on all the data without partitioning
•  E.g., count(), sum(), min(), max()
n 
Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
•  E.g., avg(), min_N(), standard_deviation()
n 
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
•  E.g., median(), mode(), rank()
n  Algebraic:
if it can be computed by an
algebraic function with M arguments
(where M is a bounded integer), each of
which is obtained by applying a
distributive aggregate function
24
19.09.15
Statistics for one Variable
n  Sample
Size
n  The
sample size denoted by N, is the number
of data items in a sample
n  Mean
n  The
arithmetic mean is the average value,
the sum of all values in the sample divided by
the number of values
N
x =∑
i=1
xi
N
€
n  Maximum,
Minimum, Range
n  Range
is the difference between
maximum and minimum
25
19.09.15
Standard Deviation and
Variance
Square root of the variance, which is the sum of
squared distances between each value and the
mean divided by population size (finite
population)
n 
1 N
∗ ∑ xi − x
N i=1
(
σ=
n 
)
2
Example
•  1,2,15 Mean=6
• 
(1− 6)
2
€
+ (2 − 6) 2 + (15 − 6) 2
= 40.66
3
σ=6.37
€
Sample Standard Deviation and
Sample Variance
n 
Square root of the variance, which is the sum of
squared distances between each value and the
mean divided by sample size
s=
n 
Example
N
1
∗ ∑ xi − x
N −1 i=1
(
)
2
•  1,2,15 Mean=6
• 
(1− 6)
2
+ (2 − 6)€2 + (15 − 6) 2
= 61
3 −1
s=7.81
€
26
19.09.15
n  Holistic:
if there is no constant bound on
the storage size needed to describe a
subaggregate.
Statistics for one Variable
n  Median
n  If
the values in the sample are sorted into a
non decreasing order, the median is the
value that splits the distribution in half
n  (1 1 1 2 3 4 5) the median is 2
n  If N is even, the sample has middle values,
and the median can be found by interpolating
between them or by selecting one of the
arbitrary
27
19.09.15
Mode
The mode is the most common value in the
distribution
n  (1 2 2 3 4 4 4) the mode is 4
n  If the data are real numbers mode nearly no
information
n 
•  Low probability that two or more data will have exactly the
same value
Solution: map into discrete numbers, by rounding or
sorting into bins for frequency histograms
n  We often speak of a distribution having two or more
modes
n 
•  Distributions has two or more values that are common
Outliers
n  Because
they are averages, both the
mean and the variance are sensitive to
outliers
n  Big effects that can wreck our
interpretation of data
n  For example:
n  Presence
of a single outlier in a distribution
over 200 values can render some statistical
comparisons insignificant
28
19.09.15
The Problem of Outliers
n  One
cannot do much about outliers
expect find them, and sometimes, remove
them
n  Removing
requires judgment and depend
on one‘s purpose
Trimmed mean
n 
Another robust alternative to the mean is the
trimmed mean
n 
Lop off a fraction of the upper and lower ends of the
distribution, and take the mean of the rest
•  0,0,1,2,5,8,12,17,18,18,19,19,20,26,86,116
n 
Lop off two smallest and two larges values and take
the mean of the rest
•  Trimmed mean is 13.75
•  The arithmetic mean 22.75
29
19.09.15
n  Interquartile
Range
n  Interquartile
range is found by dividing a
sorted distribution into four containing parts,
each containing the same number
n  Each part is called quartile
n  The difference between the highest value in
the third quartile and the lowest value in the
second quartile is the interquartile range
Quartile example
n  1,1,2,3,3,5,5,5,5,6,6,100
n  The
quartiles are
n  (1 1 2),(3 3 5),(5 5 5), (6,6,100)
n  Interquartile
range 5-3=2
n  Range 100-1=99
n  Interquartile
range is robust against
outliers
30