Download Data Warehouse and OLAP Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Outline
What is a data warehouse?
 A multi-dimensional data model
 Data warehouse design & architecture
 Data warehouse implementation
 From data warehousing to data mining

Data Warehouse &
OLAP Technology
CS 5331 by Rattikorn Hewett
Texas Tech University
1
2
Motivation
Data Warehouses
In business domains,
 Large amount of business data
 Data are in different formats and locations
(heterogeneous vs. distributed)
 Decision makers need fast accesses of
summarized information (usually) on a
single site
Loosely speaking, a data warehouse is


a data store that integrates data from multiple
heterogeneous sources to support decision making
maintained separately from operational databases
A data warehouse system provides a solid
platform of consolidated data for on-line analytical
processing (OLAP) and data analysis
Data warehousing is a process of
constructing and using data warehouses
3
4
1
Other Related terms

What is a data warehouse?
More precisely,
A data warehouse (DW) is a
 subject-oriented
 integrated
 time-variant, and
 non-volatile (i.e., stable)
collection of data in support of management’s
decision-making process [Inmon, 96]
Data Warehouse: enterprise-based
 Concerns
with decision subjects of the whole
enterprise or organization

Data Mart: department-based
 Specialized
single line of business
warehouses e.g., within departments or
groups of people
5
Subject-oriented
Integrated
A data warehouse:
 Is organized around subjects of interest
e.g., customer, product, sales.

Gives a simple and concise view of relevant issues
by excluding non-useful data to the decision support

6
Focuses on the modeling/analysis of data for
decision makers,

A data warehouse integrates multiple heterogeneous data
sources: relational databases, on-line transaction records

Moving (or converting) data to the warehouse requires
 data cleaning and
 data integration techniques
to ensure consistency among different data sources in


not on daily operations or transaction processing.
naming conventions
encoding structures, attribute measures, etc.
E.g., Hotel price: currency, tax, breakfast covered, etc.
7
8
2
Time-variant
Non-volatile
A data warehouse
 is a physically separate store
Key structures In data warehouses

contain “time” information explicitly or implicitly

Not necessary in operational databases

give historical perspective (e.g., past 5-10 years)

Operational time period is shorter (e.g., current data)
of data transformed from the operational environment.


does not require transaction processing, recovery,
and concurrency control mechanisms
requires only two operations in data accessing:
 initial loading of data
 access of data
 Non-volatile
9
Data Warehouse vs. Heterogeneous DB

Traditional heterogeneous DB integration: Query-driven



Heter.DB Site 1
Local query
Processor 1
Heter.DB Site n
Local query
Processor n

Major task of operational (traditional relational) DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
mediator

Builds wrappers/mediators on top of heterogeneous DBs
 requires complex information filtering and integration processes
Data Warehouse: Update-driven

DW systems vs. Operational DBMS
OLTP (on-line transaction processing)
Global answer
Client query
site
Metadata
Dictionary
10


11
Are structured, repetitive, short, atomic and isolated transactions
Require detailed and up-to-date data
OLAP (on-line analytical processing)

Major task of data warehouse system
Serve knowledge workers: Data analysis and decision making

OLAP Operations:

information from multiple heterogeneous sources is integrated in
advance and stored for query
 more efficient than heterogeneous DB
but data (stored over longer period) are not as current
OLTP Operations:


View data flexibly from different perspectives and abstractions
Require historical data with time elements
12
3
OLTP vs. OLAP
DW systems vs. Operational DBMS
Distinct features
OLTP
vs.
OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimension
al
integrated, consolidated
ad-hoc
read/write
index/hash on prim. key
short, simple transaction
lots of scans
unit of work
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response

User/system orientation: customer

Data contents:
current, detailed vs. historical, consolidated
data

Design:
ER &
app-oriented
vs. star &
subj-oriented
usage

View:
current, local
vs. evolutionary, integrated

Access patterns:
update
vs. market
access
vs. read-only but complex queries
complex query
13
Outline
Why Separate Data Warehouse?

High performance is desired in both DBMS and DW systems
DBMS – tuned for OLTP: indexing, searching certain records, querying
“canned” queries
 DW – tuned for OLAP: complex OLAP queries, use of special
organization, multidimensional view, consolidation.

OLAP necessitates special data organization that supports
multidimensional views
 If mixed - OLAP queries would degrade operational DBMS

14
Different functions, contents and uses in both systems

DBMS – Transactional operations, requires concurrency controls (to
ensure data consistency)
 OLAP is read only and does not need concurrency control.
Concurrency control on OLAP’s operations would jeopardize execution
of concurrent transactions in DBMS

What is a data warehouse?
 A multi-dimensional data model
 Data warehouse design & architecture
 Data warehouse implementation
 From data warehousing to data mining

DW – Decision support, requires consolidation of data from
heterogeneous sources (for aggregation etc.) and historical data
 DBMS does not maintain historical data nor facilitates data consolidation
15
16
4
Data Cubes
Lattice of Cuboids



A data warehouse is based on a multi-dimensional data model which
views data in the form of a data cube
Example: A data cube with three dimensions: item (type)
time (quarter)
location (city)
50
with measure: Dollars_sold
300



Time (quarter)
(location)
Dollars_sold (in thousands) of
(phone, Q1, Vancouver) = 400
400
0-D (apex) cuboid
()
200
100
Q1
The lattice of cuboids forms a data cube.
Each cuboid represents a different degree of summarization
The bottom most n-D base cube is called a base cuboid
The top most 0-D cuboid is called the apex cuboid - holds the
highest-level of summarization – denoted by all
(item)
(time)
1-D cuboids
Q2
Q3
Dollars_sold of (TV, Q1) = 650
Q4
(location, item) (location, time)
(item, time) 2-D cuboids
Dollars_sold of (Q3)
TV
VCR oven
phone
(location, item, time)
17
Item (type)
Cuboids of 4-D data cube
all
0-D(apex) cuboid
item
time,location
location
supplier
item,location
time,supplier
1-D cuboids

location,supplier
time,item
2-D cuboids
item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
18
Three data model schemas

time
3-D (base) cuboid
item,location,supplier
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into additonal
smaller dimension tables, forming a shape similar to
snowflake
Fact constellation schema: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
4-D(base) cuboid
time, item, location, supplier
19
20
5
Star  Snowflake
Example of Star Schema
Item: dimension table
Time: dimension table
time_key
day
day_of_the_week
month
quarter
year
Sales: Fact Table
time_key
item_key
item_name
brand
type
supplier_type

Location: dimension table
Star introduces redundancy
E.g., (203 Pine St, Abilene, TX, USA) location_key
street
(205 14th St, Abilene, TX, USA) city
state_or_province
country
item_key
branch_key
Branch: dimension table
branch_key
branch_name
branch_type
units_sold
Measures

location_key
dollars_sold
avg_sales
Location: dimension table
Snowflake normalizes this dimension table to
location_key
street
city_key
location_key
street
city
state_or_province
country
city_key
state_or_province
country
(203 Pine St, Abilene_key)
(205 14th St, Abilene_key)
(Abilene_key, TX, USA)
  
21
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
Sales: Fact Table
time_key
branch_key
branch_name
branch_type
Measures
time_key
day
day_of_the_week
month
quarter
year
item_key
Supplier_key
Supplier_type
location_key
units_sold
dollars_sold
avg_sales
Branch:
dimension table
branch_key
branch_name
branch_type
Location: dimension table
location_key
street
city-key
state_or_province
country
Item: dimension table
Time: dimension table
item_key
item_name
brand
type
key
supplier_type
branch_key
Branch: dimension table
Example of Constellation Schema
Item: dimension table
Time: dimension table
22
Measures
city_key
State_or_province
country
23
Sales: Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
units_sold
dollars_sold
avg_sales
item_key
time_key
shipper_key
from_location
to_location
branch_key
location_key
Shipping: Fact Table
dollars_cost
units_shipped
Location: dimension table
location_key
street
city
state_or_province
country
shipper_key
shipper_name
location_key
shipper_type
Shipper: dimension table
24
6
Which design schema to use?
DMQL: A Data Mining Query Language



Benchmarking performance can be used to select the best
schema
Snowflake vs. Star
 Snowflake: easier to maintain dimension tables when
they are very large (reduce space)
 Star: more effective for browsing (less joins for query)
Data Warehouse: uses “fact constellation” since it can model
multiple, interrelated subjects
Data Mart: uses “star” (more popular) or “snowflake”
Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>

Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)
time as “cube definition”
dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
 First
 define
25
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
A Star Schema in DMQL
Item: dimension table
Time: dimension table
Sales: Fact Table
time_key
define cube sales [time, item, branch, location]:
item_key
item_name
brand
type
supplier_type
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
item_key
branch_key
Branch: dimension table
branch_key
branch_name
branch_type
location_key
units_sold
Measures
dollars_sold
avg_sales
26
Location: dimension table
location_key
street
city
state_or_province
country
27
28
7
Example of Snowflake Schema
Item: dimension table
Time: dimension table
time_key
day
day_of_the_week
month
quarter
year
define cube sales [time, item, branch, location]:
item_key
item_name
brand
type
supplier_key
Sales: Fact Table
time_key
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
item_key
Supplier_key
Supplier_type
branch_key
Branch: dimension table
branch_key
branch_name
branch_type
A Snowflake Schema in DMQL
location_key
units_sold
Measures
dollars_sold
avg_sales
Location: dimension table
location_key
street
City_key
city_key
State_or_province
country
29
Time: dimension table
time_key
day
day_of_the_week
month
quarter
year
Branch:
dimension table
branch_key
branch_name
branch_type
Measures
Sales: Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
location_key
units_sold
dollars_sold
avg_sales
Shipping: Fact Table
item_key
time_key
from_location
dollars_cost
units_shipped
Location: dimension table
location_key
street
city
state_or_province
country
shipper_key
shipper_name
location_key
shipper_type
Shipping: dimension table
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
shipper_key
to_location
branch_key
30
A fact constellation Schema in DMQL
Example of Constellation Schema
Item: dimension table
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier (supplier-key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city (citykey, province_or_state, country))
31
32
8
Measures
Concept Hierarchy
Quantities computed by aggregating values in data cube
cells in various ways:

distributive: if the result can be derived by aggregating results
from partitioned data.



Europe
region
country
E.g., avg() = sum()/count(), standard_deviation().
holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.

all
all
E.g., count(), sum(), min(), max().
algebraic: if it can be computed by an algebraic function with
a finite number of arguments, each of which is obtained by
applying a distributive function.

Example: Hierarchy for the dimension “location”
city
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
...
Mexico
Toronto
E.g., median(), mode(), rank().
office
L. Chan
... M. Wind
33
Specifications of Hierarchies
Dimensions: Location
Time
Office < City < Country < Region
Region
Year
Day < {month<quarter; week} < year
Country Quarter


Schema hierarchy:
Set_grouping hierarchy
(1..10] < inexpensive
City
Office
34
Hierarchy and Warehouse View
DBMiner:
• Importing data
• Table browsing
• Dimension
creation/ browsing
• Cube building/browsing
Month Week
Day
35
36
9
Building cube:
Sales volume as a function of product, month, and region
Total annual sales of
TVs in USA
Time (quarter)
1Qtr
2Qtr
3Qtr
4Qtr

sum

U.S.A
Canada
Mexico
Location (city)
TV
PC
VCR
sum
Browsing a Data Cube

Visualization
OLAP
capabilities
Interactive
manipulation
sum
ALL = Total sales
37
Typical OLAP operations

38
Examples: Roll-up & drill-down
USA
350
Canada 300
Roll up (increase the level of abstraction):
Roll-up
on location
 Summarize
data by climbing up hierarchy or by
dimension reduction

50
300
200
100
Drill down (decrease the level of abstraction):
Q1
 Reverse
of roll-up from higher level summary to lower
level summary or detailed data, or introducing new
dimensions
Q1
Q2
Q3
Q4
400
Q2
TV
VCR oven
Q4
Drill-down
On time for Q1
TV
VCR oven
phone
Jan
100
Feb
200
100
Mar
39
phone
Q3
TV
VCR oven
phone
40
10
Typical OLAP operations (cont)



Examples: dice, slice, pivot
Q1
50
300
200
100
150
Q1
Dice half top
left corner
Q2
TV
VCR
250 400
Q2
Chicago
TV
100
Q3
New York
VCR
150
Q4
Toronto
oven
250
Vancouver
TV
VCR oven
phone
400
Pivot
Vancouver
VCR oven phone
Toronto
TV
New York
Slice for Q1
phone
100 150 250 400
Chicago
Slice and dice(project and select):
Pivot (rotate):
 reorient the cube, visualization, 3D to series
of 2D planes.
Other operations
 drill across (links to the raw data): involving
(across) more than one fact table
 drill through: through the bottom level of the
cube to its back-end relational tables (using
SQL)
Toronto
200
Vancouver 100
41
OLAP functions
42
A Star-Net Query Model
Customer Orders
Shipping Method

Customer
OLAP offers functions to:
CONTRACTS

Calculate e.g., ratios, variance, etc.
Generate e.g., summarizations, aggregations and hierarchies at
each level and every dimension intersection
 Support forecasting functions e.g., trend and stat analysis
AIR-EXPRESS


PRODUCT LINE
Time
OLAPs and SDBs (statistical Databases) are similar
Product
ANNUALY QTRLY

SDBs focus on socioeconomic applications, and are sensitive to
privacy issues when drilling down
 OLAPs are designed to handle HUGE amounts of data

ORDER
TRUCK
DAILY
PRODUCT ITEM
PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
OLAP’s querying is based on a starnet model –
representing abstraction levels in each dimension
REGION
43
Location
Each circle is called
a footprint
Promotion
DIVISION
44
Organization
11
Outline
A Data Warehouse Design
What is a data warehouse?
 A multi-dimensional data model
 Data warehouse design & architecture
 Data warehouse implementation
 From data warehousing to data mining
Four views to be considered:

 Top-down view:
allows selection of the relevant
information necessary for the data warehouse
 Data source view: exposes the information being
captured, stored, and managed by operational systems
 Data warehouse view: consists of fact tables and
dimension tables
 Business query view: perspectives of data in the
warehouse from the view of end-users
45
Design approaches

Design process
Top-down: Starts with overall design and planning
Typical steps in data warehouse design process:
Select

good for mature technology and well-defined problems
 Systematic and min. integration problem
 expensive and lack flexibility



 business
process to model
(e.g., orders, invoices, etc.)
Bottom-up: Starts with experiments and prototypes

46
 grain
(atomic level of data) of the business process
(e.g., individual transactions, daily-snapshots etc.)
Rapid returns, low cost
Flexible but hard to integrate
Combined
 dimensions
 measure
47
that will apply to each fact table record
that will populate each fact table record
48
12
Three data warehouse models

Enterprise warehouse


As in Software Engineering, two development methods:
collects all of the information about subjects spanning the entire
organization
Data Mart


Data warehouse development
a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
 Independent vs. dependent (directly from warehouse) data
mart
structured and systematic analysis at each
step before proceeding to the next
 Spiral:
rapid generation of increasingly functional systems
 short turn around time
 Recommended for the development of data warehouses
and data marts
Virtual warehouse


 Waterfall: do
A set of views over operational databases
Only some of the possible summary views may be materialized
49
A recommended approach
Data Warehouse Architecture
For data warehouse development
Data
Mart
Model refinement
External
sources
Multi-Tier Data
Warehouse
Distributed
Data Marts
50
Operational
DBs
Enterprise
Data
Warehouse
Data
Mart
Monitor
&
Integrator
Meta
data
Extract
Transform
Load
Refresh
Data
Warehouse
OLAP Server
• Analysis
• Query
Serve Reports
• Data mining
Data Marts
Model
refinement
Data Sources
Define a high-level corporate data model
51
Data Warehouse
Server
Bottom tier
OLAP
Engine
Middle tier
Front-End
(Client) Tools
Top tier
52
13
Data sources
Back-End Tools & Utilities
Data sources are

 Often
the operational DBMS
 Designed for operational use, not for decision
support
Multiple data sources



 Are
often from different systems using different
hardware and customized software
 Create problems e.g., semantic conflicts

Data Cleaning: eliminates noise and errors. Three classes of tools:
 Migration: simple data transformation
 Scrubbing: by using domain-specific knowledge
 Auditing: detects outliers
Data Extraction: by a program interface called gateway
Data Transformation: convert data from legacy or host format to
warehouse format
Load: sort, summarize, consolidate, compute views, check integrity
constraints, and build indices and partitions
Refresh: propagates updates from data sources to the data store


When to refresh? Depends on usage and types of data
How to refresh? Data shipping (by triggers) and
Transaction shipping (by updating transaction logs)
53
The Bottom tier: warehouse server
54
The Middle tier: OLAP server
Implemented by:

Monitor:
ROLAP (Relational OLAP) e.g., microstrategy’s DSS Server, Informix’s Metacube

Detect changes that are of interest
 Propagate the change to the integrator


Uses relational or extended-relational DBMS (often in a star schema)
Includes optimization of DBMS backend tools
Supports extended SQL (Sequence Query Language)
A cell in a multi-dimensional structure is represented by a tuple
Advantage: scalable (no empty cell for sparse cube)
Disadvantage: no direct access to cells



Integrator:

Integrate changes into the warehouse
Make the data conform to conceptual schema in the warehouse
 Resolve possible update anomalies


Metadata Repository: data about data (warehouse objects)

Administrative metadata: source DBs and their contents, security
Business metadata: business terms, ownership of data
 Operational metadata:history of migrated data and transformations applied

55
56
14
The Middle tier: OLAP server (cont)
Outline
MOLAP (Multidimensional OLAP) e.g., Arbor’s Essbase


Stores data in a multi-dimensional data structure (MDDS)
Uses Array-based multidimensional storage engine (sparse matrix
techniques)
Advantage: fast indexing to pre-computed aggregations.
Disadvantage: not very scalable and sparse
HOLAP (Hybrid OLAP)


Scalability of ROLAP + efficiency of MOLAP
User flexibility, e.g., low level: relational, high-level: array-based
What is a data warehouse?
 A multi-dimensional data model
 Data warehouse design & architecture
 Data warehouse implementation
 From data warehousing to data mining

Specialized SQL servers

specialized support for SQL queries over star/snowflake schemas
57
Efficient Cube Computation

Data cube can be viewed as a lattice of cuboids


The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains an empty cell
 Suppose that at most one level of each dimension appears in a cuboid.
How many cuboids in an n-dimensional cube with L levels?
n
T = Õ (Li +1)
i =1
10-D cube with 4 levels gives
T = 510 ~ 9.8 x 106 !!!!
Ex: A data cube of 3 dimensions: time, location, item
The total number of cuboids = 2x2x2
The cuboid contains time or none
58
Partial Materialization

Pre-computing and materializing all possible cuboids of a data cube is
not feasible when there are large number of them

Given a base cuboid, there are three choices

Full materialization – pre-compute every cuboid

No materialization – do not pre-compute any “nonbase” cuboid

Partial materialization – select subset of cuboids to compute

Selection of which cuboids to materialize depends on the queries in the
workload, their frequencies, and access costs etc.
E.g., OLAP product uses heuristic – “Prefer cuboids that are most
referred”
Suppose now time has 3 levels of hierarchy: month < quarter < year
The total number of cuboids = 4x2x2
The cuboid contains month or
quarter or year or none 59
60
15
Full materialization


To ensure fast on-line processing, full materialization may
be necessary  Efficient cube computation
ROLAP cube computation

Optimization ideas:



Multi-way Array Aggregation
Reorder or cluster related tuples
Partial grouping of subaggregates and use them to speedup
computation of other subaggregates and aggregates

Partition arrays into chunks (a small subcube which fits in memory).

Compress chunks to remove wasted space -- handle sparse cubes

Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C
MOLAP cube computation
 Multi-way array aggregation
b3 B13
14
15
16
b0 1
2
a0 a1
3
a2
4
a3
b2 9
(can’t use ROLAP’s approach since it is based on different data structure and
data access/index from ROLAP)
61
Multi-way Array Aggregation
B b1
5
Multi-way Array Aggregation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
60
14
15
16
b3 13
44
28 56
b2 9
40
24 52
b1 5
36
20
2
3
4
b0 1
a0
a1
a2
62
A
C
B
What is the best
traversing order
to do multi-way
aggregation?
B
a3
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
60
14
15
16
b3 13
44
28 56
b2 9
40
24 52
b1 5
36
20
2
3
4
b0 1
a0
A
a1
a2
a3
A
63
64
16
Multi-way Array Aggregation
Multi-way Array Aggregation
C
B
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
60
14
15
16
b3 13
44
28 56
b2 9
40
24 52
b1 5
36
20
2
3
4
b0 1
a0
a1
a2
B
a3
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
60
14
15
16
b3 13
44
28 56
b2 9
40
24 52
b1 5
36
20
2
3
4
b0 1
a0
A
a1
a2
a3
A
65
66
Indexing OLAP data
Multi-way Array Aggregation

Method: the planes should be sorted and computed from
size small to large (see Han and Kamber, Example 2.12 pp. 75-78)
Bitmap Indexing

Idea: keep the smallest plane in the main memory, fetch
and compute only one chunk at a time for the largest
plane

Limitation: efficient for a small number of dimensions






Index on a particular column (attribute)
Each value in the column corresponds to a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for the
indexed column
not suitable for high cardinality domains
Base table
If there is a large number of dimensions,
bottom-up computation [Beyer & Ramakrishnan, SIGMOD’99]
 iceberg cube computation

67
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Type
Retail
Dealer
Dealer
Retail
Dealer
Index on Region
Index on Type
RecID Asia Europe America
1
1
0
0
2
0
1
0
3
1
0
0
4
0
0
1
5
0
1
0
RecID Retail Dealer
1
1
0
2
0
1
3
0
1
4
1
0
5
0
1
68
17
Indexing OLAP data
Example: Selecting cuboids
Join index

Allows records to identify joinable tuples without performing join
operation (which is expensive) - used to identify subcubes of interest
E.g., in star schema data warehouse
Fact table, “sale” and dimensions: location, item
sale_k Loc_k Br_k Item_kTime_k
Loc_k
Str
City State country

Item_k Name Brand type
Which cuboid to process query on {brand, state} with “year = 2000” ?
Country
& cuboid1: {item_name, city, year}
cuboid2: {brand, country, year}
State
Type
cuboid3: {brand, state, year}
City
Brand
cuboid4: {item_name, state} where year = 2000
Street
Item_name
Eliminate cuboid2
since “country” can’t be generalized to “state” (lower level)

sale_k Loc_k
sale_k Item_k
sale_k Item_k Loc_k
By generalization rule, the rest of cuboids will work
But which is better?
cuboid1: requires two generalizations  cost most
if not many year values associated with brand but each brand has several
item_names  couboid3 is less costly than cuboid4
Join indices
69
71
Outline
Data Warehouse Usage
What is a data warehouse?
 A multi-dimensional data model
 Data warehouse design & architecture
 Data warehouse implementation
 From data warehousing to data mining


Three kinds of data warehouse applications

Information processing




73
supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
Differences among the three tasks
74
18
From OLAP to OLAM
OLAM Architecture
Layer4
Mining query

Mining result
Why OLAM (online analytical mining)?
 High

quality of data in data warehouses (DW)
User GUI API
OLAM
Engine
DW contains integrated, consistent, cleaned data
 Available
information processing structure
surrounding data warehouses

OLAP
Engine
User Interface
Layer3
OLAP/OLAM
Data Cube API
ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
Layer2
MDDB
 OLAP gives a good framework for data exploration
 “interactive” mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions e.g., integration
and swapping of multiple mining functions, algorithms, and
tasks.
Meta Data
Filtering&Integration Database API
Databases
75
Discovery-Driven Exploration
Filtering
Data
Data cleaning
Data integration Warehouse
MDDB
(Multi-dimensional DB)
Layer1
Data
Repository
76
Example
Data cube exploration:
 Hypothesis-driven
 exploration

by user, huge search space, can overlook
Discovery-driven (Sarawagi, et al.’98)
 Effective
navigation of large OLAP data cubes
measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
 Exception: significantly different from the value
anticipated, based on a statistical model
 Visual cues such as background color are used to
reflect the degree of exception of each cell
 pre-compute
77
78
19
“Exception” and its computation



Parameters
 SelfExp: surprise of cell relative to other cells at same
level of aggregation
 InExp: surprise beneath the cell
 PathExp: surprise beneath cell for each drill-down path
Computation of exception indicator (model fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like pre-computed aggregates
Data Mining & Warehouses

Construction of data warehouses requires data cleaning
and integration
 important preprocessing in data mining

Data mining tools can interface with the OLAP engine
 facilitates interactive data exploration, including
data integration, aggregation and navigation

OLAP-based mining  OLAM
79
For more, see …
Summary





Data warehouse
A multi-dimensional model of a data warehouse
 Representation: A data cube consists of dimensions & measure
 Schema: Star, snowflake, and fact constellations
OLAP operations: drilling, rolling, slicing, dicing and pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes




Partial vs. full vs. no materialization
Multi-way array aggregation
Bitmap index and join index implementations
Data mining and data warehouse


80
Discovery-driven in data warehouse
From OLAP to OLAM (on-line analytical mining)
81

S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96

D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses.
SIGMOD’97.

R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97

K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..
SIGMOD’99.

S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997.

OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm,
1998.

G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data
Cubes. VLDB’2001

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.
Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and subtotals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
82
20
For more, see …

J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures.
SIGMOD’01

V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96

Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998.

K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97.

K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities.
EDBT'98.

S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes.
EDBT'98.

E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons,
1997.

W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data
Cube Size. ICDE’02.

Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous
multidimensional aggregates. SIGMOD’97.
83
21