Download Data Warehousing and Decision Support

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Warehousing and Decision Support
M. Tamer Özsu
David R. Cheriton School of Computer Science
University of Waterloo
CS 348
Introduction to Database Management
Fall 2012
CS 348
Notes
Warehousing
Fall 2012
1 / 16
Outline
1
2
3
Introduction to Decision Support
On-Line Analytical Processing
Multidimensional Data
Multidimensional Queries
Data Warehousing
Creating and Maintaining a Warehouse
Materializing Views
CS 348
Notes
Warehousing
Fall 2012
2 / 16
Transaction Processing
The most common use of relational databases is for operational data.
• Examples:
• Students enrolling in courses
• Customers purchasing products
• Passengers purchasing airline tickets
On-Line Transactional Processing (OLTP)
Databases that support the basic operations of a business are generally
classified as OLTP systems.
• Workload characteristics:
1
2
simple queries
many short transactions making small changes
• Systems tuned to maximize throughput of concurrent transactions
CS 348
Notes
Warehousing
Fall 2012
3 / 16
Beyond Transaction Processing
More recent uses of operational data:
Decision Support Summarizing data to support high-level decision
making
• Complex queries with much aggregation
Data Mining Searching for trends or patterns in data for a business to
exploit
• Simple queries, but very data-intensive
Data Warehousing
A data warehouse is a separate copy of the operational data used for
executing decision support and/or data mining queries.
CS 348
Notes
Warehousing
Fall 2012
4 / 16
On-Line Analytical Processing
On-Line Analytical Processing (OLAP)
OLAP is a particular type of decision support
• Data is modeled as multidimensional array
• Queries are usually ad hoc
• Queries select and aggregate cells of the array
• OLAP systems are divided into two categories:
1 Special-purpose OLAP systems
• store data as multidimensional arrays (“MOLAP”)
• provide an OLAP-specific query language
2
Relational databases
• store data in relations (“ROLAP”)
• queries written in SQL
CS 348
Notes
Warehousing
Fall 2012
5 / 16
Multidimensional Data
PRODUCT
• Example: Number of Sales
3
14
45
22
2
12
20
27
1
11
27
30
A
B
C
LOCATION
CS 348
Notes
Warehousing
2
1
3
TIME
Fall 2012
6 / 16
Star Schemas
Location
lid store
A Weber
B
F-H
C
Park
city
Waterloo
Kitchener
Kitchener
Product
pid pname
1
Bolt
2
Nut
3
Wrench
Time
tid date
CS 348
Notes
category
Hardware
Hardware
Tools
province
ON
ON
ON
country
CA
CA
CA
price
.10
.05
1.99
week month quarter
virtual relation
year
Warehousing
Sales
lid pid
A
1
A
2
A
3
B
1
B
2
B
3
C
1
C
2
C
3
A
1
A
2
A
3
..
.
tid
1
1
1
1
1
1
1
1
1
2
2
2
Fall 2012
sales
11
12
14
27
20
45
30
27
22
16
20
55
7 / 16
OLAP Queries
• OLAP queries typically aggregate over one or more dimensions.
Examples:
• Total sales
• Total sales this year for each product category
• Total sales for each store per quarter
• OLAP is a tool for ad hoc data exploration/visualization
• Ad hoc queries tend to be iterative
• Desirable to express queries using operations over previous result
CS 348
Notes
Warehousing
Fall 2012
8 / 16
OLAP Query Operations
14
45
22
2
12
20
27
1
11
27
30
A
B
C
LOCATION
PRODUCT
• Roll-up and Drill-down
3
14
45
22
2
12
20
27
1
11
27
30
A
B
C
LOCATION
CS 348
Notes
2
1
3
3
TIME
1
2
TIME
/
,
3
14
45
22
2
12
20
27
1
11
27
30
3
1
A
B
C
LOCATION
PRODUCT
3
PRODUCT
PRODUCT
• Slicing and Dicing
2
TIME
Tools
14
45
22
Hardware
23
47
57
A
B
C
LOCATION
Warehousing
1
2
3
TIME
Fall 2012
9 / 16
Data Cube
• A data cube extends a multidimensional array of data to include
PRODUCT
all possible aggregated totals
3
14
45
22
81
2
12
20
27
59
1
11
27
30
68
37
92
79
208
A
B
C
LOCATION
CS 348
Notes
Warehousing
3
2
1 TIME
Fall 2012
10 / 16
Data Cubes as Relations
Sales
lid pid
A
1
A
2
A
3
A
B
1
B
2
B
3
B
C
1
C
2
C
3
C
1
2
3
A
1
tid
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
sales
11
12
14
37
27
20
45
92
30
27
22
79
68
59
81
208
16
..
.
CS 348
Notes
Warehousing
Fall 2012
11 / 16
CUBE operator in SQL:1999
• Generating the
1 SUM(sales)
2 SUM(sales)
3 SUM(sales)
4 SUM(sales)
5 SUM(sales)
6 SUM(sales)
7 SUM(sales)
8 SUM(sales)
data cube:
GROUP
GROUP
GROUP
GROUP
GROUP
GROUP
GROUP
BY
BY
BY
BY
BY
BY
BY
location, product, time (raw cells)
location, time
product, time
product, location
product
location
time
• CUBE operator in SQL:1999 groups by all combinations
SELECT lid, pid, tid, SUM(sales)
FROM Sales
GROUP BY CUBE(lid, pid, tid)
CS 348
Notes
Warehousing
Fall 2012
12 / 16
Data Warehousing
OLAP
Data
Visualization
Data
Mining
Data
Warehouse
Operational Databases
CS 348
Notes
Warehousing
Fall 2012
13 / 16
Creating and Maintaining a Warehouse
Necessary steps when creating a warehouse:
Extract Run queries against the operational databases to retrieve
necessary data
Clean Delete or repair tuples with missing or invalid information
Transform Reorganize the data to fit the conceptual schema of the
warehouse
Load Populate the warehouse tables; build indexes and/or
materialized views
Note
The data in the warehouse needs to be refreshed periodically (typically
nightly or weekly). To make this process efficient, the above steps need
to be executed incrementally.
CS 348
Notes
Warehousing
Fall 2012
14 / 16
Materializing Views
3
14
67
2
12
47
1
11
57
Kitchener
Waterloo
CREATE VIEW
ByCityQuarter(city, pid, quarter, sales) AS
SELECT city, pid, QUARTER(tid), SUM(sales)
FROM Sales s, Location l
WHERE s.lid = l.lid
GROUP BY city, pid, QUARTER(tid)
PRODUCT
• Consider the following view of the Sales data:
2007Q1
2007Q2
TIME
LOCATION
• View ByCityQuarter is useful for any query that
1 Rolls-up the Location dimension to at least City; and
2 Rolls-up the Time dimension to at least Quarter
CS 348
Notes
Warehousing
Fall 2012
15 / 16
Materializing Views (cont’d)
• Issues related to using materialized views:
1 Which views to materialize (view selection)
2 Which views are useful to answer a query (view matching)
3 Which indexes to build on the views
4 How to refresh the data in the view. Options:
• Synchronous incremental maintenance
• Asynchronous incremental maintenance
• No synchronization (periodic re-creation)
Observation
These are the very same issues that apply to the entire data
warehouse, relative to the data in the operational databases.
CS 348
Notes
Warehousing
Fall 2012
16 / 16