Download Database Management Systems Objectives of Lecture 5 Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Lecture 5
Database Management
Systems
Objectives of Lecture 5
Data Warehousing and OLAP
• Realize the purpose of data warehousing.
Winter 2004
• Comprehend the data structures behind data
warehouses and understand the OLAP technology.
CMPUT 391: Data Warehousing
Dr. Osmar R. Zaïane
• Get an overview of the schemas used for multidimensional data.
University of Alberta
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
• See some implementations of OLAP operators with
SQL
Chapter 19 of
Textbook
University of Alberta
1
Data Warehouse and OLAP
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
2
Incentive for a Data Warehouse
• Businesses have a lot of data, operational data and facts.
• This data is usually in different databases and in different
physical places.
• Data is available (or archived), but in different formats and
locations. (heterogeneous and distributed).
• What is a data warehouse and what is it for?
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Decision makers need to access information (data that has been
summarized) virtually on one single site.
• This access needs to be fast regardless of the size of the data, and
how old the data is.
• Are there issues related to data cube technology?
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
3
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
4
What Is Data Warehouse?
Evolution of Decision Support Systems
1970s
1960s
A
ata
ys
nal
ls
oo
is T
•Data Analyst
Inflexible and
non-integrated
tools
•Data Analyst
Flexible integrated
spreadsheets.
Slow access to
operational data
Database Management Systems
g
nd
ssin
g a roce
P
sin
hou cal
are alyti
ta W An
Da -Line
On
D
op
skt
De
al
nu
 Dr. Osmar R. Zaïane, 2001-2004
s
em
yst
d
ase rt S
l-b po
ina up
rm n S
Te cisio
De
a
dM
an
tch ng
Ba porti
Re
•Statistician
•Computer scientist
Difficult and limited
queries highly
specific to some
distinctive needs
• A data warehouse consolidates different data sources.
• A data warehouse is a database that is different and maintained
separately from an operational database.
• A data warehouse combines and merges information in a consistent
database (not necessarily up-to-date) to help decision support.
1990s
1980s
•Executive
Integrated tools
Data Mining
University of Alberta
5
Decision support systems access data warehouse and
do not need to access operational databases Î do
not unnecessarily over-load operational databases.
 Dr. Osmar R. Zaïane, 2001-2004
Definitions
Database Management Systems
University of Alberta
6
Definitions (con’t)
Data Warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management’s
decision making process. (W.H. Inmon)
Data Warehousing is the process of constructing and using
data warehouses.
Subject oriented: oriented to the major subject areas of the corporation that
have been defined in the data model.
A corporate data warehouse collects data about subjects
spanning the whole organization. Data Marts are specialized,
single-line of business warehouses. They collect data for a
department or a specific group of people.
Integrated: data collected in a data warehouse originates from different
heterogeneous data sources.
Time-variant: The dimension “time” is all-pervading in a data warehouse.
The data stored is not the current value, but an evolution of the value in time.
Non-volatile: update of data does not occur frequently in the data
warehouse. The data is loaded and accessed.
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
7
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
8
Data Warehouse and OLAP
Building a Data Warehouse
• What is a data warehouse and what is it for?
Option 1:
Consolidate Data Marts
Data Mart
Data Mart
Data Mart
Corporate
Data Warehouse
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
Option 2:
Build from
scratch
Data Mart
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
Corporate data
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
9
We sell products in various
markets, and we measure our
performance over time
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
e
Ti
m
Markets
Products
University of Alberta
University of Alberta
10
• Think of it as a cube with labels
on each edge of the cube.
• The cube doesn’t just have 3
dimensions, but may have many
dimensions (N).
• Any point inside the cube is at
the intersection of the coordinates
defined by the edge of the cube.
• A point in the cube may store
values (measurements) relative to
the combination of the labeled
dimensions.
Business Manager
Data Warehouse Designer
Database Management Systems
Construction of Data Warehouse
Based on Multi-dimensional Model
Describing the Organization
We sell Products in various
Markets, and we measure our
performance over Time
 Dr. Osmar R. Zaïane, 2001-2004
11
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
12
Concept-Hierarchies
Data Warehouse and OLAP
Most Dimensions are hierarchical by nature: total orders or partial orders
Example: Location(continent Î country Î province Î city)
Time(yearÎquarterÎ(month,week)Îday)
• What is a data warehouse and what is it for?
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
Industry
Dimensions: Product, Region, Time
Hierarchical summarization paths
Country
Category Region
Product
City
Office
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
Year
• What is the general architecture of a data warehouse?
Quarter
• How can we implement a data warehouse?
Month Week
• Are there issues related to data cube technology?
Day
University of Alberta
13
On-Line Transaction Processing
•OLAP operations:
DW tend to be in the order of Tb
Databases tend to be hundreds of Mb to Gb.
University of Alberta
University of Alberta
14
• On-line analytical processing (OLAP) is essential for
decision support.
• OLAP is supported by data warehouses.
• Data warehouse consolidation of operational databases.
• The key structure of the data warehouse always contains
some element of time.
•Owing to the hierarchical nature of the dimensions, OLAP
operations view the data flexibly from different perspectives
(different levels of abstractions).
¾OLTP operations are structured and repetitive
¾OLTP operations require detailed and up-to-date data
¾OLTP operations are short, atomic and isolated transactions
Database Management Systems
Database Management Systems
On-Line Analytical Processing
• Database management systems are typically used for on-line
transaction processing (OLTP)
• OLTP applications normally automate clerical data
processing tasks of an organization, like data entry and
enquiry, transaction handling, etc. (access, read, update)
• Database is current, and consistency and recoverability are
critical. Records are accessed one at a time.
 Dr. Osmar R. Zaïane, 2001-2004
 Dr. Osmar R. Zaïane, 2001-2004
15
 Dr. Osmar R. Zaïane, 2001-2004
• roll-up (increase the level of abstraction)
• drill-down (decrease the level of abstraction)
• slice and dice (selection and projection)
• pivot (re-orient the multi-dimensional view)
• drill-through (links to the raw data)
Database Management Systems
University of Alberta
16
OLTP vs OLAP
Data Warehouse OLAP Example
Time (Quarters)
Q1 Q2 Q3 Q4
Edmonton
Compact
Calgary
AB)Lethbridge
Mid-size
Red Deer
Family
(city,
Minivan.
OLTP
OLAP
users
Clerk, IT professional
Knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
complex query
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
Time (Months, Q3)
Location
Category
Edmonton
Location Calgary
(city, AB)Lethbridge
Red Deer
Jul Aug Sep
Compact
Mid-size
Family
Category
Minivan.
Drill down on Q3
Roll-up on Location
Time (Quarters)
Q1 Q2 Q3 Q4
Maritimes
Location Quebec
Compact
(province, Ontario
Mid-size
Prairies
Canada)Western
Pr
Family
Slice on Category=mid-size
Time (Quarters)
usage
access
Category
Minivan.
Q1 Q2 Q3 Q4
Edmonton
Calgary
AB)Lethbridge
Mid-size
Red Deer
Location
(city,
Pivot
Time (Quarters)
 Dr. Osmar R. Zaïane, 2001-2004
er
De
Red ge
id
hbr
Le t g a r y n
l
Ca onto
m
Ed
Category
Q1
Q2
Q3
Q4
Database Management Systems
Location
(city, AB)
Mid-size
Category
(Source: JH)
University of Alberta
17
Why Do We Separate DW From DB?
• Performance reasons:
University of Alberta
18
Data Warehouse and OLAP
• What is the multi-dimensional data model?
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Decision support requires historical data.
• Decision support requires consolidated data.
Database Management Systems
Database Management Systems
• What is a data warehouse and what is it for?
– OLAP necessitates special data organization
that supports multidimensional views.
– OLAP queries would degrade operational DB.
– OLAP is read only.
– No concurrency control and recovery.
 Dr. Osmar R. Zaïane, 2001-2004
 Dr. Osmar R. Zaïane, 2001-2004
University of Alberta
• Are there issues related to data cube technology?
19
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
20
Three-tier Architecture
Monitor
&
Integrator
metadata
External
sources
Operational
DBs
Data Warehouse and OLAP
OLAP Server
Analysis
• What is a data warehouse and what is it for?
• What is the multi-dimensional data model?
Extract
Transform
Load
Refresh
Serve
Data
Warehouse
• What is the difference between OLAP and OLTP?
Query
Reports
• What is the general architecture of a data warehouse?
Data
mining
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
Client Tools
Data sources
 Dr. Osmar R. Zaïane, 2001-2004
• Can we mine data warehouses?
Data Marts
Database Management Systems
University of Alberta
21
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
Product
Date
Most data warehouses use a star schema to represent the multidimensional model.
Day
Month
Year
Each dimension is represented by a dimension-table that
describes it.
Date
Product
ProductNo
ProdName
ProdDesc
Category
Store
Cust
StoreID
Customer
City
CustId
State
CustName
unit_sales
Country
CustCity
dollar_sales
Region
CustCountry
Star schema: A single object (fact table) in the middle
connected to a number of objects (dimension tables)
Each dimension is represented by one table
ÎUn-normalized (introduces redundancy).
Normalize dimension tables Î Snowflake schema
The links between the fact-table in the centre and the dimensiontables in the extremities form a shape like a star. (Star Schema)
University of Alberta
Sales Fact Table
Store
A fact-table connects to all dimension-tables with a multiple
join. Each tuple in the fact-table consists of a pointer to each of
the dimension-tables that provide its multi-dimensional
coordinates and stores measures for those coordinates.
Database Management Systems
22
Example of Star Schema
Data Warehouse Design
 Dr. Osmar R. Zaïane, 2001-2004
University of Alberta
23
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
(Source: JH)
24
Example of Snowflake Schema
Year
Year
Month
Month
Year
Date
Sales Fact Table
Date
Month
Date
Product
City
Country
City
State
Country
Region
Multidimensional view of data in the warehouse:
Stress on aggregation of measures by one or more
dimensions
Cust
StoreID
City
Customer
unit_sales
dollar_sales
State
Country
ProductNo
ProdName
ProdDesc
Category
Store
Store
State
Aggregation in Data Warehouses
Product
The Data Cube and
The Sub-Space Aggregates
CustId
CustName
CustCity
CustCountry
By City
Group By
Aggregate
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
Sum
University of Alberta
Time
City
Edmonton
Calgary
Lethbridge
Sum
2002 Sum
25
Sum
Drama
Comedy
Horror
By Category & City
Sum
Sum
By Time & Category
By Category
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
26
Implementation of the OLAP Server
ROLAP: Relational OLAP - data is stored in tables in relational
database or extended-relational databases. They use an RDBMS to
manage the warehouse data and aggregations using often a star
schema.
•They support extensions to SQL
•A cell in the multi-dimensional structure is represented by a tuple.
Advantage: Scalable (no empty cells for sparse cube).
Disadvantage: no direct access to cells.
All Years
Drama, Edmonton
Drama
Comedy
… ...
Drama
Comedy
Horror
Drama
Comedy
Horror
By Time
By Time & City
(Source: JH)
Construction of Multi-dimensional
Data Cube
1999 2000 2001
By Category
By Time
Snowflake schema: Easier to maintain dimension tables when
dimension tables are very large (reduces overall space).
Star schema: More effective for data cube browsing (less joins):
can affect performance.
Cross Tab
Q1Q2Q3Q4
Category
Q3Q Red Deer
4
Q1 Q2
Lethbridge
Calgary
Edmonton
Category
Sum
Ex: Microstrategy
Metacube (Informix)
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
27
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
28
Implementation of the OLAP Server
Example of Fact and Dimension
tables for ROLAP
MOLAP: Multidimensional OLAP – implements the
multidimensional view by storing data in special multidimensional
data structures (MDDS)
Advantage: Fast indexing to pre-computed aggregations. Only
values are stored.
Disadvantage: Not very scalable and sparse
• The dimensions of the fact table are further
described with dimension tables
• Fact table:
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
• Dimension Tables:
Ex: Essbase of Arbor
Market (Market_Id, City, Province, Region)
Product (Product_Id, Name, Category, Price)
Time (Time_Id, Week, Month, Quarter)
HOLAP: Hybrid OLAP - combines ROLAP and MOLAP
technology. (Scalability of ROLAP and faster computation of MOLAP)
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
29
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
30
Aggregation over Time
Aggregation
• The output of the previous query
• Many OLAP queries involve aggregation of the
data in the fact table
• For example, to find the total sales (over time) of
each product in each market, we might use
Market_Id
M1
Product_Id
SELECT
S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM
Sales S
GROUP BY S.Market_Id, S.Product_Id
• The aggregation is over the entire time dimension
and thus produces a two-dimensional view of the
data
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
31
SUM(Sales_Amt)
P1
P2
P3
P4
P5
 Dr. Osmar R. Zaïane, 2001-2004
M2
3003
1503
6003
2402
4503
3
7503
7000
…
…
Database Management Systems
M3
M4
…
…
…
…
…
University of Alberta
32
Drilling Down and Rolling Up
Drilling Down
• Some dimension tables form an aggregation hierarchy
•
Market_Id → City → Province → Region
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
Market (Market_Id, City, Province, Region)
• Executing a series of queries that moves down a
hierarchy (e.g., from aggregation over regions to that
over provinces) is called drilling down
1.
• Executing a series of queries that moves up the hierarchy
(e.g., from provinces to regions) is called rolling up
– Note: In a rollup, coarser aggregations can be computed using
prior queries for finer aggregations
Database Management Systems
University of Alberta
33
Rolling Up
•
1.
If we have already created a table, Province_Sales,
Province_Sales using
Database Management Systems
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
34
SELECT
S.Product_Id, T.Quarter, SUM (S.Sales_Amt)
FROM
Sales S, Time T
WHERE
T.Time_Id = S.Time_Id
Pivot
GROUP BY S.Product_Id, T.Quarter
T.Product_Id, M.Region, SUM (T.Sales_Amt)
Province_Sales T, Market M
M.Province = T.Province
GROUP BY T.Product_Id, M.Region
 Dr. Osmar R. Zaïane, 2001-2004
SELECT
S.Product_Id, M.Province, SUM (S.Sales_Amt)
FROM
Sales S, Market M
WHERE
M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.Province,
– Pivoting on dimensions D1,…,Dk in a data cube
D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY
A1,…,Ak and aggregate over Ak+1,…An, where Ai is an
attribute of the dimension Di
– Example: Pivoting on Product and Time corresponds to
grouping on Product_id and Quarter and aggregating
Sales_Amt over Market_id:
SELECT
S.Product_Id, M.Province, SUM (S.Sales_Amt)
FROM
Sales S, Market M
WHERE
M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.Province
SELECT
FROM
WHERE
2.
• When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are said
to be performing a pivot on those axes
then we can roll up from there to:
2.
S.Product_Id, M.Region, SUM (S.Sales_Amt)
Pivoting
Rolling up on market, from Province to Region
–
SELECT
FROM
Sales S, Market M
WHERE
M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.Region
– Requires the use of the fact table or information more specific
than the requested aggregation (e.g., cities)
 Dr. Osmar R. Zaïane, 2001-2004
Drilling down on market: from Region to Province
University of Alberta
35
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
36
Slicing-and-Dicing
Slicing-and-Dicing
• When we use WHERE to specify a particular
value for an axis (or several axes), we are
performing a slice
• Typically slicing and dicing involves several queries to
find the “right slice.”
For instance, change the slice and the axes:
– Slicing the data cube in the Time dimension
(choosing sales only in week 12) then pivoting to
Product_id (aggregating over Market_id)
SELECT S.Product_Id, SUM (Sales_Amt)
• Slicing on Time and Market dimensions then pivoting to Product_id
and Week (in the time dimension)
SELECT
FROM
WHERE
S.Product_Id, T.Quarter, SUM (Sales_Amt)
Sales S, Time T
T.Time_Id = S.Time_Id
AND T.Quarter = 4
AND S.Market_id = 12345
GROUP BY S.Product_Id, T.Week
Slice
FROM Sales S, Time T
WHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’
GROUP BY S. Product_Id
Pivot
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
Pivot
37
Product_Id
SUM(Sales_Amt)
P1
P2
P3
P4
Total
 Dr. Osmar R. Zaïane, 2001-2004
3003
1503
6003
2402
4503
3
7503
7000
…
…
Database Management Systems
M3 Total
…
…
…
…
…
Database Management Systems
University of Alberta
38
• For the table entries, without the totals (aggregation on time)
SELECT
S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM
Sales S
GROUP BY S.Market_Id, S.Product_Id
• For the row totals (aggregation on time and supermarkets)
SELECT
S.Product_Id, SUM (S.Sales_Amt)
FROM
Sales S
GROUP BY S.Product_Id
• For the column totals (aggregation on time and products)
SELECT
S.Market_Id, SUM (S.Sales)
FROM
Sales S
GROUP BY S.Market_Id
• To construct the following table, would take 3
queries (next slide)
Market_Id
M2
 Dr. Osmar R. Zaïane, 2001-2004
The Three Queries
The CUBE Operator
M1
Slice
…
…
…
…
…
University of Alberta
39
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
40
Definition of the CUBE Operator
Example of CUBE Operator
• Doing these three queries is wasteful
• The following query returns all the information
needed to make the previous products/markets
table:
– The first does much of the work of the other two: if
we could save that result and aggregate over
Market_Id and Product_Id, we could compute the
other queries more efficiently
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY CUBE (S.Market_Id, S.Product_Id)
• The CUBE clause is part of SQL:1999
– GROUP BY CUBE (v1, v2, …, vn)
– Equivalent to a collection of GROUP BYs, one for
each of the 2n subsets of v1, v2, …, vn
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
41
ROLLUP
Database Management Systems
University of Alberta
42
Example of ROLLUP Operator
• ROLLUP is similar to CUBE except that instead of
aggregating over all subsets of the arguments, it
creates subsets moving from right to left
• GROUP BY ROLLUP (A1,A2,…,An) is a series of
these aggregations:
–
–
–
–
–
–
 Dr. Osmar R. Zaïane, 2001-2004
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY ROLLUP (S.Market_Id, S. Product_Id)
– first aggregates with the finest granularity:
GROUP BY
GROUP BY A1 ,…, An-1 ,An
GROUP BY A1 ,…, An-1
………
GROUP BY A1, A2
GROUP BY A1
No GROUP BY
S.Market_Id, S.Product_Id
– then with the next level of granularity:
GROUP BY
S.Market_Id
– then the grand total is computed with no GROUP
BY clause
• ROLLUP is also in SQL:1999
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
43
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
44
Materialized Views
ROLLUP vs. CUBE
• The same query with CUBE:
- first aggregates with the finest granularity:
GROUP BY
The CUBE operator is often used to precompute aggregations on all dimensions of a
fact table and then save them as a materialized
views to speed up future queries
S.Market_Id, S.Product_Id
- then with the next level of granularity:
GROUP BY
S.Market_Id
and
GROUP BY
S.Product_Id
- then the grand total with no GROUP BY
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
45
 Dr. Osmar R. Zaïane, 2001-2004
Data Warehouse and OLAP
Scalability
Sparseness
Curse of dimensionality
Materialization of the multidimensional
data cube (total, virtual, partial)
• Efficient computation of aggregations
• Indexing
• What is the difference between OLAP and OLTP?
• What is the general architecture of a data warehouse?
• How can we implement a data warehouse?
• Are there issues related to data cube technology?
University of Alberta
46
•
•
•
•
• What is the multi-dimensional data model?
Database Management Systems
University of Alberta
Issues
• What is a data warehouse and what is it for?
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
47
 Dr. Osmar R. Zaïane, 2001-2004
Database Management Systems
University of Alberta
48