Download mview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational algebra wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Materialized View Selection in
a Multidimensional Database
Presenter: Dong Wang
3/14/2006
outlines
1.
2.
3.
4.
What is multidimensional database.
Why materialize views.
The cost evaluation.
The MDred-lattice.
Multidimensional Database
• A multidimensional database (MDDB) is a data
repository that provides an integrated environment for
decision support queries that require complex
aggregations on huge amounts of historical data.
• An MDDB is a relational data warehouse where the
information is organized following the so-called starmodel.
A Practical Example
•
•
•
Consider the MDDB for a large store chain, characterized by a
large number of stores, each of which is a supermarket selling a
wide variety of different products. We can identify the following
dimensions:
Product, which can be characterized by Product_id, Department,
Manufactured_date and Price.
Store, which can be characterized by Store_id, store address
(which can be decomposed into City, State, and Zip).
Time, which can be characterized by Timestamp, Date, Week,
Month, Quarter, Year.
The schema of the example
Time
Timestamp
Date
Week
Month
Quarter
Year
Store
Sales
Transaction_id
Product
Product_id
Department
Manufactured_
date
Price
Timestamp
Product_id
Store_id
Store_id
City
State
Zip
Example queries
• Query 1: the total sales for year 2003.
SELECT SUM (Price)
FROM Sales, Time, Product
WHERE Sales.Product_id = Product.Product_id
AND Sales.Timestamp = Time.Timestamp
AND Time.Year = ‘2003’
• Query 2: the total sales for store at Ohio.
SELECT SUM (Price)
FROM Sales, Store, Product
WHERE Sales.Product_id = Product.Product_id
AND Sales.Store_id = Store.Store_id
AND Store.State = ‘Ohio’
How many views an MDDB can have?
• It depends on the number of attributes of the dimensions
of the MDDB
without hierarchies on the dimensional tables, the
number is
ntotal   (2ni  1)
i
In our example database with only 3 dimension tables of 6, 4, 4
attributes, this number is 18785, but for a real-world database with 50
attributes, this number is 250~1015,
outlines
1.
2.
3.
4.
What is multidimensional database.
Why materialize views.
The cost evaluation.
Data-cube lattice, MD-lattice and
MDred-lattice.
Materialized View
• A materialized view is the result of some queries, which we choose
to store in the database, rather than reconstructing it as needed in
response to queries.
INSERT INTO SalesV1
SELECT SUM (Price)
FROM Sales, Time, Product
WHERE Sales.Product_id = Product.Product_id
AND Sales.Timestamp = Time.Timestamp
GROUP BY (Time.Year)
The materialized view SalesV1 can answer the query 1 directly.
outlines
1.
2.
3.
4.
What is multidimensional database.
Why materialize views.
The cost evaluation.
MDred-lattice.
The MDmat-Problem—the cost
• Query cost Cqi (M ) : the cost of computing query qi, given a set
of materializations M. We want to minimize this cost.
• Update cost Cu (M ) 
f
mi M
mi
 cu (mi )
, here mi is the ith view in M and
fmi is the frequency mj is updated and cu(mi) is the update cost for
mi. We want to minimize this cost too.
So, given the query set and the materialized view set, the cost of
this solution is the sum of the above two costs:
C (Q, M )  Cqi ( M )  Cu ( M )
choose the right views to materialize
•
Compare to the possible views we can have, the number of queries is
extremely small. Consider the data-cube lattice we have below, among the
total 16 nodes, only 4 nodes may be used to answer queries. So we can
only select a small number of views to materialize.
psdr
q4
q3
psd
psr
pd
r
sdr
pd
pr
sd
sr
p
s
d
r
q2
ps
q1
none
dr
Functional Dependence
• Functional dependency is a constraint on the content of
the dimension table:
for each tuple pair t1,t2 and fd: Al→Ar, t1[Al]=t2[Al]→t1[Ar]=t2[Ar]
Examples:
1. In the dimension table Store, we have fds1: Store_id →Zip,
fds2: Zip → City, fds3: City →State.
2. In the dimension talbe Time, we have fdt1: timestamp
→week, fdt2: timestamp →date, fdt3: date →month, fdt4: month
→quarter, fdt5: quarter →year.
Use the attributes hierarchy, we can get the multidimensional lattice.
The MD-lattice
Timestamp
Store_id
Date
Zip
Week
Month
City
Quarter
State
Year
all
all
The MD-lattice of the
Store dimension
The MD-lattice of the Time
dimension
Candidate Views
•
•
•
It’s impossible (and no need) to materialize all the possible views
in the data cube. We only need the views which can help us to
answer the queries.
We only consider the views that can provide some contribution to
reduce the total cost, the candidate views.
A view vi belonging to an MD-lattice is a candidate view if one of
the following two conditions holds:
1. View vi is associated to some query qi;
2. There exist two candidate views vj and vk, and vi is the least upper
bound of vj and vk.
The materialization of a noncandidate view will not help
•
1.
2.
Suppose there is a non-candidate view vi and it’s materialized. We
consider two cases:
There is no candidate view depending on vi. Since vi will not change the
query cost, and the update cost for view vi is always positive, so
materialize vi will not help.
At least one candidate view exists depending on vi. Say there’s a
candidate view vj depending on vi. Since the size of vj is smaller than vi, we
can see the update cost of vj is always smaller than vi. That means the
materialization of vi always costs more.
: materialized view
: unmaterialized view
case 1
Both views are materialized
case 2
only the non-candidate view is materialized
Conclusion: we should always choose the candidate view to materialize.
Candidate views examples
• For query 1 on slide #5, we can choose the view
SalesV1 to materialize.
• For query 2, we can do:
CREAT MATERIALIZED VIEW SalesV2
SELECT SUM (Price)
FROM Sales, Store, Product
WHERE Sales.Product_id = Product.Product_id
AND Sales.Store_id = Store.Store_id
GROUP BY (Store.State)
In both examples, we choose the view which is associated to the
query to materialize.
outlines
1.
2.
3.
4.
What is multidimensional database.
Why materialize views.
The cost evaluation.
The MDred-lattice.
The MDred-lattice
•
Given an MD-lattice and a set of queries Q, the set of its candidate views forms the
MDred-lattice.
The MDred-lattice Construction Algorithm
/* input : a finite set Q of queries * /
/* output : the MDred  lattice obtained by Q * /
L : Q; lastViews : L; newViews : null ;
while lastViews  null
for each vi  lastViews do
for each v j  L, v j  vi do
if vi  v j  L
newViews : newViews  (vi  v j )
L : L  newViews;
lastViews : newViews; newViews : null ;
return L;
An MD-lattice construction
• Suppose we have two queries:
query 1: the total sale of the
week 50.
query 2: the total sale of the 3rd
quarter of year 2005.
From the MDred-lattice construction
algorithm, first we need to materialize the
views group by attribute Week and attribute
Quarter to answer the queries, then we need
to extend the view set by adding the least
upper bound, attribute Timestamp to the view
set.
The cost evaluation
• Suppose we have two queries qj and qk,
consider both the query cost and the update
cost, we have two options:
Option 1: materialize vj and vk. The total cost is
C1  fu  cu (v j )  f q j  cq j (v j )  fu  cu (vk )  f qk  cqk (vk )
Option 2: only materialize vi, which is the least upper
bound of vj and vk. The total cost is
C2  fu  cu (vi )  f q j  cq j (vi )  f qk  cqk (vi )
The cost evaluation (cont.)
• For option 1, let fu=0.8, cu(vj)=100, fqj=0.5, cqj(vj)=100, cu(vk)=100,
cqk(vk)=100, we can get
C1=0.8×100+0.5×100+0.8×100+0.5×100=260.
• For option 2, let fu=0.8, the update cost will be larger (since the
cardinality of vi is larger), say cu(vi)=120, the query cost will also be
larger (since additional aggregation will be used to answer the
queries), say cqj(vi)=110, cqk(vi)=110, we can get
C2=0.8×120+0.5×110+0.5×110=206.
So option 2 is the better choice!
References
•
•
Materialized view selection in a multidimensional database. Elena Baralis,
Stefano Paraboschi and Ernest Teniente. Proceedings of the 23rd VLDB
Conference.1997
Designing Data Warehouses. Dimitri Theodoratos, Timos Sellis. 1999