Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Materialized View Selection in a Multidimensional Database Presenter: Dong Wang 3/14/2006 outlines 1. 2. 3. 4. What is multidimensional database. Why materialize views. The cost evaluation. The MDred-lattice. Multidimensional Database • A multidimensional database (MDDB) is a data repository that provides an integrated environment for decision support queries that require complex aggregations on huge amounts of historical data. • An MDDB is a relational data warehouse where the information is organized following the so-called starmodel. A Practical Example • • • Consider the MDDB for a large store chain, characterized by a large number of stores, each of which is a supermarket selling a wide variety of different products. We can identify the following dimensions: Product, which can be characterized by Product_id, Department, Manufactured_date and Price. Store, which can be characterized by Store_id, store address (which can be decomposed into City, State, and Zip). Time, which can be characterized by Timestamp, Date, Week, Month, Quarter, Year. The schema of the example Time Timestamp Date Week Month Quarter Year Store Sales Transaction_id Product Product_id Department Manufactured_ date Price Timestamp Product_id Store_id Store_id City State Zip Example queries • Query 1: the total sales for year 2003. SELECT SUM (Price) FROM Sales, Time, Product WHERE Sales.Product_id = Product.Product_id AND Sales.Timestamp = Time.Timestamp AND Time.Year = ‘2003’ • Query 2: the total sales for store at Ohio. SELECT SUM (Price) FROM Sales, Store, Product WHERE Sales.Product_id = Product.Product_id AND Sales.Store_id = Store.Store_id AND Store.State = ‘Ohio’ How many views an MDDB can have? • It depends on the number of attributes of the dimensions of the MDDB without hierarchies on the dimensional tables, the number is ntotal (2ni 1) i In our example database with only 3 dimension tables of 6, 4, 4 attributes, this number is 18785, but for a real-world database with 50 attributes, this number is 250~1015, outlines 1. 2. 3. 4. What is multidimensional database. Why materialize views. The cost evaluation. Data-cube lattice, MD-lattice and MDred-lattice. Materialized View • A materialized view is the result of some queries, which we choose to store in the database, rather than reconstructing it as needed in response to queries. INSERT INTO SalesV1 SELECT SUM (Price) FROM Sales, Time, Product WHERE Sales.Product_id = Product.Product_id AND Sales.Timestamp = Time.Timestamp GROUP BY (Time.Year) The materialized view SalesV1 can answer the query 1 directly. outlines 1. 2. 3. 4. What is multidimensional database. Why materialize views. The cost evaluation. MDred-lattice. The MDmat-Problem—the cost • Query cost Cqi (M ) : the cost of computing query qi, given a set of materializations M. We want to minimize this cost. • Update cost Cu (M ) f mi M mi cu (mi ) , here mi is the ith view in M and fmi is the frequency mj is updated and cu(mi) is the update cost for mi. We want to minimize this cost too. So, given the query set and the materialized view set, the cost of this solution is the sum of the above two costs: C (Q, M ) Cqi ( M ) Cu ( M ) choose the right views to materialize • Compare to the possible views we can have, the number of queries is extremely small. Consider the data-cube lattice we have below, among the total 16 nodes, only 4 nodes may be used to answer queries. So we can only select a small number of views to materialize. psdr q4 q3 psd psr pd r sdr pd pr sd sr p s d r q2 ps q1 none dr Functional Dependence • Functional dependency is a constraint on the content of the dimension table: for each tuple pair t1,t2 and fd: Al→Ar, t1[Al]=t2[Al]→t1[Ar]=t2[Ar] Examples: 1. In the dimension table Store, we have fds1: Store_id →Zip, fds2: Zip → City, fds3: City →State. 2. In the dimension talbe Time, we have fdt1: timestamp →week, fdt2: timestamp →date, fdt3: date →month, fdt4: month →quarter, fdt5: quarter →year. Use the attributes hierarchy, we can get the multidimensional lattice. The MD-lattice Timestamp Store_id Date Zip Week Month City Quarter State Year all all The MD-lattice of the Store dimension The MD-lattice of the Time dimension Candidate Views • • • It’s impossible (and no need) to materialize all the possible views in the data cube. We only need the views which can help us to answer the queries. We only consider the views that can provide some contribution to reduce the total cost, the candidate views. A view vi belonging to an MD-lattice is a candidate view if one of the following two conditions holds: 1. View vi is associated to some query qi; 2. There exist two candidate views vj and vk, and vi is the least upper bound of vj and vk. The materialization of a noncandidate view will not help • 1. 2. Suppose there is a non-candidate view vi and it’s materialized. We consider two cases: There is no candidate view depending on vi. Since vi will not change the query cost, and the update cost for view vi is always positive, so materialize vi will not help. At least one candidate view exists depending on vi. Say there’s a candidate view vj depending on vi. Since the size of vj is smaller than vi, we can see the update cost of vj is always smaller than vi. That means the materialization of vi always costs more. : materialized view : unmaterialized view case 1 Both views are materialized case 2 only the non-candidate view is materialized Conclusion: we should always choose the candidate view to materialize. Candidate views examples • For query 1 on slide #5, we can choose the view SalesV1 to materialize. • For query 2, we can do: CREAT MATERIALIZED VIEW SalesV2 SELECT SUM (Price) FROM Sales, Store, Product WHERE Sales.Product_id = Product.Product_id AND Sales.Store_id = Store.Store_id GROUP BY (Store.State) In both examples, we choose the view which is associated to the query to materialize. outlines 1. 2. 3. 4. What is multidimensional database. Why materialize views. The cost evaluation. The MDred-lattice. The MDred-lattice • Given an MD-lattice and a set of queries Q, the set of its candidate views forms the MDred-lattice. The MDred-lattice Construction Algorithm /* input : a finite set Q of queries * / /* output : the MDred lattice obtained by Q * / L : Q; lastViews : L; newViews : null ; while lastViews null for each vi lastViews do for each v j L, v j vi do if vi v j L newViews : newViews (vi v j ) L : L newViews; lastViews : newViews; newViews : null ; return L; An MD-lattice construction • Suppose we have two queries: query 1: the total sale of the week 50. query 2: the total sale of the 3rd quarter of year 2005. From the MDred-lattice construction algorithm, first we need to materialize the views group by attribute Week and attribute Quarter to answer the queries, then we need to extend the view set by adding the least upper bound, attribute Timestamp to the view set. The cost evaluation • Suppose we have two queries qj and qk, consider both the query cost and the update cost, we have two options: Option 1: materialize vj and vk. The total cost is C1 fu cu (v j ) f q j cq j (v j ) fu cu (vk ) f qk cqk (vk ) Option 2: only materialize vi, which is the least upper bound of vj and vk. The total cost is C2 fu cu (vi ) f q j cq j (vi ) f qk cqk (vi ) The cost evaluation (cont.) • For option 1, let fu=0.8, cu(vj)=100, fqj=0.5, cqj(vj)=100, cu(vk)=100, cqk(vk)=100, we can get C1=0.8×100+0.5×100+0.8×100+0.5×100=260. • For option 2, let fu=0.8, the update cost will be larger (since the cardinality of vi is larger), say cu(vi)=120, the query cost will also be larger (since additional aggregation will be used to answer the queries), say cqj(vi)=110, cqk(vi)=110, we can get C2=0.8×120+0.5×110+0.5×110=206. So option 2 is the better choice! References • • Materialized view selection in a multidimensional database. Elena Baralis, Stefano Paraboschi and Ernest Teniente. Proceedings of the 23rd VLDB Conference.1997 Designing Data Warehouses. Dimitri Theodoratos, Timos Sellis. 1999