Download Improving the Performance of Data Cube Queries Using Families of

Improving the Performance of Data Cube Queries Using Families of Statistics Trees Joachim Hammer and Lixin Fu Computer & Information Science & Engineering University of Florida Gainesville, Florida 32611-6120 {jhammer, lfu}@cise.ufl.edu Abstract. We propose a new approach to speeding up the evaluation of cube queries, an important class of OLAP queries which return aggregated values rather than sets of tuples. Specifically, in this paper we focus on those cube queries which navigate over the abstraction hierarchies of the dimensions. Abstraction hierarchies are the result of partitioning dimension values of a data cube into different levels of granularity. Our approach is based on our previous version of CubiST (Cubing with Statistics Trees) which represents a drastic departure from existing approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views. Instead, CubiST computes and stores all possible aggregate views in the leaves of a statistics tree during a one-time scan of the detailed data. However, it uses a single statistics tree to answer all possible cube queries. The new version described here remedies this limitation by applying a greedy strategy to select a family of candidate trees which represent superviews over the different hierarchical levels of the dimensions. In addition, we have developed an algorithm to materialize the candidate trees in the family starting from the single statistics tree (base tree). Given an input query, our new query evaluation algorithm selects the smallest tree in the family which can provide the answer. This significantly reduces I/O time and improves in-memory performance over the original CubiST. Experimental evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems. 1 Introduction Most OLAP queries involve aggregates of measures over arbitrary regions of the dimensions. This ad-hoc “navigation” through the data cube helps users obtain a “big picture” view of the underlying data. However, on any but the smallest databases, these kinds of queries will tax even the most sophisticated query processors since the database size often grows to hundreds of GBs or TBs and the data in the tables is of high dimensionality with large domain sizes. In [6], Fu and Hammer describe an efficient algorithm (CubiST) for evaluating ad-hoc OLAP queries using socalled Statistics Trees (STs). CubiST can be used to evaluate queries that return (aggregate) values (e.g., What is the percent difference in car sales between the southeast and northeast regions for the years 1990 through 2000?) rather than the records that satisfy the query (e.g., Find all car dealers in the southeast who sold more cars in 1999 than in 1998). We have termed the former type of query cube query. However, the original CubiST algorithm in [6] does not efficiently support complex cube queries which involve abstraction hierarchies for multiple dimensions. Abstraction hierarchies are the result of partitioning dimension values of a data cube into different levels of granularity (e.g., days into weeks, quarters, and years). In this paper, we present a new methodology called CubiST ++ to evaluate ad-hoc cube queries which is based on the original version of CubiST but greatly expands its usefulness and efficiency for all types of cube queries. CubiST++ specifically improves the evaluation of cube queries involving arbitrary navigation over the abstraction hierarchies in a data cube, an important and frequent class of queries. Instead of using a single (base) tree to compute all cube queries, our algorithms materialize a family of trees from which CubiST ++ selects potentially smaller derived trees to compute the answer. This reduces memory requirements, I/O, and improves processing speed. 1 2 Related Research Research related to this work falls into three broad categories: OLAP servers including ROLAP and MOLAP, indexing and view materialization. ROLAP servers use the familiar “row-and-column view” to store the data in relational tables using a star or snowflake schema design [4]. In the ROLAP approach, OLAP queries are translated into relational queries using the standard relational operators such as selection, projection, relational join, group-by, etc. However, directly executing translated SQL can be very inefficient. As a result, many commercial ROLAP servers extend SQL to support important OLAP operations directly inside the relational engine (e.g., RISQL from Redbrick Warehouse [24], the cube operator in Microsoft SQL Server [17]). MicroStrategy [18], Redbrick [23], and Informix's Metacube [12] are examples of ROLAP servers. MOLAP servers use proprietary data structures such as multidimensional arrays to store the data cubes. Arbor software's Essbase [2], Oracle Express [21] and Pilot LightShip [22] are based on MOLAP technology. When the number of dimensions and their domain sizes increase, the data becomes very sparse. Storing sparse data in an array directly is inefficient. A popular technique to deal with sparse data is chunking [27] in which the full cube (array) is chunked into small pieces called cuboids. Often specialized index structures are used in conjunction with OLAP servers to improve query performance. When the domain sizes are small, a bitmap index structure [20] can be used to help speed up OLAP queries. However, simple bitmap indexes are not efficient for large-cardinality domains and large range queries. In order to overcome this deficiency, an encoded bitmap scheme has been proposed [3]. A well-defined encoding can reduce the complexity of the retrieve function thus optimizing the computation. However, designing well-defined encoding algorithms remains an open problem. A good alternative to encoded bitmaps for large domain sizes is the B-Tree index structure [5]. O’Neil and Quass [19] provide an excellent overview of and detailed analyses for index structures which can be used to speed up OLAP queries. View materialization refers to the pre-computing of partial query results used to derive the answer for frequently asked queries. Since it is impractical to materialize all possible views, view selection is an important research problem. For example, [11] introduced a greedy algorithm for choosing a near-optimal subset of views from a view materialization lattice. The computation of the materialized views, some of which depend on previously materialized views in the lattice, can be expensive when the views are stored on disk. More recently [9, 13, 14], for example, developed various algorithms for view selection in data warehouse environments. Another optimization to processing OLAP queries using view materialization is to pipeline and overlap the computation of group-by operations to amortize the disk reads, as proposed by [1]. Other related research in this area has focused on indexing pre-computed aggregates [25] and incrementally maintaining them [16]. Also relevant is the work on maintenance of materialized views (see [15] for a summary of excellent papers) and processing of aggregation queries [8, 26]. However, in order to be able to support truly ad-hoc OLAP queries, indexing and precomputation of results alone will not produce good results. For example, building an index for each attribute of the warehouse or pre-computing every sub-cube requires too much space and results in unacceptable maintenance costs. We believe that our approach which is based on a new data structure, view selection and query answering algorithm holds the answer to how to support ad-hoc OLAP queries efficiently. 3 Answering Cube Queries Using Statistics Trees In this Section, we briefly review CubiST and the statistics tree (ST) data structure on which our new approach is based. Complete details can be found in [6]. An ST tree is a multi-way, balanced tree whose leave nodes hold the aggregate values for a set of records consisting of one or more attributes. By aggregate value we are referring to the result of applying an aggregation function (e.g., count, min, max) to each combination of dimension values in the data set. Leaf nodes are connected to facilitate retrieval of multiple aggregates. Each level in the tree (except the leaf level) corresponds to an attribute. An internal node has one pointer for each domain value, and an additional “star” pointer representing aggregation along the attribute (the star has the intuitive meaning “any” analogously to the star in Gray’s cube operator [7]). Internal nodes contain no data values. In order to represent a data cube, an empty ST is created which has as many levels as there are dimensions in the cube (plus one for the leaves). The size of the nodes on each level corresponds to the cardinality of the dimension (plus one for the star pointer). Initially, the values in the leave nodes are set to zero. Populating the tree is done during a one-time scan of the data set: for each record in the set, the tree is traversed based on the dimension values in the 2 tuple and the leaf node at the end of the path is updated accordingly (e.g., incremented by one in the case of the count function). Figure 1 depicts a statistics tree corresponding to a data cube with three dimensions with cardinalities d1=2, d2=3, and d3=4 respectively. The contents of the tree are shown after inserting the record (1,2,4) into the empty tree. Note that the leaf nodes are storing the result of applying the aggregation function count to the data set, i.e., counting all possible combination of domain values. Further note that each tuple results in updates to eight different leaf nodes: at each node the algorithm follows the pointer corresponding to the data value as well as the star pointer, resulting in 2n paths, where n is the height of the tree. The update paths relating to the sample record are shown as solid thick lines in Figure 1. Root 1 2 * • • • Level 1 Star Pointer Interior Node 1 2 3 * • • • • Level 2 Level 3 ... • • • • • ... • • • • 1 2 3 4 * ... • • • • • • • • • • • • • • ... • • • • • Level 4 0 0 0 1 1 ... 0 0 0 1 1 0 0 0 1 1 ... 0 0 0 1 1 Leaf Nodes Figure 1: Statistics tree after processing input record (1,2,4) Once all the records are inserted, the ST is ready to answer queries. An ad-hoc cube query can be viewed as an aggregation of the measure attributes over a region defined by a set of dimensional constraints. Formally, a cube query q is a tuple defining the regions of the cube on which to aggregate on: q  agg ( si1 , si2 ,..., sir ) , where agg describes the aggregation function to be used, {i1, i2, …, ir}  {1,2, …,k}, r is the number of dimensions specified in the query, and k is the total number of dimensions in the data set. Component si describes a constraint for dimension ij, specifying the j selected values which can either be a singleton, a range, or a partial set. For instance, query q = count(s1=4, s4=[1,3], s5={1,3}), specifies three constraints on dimensions 1, 4, and 5 of the data set: “4” states that the selected value of the first dimension must be the singleton value 4, “[1,3]” states that the desired values for the fourth dimension fall in the range from 1 to 3, and “{1,3}” states that the selected values for the fifth dimension must be either 1 or 3 (partial selection). The function count specifies that the total number of records satisfying the constraints be returned as result. In order to satisfy the above query, CubiST uses the constraint values in q to follow the paths to the desired leaves in the underlying ST. In the case of s1=4 this means following the fourth pointer of the node representing the first dimension in the tree and so forth. There is a total of six different paths since the query is composed of 1*3*2=6 partial results, one for each combination of the three dimensions. The partial results have been pre-computed during the creation of the tree (using the function count). The total of the values at the six leaf nodes form the result. Details about CubiST and how it can be integrated with relational warehouse systems are provided in [6]. The original CubiST provides a useful framework for answering cube queries. However, CubiST does NOT efficiently answer queries on data cubes containing hierarchies on the dimensional values. In those cases, one must first transform the conditions of the query which are expressed using different levels of abstraction into conditions not involving hierarchies, and then treat them as range queries or partial queries. Given the amount of data to be stored in the tree, the underlying ST may be large and may not fit into memory all at once. In order to overcome this problem we have developed CubiST++ ("CubiST plus Family") to automatically support abstraction hierarchies in the underlying data set and to alleviate the memory problems of its predecessor. In the next sections we describe a suite of algorithms for generating families of ST's and for using them to efficiently compute ad-hoc cube queries. 3 4 Generating Families of Statistics Trees One way to reduce the cost of paging STs in and out of memory is to transmit only the leaves of the ST. The internal nodes can be generated in memory without additional I/O. A more effective approach is to partition an ST into multiple, smaller subtrees, each representing a certain part of the data. In most cases, cube queries can be answered using one of the smaller subtrees rather than the single large ST. We term this single large ST as defined in the original version of CubiST base tree. Using this base tree, one can compute and materialize other statistics trees with the same dimensions but containing different levels of the abstraction hierarchy for one or more of the dimensions. We term these smaller trees derived trees. A base tree and its derived trees form a family of statistics trees with respect to the base tree. Before we describe CubiST ++, we first show how to formally represent hierarchies. 4.1 Hierarchical Partitioning and Mapping of Domain Values In relational systems, dimensions containing hierarchies are stored in the familiar row-column format in which different levels are represented as separate attributes. In our new framework, we regard the interdependent levels in a hierarchy as one hierarchical attribute and map the values at different levels into integers. In this representation, a value at a higher level of abstraction includes multiple values at a lower level. By scanning a dimension table, we can establish the mappings, hierarchical partitions, as well as the inclusion relationships. To illustrate, consider Table 1 which represents the location information in a data source. We first group the records by columns Region, State and City, remove duplicates, and then map the domain values into integers using the sorted order. The result is shown in Table 2 (numbers in parentheses are the integer assignments for the strings). LocationID 1 2 3 4 5 6 7 8 City Gainesville Atlanta Los Angeles Chicago Miami San Jose Seattle Twin City Region State MW (1) IL (1) MN (2) SE (2) FL (3) City Chicago (1) Twin City (2) Gainesville (3) Miami (4) GA (4) Atlanta (5) WC (3) CA (5) Los Angeles (6) San Jose (7) WA (6) Seattle (8) State Region FL SE GA SE CA WC IL MW FL SE CA WC WA WC MN MW Table 1: Original location data Table 2: Partitioning and mapping of location data Table 3 is created by replacing the strings in Table 1 with their integer assignments. Mapping strings into integers saves space and facilitates data processing internally. The compressed representation of the transformed tables reduces the runtime of join operations during the initialization phase of the ST since most relational warehouses store measures and dimension values in separate tables. Note that we only need to store tables 2 and 3 because they allow us to deduce the contents of Table 1. LocationID 1 2 3 4 5 6 7 8 City State 3 5 6 1 4 7 8 2 Region 3 4 5 1 3 5 6 2 2 2 3 1 2 3 3 1 Table 3: Location data after transformation 4 4.2 A Greedy Algorithm for Selecting Family Members To generate a family of statistics tree, we first need an algorithm for choosing from the set of all possible candidate trees (i.e., trees containing all combinations of dimensions and corresponding hierarchy levels) those trees that will eventually make up the family. Second, we need an algorithm for deriving the selected trees from the base tree. Potentially, any combination of hierarchy levels can be selected as a potential tree. However, we need to consider space limitations and maintenance overhead. We introduce a greedy algorithm to choose the members in a step-bystep fashion. During each step starting from the base tree, we roll-up the values of the largest dimension to its next level in the hierarchy, keeping other dimensions the same. This newly formed ST becomes a new member of the family. The iteration continues until each dimension is rolled-up to its highest level or until the size of the family exceeds the maximum available space. The pseudo-code for this greedy selection mechanism is shown in Figure 2. The total size of the derived trees is only a small fraction of the base tree. In line 1, the original ST initialization algorithm proposed in [6] is used. The roll-up algorithm of line 7 is discussed in Section 4.3. 1 Setup base tree T0 by scanning input data; 2 Save T0 in repository; 3 Save its parameters to Metadata Repository; 4 T = T0; 5 Repeat { 6 Find dimension with largest cardinality 7 Rollup dimension to generate a new in T; derived Tree T; 9 Save T; 10 11 Save params of T to Metadata Repository; } Until (stop conditions); Figure 2: Greedy algorithm for generating a family of statistics trees Let us illustrate with an example. Suppose a base tree T 0 represents a data set with three dimensions d1, d2, d3 whose cardinalities are 100 each. Each dimension is organized into ten groups of ten elements to form a two-level hierarchy. We use the superscripted hash mark to indicate values corresponding to the second hierarchy level. For example, 0# represents values 0 through 9, 1# represents values 10 through 19 etc. The degrees of internal nodes of T 0 are 101 for each level of the tree (the extra 1 accounts for the star pointer). The selected derived trees T 1, T2, T3 that make up a possible family together with T 0 are as follows: T1 is the result of rolling up d2 in T0 (note since all three dimensions are of the same size, we pick one at random). T 1 has degrees 101, 101 and 11. T 2 is the result of rolling up d1 in T1. Its degrees are 101, 11, and 11. T 3 is the result of rolling up d0 in T2. Its degrees are 11, 11, and 11. The family generated from T0 is shown in Figure 3. T3 Hierarchy Level 1 T2 T1 Hierarchy Level 0 T0 d0 d1 d2 Figure 3: A family of ST's consisting of {T0, T1, T2, T3} 5 4.3 Deriving a New Statistics Tree Except for the base tree which is computed from the initial data set, each derived tree is computed by rolling-up one of its dimensions. Before introducing our new tree derivation algorithm, we first define the a new operator “” on STs as follows: Definition: Two STs are isomorphic if they have the same structure except for the values of their leaves. The summation of two isomorphic ST's S1, S2 is a new ST S that has the same structure as S1 or S2 but its leaf values are the summation of the corresponding leaf values of S1 and S2. This relationship is denoted by S=S1 S2. To derive a new tree, proceed from the root to the level that is being rolled up. According to the mapping relationship described in Section 4.1, recursively reorganize and merge the sub-trees for all the nodes in that level to form new subtrees. For each node, adjust the degree and update its pointers to point to the newly created subtrees. The best way to illustrate this tree derivation process is through a simple example. Suppose that during the derivation, we need to roll-up a dimension that has nine values and forms a two-level hierarchy consisting of three groups with three values each. A sample node N of this dimension with children S1,.., S9, S* is shown in the left of Figure 4. We group and sum up its nine sub-trees into three new trees S1#, S2# and S3# each having three subtrees. The new node N' with degree four is shown on the right side of the figure. The star sub-tree remains the same. Using the definitions from above, we can see that the following relationships hold among the subtrees of nodes N and N': S1#=S1S2S3, S2#=S4S5S6, and S3#=S7S8S9 Node N’ Node N 1# 2# 3# 1 2 3 4 5 6 7 8 9 1# 2# 3 # * * Roll-up S1 S3 S2 S5 S4 S7 S6 # S9 S8 # S1 S* S3 # S2 S* Figure 4: Rolling-up one dimension 5 Answering Cube Queries Using a Family of ST's Once the family is established and materialized, it can be used to answer cube queries as described in the following sections. 5.1 Choosing the Optimal Derived Tree Given a family of trees, potentially multiple trees can answer a given query; however, the smallest possible one is preferred. The proper tree is the tree which contains the highest level of abstraction that matches the query. This tree can be found using our following matching scheme described below, which is based on the following new concept and data structure. Definition: A view matrix (VM) is a matrix in which rows represent dimensions, columns represent the hierarchy levels, and the entries are the labels of the ST's in the family. Intuitively speaking, a VM describes what views are materialized in the hierarchies. 6 Continuing with our running example, T1 has fan-out structure 101*101*11. Its first two dimensions have not been rolled-up (= level 0 of the abstraction hierarchy) whereas the third dimension has been rolled-up to level 1; hence T1 is placed in rows 1 and 2 of column 1 and the third row in column 2 as shown in Figure 6. Level of Abstraction Roll-up level 0 Roll-up level 1 1 T0, T1, T2 T3 2 T0, T1 T2, T3 3 T0 T1, T2, T3 Dimension Figure 6: A sample view matrix A VM is easily implemented as a stack of vectors, each representing the abstraction level for each dimension by tree. For example, ST T3 on top of the stack, has all of its dimensions rolled-up to level 1. Removing (i.e., popping) a layer from the stack is equivalent to peeling the right-most ST from the VM. The stack corresponding to the VM above is shown in Figure 7. T3 (1, 1, 1) (0, 1, 1) T2 T1 T0 (0, 0, 1) (0, 0, 0) Figure 7: Stack representation of the VM 5.2 Rewriting and Answering the Query Next, we introduce a new concept for representing hierarchies in cube queries. Definition: A query hierarchy vector (QHV) is a vector (l1,l2,...,lk), where li is the ith dimension level value in its hierarchy in the query. Definition: A QHV (l1,l2,...,lk) matches view T in the VM iff li  column index of the ith row entry of T in VM (the column indices of T compose T's column vector), i, i=1,2,...,k. In other words, T's column index vector  QHV. Definition: An optimal matching view with respect to a query is the view in VM that has the highest index and matches the query's QHV. 7 Level of Abstraction Roll-up level 0 Roll-up level 1 1 T0, T1 — 2 T0, T1 — 3 T0 T1, T2, T3 Dimension Figure 8: VM after removing T2 and T3 Consider the sample cube query q = count(3#, 2, 5#). Its QHV is (1,0,1). Since the column vector for T 3 is (1,1,1) which does not match (1,0,1), T 3 must be removed. In the same way, T 2 must also be removed. The resulting VM is shown in Fig. 8. This time, the column index vector of T1 is (0,0,1)  (1,0,1) i.e. T1 matches QHV. So T1 is the optimal matching view which is used to answer q. The pseudo-code for finding the optimal matching tree is shown in Fig. 9. Each query is evaluated based on its optimal matching ST. Before evaluation, the query must first be rewritten if its QHV is not equal to the ST's column index vector. For example, query q = count(3#, 2, 5#) will be rewritten into q’=count([30,39], 2, 5#) so that its new QHV (0,0,1) exactly equals the levels of optimal ST T 1. Then we compute the query using T1 by applying the original query evaluation algorithm of CubiST as outlined in Section 3. The interested reader is invited to refer to the technical report [10] for additional details on CubiST++. BEGIN 1 Compute the QHV for the query; 2 Peel off the entries of VM layer by layer from right to left 3 until the view matches QHV; 4 return selected tree END Figure 9: Finding the smallest matching ST for a given query 6 Performance Evaluation We simulate a relational warehouse using synthetic data sets with the following parameters: r is the number of records and k is the number of dimensions with cardinalities d1, d2, …, dk. The elements in data set with r rows and k columns are uniformly distributed in the range of the domain sizes of the corresponding columns. The data set is stored as a text file and serves as input to the different query evaluation algorithms. For each set of experiments, a set of randomly generated queries is used. The testbed is implemented using a SUN ULTRA 10 workstation running Sun OS 5.6 with 90MB of available main memory. 6.1 CubiST++ vs. CubiST In this experiment we show the effectiveness of using families of statistics trees (CubiST ++). We have fixed the number of records to r=1,000,000. We also assume there are three dimensions, each of which has a two-level hierarchy. To simplify the hierarchical inclusion relationships, we further assume that each high level value includes an equal number of lower level values; this number is referred to as factor. The test query is q=count(s1) = Factors Dimension Sizes Number of Leaves I/O Time (ms) Total Time (ms) 1, 1, 1 3, 2, 2 4, 4, 3 60, 60, 60 20, 30, 30 15, 15, 20 8 226,981 20,181 5,376 16,818 1,517 384 25,387 1,655 446 count([0,29]). For both approaches we compare the size of the ST's (based on number of leaves), the I/O time spent on loading the ST from the ST repository, and the total query answering time. Table 4 summarizes the results which show a dramatic decrease in I/O time and time it took to answer the query over the course of the experiment in favor of CubiST++. Table 4: Sample runtimes for different derived trees 6.2 CubiST++ vs. Bitmap In this experiment we compare the setup and response times of CubiST ++ with those of two other approaches: A naive query evaluation technique, henceforth referred to as scanning, and a bitmap-based query evaluation algorithm referred to as bitmap. The bitmap-based query evaluation algorithm uses a bitmap index structure to answer cube queries without touching the original data set when the bit vectors are already setup. In this series of experiments, the domain sizes of the five dimensions in the data set are 10, 10, 15, 15, and 15 respectively. The goal is to validate the claim that CubiST++ has excellent scalability in terms of the number of records in the set. Figures 10 and 11 show the performance of the three algorithms on the same query: q=count(s1,s2,s3)=count([2,8],[2,8],[5,10]). Although the setup time for CubiST ++ is larger than that used by Bitmap, its response time is much faster and independent of the number of records. Response Time vs Number of Records 100000 Scanning Bitmap 10000 CubiST+ Setup Time (ms) 60,000 Response Time (ms) Setup Time vs Number of Records Writing Bitmap CubiST+ 50,000 40,000 30,000 20,000 10,000 0 100 10 1 10K 100K 200K 10K 100K 200K Writing 1,211 11,527 23,125 Scanning 979 9,196 18,347 Bitmap 1,045 10,039 21,660 Bitmap 51 494 989 CubiST+ 16,961 32,776 50,786 CubiST+ 32 31 32 Number of Records Number of Records Figure 10: Setup time vs number of records 6.3 1000 Figure 11: Response time vs number of records CubiST++ vs Commercial RDBMS In our last set of experiments we compare running times of CubiST ++ against those of a commercial RDBMS. The single relational table in the warehouse has 1,000,000 rows and 5 columns with domain sizes 10, 10, 10, 10, and 15. First, we loaded the table and answered five randomly generated cube queries q1 to q5 using the commercial ROLAP engine. Next, twe created a view to group columns in support of the count operation. The view was materialized and used to answer the same set of queries. The same queries were also processed by CubiST ++. 150 4,000 Time (milsec) Time (sec) 100 50 3,000 2,000 1,000 0 0 Writing Loading 15 93 View Setup ST Setup 122 52 q2 q3 q4 q5 W/o View 2,5202,640 2,3001,483 3,580 With View 376 370 290 153 533 ST 9 q1 1 1 1 1 1 Figure 12: Setup times Figure 13: Query response times The results are shown in Figures 12 and 13. The column labeled "writing" in Figure 12 refers to the time it takes to write the entire table to disk (a comparison measure only). Notice that in Figure 13 the query execution times for CubiST++ are around 1ms and cannot be properly displayed using the relatively large scale. It is worth pointing out that CubiST++ beats the setup time of the RDBMS. More importantly, its evaluation times are two orders faster than those posted by the RDBMS even when the view is used. 7 Conclusion In this paper, we have presented an overview of CubiST ++, a new algorithm for efficiently answering cube queries that involve constraints on arbitrary abstraction hierarchies. Using a single statistics tree, we select and derive an appropriate family of statistics trees which are smaller then the base tree and, when taken together, can answer most ad-hoc cube queries while reducing the requirements on I/O and memory. We have proposed a new greedy algorithm to choose family members and a new roll-up algorithm to compute derived trees. Using our matching algorithm the smallest tree that can answer a given query is chosen. Our experiments demonstrate the performance benefits of CubiST++ over existing OLAP techniques. References [1] S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan, “On The Computation of Multidimensional Aggregates,” in Proceedings of the International Conference on Very Large Databases, Mumbai (Bomabi), India, 1996. [2] Arbor Systems, “Large-Scale Data Warehousing Using Hyperion Essbase OLAP Technology,” Arbor Systems, White Paper, http://www.hyperion.com/whitepapers.cfm. [3] C. Y. Chan and Y. E. Ioannidis, “Bitmap Index Design and Evaluation,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 355-366, 1998. [4] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,” SIGMOD Record, 26:1, pp. 65-74, 1997. [5] D. Comer, “The Ubiquitous Btree,” ACM Computing Surveys, 11:2, pp. 121-137, 1979. [6] L. Fu and J. Hammer, “CubiST: A New Algorithm for Improving the Performance of Ad-hoc OLAP Queries,” in Proceedings of the ACM Third International Workshop on Data Warehousing and OLAP (DOLAP), Washington, DC, pp. 72-79, 2000. [7] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,” Data Mining and Knowledge Discovery, 1:1, pp. 29-53, 1997. [8] A. Gupta, V. Harinarayan, and D. Quass, “Aggregate-query Processing in Data Warehousing Environments,” in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 358-369, 1995. [9] H. Gupta and I. Mumick, “Selection of Views to Materialize Under a Maintenance Cost Constraint,” Stanford University, Technical Report. [10] J. Hammer and L. Fu, “Improving the Performance of Data Cube Queries Using Families of Statistics Trees,” University of Florida, Gainesville, FL, Research report, January 2001, http://www.cise.ufl.edu/~jhammer/publications.html. [11] V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing data cubes efficiently,” SIGMOD Record (ACM Special Interest Group on Management of Data), 25:2, pp. 205-216, 1996. 10 [12] Informix Inc., “Informix MetaCube 4.2, Delivering the Most Flexible Business-Critical Decision Support Environments,” Informix, Menlo Park, CA, White Paper, http://www.informix.com/informix/products/tools/metacube/datasheet.htm. [13] W. Labio, D. Quass, and B. Adelberg, “Physical Database Design for Data Warehouses,” in Proceedings of the International Conference on Database Engineering, Birmingham, England, pp. 277-288, 1997. [14] M. Lee and J. Hammer, “Speeding Up Warehouse Physical Design Using A Randomized Algorithm,” in Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW '99), Heidelberg, Germany, 1999, ftp://ftp.dbcenter.cise.ufl.edu/Pub/publications/VSgenetic-full.doc/pdf. [15] D. Lomet, Bulletin of the Technical Committee on Data Engineering, vol. 18, IEEEE Computer Society, 1995. [16] Z. Michalewicz, Statistical and Scientific Databases, Ellis Horwood, 1992. [17] Microsoft Corp., “Microsoft SQL Server,” Microsoft, Seattle, WA, White Paper, http://www.microsoft.com/federal/sql7/white.htm. [18] MicroStrategy Inc., “The Case For Relational OLAP,” MicroStrategy, White Paper, http://www.microstrategy.com/publications/whitepapers/Case4Rolap/execsumm.HTM. [19] P. O'Neil and D. Quass, “Improved Query Performance with Variant Indexes,” SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 38-49, 1997. [20] P. E. O'Neil, “Model 204 Architecture and Performance,” in Proceedings of the 2nd International Workshop on High Performance Transaction Systems, Asilomar, CA, pp. 40-59, 1987. [21] Oracle Corp., “Oracle Express OLAP Technology”, Web site, http://www.oracle.com/olap/index.html. [22] Pilot Software Inc., “An Introduction to OLAP Multidimensional Terminology and Technology,” Pilot Software, Cambridge, MA, White Paper, http://www.pilotsw.com/olap/olap.htm. [23] Redbrick Systems, “Aggregate Computation and Management,” Redbrick, Los Gatos, CA, White Paper, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/redbrickvistawhite paper.html. [24] Redbrick Systems, “Decision-Makers, Business Data and RISQL,” Informix, Los Gatos, CA, White Paper, 1997, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/risql.html. [25] J. Srivastava, J. S. E. Tan, and V. Y. Lum, “TBSAM: An Access Method for Efficient Processing of Statistical Queries,” IEEE Transactions on Knowledge and Data Engineering, 1:4, pp. 414-423, 1989. [26] W. P. Yan and P. Larson, “Eager Aggregation and Lazy Aggregation,” in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 345-357, 1995. [27] Y. Zhao, P. M. Deshpande, and J. F. Naughton, “An Array-Based Algorithm for Simultaneous Multidimensional Aggregates,” SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 159-170, 1997. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Improving the Performance of Data Cube Queries Using Families of