Download 3. Answering Cube Queries Using Statistics Trees

Answering Data Cube Queries Using Families of Statistics Trees Lixin Fu and Joachim Hammer Department of Computer & Information Sciences and Engineering University of Florida Gainesville, Florida 32611-6120 {lfu, jhammer}@cise.ufl.edu Abstract In data warehouse and OLAP, the modeling and implementation of Data Cube operation have been a major concern. A Cube query is to compute the aggregates of measures over an arbitrary combination of dimensions in a relational warehouse. It often needs multiple table joins and multiple scans. How to speed up Cube computation will have great influence on DSS (Decision Support System). CubiST (CUBIng with Statistics Trees) algorithm evaluates an ad-hoc Cube query based on the new statistics trees structures which are initialized by scanning the original table. In this paper, further new Cube query optimizations over CubiST will be given. Especially, the Data Cube operator is generalized so that an arbitrary navigation over the abstraction hierarchies from different dimensions is much more efficient. We view hierarchies as iterative layering partitioning of the low-level values to high-level values after mapping them to integers. Instead of only using flat statistics trees to compute queries, we choose potentially much smaller appropriate trees from families of statistics trees, thus improving query performance significantly. Experiments demonstrated the effectiveness of these optimizations and superior performance to other alternatives. 1. Introduction OLAP (On-line Analytical Processing) [6, 7] and data warehousing have been active research areas. A warehouse contains data from a number of independent sources, integrated and cleansed to support clients who wish to analyze the data for trends and anomalies. The decision support is provided by OLAP tools which present their users with a multi-dimensional perspective of the data in the warehouse and facilitate the writing of reports involving aggregations along the various dimensions of the data set [8]. In the multi-dimensional data model, data is structured around measures and dimensions. Measures are numerical data being tracked (e.g. sales). Dimensions are the natural business parameters that define the individual transactions (e.g. time, location, product). Some dimensions may have hierarchies. For example, time may have a “daymonthyear” hierarchy. OLAP queries select data sets called subcubes that are represented in multi-dimensional regions. Slicing, dicing, rolling-up, drilling-down, and pivoting are typical OLAP operators. The data cube operator, which was proposed in [12], contains these operators and generalizes aggregates, subtotals, cross tabulations, and group-bys. Most OLAP queries involve aggregates of measures over arbitrary regions of the dimensions. This ad-hoc aggregate navigation helps users to obtain the “big picture” of underlying data and generate or verify user hypotheses. In our recent paper [ref], we gave these aggregate-aware queries a new term called cube 1 queries (a.k.a. cube operation or cubing) and provided a new efficient cubing algorithm called CubiST (CUBIng with Statistics Trees). A Cube query is to compute the aggregates of measures over an arbitrary combination of dimensions in a relational warehouse. The Cube query generalizes the cube operator in that each selected dimension set in the query can be a value, a range, or an arbitrary subset of domains. For example, a data warehouse containing sales information for automobiles across the US may be used to answer a cube query such as “How many red Toyota cars have been sold in Florida and Georgia between 1990 and 2000?” Answering this query requires the selection of tuples satisfying the specified conditions as well as the aggregation of the sales data along the location, time, and product attributes. On any but the smallest databases, these kinds of queries will tax even the most sophisticated query processors. It often needs multiple table joins and multiple scans. In the data warehouse environment, implementing cube queries over large databases poses great challenges. The database size often grows to hundreds of GBs or TBs with millions or even billions of records with high dimensionality and large domain sizes. There are many algorithms for evaluating OLAP and cube queries, but no existing indexing and query optimization performs sufficiently well for high dimensional data [4]. Current systems (e.g. ROLAP, MOLAP, indexing, materialized views, etc.) suffer from the fact that the types of supported OLAP queries must be known beforehand in order to set up the appropriate index structures and assure that relevant partial results have been pre-computed. This limits the analysts’ ability to navigate freely through the entire data set, in order to get a high-level overview of possible trends and patterns. How to speed up Cube computation will have great influence on DSS (Decision Support System). CubiST evaluates an ad-hoc Cube query based on a novel data structure called statistics trees (STs). Simply speaking, a statistics tree is a multi-way tree in which internal nodes contain references to next-level nodes, and are used to direct the query evaluation. Leave nodes hold the statistics or histograms for the data (e.g., SUM, COUNT, MIN, MAX values) and are linked together to facilitate scanning, similarly to the B/B +Tree data structure [9]. In order to use an ST to answer cube queries over a particular data set, one must first pre-compute the aggregations on all subcubes by scanning the detailed data set. While reading each record, walk through the tree according to its column values and update the aggregate value in the leaves. CubiST computes the aggregation information rather than listing the details of the records that satisfy the query conditions. Duplicate records are aggregated into the same leaves. In this way, many operations performed on records such as sorting, hashing, etc. are eliminated. In this paper, further new Cube query optimizations over CubiST will be given. Especially, the Data Cube operator is generalized so that an arbitrary navigation over the abstraction hierarchies from different dimensions is much more efficient. We view hierarchies as iterative layering partitioning of the low-level values to high-level values after mapping them to integers. Instead of only using flat statistics trees to 2 compute queries, we choose potentially much smaller appropriate trees from families of statistics trees, thus improving query performance significantly. The main contribution of this paper is a comprehensive, new methodology to answer cube queries with arbitrary level aggregations over any subset of the underlying dimensions of the data cube. The remainder of the paper is organized as follows. Section 2 reviews current and past research activities related to the work presented here, focusing chiefly on OLAP query processing, indexing and view materialization in data warehouses. In section 3, we briefly introduce statistics trees and CubiST as a preparation of the new algorithms presented later in the paper. Section 4 gives new perception concerning hierarchies and mappings. Section 5 describes the generation of families of statistics trees. We give our cube query optimization algorithm using the families in section 6. A description of our experimental system and the results of our evaluation are presented in section 7. A summary and concluding remarks are presented in section 8. 2. State-of-the-Art Research related to this work falls into three broad categories: OLAP servers including ROLAP and MOLAP, indexing, and view materialization in data warehousing. 2.1. ROLAP Servers ROLAP servers use the familiar “row-and-column view” to store the data in relational tables using a star or snowflake schema design [7]. In the star schema, there is a fact table plus one or more dimension tables. The snowflake schema is a generalization of the star schema where the core dimensions have aggregation levels of different granularities. In the ROLAP approach, cube queries are translated into relational queries against the underlying star or snowflake schema using the standard relational operators such as selection, projection, relational join, group-by, etc. However, directly executing translated SQL can be very inefficient and as a result, many commercial ROLAP servers extend SQL to support important OLAP operations directly (e.g., RISQL from Redbrick Warehouse [30], cube operator in Microsoft SQL Server [23]). A simple algorithm 2N-algorithm for evaluating the cube operator is proposed in [12]. In this algorithm, where N is the number of dimensions, a handle is allocated for each for each cell of the data cube. For each new record (x1,x2,…,xN,v) the handle function is called 2N times – once for each handle of each cell of the cube matching this value. Here, xi are the dimension values and v is the measure. When all input values have been processed the final aggregate for each of the nodes in the cube is computed. Due to the large number of handles, this algorithm does not scale well for large N. 3 To speed up the group-by’s, indices and materialized views are widely used. As far as we know, there is no internal ROLAP algorithm in the literature for evaluating cube queries efficiently. MicroStrategy [24], Redbrick [29], Informix's Metacube [18] and Information Advantage [17] are examples of ROLAP servers. 2.2. MOLAP Servers MOLAP servers use uses proprietary data structures such as multidimensional arrays to store the data cubes. MOLAP is often several orders faster than the ROLAP alternative when the dimensionality and domain size are relatively small compared to the available memory. However, when the number of dimensions and their domain sizes increase, the data becomes very sparse resulting in many empty cells in the array structure (especially cells containing high dimensional data). Storing sparse data in an array in this fashion is inefficient. A popular technique to deal with the sparse data is chunking [33]. The full cube (array) is chunked into small pieces called cuboids. For a non-empty cell, a(OffsetInChunk,data) pair is stored. Zhao et. al. describe a single pass, multi-way algorithm that overlaps the different group-by computations to minimize the memory requirement. The authors also give a lower-bound for the memory which is required by the minimal memory spanning tree (MMST) of the optimal dimension order (which increases with the domain sizes of these dimensions). Their performance evaluations show that a MOLAP server using an appropriate chunk-offset compression algorithm is much faster than most ROLAP servers. However, if there is not enough memory to hold the MMST, several passes over the input data are needed. In the first read-write pass, data is partitioned. In the second read-write pass, the partitions are clustered further into chunks. Additional passes may be needed to compute all aggregates in the MMST execution plan. In this case, the initialization time may be prohibitively large. In addition, since the materialized views reside on disk, answering OLAP queries may require multiple disk I/Os. To address the scalability problem of MOLAP, Goil and Choudhary proposed a parallel MOLAP infrastructure called PARSIMONY [10, 11]. Their algorithm incorporates chunking, data compression, view optimization using a lattice framework, as well as data partitioning and parallelism. The chunks can be stored as multi-dimensional arrays or (OffsetInChunk,data) pairs depending on whether they are dense or sparse. The OffsetInChunk is bit-encoded (BESS). However, like other MOLAP implementations, the algorithm still suffers from high I/O costs during aggregation because of frequent paging operations that are necessary to access the underlying data. In general, ROALP is more scalable in terms of the data size, while MOLAP has better performance when the number of dimensions is small. However, the decision of whether to use ROLAP or MOLAP does not only depend on the original data size but also the data volume which is defined as the product of the cardinalities. To illustrate our point, consider a database with 50 billion records and three dimensions each having 100 values. Suppose that each row needs 4*4 bytes (assuming an integer uses 4 bytes and the table 4 has one measure), then the data size is 16*50*109 = 800GB but the data volume is 100*100*100 = 1M. In this case, MOLAP would be a better choice. Arbor software's Essbase [3], Oracle Express [27] and Pilot LightShip [28] are based on MOLAP technology. The latest trend is to combine ROLAP and MOLAP in order to take advantage of the best of both worlds. For example, in PARSIMONY, some of the operations within sparse chunks are relational while operations between chunks are multidimensional. 2.3. Work on Indexing Specialized index structures are another way to improve the performance of OLAP queries. The use of complex index structures is made possible by the fact that the data warehouse is a “read-mostly” environment in which updates are performed in large batch processes. This allows time for reorganizing the data and indexes to a new optimal clustered form. When the domain sizes are small, a bitmap index structure [26] can be used to help speed up OLAP queries. A bitmap index for a dimension with m values generates m bitmaps (bit vectors) of length N, where N is the number of records in the underlying table. To initialize a bitmap index on a particular attribute (dimension) of a table, we set the bits in each bitmap as follows: for each record we indicate the occurrence of a particular value with a 1 in the same row of the bitmap that represents the value; the bits in all other bitmaps for this row will be set to 0. Bitmap indexes use bit-wise logical AND, OR, NOT operations to speed up the computation of the whereclause predicates in queries. However, simple bitmap indexes are not efficient for large-cardinality domains and large range queries. In order to overcome this deficiency, an encoded bitmap scheme has been proposed [5]. Suppose a dimension has 1,024 values. Instead of using 1,024 bit vectors most rows of which are zero, log 1024 = 10 bit vectors are used plus a mapping table, and a Boolean retrieve function. A welldefined encoding can reduce the complexity of the retrieve function thus optimizing the computation. However, designing well-defined encoding algorithms remains an open problem. Bitmap schemes are a powerful means to evaluate complex OLAP queries when the number of records is small enough so that the entire bitmap fits into main memory. If not all of the bitmaps fit into memory (e.g., when the number of records is large), query processing will require many I/O operations. Even when all the bitmaps fit into memory, the runtime of an algorithm using bitmap indexes is proportional to the number of records in the table. Later in the paper, we show that the runtime of a bitmap-based algorithm is much larger than the runtime of CUBIST which has a worst-case runtime proportional to the number of dimensions of the data cube. A good alternative to encoded bitmaps for large domain sizes is the B-Tree index structure [9]. O’Neil and Quass [25] provide an excellent overview of and detailed analyses for index structures which can be used to speed up OLAP queries. 5 2.4. View Materialization View materialization in decision support systems refers to the pre-computing of partial query results which may be used to derive the answer for frequently asked queries. Since it is impractical to materialize all possible views, view selection is an important research problem. For example, [16] introduced a greedy algorithm for choosing a near-optimal subset of views from a view materialization lattice based on userspecified criteria such as available space, number of views, etc. In their approach, the next view to be materialized is chosen such that its benefit is maximal among all the non-materialized views. The computation of the materialized views, some of which depend on previously materialized views in the lattice, can be expensive when the views are stored on disk. More recently [14, 19, 20], for example, developed various algorithms for view selection in data warehouse environments. Another optimization to processing OLAP queries using view materialization is to pipeline and overlap the computation of group-by operations to amortize the disk reads, as proposed by [1]. To illustrate, assume a table has four dimensions A, B, C, and D. The computation of the following group-by operations can be pipelined: ABCD  ABC  AB A (i.e., from view ABCD we can compute view ABC form which view AB can be computed and so on). When a data partition is computed, the memory can be reused for the next partition. Interleaved with the pipelined aggregations are sorting or hashing operations on some nodes on the minimal cost processing tree. When the data set is large, external sorting is required which often needs multiple passes and is very expensive. Other related research in this area has focused on indexing pre-computed aggregates [31] and incrementally maintaining them [22]. Also relevant is the work on maintenance of materialized views (see [21] for a summary of excellent papers) and processing of aggregation queries [13, 32]. However, in order to be able to support true ad-hoc OLAP queries, indexing and pre-computation of results alone will not produce good results. For example, building an index for each attribute of the warehouse or pre-computing every sub cube requires a lot of space results in high maintenance cost. On the other hand, if we index only some of the dimensions or pre-compute few views, queries for which no indexes or views are available will be slow. In the remainder of this paper we introduce a new approach which may bring us closer to the goal of being able to answer ad-hoc OLAP queries efficiently. 3. Answering Cube Queries Using Statistics Trees Before the description of our new algorithms that are designed to improve CubiST for hierarchies, we first briefly introduce in this section the statistics trees and CubiST algorithms as the necessary preparation of later sections. Many details can be found in [ref]. 6 3.1. Data Structure and Maintenance of ST A Statistics Tree (ST) is a multi-way balanced tree structure which holds statistics or aggregate information (e.g., SUM, AVG, MIN, MAX) for one or more attributes over a set of records (e.g., tuples in a relational table). Assume R to be a relational table with attributes A1,A2,…,Ak with cardinalities d1,d2,…,dk respectively. The height of the statistics tree for R is k+1, where each level in the tree (except the leaf level) corresponds to an attribute in R. Root is at level 1. The fan-out (degree) of an internal node at level j is dj+1, where j = 1,2,...,k. The first dj pointers point to the subtrees which store information for the jth column value of the input data. The (dj+1)th pointer is called star pointer which leads to a region in the tree where this domain has been “collapsed,” meaning it contains all of the domain values for this dimension. This “collapsed” domain is related to the definition of super-aggregate (a.k.a “ALL”) presented in Gray et al. [12]. The leaf nodes at level k+1 contain the aggregation information and form a linked list. Without lose of generality, we use the COUNT aggregate to demonstrate how to populate and maintain a Statistics Tree. STs for other aggregate operations such as SUM, MIN, MAX, etc. can be processed similarly. Given the number of dimensions and their cardinalities in the input data, initially, an empty statistics tree is constructed by creating the nodes and pointers that form the entire tree structure described above, with the count values in the leaf nodes being zero. Then, we scan the relational data set record by record, using the attribute values to update the aggregates in the statistics tree with the recursive procedure update_count(). For each record in the input set, the update procedure descends into the tree as follows: Starting at the root, for each component xi of the input record x=(x1,x2,…,xk), where i indicates the current level in the tree, follow the xith pointer as well as the star pointer to the two nodes at the next-lower level. When reaching the nodes at level k (after repeated calls to update_count()), increment the count values of the two leaves following the xkth pointer and the star pointer. Repeat for each record until all input records have been processed in this fashion. A pseudo-code description of the algorithm is shown in Figure 1. 1 2 3 4 5 6 update_count(Node n, record x, int level) { IF level == k THEN increase count field for xkth child of Node n; increase count field for child following star pointer; return; level := level + 1; th 7 update_count( xlevel 1 child of n, x, level); 8 9 10 11 12 13 14 update_count(child of n following star pointer, x, level); } // end update_count WHILE ( more records ) DO read next record x; update_count(root,x,1); Figure 1: Recursive algorithm for updating aggregates in a statistics tree. Figure 2 depicts the contents of the statistics tree after inserting the first input record (1,2,4) into the empty tree corresponding to a relational table with three dimensions A1, A2, A3 with cardinalities d1=2, d2=3, and 7 d3=4 respectively. The update paths relating to this record are indicated as dashed lines. To improve readability, we omit the pointers and nodes that do not play a role in this update from the picture. Root • • • Star Pointer * Interior Nodes • • • • • • • • * * ... ... • • • • • * ... • • • • • * 0 0 0 1 1 • • • • * ... 0 0 0 1 1 ... • • • • • • • • • • * * 0 0 0 1 1 ... 0 0 0 1 1 Leaf Nodes Figure 2: Statistics tree after processing input record (1,2,4). Statistics trees differ from multidimensional arrays in that a multi-dimensional array does not have a starpointer although the extension “ALL” to the domain of a dimension attribute has been used in the query model and summary table described in [12]. Hence one can regard the Statistics Tree as a generalization of the multidimensional array data structure. B-tree is an index structure for one attribute (dimension) only, as opposed to a Statistics Tree which can contain aggregates for multiple dimensions. In addition, in the BTree, the degree of the internal nodes is restricted. The Statistics Tree on the other hand is naturally balanced (it is always a full tree) and its height is based on the number of dimensions but independent of the number of records and data input order. 3.2. Formal Representation of Cube Queries and CubiST Algorithm The conceptual model in this paper is the multidimensional database model [2]. A cell of a k-dimensional data cube with attributes A1 , A2 ,..., Ak with cardinalities d1 , d 2 ,..., d k respectively, is the smallest fulldimensional cube1 seated at a point P  ( x1 , x2 ,..., xk ) , xi  [1..d i ] . The total number of cells in a kk dimensional data cube is d i . All these cells form a whole set C. An ad-hoc cube query can be viewed i 1 as an aggregation of the measure attributes over a subset of C. Formally, a Cube query q is a tuple of the form to define the regions to aggregate on: q  ( si1 , si2 ,..., sir ) , where {i1, i2 ,..., ir }  {1,2,..., k} and r is the number of dimensions specified in the query. Each selection si j can be one of the following (in decreasing order of generality). Let w  i j : 8 1. A partial selection, {t1,t2,…,tr}, 2  r  dw, ti  {1,2,…,dw}, specifying any subset of all domain values for dimension ij. 2. A range selection [a,b], specifying a contiguous range in the domains of some of the attributes, a, b  [1..d w ] , [a, b]  [1, d w ] . 3. A singleton value a, a  [1..d w ] . If r = k, we say the query is in normal form. A query can be transformed into its normal form by adding collapsed dimensions in the query using “ALL” values. Continuing our car sales example from the beginning, assume that the domain of manufacturer is {Ford, GM, Honda, Toyota}, of color is {blue, red, white, black}, of style is {sports car, sedan, SUV}, of time is {Jan, Feb, …, Dec} and of location is {Florida, Alabama, Georgia, South Carolina, Tennessee}. Then a formal representation of the sample cube query “How many Toyota cars have been sold from January to March this year in Florida and Georgia?” is q  ( s1 , s4 , s5 ) = (4, [1,3], {1,3}). Its normal form is (4, ALL, ALL, [1 3], {1,3}). 1 INT cubist(Node n, query q, int level) { 2 IF level == k, THEN 3 count := count summation of z k ,1 4 5 return count; level := level+1; 6 count := cubist( z level 1,1 7 th th , z k ,2 th ,...child of n; child of n, q, level)+ th cubist( z level 1, 2 child of n, q, level)+ ... 8 9 } 10 end cubist 11 12 Total counts in the query regions := cubist(root,q,1); Figure 3: Recursive algorithm for evaluating cube queries. The Statistics Tree is capable of answering complex cube queries. Figure 3 shows a sketch of our recursive CubiST query-answering algorithm. Assumes that the input query is in its normal form: q  {zi , j  [1..di ] | i  1,2,..., k , j  1,2,..., di  1}. Here z i , j refers to the j th selected value of the i th dimension. Cubist is a recursive function that returns the aggregate for a cube query q. k The amount of memory that is need to store the Statistics Tree is bounded by: c  (d i 1 i  1)  M , where M is the size of memory available to store the tree, the constant value c accounts for the space that is needed to store the internal nodes of the tree. Please note that c is typically around 2 or 3 because the number of 1 A cube containing no “ALL” values. 9 internal nodes is no larger than twice the number of leaves. The update time of ST for each record is O( 2 k ). The runtime of answering a singleton query is O(k), where k is the number of dimensions. The worst case runtime of answering an arbitrary query is O(size of ST). CUBIST is particularly useful when the number of records is large and the aggregated data volume does not exceed the memory size. For example, this is the case in data marts where the dimensionality of the data is low, yet the number of tuples is large enough to warrant the use of special query processing algorithm such as CUBIST. CUBIST is also well suited for large data warehouses to provide a high-level “view” over the data in order to decide where to start drilling down: in this scenario there are many dimensions with large domain sizes. Finally, with the continued increase in available memory size, more and more cube queries can be answered using ST. We will introduce a new data warehouse system architecture in the next section that achieves good performance for cube queries when combined with CUBIST. 3.3. System Architecture Integrating with CubiST Above algorithms require that the particular ST tree be in memory. In many cases, having one ST to answer ad-hoc cube queries over all dimensions is not sufficient, especially when the queries involve many dimensions with large domain sizes. In this case, we proposed a data warehousing architecture that integrates with CubiST. To support efficient cube query processing over large data sets, we used multiple STs. Conceptually, we can join all the dimensions with the fact tables along the foreign keys into a single wide table. Table 1 displays a single table schema for the car sales example. Maker Product m-name m-loc m-date color style Customer Sale Measures #cyl c-name age income s-name s-loc s-date sales invoice Table 1: Sample1-table schema for car sales. The warehouse administrator can specify the groups of dimensions to setup STs on according to domain knowledge or user query patterns. In general, attributes from the same source (i.e. the same dimension table in star schema) are grouped together, since queries are more likely to request aggregates over attributes from the same group. STs are stored in the ST Repository which is shown in the upper-left hand corner of Figure 4. Using the ST Manager, shown directly above the data warehouse, STs are selected from the repository to match the incoming cube queries: the dimensions and aggregate operators specified in the query must match those of the stored STs. The ST Manager receives a query submission, it checks if there is an ST that matches the query. If there is, the query is evaluated using this ST. Otherwise, a new ST is set up and placed into the ST repository. The parameters of the ST (i.e., number of dimensions, their cardinalities, aggregate operators used to compute the contents of the leaves) are stored in the metadata repository. At the same time, the usage statistics of the STs are recorded. Due to space constraints and maintenance overhead, the large and less frequently queried STs may be deleted. To improve the hit ratio 10 of queries against existing STs, small, frequently matched STs are periodically merged to form one larger ST, memory size permitting. The rationale behind merging is that the attributes from frequently queried STs are more likely be queried together later. Query Interface ST Manager ST Repository Data Warehouse Integration System Metadata ... Extractor Extractor Extractor Data Source 1 Data Source 2 Data Source n ... Figure 4: Warehouse architecture with CubiST. In the same figure, the familiar back-end, consisting of source data extractors, warehouse integrator, and warehouse is responsible for accessing, cleaning, integrating and storing the potentially heterogeneous data from the underlying data sources. Data in the warehouse is typically stored in the form of a star or snowflake schema with materialized views and indexes as optional support structures. 4. Hierarchies CubiST sets up a new framework to answer cube queries efficiently. However, it does not address the important issue appearing frequently in DSS – hierarchies. Many dimensions e.g. time, location, product have concept abstract level hierarchies, where different levels reflect different degree of granularities for the concepts. For example, a “year—month—day” hierarchy reflects different granularity of time dimension. To rollup and drill down, users often submit queries that have constraints on different levels of the hierarchies in addition to on different combinations of the dimensions. This new requirement adds another level of complexity to OLAP systems but also provides great optimization opportunities to improve the performance for these queries. From this section on, we will focus on these optimizations. We start with our discussion on hierarchies. 4.1. Hierarchical Partitioning and Mapping of the Domains Usually, the dimensions with hierarchies are viewed as common tables with the different levels as their attributes. These attributes are processed in the same way as other non-hierarchical simple attributes. The dimensions are joined with the fact table along the foreign keys, and then selections, projections and aggregations (group by) are performed to compute the final results. In our new framework, we have quite different techniques than traditional ones. Naturally, we regard the interdependent levels in a hierarchy as 11 one hierarchical attribute instead of multiple attributes. The values at different levels are mapped into integers and a value at higher level includes multiple values at a lower level. That is, the lower level values are partitioned into groups. The indexes or labels of the groups constitute the higher level values. Through scanning the dimension table, we establish the mappings, hierarchical partitions and the value inclusion relationships. Definition: For a set V, if  V1, V2, …, Vk, where ViV, i=1,2,…,k, such that (1) ViVj = , i,j=1,2,…,k, ij (2) Vi = V, i=1,2,…,k then, V1, V2, …, Vk is a partition of V. Higher level values are just the labels of the subsets of lower level domain values that form a partition. The higher level values as a set can further be partitioned. In this way, the hierarchical partitioning structure of the domain values is the concept of attribute/dimension hierarchy. For example, in time dimension, a popular hierarchy is “year—quarter—month”. For each year of 1997 and 1998, there are four quarters Q1, Q2, Q3, Q4. For each of these quarters, there are three months among January, February, …, December. We can represent this as a table with three columns “year”, “quarter” and “month” whose cardinalities seem to be 2, 4 and 12 respectively. However, for the hierarchy to be valid, the functional dependency constraints between the columns must be satisfied. If column A is functionally dependent on column B, the constraint stipulates that if two rows have the same value for column A then they must have the same value for column B. Obviously, in this example, the first two columns do not satisfy the constraint since Q1 can be of year 1997 or year 1998. To eliminate the ambiguity, we can qualify Q1 as 1997/Q1 or 1998/Q1 respectively. In the same way, January needs to be qualified as 1997/January or 1998/January. After qualifying all the quarters and months, the “year— quarter—month” hierarchy is valid. In categorical attributes (domains with strings) or attributes with real domain values, the mapping that maps a value to an integer facilitates the cubing with Statistics Trees or multidimensional arrays. The following is a mapping example for the hierarchy structure. Now, we can more clearly see the hierarchical partitioning structure of the time dimension domain values. Some dimensions e.g. time have static domain values and hierarchical inclusion relationships while others do not. In that case, we establish the partitioning and mapping during the load of the dimension tables. Let us take the Location dimension as an example. 12 Region State MW (1) IL (1) MN (2) Lockey City State Region 1 Gainesville FL SE 2 Atlanta GA SE 3 Los Angeles CA WC 4 Chicago IL MW 5 Miami FL SE 6 San Jose CA WC 7 Seattle WA WC 8 Twin City MN MW SE (2) FL (3) GA (4) City Chicago (1) Twin City (2) Gainesville (3) Miami (4) Atlanta (5) WC (3) CA (5) Los Angeles (6) San Jose (7) WA (6) Seattle (8) Table 1: Location dimension Table 2: Partitioning and Mapping of Location Suppose Table 1 is part of the original dimension table of Location that indicates a composite attribute (City, State, Region). Table 2 is the result of sorting along column Region1, State and City and mapping the column values to integers (in parenthesis). This process has the effect of executing the following SQL statement: SELECT DISTINCT Region, State, City FROM Location GROUP BY Region, State, City ORDER BY Region, State, City; LocKey City 1 2 3 4 5 6 7 8 State Region 3 5 6 1 4 7 8 2 3 4 5 1 3 5 6 2 2 2 3 1 2 3 3 1 Table 3: Location after transformation Table 3 results from the mapping of Table 1 column values. The mapping itself can be represented and stored as an encoding index structure e.g. trie or dictionary (string  integer) and a decoding data structure such as arrays (integer  string). The encoding mapping is very important through which strings or other types in the user input queries will be transformed into internal integer representation to facilitate CubiST application. Table 2 can also be directly implemented as a multi-way tree. Notice that we only need to store Table 2 and Table 3 because they are reversible transformation of Table 1. Furthermore, we only store the finest level column; other level columns are implied from mapping. This 13 is potentially very large compression of dimensional data. In the example, column 3 and 4 in Table 3 will not be stored. In addition, instead of storing strings like “Gainesville” (consuming 11 bytes), integers like 3 are stored (consuming at most two bytes if there are no more than 65,536 cities in the data warehouse). Though we still store the mapping, only the transformed tables (e.g. Table 3) are read during the initialization of STs. The data compression for the transformed tables significantly reduced the runtime of joining operations in the initialization phase. Some attributes have complex hierarchies, where multiple paths from high level concepts to low level concepts exist. We can decompose the complex structure into simple hierarchies. For example, the following time dimension hierarchy lattice structure can be decomposed into two simple hierarchies: “year—quarter—month – day” with four levels and “year—week – day” with three levels. Two paths correspond to two different ways of sub-setting/partitioning the days. year quarter week month day 4.2. Free Aggregate Navigation In the new setting, let us explore what “ad-hoc” or “free navigation” really means. We also give a new light on the buzzwords “rollup” and “drill down”. A1 A2 Ak Level 1 Level 1 Level 1 Level 2 … Level 2 … Level 2 … level L1 level L2 … level Lk Figure To navigate freely on the aggregations, queries can be constrained on any combination of the dimensions. For any constrained dimension with a concept hierarchy, the constraints can be in any abstract level. For a 14 particular level, the selected values can form any subset of the domain values in that level. In addition, the aggregate functions can be different. Theorem: If in a group, there are k attributes A1, A2, …, Ak with L1, L2,…, Lk levels in their hierarchies respectively, then the number of level combinations is k  ( L  1) i 1 i For ith attribute (i=1,2,…,k), there are Li+1 choices of the levels. The additional choice is to choose the attribute or not so that the combinations of the attributes are also counted in. According to composition theory, the total number of combinations is the product of the number of choices in the attributes. Theorem: If in a group, there are k attributes A1, A2, …, Ak with cardinalities d1, d2,…, dk respectively in their finest levels, then the number of potential queries at the lowest level is 2d1 2d2…2dk. Definition: When one or/and two of the following cases happen, a rolling-up operation is performed: (1) One or more dimensions go up one or more levels along their abstract hierarchy ladders and no dimension goes down at the same time. (2) Leave out one or more constrained dimensions. Similarly, we can define a drilling-down operation. Rolling-up and drilling-down are two common operators in OLAP. In strict sense, they are not new operators but a cube operation relative to another cube operation. However, the requirement of free navigation on the hierarchies adds additional complexity for implementing OLAP systems. Our objective is to design new algorithms that can improve the efficiency of free aggregate navigating. 4.3. Motivation and Overview of the New Optimizations In the statistics tree framework, CubiST can freely navigate on the hierarchies internally, provided that the core statistics trees include the lowest levels which the queried levels are at least the same as or above. To implement free navigation using CUBIST directly, first transform the conditions expressed in different abstract level to the conditions expressed in the lowest levels. We can map the higher level values to lower level values according to the mapping functions (e.g. Table 2 or corresponding multi-way tree). The hierarchy navigating queries are then just simply regarded as range queries or partial queries that can be answered directly by CubiST. Even though these queries can be answered in memory, in the worst case, answering them may need to walk through most part of the statistics trees. In the car sales example, suppose that the time dimension has a year-month-day hierarchy and the location hierarchy is region-state-city. The user cube query is "How many cars were sold in southeast in 1998?" for instance. Suppose southeast region have 150 cities in the database and the records are in finest granularities, answering the query is equivalent to answering 150*365 = 54,750 singleton queries. 15 In the real world applications, multiple STs may be used thus requiring the page-in and page-out of the STs. If the STs are large, the I/O overhead can be also large. CubiST should be further optimized to cut down the overhead. Notice that in order to reduce the cost of paging STs in and out of memory, we only transmits the leaves of the ST trees to and from the ST Repository. The internal nodes can be generated in memory without extra I/O operations. In this way, the I/O complexity of STs is the same as that of multidimensional arrays. The idea is to materialize some statistics trees with higher levels. For instance, if we materialize the statistics tree in our car sales example in region and year level, then the sample cube query becomes a singleton query and the I/O cost is negligible. To implement the free navigation, we select, compute and materialize a set of Statistics Trees and then answer an arbitrary query by applying CubiST on a chosen appropriate tree from these STs. First, let us give some definitions. Definition: The base abstraction levels correspond to the lowest levels in the concept hierarchies such that the statistics trees are in memory. If each dimension uses the lowest level, the statistics trees may not fit in memory. In this case, use the higher levels for the large dimensions or take out some dimensions from the group. The resulting statistics trees are called base trees. Definition: The derived trees are the STs in the same group but with higher levels for one or more dimensions. They are computed from base trees. Definition: The base tree and its derived trees form a family of statistics trees. 5. Generating a Family of Statistics Trees The families of STs are used for answering any queries in different abstraction levels for different dimensions. To generate the family, first we need to design algorithms for choosing candidates from all derived trees as members of the family. Second, methods of computing the derived trees from the base tree must be found. 5.1. A Greedy Algorithm to Choose Family Members Potentially, any combination of the levels in the attribute group can be selected as a member of the family. However, we need to consider the space limitation and maintenance overhead. Again it is a space-time trade-off issue. Here, we introduce a greedy algorithm to choose the members. From the base tree, each time, the largest dimension climbs up one level in its hierarchy while other dimensions remain the same. The newly formed ST in this way is included in the family. The iteration continues until each dimension climbs to its highest level or the size of the family reaches the space limitation (stop conditions). The pseudo-code is shown in Figure xx. We discussed the algorithm of line 1 in section 3.1. The rollup algorithm for line 7 will be given in the next subsection 5.2. Lines 2 and 8 save the leaves while the parameters saved in lines 3 and 9 are used to register the STs stored in ST Repository. 16 The size of the family is no more than two times the size of base tree. Usually, the total size of the derived trees is just a very small fraction of the base tree. Of course, more sophisticated approaches like cost-based algorithms are also possible. The algorithm of rolling-up the largest dimension one after another is greedy in terms of the size reduction of the new derived tree. Without lose of generality, for numerical attributes in our examples of this subsection, they are grouped into equal-size segments except the last one that may be smaller. 1 2 3 4 5 6 7 8 9 10 Setup base tree T0 by scanning input data; Save T0 to ST Repository; Save its parameters to Metadata Repository; T= T0; Repeat { Find the dimension with largest cardinality in T; Rollup that dimension to generate a new derived Tree T; Save T to ST Repository; Save the parameters of T to Metadata Repository; } Until (stop conditions); Figure Example 5.1 Suppose the base tree T0 has three dimensions whose sizes are 100,100, and 100. Each dimension can be grouped ten by ten to form two-level hierarchies. The cardinalities of base tree T0 are 101,101,101. T1 is the result of rolling up the third dimension in T0, therefore, have cardinalities 101,101,11. T2 is the result of rolling up the second dimension in T1, its cardinalities are 101,11,11. T3 is the result of rolling up the first dimension in T2, its cardinalities are 11,11,11. T1, T2, T3 are derived trees. Together with base tree T0, they form a family. 5.2. Computing a New Derived Tree from Existing Trees The base tree is setup during the initialization of ST described in Section 2. Except the first one which is computed from the base tree, a new derived tree is computed from an existing one by merging some subtrees in the old tree. Before introducing this new roll-up algorithm, we give a definition for a new operator “” on STs as follows. Definition: Two STs are isomorphic if they are identical except the values of their leaves i.e. they have the same structure. All pairs of corresponding nodes in two isomorphic STs have the same degree. Definition: The summation S of two isomorphic STs S1, S2 is a new ST that has the same structure of S1 or S2 (i.e. isomorphic to them) but its leaf values are the summation of the corresponding leaf values (i.e. they are of the same position) of S1 and S2. Their relationship is denoted by S=S1 S2. 17 The implementation of the new operator is straightforward. Keeping one of the STs (say S1) and updating its leaves by repeatedly adding up the count values of the leaves directed by two sliding pointers. Notice that the new operator can adapt to other aggregation functions such as MIN/MAX. In that case, instead of summing the leaves up, compute the minimum or maximum of the corresponding leaves as the leaves of the resulting ST. The summation of several isomorphic STs is a ST which has exactly the same structure as any of the STs except the leaves. The count fields are the summation of the count fields of corresponding leaves. Next, we will give an algorithm of roll-up operation of case (1) in definition xx. Case (2) is covered by CubiST algorithm directly. To derive a new tree by rolling up a dimension2, proceed from root to the level corresponding to the rollup dimension. According to the mapping relationship described in section 4.1, reorganizing and merging the sub-trees for all the nodes in that level to form new children. Adjust the degree and update these nodes. Lastly, link the newly formed (summation of subtrees) children to the new nodes. The best way to illustrate the roll-up process is through a simple example. Example 5.2 Suppose that during the derivation, we proceed to a node in the rollup dimension level. The dimension with nine values has a two-level hierarchy which group the lower level values three by tree. The sample node is shown in the left of Figure? with its children S1,.., S9, S*. We group and sum up its sub-trees (also STs) three by three to form three new children S1#, S2# and S3# of the new node with degree of four, as shown in the right side. We superscript #’s to indicate they are higher level values. The star sub-tree remains the same. Notice that S1#=S1 S2 S3 S2#=S4 S5 S6 S3#=S7 S8 S9 1# 1 S1 2# 2 S2 3 S3 4 S4 5 S5 3# 6 S6 7 S7 8 S8 18 9 S9 * S* 1# 2# S1# S2# 3# S3# * S* Figure ? Rolling-up one dimension. The following is the roll-up algorithm. 1 change(Node nd, int level, int rollDim, int factor) { 2 IF level == rollDim THEN 3 group the children of nd such that each group has factor 4 subtrees except the last group and star node; 5 compute the summation of the subtrees in each group; 6 make a new node nd’ whose children are the summations 7 and star node subtree; adjust the domain of the nd’; 8 link nd' to nd's original parent IF level is not equal to 1; 9 return; 10 level++; 11 FOR each child Ci of nd, change(Ci,level,rollDim,factor); 12 } 13 rollUP(int rollDim, int factor) { 14 15 change(root,1,rollDim,factor); } Figure ?: Algorithm for Rolling-up One Dimension to Form a New Derived Tree Line 1 is the signature of a recursive function change, which proceed from a node nd at level level to the rollup dimension rollDim. Suppose that the hierarchy is formed by grouping the lower level values factor at a time. Lines 2 to 9 corresponds to the modifying process in the example described in Figure ?. Line 14 invokes change. Before computing the new derived tree, we first store the old ST or base tree on disk (swap out). The metadata about the ST such as the dimensions in the group, hierarchy levels, etc is also stored in Metadata Repository. The new ST is usually much smaller than the old one and the remaining memory space can be reclaimed for other use. 6. Answering Cube Queries Using the Families Once the family is established and materialized, it is ready to answer the queries. Select a ST from the family and answer the submitted query using CubiST. Rewrite the query if necessary. 6.1. Selecting a Tree in the Family for Answering the Query Potentially, multiple trees can answer them but the smaller one is preferred. The proper tree in family to select is the tree with highest levels that match the query. This can be done by the following matching scheme. We will introduce some new concepts and data structures first. Definition: A view matrix (VM) is a matrix, where the rows are dimensions, the columns are the hierarchy levels and the entries are the statistics tree labels in the family. It describes what views are materialized in the hierarchies. 19 Example 6.1 Continuing example 5.1, suppose the level with cardinality 101 has level value 0 and grouped level with cardinality 11 has level value 1. For example, T1 has cardinality structure 101*101*11. Its first two dimensions are at level 0 and the last dimension is at level 1, therefore, T1 is put in the first column at first two rows and in row 2 and column 2. Here we assume that dimension and level labels start from 0 (to match with C/C++ array convention). The following is the VM. dim\lev 0 1 0 T0,T1,T2 T3 1 T0,T1 T2,T3 2 T0 T1,T2,T3 VM is easily implemented as a stack of level vectors which are composed of level values for the dimensions. The stack corresponding to the above VM is (1, 1, 1) T3 (0, 1, 1) T2 (0, 0, 1) T1 T0 (0, 0, 0) Figure We can peel a MV one layer off from right, that is, each dimension pops off the rightmost ST. The peel operator corresponds to a pop operation of the stack. MV creation corresponds the push of the level vectors to an empty stack. Example 6.2 After peeling off T3, the MV in example 4.2.1 is dim\lev 0 1 0 T0,T1,T2 1 T0,T1 T2 2 T0 T1,T2 Definition: A query hierarchy vector (QHV) is a vector (l 1,l2,...,lk), where li is ith dimension level value in its hierarchy in the query. Definition: A QHV (l1,l2,...,lk) matches view T in the VM iff li>= column index of the ith row entry of T in VM (the column indices of T compose T's column vector), i, i=1,2,...,k. In other words, when T's column 20 vector <= QHV, we say " the query or its QHV matches view T". If a query’s QHV matches view T, the popped vector corresponding to T in stack is less than or equal to QHV. Definition: An optimal matching view with respect to a VM is the view in VM that has the highest index and matches the query's QHV. Example 6.3 The QHV of query ([3#,5#], 1#, [4#,6#]) is (1,1,1). It matches T0, T1, T2, T3. But T3 is the optimal matching view. As before, a number with a sharp superscript indicates that it belongs to higher level values. [3 #,5#] means [30,59] in the example 5.1. If a query's QHV is (0,1,0), T0 is the only matched view, so it is also optimal tree among the family. Theorem: Any query with no finer granularities than the base tree has an optimal view that matches its QHV. This is quite straightforward, because T0 always matches and the indices of the derived trees have a total order (increasing). The MV is generated in the generation of family. Now the problem is to find the optimal matching view for the query. The pseudo-code is shown as follows. 1 Compute the QHV for the query; 2 peel off the entries of VM layer by layer from right to left 3 until the view matches QHV; 4 return the view; Algorithm 2: Find optimal matching view for the query Example 6.4 Suppose the query is Cube( 3#, 2, 5#). Its QHV is (1,0,1). Because the column vector for T 3 is (1,1,1) which does not match (1,0,1), T3 must be peeled off. In the same way, T2 must also be peeled off. The resulting MV is dim\lev 0 0 T0,T1 1 T0,T1 2 T0 1 T1 This time column vector of T1 is (0,0,1) <= (1,0,1) i.e. T1 matches QHV. So T1 is the optimal matching view. 6.2. Rewriting and Answering the Query In this phase, the query is evaluated based upon its optimal matching view. The answering algorithms are the same as CubiST. Before evaluation, the query must first be transferred if its QHV is not equal to the 21 optimal view's column vector. In the Example 6.4, query Cube(3#, 2, 5#) will be translated into Cube([30,39],2, 5#) so that its QHV (0,0,1) exactly indicates the optimal view levels. Then compute the query using T1. The query is rewritten according to the mapping functions. Higher level values are translated into lower level values by the inclusion relationship. 7. Performance Experiments 7.1. Datasets We use synthetic data sets in our experiments. Our random data generator simulated the relational warehouse with the following parameters: r is the number of records and k is the number of dimensions with cardinalities d1 , d 2 ,..., d k respectively. The elements in the array with r rows and k columns are uniformly distributed in the range of the domain sizes of the corresponding columns. The array is written to disk as a text file and serves as input to our new cubing algorithms. The testbed consists of a SUN ULTRA 10 workstation running Sun OS 5.6 with 90MB of available main memory. 7.2. Performance improvement of Hierarchical vs Flat CubiST In this subsection, we will show the effectiveness of materializing and using the families of the statistics trees. Since the number of records does not play a role in the size of the statistics tree and the cube query answering algorithms, in the set of experiments, we fixed the number of records (r=1,000,000). Suppose there are three dimensions, each of which has a two-tier hierarchy. To simplify the hierarchical inclusion relationship of mapping described in section xx, we assume that each higher level value includes equal number of lower level values and called this number as a factor. For example, each year has exactly four quarters. The factor is four. Of course, in the real world, there are many more dimensions and the eveninclusion assumption may not be true (e.g. location dimension). However, we did not lose the generality of our new algorithms by these assumptions. Factors 1, 1, 1 3, 2, 2 3, 3, 3 4, 4, 3 Dim. Sizes 60, 60, 60 20, 30, 30 20, 20, 20 15, 15, 20 #Leaves 226,981 20,181 9,261 5,376 I/O Time (ms) 16,818 1,517 681 384 Total Time (ms) 25,387 1,655 731 446 Table In this set of experiments, we fixed the query but varied the factors. We compare the statistics tree sizes by the number of leaves in the STs, the I/O time spent on loading the ST from ST repository and the total answering time. Notice that, in general, to answer a query, the total time is composed of the time in choosing the proper ST, in-memory empty tree formation time, leaves loading time and finally query answering time using the newly formed tree. From Table xx, the I/O time and the total time decrease 22 dramatically as the factor increases and the number of leaves decreases. The more low-level values that a high level value includes (the larger the factors), the faster the high level derived trees answer the queries. 7.3. Varying Number of Records In this subsection, we compare the setup and response times of CubiST with those of two other frequently used techniques: A simple query evaluation technique, henceforth referred to as scanning, and a bitmapbased query evaluation algorithm, henceforth referred to as bitmap. The simple query evaluation algorithm computes the aggregates for each cube query by scanning the entire data set each time. The bitmap-based query evaluation algorithm uses a bitmap index structure to answer cube queries without touching the original data set (assuming the bit vectors are already setup). CubiST is implemented as described in Sections xx. In this series of experiments, the number of dimensions and corresponding domain sizes are fixed while we increase the number of records. Specifically, there are five dimensions and their dimension sizes are 10, 10, 15, 15, 15 respectively. The goal is to validate our claim that CUBIST has excellent scalability in terms of the number of records. Response Time vs Number of Records Setup Time vs Number of Records 100000 Writing Bitmap Cubist 50,000 40,000 Response Time (ms) Setup Time (ms) 60,000 30,000 20,000 10,000 0 Scanning Bitmap Cubist 10000 1000 100 10 1 10K 100K 200K Writing 1,211 11,527 23,125 Scanning Bitmap 1,045 10,039 21,660 Bitmap Cubist 16,961 32,776 50,786 Cubist Number of Records 10K 100K 200K 979 9,196 18,347 51 494 989 32 31 32 Number of Records Figure Figure Figures 6 and 7 show the performance of the three algorithms. The sample cube query on which the experiments are based is as follows: q1 = (s1,s2,s3) = ([2,8],[2,8],[5,10]). Query q1 is an example of a range query which computes the total number of records that lie in the region where the first two dimension values are in the range of [2,8] and the third in [5,10]. Although the setup time for CubiST is larger than that used by Bitmap because of the insertion operation for each record, its response time is much faster and almost independent of the number of records. Since there is no setup time for the scanning algorithm, we simply measured how long it would take to 23 write the data out to disk as a reference value. We call this set of measurements “writing” in the first row in Figure 6. 7.4. Varying Domain Size and Query Complexity In this series of experiments, we compare response times to domain size and query complexity (measured in terms of selection components). First, the number of records (100,000) and the domain sizes (15,15,15,15,15) are fixed but the complexity of the input queries increases as follows: Q1 : ( s1 )  ([1,10]); Q 2 : ( s1 , s 2 )  ([1,10], [1,10]); Q3 : ( s1 , s 2 , s3 )  ([1,10], [1,10], [1,10]); Q 4 : ( s1 , s 2 , s3 , s 4 )  ([1,10], [1,10], [1,10], [1,10]); Q5 : ( s1 , s 2 , s3 , s 4 , s5 )  ([1,10], [1,10], [1,10], [1,10], [1,10]); Although the performance of the scanning approach is less sensitive to the varying degree of a query when compared with bitmap and CUBIST, its response times were nonetheless several orders larger (Figure 12). As before, CUBIST is considerably faster than the bitmap approach. The large increase of the number of the selected cells in the query, an increase from 10 to 100,000, accounts for the increase in response time of CUBIST. Response Time vs Queries Response Time vs Domain Size 10,000 Reponse Time (ms) Response Time (ms) 10000 1000 100 Scanning Bitmap Cubist 10 1 Q1 Q2 Q3 Q4 Q5 Scanning 8201 8109 7930 7870 7964 Bitmap 252 723 1093 1385 1742 Cubist 4 4 6 19 42 1,000 10 1 Q6 Q7 Q8 Q9 5,040 4,812 5,017 4,898 Bitmap 831 1,964 2,155 2,823 Cubist 5 6 12 24 Scanning Queries Scanning Bitmap Cubist 100 Domain Size Figure 12: Response time vs query complexity. Figure 13: Response time vs domain sizes. 24 We repeated the series of experiments using another data set, where we fixed the number of records (100,000) but increased the domain sizes of each dimension from 20 to 80 (i.e. the four groups of domain sizes are 20,20,20; 40,40,40; 60,60,60; 80,80,80) and used the following sets of queries: Q6 : ( s1 , s 2 , s3 )  ([0,10], [0,10], [0,10]); Q7 : ( s1 , s 2 , s3 )  ([0,20], [0,20], [0,20]); Q8 : ( s1 , s 2 , s3 )  ([0,30], [0,30], [0,30]); Q9 : ( s1 , s 2 , s3 )  ([0,40], [0,40], [0,40]); From Figure 13, we can see the drastic increase in response times for both CUBIST and bitmap as the number of selected cells increases. In summary, the above experiments showed that CUBIST demonstrates significant improvement in performance and scalability over both the scanning algorithm as well as ROLAP implementations based on bitmap indexing techniques. 7.5. Comparison of setup time and response time of Cubist and a top commercial product for a set of random queries 8. Conclusion In this paper, we have presented our new algorithms efficiently answering cube queries that have constraints on the arbitrary hierarchy levels. Hierarchies are perceived as an iterative partitioning of the lower level values. All attribute values are mapped into integers and the value inclusion relationship is established among the values at different levels. From the flat base statistics tree, we expand and materialized a family of statistics trees. We proposed a new greedy algorithm to choose family members and a new algorithm to compute the new derived trees. A smallest query-matching tree from the family is selected to answer the query. Before applying CubiST for the query, one may need to rewrite it according to the value mapping and inclusion relationship. Together with basic data structure of statistics trees and algorithm CubiST, the new methods proposed in the paper provide a new effective way to implement truly ad-hoc cube queries. Our simulation results demonstrate the performance benefits of the new algorithms. References [1] S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan, “On The Computation of Multidimensional Aggregates,” in Proceedings of the International Conference on Very Large Databases, Mumbai (Bomabi), India, 1996. [2] [3] R. Agrawal, A. Gupta, and S. Sarawagi, “Modeling Multidimensional Databases,” in Proceedings of the Thirteenth International Conference on Database Engineering, Birmingham, U.K., 1997. Arbor Systems, “Large-Scale Data Warehousing Using Hyperion Essbase OLAP Technology,” Arbor Systems, White Paper, http://www.hyperion.com/whitepapers.cfm. 25 [4] S. Berchtold and D. A. Keim, “High-dimensional index structures database support for next decade's applications (tutorial),” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 501, 1998. [5] C. Y. Chan and Y. E. Ioannidis, “Bitmap Index Design and Evaluation,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 355-366, 1998. [6] S. Chaudhuri and U. Dayal, “Data Warehousing and OLAP for Decision Support,” SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 507-508, 1997. [7] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,” SIGMOD Record, 26:1, pp. 65-74, 1997. [8] E. F. Codd, S. B. Codd, and C. T. Salley, “Providing OLAP (on-line analytical processing) to useranalysts: An IT mandate,” Technical Report, http://www.arborsoft.com/OLAP.html. [9] D. Comer, “The Ubiquitous Btree,” ACM Computing Surveys, 11:2, pp. 121-137, 1979. [10] S. Goil and A. Choudhary, “High Performance OLAP and Data Mining on Parallel Computers,,” Journal of Data Mining and Knowledge Discovery, 1:4, pp. 391-417, 1997. [11] S. Goil and A. Choudhary, “PARSIMONY: An Infrastructure for Parallel Multidimensi onal Analysis and Data Mining,,” Journal of Parallel and Distributed Computing, to appear. [12] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,” Data Mining and Knowledge Discovery, 1:1, pp. 29-53, 1997. [13] A. Gupta, V. Harinarayan, and D. Quass, “Aggregate-query Processing in Data Warehousing Environments,” in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 358-369, 1995. [14] H. Gupta and I. Mumick, “Selection of Views to Materialize Under a Maintenance Cost Constraint,” Stanford University, Technical Report. [15] J. Han, “Towards On-Line Analytical Mining in Large Databases,” SIGMOD Record, 27:1, pp. 97107, 1998. [16] V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing data cubes efficiently,” SIGMOD Record (ACM Special Interest Group on Management of Data), 25:2, pp. 205--216, 1996. [17] Information Advantage, “Business Intelligence,” , White Paper, 1998, http://www.sterling.com/eureka/. 26 [18] Informix Inc., “Informix MetaCube 4.2, Delivering the Most Flexible Business-Critical Decision Support Environments,” Informix, Menlo Park, CA, White Paper, http://www.informix.com/informix/products/tools/metacube/datasheet.htm. [19] W. Labio, D. Quass, and B. Adelberg, “Physical Database Design for Data Warehouses,” in Proceedings of the International Conference on Database Engineering, Birmingham, England, pp. 277-288, 1997. [20] M. Lee and J. Hammer, “Speeding Up Warehouse Physical Design Using A Randomized Algorithm,” in Proceedings of the International Workshop on Design and Management of data Warehouses (DMDW '99), Heidelberg, Germany, 1999, [21] D. Lomet, Bulletin of the Technical Committee on Data Engineering, vol. 18, IEEEE Computer Society, 1995. [22] Z. Michalewicz, Statistical and Scientific Databases, Ellis Horwood, 1992. [23] Microsoft Corp., “Microsoft SQL Server,” Microsoft, Seattle, WA, White Paper, http://www.microsoft.com/federal/sql7/white.htm. [24] MicroStrategy Inc., “The Case For Relational OLAP,” MicroStrategy, White Paper, http://www.microstrategy.com/publications/whitepapers/Case4Rolap/execsumm.HTM. [25] P. O'Neil and D. Quass, “Improved Query Performance with Variant Indexes,” SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 38-49, 1997. [26] P. E. O'Neil, “Model 204 Architecture and Performance,” in Proceedings of the 2nd International Workshop on High Performance Transaction Systems, Asilomar, CA, pp. 40-59, 1987. [27] Oracle Corp., “Oracle Express OLAP Technology”, Web site, http://www.oracle.com/olap/index.html. [28] Pilot Software Inc., “An Introduction to OLAP Multidimensional Terminology and Technology,” Pilot Software, Cambridge, MA, White Paper, http://www.pilotsw.com/olap/olap.htm. [29] Redbrick Systems, “Aggregate Computation and Management,” Redbrick, Los Gatos, CA, White Paper, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/redbrickvi stawhitepaper.html. [30] Redbrick Systems, “Decision-Makers, Business Data and RISQL,” Informix, Los Gatos, CA, White Paper, 1997, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/risql.html. 27 [31] J. Srivastava, J. S. E. Tan, and V. Y. Lum, “TBSAM: An Access Method for Efficient Processing of Statistical Queries,” IEEE Transactions on Knowledge and Data Engineering, 1:4, pp. 414-423, 1989. [32] W. P. Yan and P. Larson, “Eager Aggregation and Lazy Aggregation,” in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 345-357, 1995. [33] Y. Zhao, P. M. Deshpande, and J. F. Naughton, “An Array-Based Algorithm for Simultaneous Multidimensional Aggregates,” SIGMOD Record (ACM Special Interest Group on Management of Data), 26:2, pp. 159-170, 1997. 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 3. Answering Cube Queries Using Statistics Trees