Download Improving the Performance of Data Cube Queries Using Families of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Lattice model (finance) wikipedia , lookup

B-tree wikipedia , lookup

Red–black tree wikipedia , lookup

Quadtree wikipedia , lookup

Binary search tree wikipedia , lookup

Interval tree wikipedia , lookup

Transcript
Improving the Performance of Data Cube Queries Using Families of
Statistics Trees
Joachim Hammer and Lixin Fu
Computer & Information Science & Engineering
University of Florida
Gainesville, Florida 32611-6120
{jhammer, lfu}@cise.ufl.edu
Abstract. We propose a new approach to speeding up the evaluation of cube queries, an important class of OLAP
queries which return aggregated values rather than sets of tuples. Specifically, in this paper we focus on those
cube queries which navigate over the abstraction hierarchies of the dimensions. Abstraction hierarchies are the
result of partitioning dimension values of a data cube into different levels of granularity. Our approach is based on
our previous version of CubiST (Cubing with Statistics Trees) which represents a drastic departure from existing
approaches in that it does not use the familiar view lattice to compute and materialize new views from existing
views. Instead, CubiST computes and stores all possible aggregate views in the leaves of a statistics tree during a
one-time scan of the detailed data. However, it uses a single statistics tree to answer all possible cube queries. The
new version described here remedies this limitation by applying a greedy strategy to select a family of candidate
trees which represent superviews over the different hierarchical levels of the dimensions. In addition, we have
developed an algorithm to materialize the candidate trees in the family starting from the single statistics tree (base
tree). Given an input query, our new query evaluation algorithm selects the smallest tree in the family which can
provide the answer. This significantly reduces I/O time and improves in-memory performance over the original
CubiST. Experimental evaluations of our prototype implementation have demonstrated its superior run-time
performance and scalability when compared with existing MOLAP and ROLAP systems.
1
Introduction
Most OLAP queries involve aggregates of measures over arbitrary regions of the dimensions. This ad-hoc
“navigation” through the data cube helps users obtain a “big picture” view of the underlying data. However, on any
but the smallest databases, these kinds of queries will tax even the most sophisticated query processors since the
database size often grows to hundreds of GBs or TBs and the data in the tables is of high dimensionality with large
domain sizes.
In [6], Fu and Hammer describe an efficient algorithm (CubiST) for evaluating ad-hoc OLAP queries using socalled Statistics Trees (STs). CubiST can be used to evaluate queries that return (aggregate) values (e.g., What is the
percent difference in car sales between the southeast and northeast regions for the years 1990 through 2000?) rather
than the records that satisfy the query (e.g., Find all car dealers in the southeast who sold more cars in 1999 than in
1998). We have termed the former type of query cube query. However, the original CubiST algorithm in [6] does not
efficiently support complex cube queries which involve abstraction hierarchies for multiple dimensions. Abstraction
hierarchies are the result of partitioning dimension values of a data cube into different levels of granularity (e.g., days
into weeks, quarters, and years).
In this paper, we present a new methodology called CubiST ++ to evaluate ad-hoc cube queries which is based on
the original version of CubiST but greatly expands its usefulness and efficiency for all types of cube queries.
CubiST++ specifically improves the evaluation of cube queries involving arbitrary navigation over the abstraction
hierarchies in a data cube, an important and frequent class of queries. Instead of using a single (base) tree to compute
all cube queries, our algorithms materialize a family of trees from which CubiST ++ selects potentially smaller derived
trees to compute the answer. This reduces memory requirements, I/O, and improves processing speed.
1
2
Related Research
Research related to this work falls into three broad categories: OLAP servers including ROLAP and MOLAP,
indexing and view materialization.
ROLAP servers use the familiar “row-and-column view” to store the data in relational tables using a star or
snowflake schema design [4]. In the ROLAP approach, OLAP queries are translated into relational queries using the
standard relational operators such as selection, projection, relational join, group-by, etc. However, directly executing
translated SQL can be very inefficient. As a result, many commercial ROLAP servers extend SQL to support
important OLAP operations directly inside the relational engine (e.g., RISQL from Redbrick Warehouse [24], the
cube operator in Microsoft SQL Server [17]). MicroStrategy [18], Redbrick [23], and Informix's Metacube [12] are
examples of ROLAP servers.
MOLAP servers use proprietary data structures such as multidimensional arrays to store the data cubes. Arbor
software's Essbase [2], Oracle Express [21] and Pilot LightShip [22] are based on MOLAP technology. When the
number of dimensions and their domain sizes increase, the data becomes very sparse. Storing sparse data in an array
directly is inefficient. A popular technique to deal with sparse data is chunking [27] in which the full cube (array) is
chunked into small pieces called cuboids.
Often specialized index structures are used in conjunction with OLAP servers to improve query performance.
When the domain sizes are small, a bitmap index structure [20] can be used to help speed up OLAP queries. However,
simple bitmap indexes are not efficient for large-cardinality domains and large range queries. In order to overcome
this deficiency, an encoded bitmap scheme has been proposed [3]. A well-defined encoding can reduce the complexity
of the retrieve function thus optimizing the computation. However, designing well-defined encoding algorithms
remains an open problem. A good alternative to encoded bitmaps for large domain sizes is the B-Tree index structure
[5]. O’Neil and Quass [19] provide an excellent overview of and detailed analyses for index structures which can be
used to speed up OLAP queries.
View materialization refers to the pre-computing of partial query results used to derive the answer for frequently
asked queries. Since it is impractical to materialize all possible views, view selection is an important research
problem. For example, [11] introduced a greedy algorithm for choosing a near-optimal subset of views from a view
materialization lattice. The computation of the materialized views, some of which depend on previously materialized
views in the lattice, can be expensive when the views are stored on disk. More recently [9, 13, 14], for example,
developed various algorithms for view selection in data warehouse environments.
Another optimization to processing OLAP queries using view materialization is to pipeline and overlap the
computation of group-by operations to amortize the disk reads, as proposed by [1]. Other related research in this area
has focused on indexing pre-computed aggregates [25] and incrementally maintaining them [16]. Also relevant is the
work on maintenance of materialized views (see [15] for a summary of excellent papers) and processing of
aggregation queries [8, 26]. However, in order to be able to support truly ad-hoc OLAP queries, indexing and precomputation of results alone will not produce good results. For example, building an index for each attribute of the
warehouse or pre-computing every sub-cube requires too much space and results in unacceptable maintenance costs.
We believe that our approach which is based on a new data structure, view selection and query answering algorithm
holds the answer to how to support ad-hoc OLAP queries efficiently.
3
Answering Cube Queries Using Statistics Trees
In this Section, we briefly review CubiST and the statistics tree (ST) data structure on which our new approach is
based. Complete details can be found in [6]. An ST tree is a multi-way, balanced tree whose leave nodes hold the
aggregate values for a set of records consisting of one or more attributes. By aggregate value we are referring to the
result of applying an aggregation function (e.g., count, min, max) to each combination of dimension values in the
data set. Leaf nodes are connected to facilitate retrieval of multiple aggregates. Each level in the tree (except the leaf
level) corresponds to an attribute. An internal node has one pointer for each domain value, and an additional “star”
pointer representing aggregation along the attribute (the star has the intuitive meaning “any” analogously to the star in
Gray’s cube operator [7]). Internal nodes contain no data values.
In order to represent a data cube, an empty ST is created which has as many levels as there are dimensions in the
cube (plus one for the leaves). The size of the nodes on each level corresponds to the cardinality of the dimension
(plus one for the star pointer). Initially, the values in the leave nodes are set to zero. Populating the tree is done during
a one-time scan of the data set: for each record in the set, the tree is traversed based on the dimension values in the
2
tuple and the leaf node at the end of the path is updated accordingly (e.g., incremented by one in the case of the count
function).
Figure 1 depicts a statistics tree corresponding to a data cube with three dimensions with cardinalities d1=2, d2=3,
and d3=4 respectively. The contents of the tree are shown after inserting the record (1,2,4) into the empty tree. Note
that the leaf nodes are storing the result of applying the aggregation function count to the data set, i.e., counting all
possible combination of domain values. Further note that each tuple results in updates to eight different leaf nodes: at
each node the algorithm follows the pointer corresponding to the data value as well as the star pointer, resulting in 2n
paths, where n is the height of the tree. The update paths relating to the sample record are shown as solid thick lines in
Figure 1.
Root
1 2
*
• • •
Level 1
Star Pointer
Interior Node
1 2 3
*
• • • •
Level 2
Level 3
...
• • • • •
...
• • • •
1 2 3 4
*
...
• • • • •
• • • •
• • • • •
...
• • • • •
Level 4
0 0 0 1 1
...
0 0 0 1 1
0 0 0 1 1
...
0 0 0 1 1
Leaf Nodes
Figure 1: Statistics tree after processing input record (1,2,4)
Once all the records are inserted, the ST is ready to answer queries. An ad-hoc cube query can be viewed as an
aggregation of the measure attributes over a region defined by a set of dimensional constraints. Formally, a cube query
q is a tuple defining the regions of the cube on which to aggregate on: q  agg ( si1 , si2 ,..., sir ) , where agg describes the
aggregation function to be used, {i1, i2, …, ir}  {1,2, …,k}, r is the number of dimensions specified in the query, and k
is the total number of dimensions in the data set. Component si describes a constraint for dimension ij, specifying the
j
selected values which can either be a singleton, a range, or a partial set. For instance, query q = count(s1=4, s4=[1,3],
s5={1,3}), specifies three constraints on dimensions 1, 4, and 5 of the data set: “4” states that the selected value of the
first dimension must be the singleton value 4, “[1,3]” states that the desired values for the fourth dimension fall in the
range from 1 to 3, and “{1,3}” states that the selected values for the fifth dimension must be either 1 or 3 (partial
selection). The function count specifies that the total number of records satisfying the constraints be returned as
result. In order to satisfy the above query, CubiST uses the constraint values in q to follow the paths to the desired
leaves in the underlying ST. In the case of s1=4 this means following the fourth pointer of the node representing the
first dimension in the tree and so forth. There is a total of six different paths since the query is composed of 1*3*2=6
partial results, one for each combination of the three dimensions. The partial results have been pre-computed during
the creation of the tree (using the function count). The total of the values at the six leaf nodes form the result. Details
about CubiST and how it can be integrated with relational warehouse systems are provided in [6].
The original CubiST provides a useful framework for answering cube queries. However, CubiST does NOT
efficiently answer queries on data cubes containing hierarchies on the dimensional values. In those cases, one must
first transform the conditions of the query which are expressed using different levels of abstraction into conditions not
involving hierarchies, and then treat them as range queries or partial queries. Given the amount of data to be stored in
the tree, the underlying ST may be large and may not fit into memory all at once. In order to overcome this problem
we have developed CubiST++ ("CubiST plus Family") to automatically support abstraction hierarchies in the
underlying data set and to alleviate the memory problems of its predecessor. In the next sections we describe a suite of
algorithms for generating families of ST's and for using them to efficiently compute ad-hoc cube queries.
3
4
Generating Families of Statistics Trees
One way to reduce the cost of paging STs in and out of memory is to transmit only the leaves of the ST. The internal
nodes can be generated in memory without additional I/O. A more effective approach is to partition an ST into
multiple, smaller subtrees, each representing a certain part of the data. In most cases, cube queries can be answered
using one of the smaller subtrees rather than the single large ST. We term this single large ST as defined in the
original version of CubiST base tree. Using this base tree, one can compute and materialize other statistics trees with
the same dimensions but containing different levels of the abstraction hierarchy for one or more of the dimensions.
We term these smaller trees derived trees. A base tree and its derived trees form a family of statistics trees with
respect to the base tree. Before we describe CubiST ++, we first show how to formally represent hierarchies.
4.1 Hierarchical Partitioning and Mapping of Domain Values
In relational systems, dimensions containing hierarchies are stored in the familiar row-column format in which
different levels are represented as separate attributes. In our new framework, we regard the interdependent levels in a
hierarchy as one hierarchical attribute and map the values at different levels into integers. In this representation, a
value at a higher level of abstraction includes multiple values at a lower level. By scanning a dimension table, we can
establish the mappings, hierarchical partitions, as well as the inclusion relationships.
To illustrate, consider Table 1 which represents the location information in a data source. We first group the
records by columns Region, State and City, remove duplicates, and then map the domain values into integers using the
sorted order. The result is shown in Table 2 (numbers in parentheses are the integer assignments for the strings).
LocationID
1
2
3
4
5
6
7
8
City
Gainesville
Atlanta
Los Angeles
Chicago
Miami
San Jose
Seattle
Twin City
Region State
MW (1) IL (1)
MN (2)
SE (2) FL (3)
City
Chicago (1)
Twin City (2)
Gainesville (3)
Miami (4)
GA (4)
Atlanta (5)
WC (3) CA (5) Los Angeles (6)
San Jose (7)
WA (6)
Seattle (8)
State Region
FL
SE
GA
SE
CA
WC
IL
MW
FL
SE
CA
WC
WA WC
MN MW
Table 1: Original location data
Table 2: Partitioning and mapping of location data
Table 3 is created by replacing the strings in Table 1 with their integer assignments. Mapping strings into integers
saves space and facilitates data processing internally. The compressed representation of the transformed tables reduces
the runtime of join operations during the initialization phase of the ST since most relational warehouses store
measures and dimension values in separate tables. Note that we only need to store tables 2 and 3 because they allow
us to deduce the contents of Table 1.
LocationID
1
2
3
4
5
6
7
8
City
State
3
5
6
1
4
7
8
2
Region
3
4
5
1
3
5
6
2
2
2
3
1
2
3
3
1
Table 3: Location data after transformation
4
4.2 A Greedy Algorithm for Selecting Family Members
To generate a family of statistics tree, we first need an algorithm for choosing from the set of all possible candidate
trees (i.e., trees containing all combinations of dimensions and corresponding hierarchy levels) those trees that will
eventually make up the family. Second, we need an algorithm for deriving the selected trees from the base tree.
Potentially, any combination of hierarchy levels can be selected as a potential tree. However, we need to consider
space limitations and maintenance overhead. We introduce a greedy algorithm to choose the members in a step-bystep fashion. During each step starting from the base tree, we roll-up the values of the largest dimension to its next
level in the hierarchy, keeping other dimensions the same. This newly formed ST becomes a new member of the
family. The iteration continues until each dimension is rolled-up to its highest level or until the size of the family
exceeds the maximum available space. The pseudo-code for this greedy selection mechanism is shown in Figure 2.
The total size of the derived trees is only a small fraction of the base tree. In line 1, the original ST initialization
algorithm proposed in [6] is used. The roll-up algorithm of line 7 is discussed in Section 4.3.
1
Setup base tree T0 by scanning input data;
2
Save T0 in repository;
3
Save its parameters to Metadata Repository;
4
T = T0;
5
Repeat {
6
Find dimension with largest cardinality
7
Rollup dimension to generate a new
in T;
derived Tree T;
9
Save T;
10
11
Save params of T to Metadata Repository;
}
Until (stop conditions);
Figure 2: Greedy algorithm for generating a family of statistics trees
Let us illustrate with an example. Suppose a base tree T 0 represents a data set with three dimensions d1, d2, d3
whose cardinalities are 100 each. Each dimension is organized into ten groups of ten elements to form a two-level
hierarchy. We use the superscripted hash mark to indicate values corresponding to the second hierarchy level. For
example, 0# represents values 0 through 9, 1# represents values 10 through 19 etc. The degrees of internal nodes of T 0
are 101 for each level of the tree (the extra 1 accounts for the star pointer). The selected derived trees T 1, T2, T3 that
make up a possible family together with T 0 are as follows: T1 is the result of rolling up d2 in T0 (note since all three
dimensions are of the same size, we pick one at random). T 1 has degrees 101, 101 and 11. T 2 is the result of rolling up
d1 in T1. Its degrees are 101, 11, and 11. T 3 is the result of rolling up d0 in T2. Its degrees are 11, 11, and 11. The
family generated from T0 is shown in Figure 3.
T3
Hierarchy Level 1
T2
T1
Hierarchy Level 0
T0
d0
d1
d2
Figure 3: A family of ST's consisting of {T0, T1, T2, T3}
5
4.3 Deriving a New Statistics Tree
Except for the base tree which is computed from the initial data set, each derived tree is computed by rolling-up one of
its dimensions. Before introducing our new tree derivation algorithm, we first define the a new operator “” on STs as
follows:
Definition: Two STs are isomorphic if they have the same structure except for the values of their leaves. The
summation of two isomorphic ST's S1, S2 is a new ST S that has the same structure as S1 or S2 but its leaf values are
the summation of the corresponding leaf values of S1 and S2. This relationship is denoted by S=S1 S2.
To derive a new tree, proceed from the root to the level that is being rolled up. According to the mapping
relationship described in Section 4.1, recursively reorganize and merge the sub-trees for all the nodes in that level to
form new subtrees. For each node, adjust the degree and update its pointers to point to the newly created subtrees.
The best way to illustrate this tree derivation process is through a simple example. Suppose that during the
derivation, we need to roll-up a dimension that has nine values and forms a two-level hierarchy consisting of three
groups with three values each. A sample node N of this dimension with children S1,.., S9, S* is shown in the left of
Figure 4. We group and sum up its nine sub-trees into three new trees S1#, S2# and S3# each having three subtrees. The
new node N' with degree four is shown on the right side of the figure. The star sub-tree remains the same. Using the
definitions from above, we can see that the following relationships hold among the subtrees of nodes N and N':
S1#=S1S2S3, S2#=S4S5S6, and S3#=S7S8S9
Node N’
Node N
1#
2#
3#
1 2 3 4 5 6 7 8 9
1# 2# 3 # *
*
Roll-up
S1
S3
S2
S5
S4
S7
S6
#
S9
S8
#
S1
S*
S3
#
S2
S*
Figure 4: Rolling-up one dimension
5
Answering Cube Queries Using a Family of ST's
Once the family is established and materialized, it can be used to answer cube queries as described in the following
sections.
5.1 Choosing the Optimal Derived Tree
Given a family of trees, potentially multiple trees can answer a given query; however, the smallest possible one is
preferred. The proper tree is the tree which contains the highest level of abstraction that matches the query. This tree
can be found using our following matching scheme described below, which is based on the following new concept and
data structure.
Definition: A view matrix (VM) is a matrix in which rows represent dimensions, columns represent the hierarchy
levels, and the entries are the labels of the ST's in the family. Intuitively speaking, a VM describes what views are
materialized in the hierarchies.
6
Continuing with our running example, T1 has fan-out structure 101*101*11. Its first two dimensions have not
been rolled-up (= level 0 of the abstraction hierarchy) whereas the third dimension has been rolled-up to level 1; hence
T1 is placed in rows 1 and 2 of column 1 and the third row in column 2 as shown in Figure 6.
Level of
Abstraction
Roll-up level 0
Roll-up level 1
1
T0, T1, T2
T3
2
T0, T1
T2, T3
3
T0
T1, T2, T3
Dimension
Figure 6: A sample view matrix
A VM is easily implemented as a stack of vectors, each representing the abstraction level for each dimension by
tree. For example, ST T3 on top of the stack, has all of its dimensions rolled-up to level 1. Removing (i.e., popping) a
layer from the stack is equivalent to peeling the right-most ST from the VM. The stack corresponding to the VM
above is shown in Figure 7.
T3
(1, 1, 1)
(0, 1, 1)
T2
T1
T0
(0, 0, 1)
(0, 0, 0)
Figure 7: Stack representation of the VM
5.2 Rewriting and Answering the Query
Next, we introduce a new concept for representing hierarchies in cube queries.
Definition: A query hierarchy vector (QHV) is a vector (l1,l2,...,lk), where li is the ith dimension level value in its
hierarchy in the query.
Definition: A QHV (l1,l2,...,lk) matches view T in the VM iff li  column index of the ith row entry of T in VM (the
column indices of T compose T's column vector), i, i=1,2,...,k. In other words, T's column index vector  QHV.
Definition: An optimal matching view with respect to a query is the view in VM that has the highest index and
matches the query's QHV.
7
Level of
Abstraction
Roll-up level 0
Roll-up level 1
1
T0, T1
—
2
T0, T1
—
3
T0
T1, T2, T3
Dimension
Figure 8: VM after removing T2 and T3
Consider the sample cube query q = count(3#, 2, 5#). Its QHV is (1,0,1). Since the column vector for T 3 is (1,1,1)
which does not match (1,0,1), T 3 must be removed. In the same way, T 2 must also be removed. The resulting VM is
shown in Fig. 8. This time, the column index vector of T1 is (0,0,1)  (1,0,1) i.e. T1 matches QHV. So T1 is the
optimal matching view which is used to answer q. The pseudo-code for finding the optimal matching tree is shown in
Fig. 9.
Each query is evaluated based on its optimal matching ST. Before evaluation, the query must first be rewritten if
its QHV is not equal to the ST's column index vector. For example, query q = count(3#, 2, 5#) will be rewritten into
q’=count([30,39], 2, 5#) so that its new QHV (0,0,1) exactly equals the levels of optimal ST T 1. Then we compute the
query using T1 by applying the original query evaluation algorithm of CubiST as outlined in Section 3. The interested
reader is invited to refer to the technical report [10] for additional details on CubiST++.
BEGIN
1
Compute the QHV for the query;
2
Peel off the entries of VM layer
by layer from right to left
3
until the view matches QHV;
4
return selected tree
END
Figure 9: Finding the smallest matching ST for a given query
6
Performance Evaluation
We simulate a relational warehouse using synthetic data sets with the following parameters: r is the number of records
and k is the number of dimensions with cardinalities d1, d2, …, dk. The elements in data set with r rows and k columns
are uniformly distributed in the range of the domain sizes of the corresponding columns. The data set is stored as a
text file and serves as input to the different query evaluation algorithms. For each set of experiments, a set of
randomly generated queries is used. The testbed is implemented using a SUN ULTRA 10 workstation running Sun OS
5.6 with 90MB of available main memory.
6.1 CubiST++ vs. CubiST
In this experiment we show the effectiveness of using families of statistics trees (CubiST ++). We have fixed the
number of records to r=1,000,000. We also assume there are three dimensions, each of which has a two-level
hierarchy. To simplify the hierarchical inclusion relationships, we further assume that each high level value includes
an equal number of lower level values; this number is referred to as factor. The test query is q=count(s1) =
Factors
Dimension Sizes
Number of Leaves
I/O Time (ms)
Total Time (ms)
1, 1, 1
3, 2, 2
4, 4, 3
60, 60,
60
20,
30,
30
15,
15, 20
8
226,981
20,181
5,376
16,818
1,517
384
25,387
1,655
446
count([0,29]). For both approaches we compare the size of the ST's (based on number of leaves), the I/O time spent
on loading the ST from the ST repository, and the total query answering time. Table 4 summarizes the results which
show a dramatic decrease in I/O time and time it took to answer the query over the course of the experiment in favor
of CubiST++.
Table 4: Sample runtimes for different derived trees
6.2 CubiST++ vs. Bitmap
In this experiment we compare the setup and response times of CubiST ++ with those of two other approaches: A naive
query evaluation technique, henceforth referred to as scanning, and a bitmap-based query evaluation algorithm
referred to as bitmap. The bitmap-based query evaluation algorithm uses a bitmap index structure to answer cube
queries without touching the original data set when the bit vectors are already setup. In this series of experiments, the
domain sizes of the five dimensions in the data set are 10, 10, 15, 15, and 15 respectively. The goal is to validate the
claim that CubiST++ has excellent scalability in terms of the number of records in the set.
Figures 10 and 11 show the performance of the three algorithms on the same query:
q=count(s1,s2,s3)=count([2,8],[2,8],[5,10]). Although the setup time for CubiST ++ is larger than that used by Bitmap,
its response time is much faster and independent of the number of records.
Response Time vs Number of
Records
100000
Scanning
Bitmap
10000
CubiST+
Setup Time (ms)
60,000
Response Time (ms)
Setup Time vs Number of
Records
Writing
Bitmap
CubiST+
50,000
40,000
30,000
20,000
10,000
0
100
10
1
10K
100K
200K
10K
100K
200K
Writing
1,211
11,527
23,125
Scanning
979
9,196
18,347
Bitmap
1,045
10,039
21,660
Bitmap
51
494
989
CubiST+
16,961
32,776
50,786
CubiST+
32
31
32
Number of Records
Number of Records
Figure 10: Setup time vs number of records
6.3
1000
Figure 11: Response time vs number of records
CubiST++ vs Commercial RDBMS
In our last set of experiments we compare running times of CubiST ++ against those of a commercial RDBMS. The
single relational table in the warehouse has 1,000,000 rows and 5 columns with domain sizes 10, 10, 10, 10, and 15.
First, we loaded the table and answered five randomly generated cube queries q1 to q5 using the commercial
ROLAP engine. Next, twe created a view to group columns in support of the count operation. The view was
materialized and used to answer the same set of queries. The same queries were also processed by CubiST ++.
150
4,000
Time (milsec)
Time (sec)
100
50
3,000
2,000
1,000
0
0
Writing Loading
15
93
View
Setup
ST
Setup
122
52
q2
q3
q4
q5
W/o View
2,5202,640 2,3001,483 3,580
With View
376 370 290 153 533
ST
9
q1
1
1
1
1
1
Figure 12: Setup times
Figure 13: Query response times
The results are shown in Figures 12 and 13. The column labeled "writing" in Figure 12 refers to the time it takes to
write the entire table to disk (a comparison measure only). Notice that in Figure 13 the query execution times for
CubiST++ are around 1ms and cannot be properly displayed using the relatively large scale. It is worth pointing out
that CubiST++ beats the setup time of the RDBMS. More importantly, its evaluation times are two orders faster than
those posted by the RDBMS even when the view is used.
7
Conclusion
In this paper, we have presented an overview of CubiST ++, a new algorithm for efficiently answering cube queries that
involve constraints on arbitrary abstraction hierarchies. Using a single statistics tree, we select and derive an
appropriate family of statistics trees which are smaller then the base tree and, when taken together, can answer most
ad-hoc cube queries while reducing the requirements on I/O and memory. We have proposed a new greedy algorithm
to choose family members and a new roll-up algorithm to compute derived trees. Using our matching algorithm the
smallest tree that can answer a given query is chosen. Our experiments demonstrate the performance benefits of
CubiST++ over existing OLAP techniques.
References
[1]
S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan, “On The
Computation of Multidimensional Aggregates,” in Proceedings of the International Conference on Very
Large Databases, Mumbai (Bomabi), India, 1996.
[2]
Arbor Systems, “Large-Scale Data Warehousing Using Hyperion Essbase OLAP Technology,” Arbor
Systems, White Paper, http://www.hyperion.com/whitepapers.cfm.
[3]
C. Y. Chan and Y. E. Ioannidis, “Bitmap Index Design and Evaluation,” in Proceedings of the ACM
SIGMOD International Conference on Management of Data, Seattle, WA, pp. 355-366, 1998.
[4]
S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,” SIGMOD Record,
26:1, pp. 65-74, 1997.
[5]
D. Comer, “The Ubiquitous Btree,” ACM Computing Surveys, 11:2, pp. 121-137, 1979.
[6]
L. Fu and J. Hammer, “CubiST: A New Algorithm for Improving the Performance of Ad-hoc OLAP
Queries,” in Proceedings of the ACM Third International Workshop on Data Warehousing and OLAP
(DOLAP), Washington, DC, pp. 72-79, 2000.
[7]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh,
“Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,” Data
Mining and Knowledge Discovery, 1:1, pp. 29-53, 1997.
[8]
A. Gupta, V. Harinarayan, and D. Quass, “Aggregate-query Processing in Data Warehousing Environments,”
in Proceedings of the Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp.
358-369, 1995.
[9]
H. Gupta and I. Mumick, “Selection of Views to Materialize Under a Maintenance Cost Constraint,”
Stanford University, Technical Report.
[10]
J. Hammer and L. Fu, “Improving the Performance of Data Cube Queries Using Families of Statistics Trees,”
University of Florida, Gainesville, FL, Research report, January 2001,
http://www.cise.ufl.edu/~jhammer/publications.html.
[11]
V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing data cubes efficiently,” SIGMOD Record
(ACM Special Interest Group on Management of Data), 25:2, pp. 205-216, 1996.
10
[12]
Informix Inc., “Informix MetaCube 4.2, Delivering the Most Flexible Business-Critical Decision Support
Environments,” Informix, Menlo Park, CA, White Paper,
http://www.informix.com/informix/products/tools/metacube/datasheet.htm.
[13]
W. Labio, D. Quass, and B. Adelberg, “Physical Database Design for Data Warehouses,” in Proceedings of
the International Conference on Database Engineering, Birmingham, England, pp. 277-288, 1997.
[14]
M. Lee and J. Hammer, “Speeding Up Warehouse Physical Design Using A Randomized Algorithm,” in
Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW '99),
Heidelberg, Germany, 1999, ftp://ftp.dbcenter.cise.ufl.edu/Pub/publications/VSgenetic-full.doc/pdf.
[15]
D. Lomet, Bulletin of the Technical Committee on Data Engineering, vol. 18, IEEEE Computer Society,
1995.
[16]
Z. Michalewicz, Statistical and Scientific Databases, Ellis Horwood, 1992.
[17]
Microsoft Corp., “Microsoft SQL Server,” Microsoft, Seattle, WA, White Paper,
http://www.microsoft.com/federal/sql7/white.htm.
[18]
MicroStrategy Inc., “The Case For Relational OLAP,” MicroStrategy, White Paper,
http://www.microstrategy.com/publications/whitepapers/Case4Rolap/execsumm.HTM.
[19]
P. O'Neil and D. Quass, “Improved Query Performance with Variant Indexes,” SIGMOD Record (ACM
Special Interest Group on Management of Data), 26:2, pp. 38-49, 1997.
[20]
P. E. O'Neil, “Model 204 Architecture and Performance,” in Proceedings of the 2nd International Workshop
on High Performance Transaction Systems, Asilomar, CA, pp. 40-59, 1987.
[21]
Oracle Corp., “Oracle Express OLAP Technology”, Web site,
http://www.oracle.com/olap/index.html.
[22]
Pilot Software Inc., “An Introduction to OLAP Multidimensional Terminology and Technology,” Pilot
Software, Cambridge, MA, White Paper, http://www.pilotsw.com/olap/olap.htm.
[23]
Redbrick Systems, “Aggregate Computation and Management,” Redbrick, Los Gatos, CA, White Paper,
http://www.informix.com/informix/solutions/dw/redbrick/wpapers/redbrickvistawhite
paper.html.
[24]
Redbrick Systems, “Decision-Makers, Business Data and RISQL,” Informix, Los Gatos, CA, White Paper,
1997, http://www.informix.com/informix/solutions/dw/redbrick/wpapers/risql.html.
[25]
J. Srivastava, J. S. E. Tan, and V. Y. Lum, “TBSAM: An Access Method for Efficient Processing of
Statistical Queries,” IEEE Transactions on Knowledge and Data Engineering, 1:4, pp. 414-423, 1989.
[26]
W. P. Yan and P. Larson, “Eager Aggregation and Lazy Aggregation,” in Proceedings of the Eighth
International Conference on Very Large Databases, Zurich, Switzerland, pp. 345-357, 1995.
[27]
Y. Zhao, P. M. Deshpande, and J. F. Naughton, “An Array-Based Algorithm for Simultaneous
Multidimensional Aggregates,” SIGMOD Record (ACM Special Interest Group on Management of Data),
26:2, pp. 159-170, 1997.
11