Download 3. Answering Cube Queries Using Statistics Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Answering Data Cube Queries Using Families of Statistics Trees
Lixin Fu and Joachim Hammer
Department of Computer & Information Sciences and Engineering
University of Florida
Gainesville, Florida 32611-6120
{lfu, jhammer}@cise.ufl.edu
Abstract
In data warehouse and OLAP, the modeling and implementation of Data Cube
operation have been a major concern. A Cube query is to compute the aggregates of
measures over an arbitrary combination of dimensions in a relational warehouse. It
often needs multiple table joins and multiple scans. How to speed up Cube
computation will have great influence on DSS (Decision Support System). CubiST
(CUBIng with Statistics Trees) algorithm evaluates an ad-hoc Cube query based on the
new statistics trees structures which are initialized by scanning the original table.
In this paper, further new Cube query optimizations over CubiST will be given.
Especially, the Data Cube operator is generalized so that an arbitrary navigation over
the abstraction hierarchies from different dimensions is much more efficient. We view
hierarchies as iterative layering partitioning of the low-level values to high-level values
after mapping them to integers. Instead of only using flat statistics trees to compute
queries, we choose potentially much smaller appropriate trees from families of
statistics trees, thus improving query performance significantly.
Experiments demonstrated the effectiveness of these optimizations and superior
performance to other alternatives.
1. Introduction
OLAP (On-line Analytical Processing) [6, 7] and data warehousing have been active research areas. A
warehouse contains data from a number of independent sources, integrated and cleansed to support clients
who wish to analyze the data for trends and anomalies. The decision support is provided by OLAP tools
which present their users with a multi-dimensional perspective of the data in the warehouse and facilitate
the writing of reports involving aggregations along the various dimensions of the data set [8].
In the multi-dimensional data model, data is structured around measures and dimensions. Measures are
numerical data being tracked (e.g. sales). Dimensions are the natural business parameters that define the
individual transactions (e.g. time, location, product). Some dimensions may have hierarchies. For example,
time may have a “daymonthyear” hierarchy. OLAP queries select data sets called subcubes that are
represented in multi-dimensional regions. Slicing, dicing, rolling-up, drilling-down, and pivoting are
typical OLAP operators. The data cube operator, which was proposed in [12], contains these operators and
generalizes aggregates, subtotals, cross tabulations, and group-bys.
Most OLAP queries involve aggregates of measures over arbitrary regions of the dimensions. This ad-hoc
aggregate navigation helps users to obtain the “big picture” of underlying data and generate or verify user
hypotheses. In our recent paper [ref], we gave these aggregate-aware queries a new term called cube
1
queries (a.k.a. cube operation or cubing) and provided a new efficient cubing algorithm called CubiST
(CUBIng with Statistics Trees).
A Cube query is to compute the aggregates of measures over an arbitrary combination of dimensions in a
relational warehouse. The Cube query generalizes the cube operator in that each selected dimension set in
the query can be a value, a range, or an arbitrary subset of domains. For example, a data warehouse
containing sales information for automobiles across the US may be used to answer a cube query such as
“How many red Toyota cars have been sold in Florida and Georgia between 1990 and 2000?” Answering
this query requires the selection of tuples satisfying the specified conditions as well as the aggregation of
the sales data along the location, time, and product attributes. On any but the smallest databases, these
kinds of queries will tax even the most sophisticated query processors. It often needs multiple table joins
and multiple scans. In the data warehouse environment, implementing cube queries over large databases
poses great challenges. The database size often grows to hundreds of GBs or TBs with millions or even
billions of records with high dimensionality and large domain sizes. There are many algorithms for
evaluating OLAP and cube queries, but no existing indexing and query optimization performs sufficiently
well for high dimensional data [4]. Current systems (e.g. ROLAP, MOLAP, indexing, materialized views,
etc.) suffer from the fact that the types of supported OLAP queries must be known beforehand in order to
set up the appropriate index structures and assure that relevant partial results have been pre-computed. This
limits the analysts’ ability to navigate freely through the entire data set, in order to get a high-level
overview of possible trends and patterns. How to speed up Cube computation will have great influence on
DSS (Decision Support System).
CubiST evaluates an ad-hoc Cube query based on a novel data structure called statistics trees (STs). Simply
speaking, a statistics tree is a multi-way tree in which internal nodes contain references to next-level nodes,
and are used to direct the query evaluation. Leave nodes hold the statistics or histograms for the data (e.g.,
SUM, COUNT, MIN, MAX values) and are linked together to facilitate scanning, similarly to the B/B +Tree data structure [9]. In order to use an ST to answer cube queries over a particular data set, one must
first pre-compute the aggregations on all subcubes by scanning the detailed data set. While reading each
record, walk through the tree according to its column values and update the aggregate value in the leaves.
CubiST computes the aggregation information rather than listing the details of the records that satisfy the
query conditions. Duplicate records are aggregated into the same leaves. In this way, many operations
performed on records such as sorting, hashing, etc. are eliminated.
In this paper, further new Cube query optimizations over CubiST will be given. Especially, the Data Cube
operator is generalized so that an arbitrary navigation over the abstraction hierarchies from different
dimensions is much more efficient. We view hierarchies as iterative layering partitioning of the low-level
values to high-level values after mapping them to integers. Instead of only using flat statistics trees to
2
compute queries, we choose potentially much smaller appropriate trees from families of statistics trees, thus
improving query performance significantly. The main contribution of this paper is a comprehensive, new
methodology to answer cube queries with arbitrary level aggregations over any subset of the underlying
dimensions of the data cube.
The remainder of the paper is organized as follows. Section 2 reviews current and past research activities
related to the work presented here, focusing chiefly on OLAP query processing, indexing and view
materialization in data warehouses. In section 3, we briefly introduce statistics trees and CubiST as a
preparation of the new algorithms presented later in the paper. Section 4 gives new perception concerning
hierarchies and mappings. Section 5 describes the generation of families of statistics trees. We give our
cube query optimization algorithm using the families in section 6. A description of our experimental system
and the results of our evaluation are presented in section 7. A summary and concluding remarks are
presented in section 8.
2. State-of-the-Art
Research related to this work falls into three broad categories: OLAP servers including ROLAP and
MOLAP, indexing, and view materialization in data warehousing.
2.1. ROLAP Servers
ROLAP servers use the familiar “row-and-column view” to store the data in relational tables using a star or
snowflake schema design [7]. In the star schema, there is a fact table plus one or more dimension tables.
The snowflake schema is a generalization of the star schema where the core dimensions have aggregation
levels of different granularities. In the ROLAP approach, cube queries are translated into relational queries
against the underlying star or snowflake schema using the standard relational operators such as selection,
projection, relational join, group-by, etc. However, directly executing translated SQL can be very
inefficient and as a result, many commercial ROLAP servers extend SQL to support important OLAP
operations directly (e.g., RISQL from Redbrick Warehouse [30], cube operator in Microsoft SQL Server
[23]).
A simple algorithm 2N-algorithm for evaluating the cube operator is proposed in [12]. In this algorithm,
where N is the number of dimensions, a handle is allocated for each for each cell of the data cube. For each
new record (x1,x2,…,xN,v) the handle function is called 2N times – once for each handle of each cell of the
cube matching this value. Here, xi are the dimension values and v is the measure. When all input values
have been processed the final aggregate for each of the nodes in the cube is computed. Due to the large
number of handles, this algorithm does not scale well for large N.
3
To speed up the group-by’s, indices and materialized views are widely used. As far as we know, there is no
internal ROLAP algorithm in the literature for evaluating cube queries efficiently. MicroStrategy [24],
Redbrick [29], Informix's Metacube [18] and Information Advantage [17] are examples of ROLAP servers.
2.2. MOLAP Servers
MOLAP servers use uses proprietary data structures such as multidimensional arrays to store the data
cubes. MOLAP is often several orders faster than the ROLAP alternative when the dimensionality and
domain size are relatively small compared to the available memory. However, when the number of
dimensions and their domain sizes increase, the data becomes very sparse resulting in many empty cells in
the array structure (especially cells containing high dimensional data). Storing sparse data in an array in this
fashion is inefficient.
A popular technique to deal with the sparse data is chunking [33]. The full cube (array) is chunked into
small pieces called cuboids. For a non-empty cell, a(OffsetInChunk,data) pair is stored. Zhao et. al.
describe a single pass, multi-way algorithm that overlaps the different group-by computations to minimize
the memory requirement. The authors also give a lower-bound for the memory which is required by the
minimal memory spanning tree (MMST) of the optimal dimension order (which increases with the domain
sizes of these dimensions). Their performance evaluations show that a MOLAP server using an appropriate
chunk-offset compression algorithm is much faster than most ROLAP servers. However, if there is not
enough memory to hold the MMST, several passes over the input data are needed. In the first read-write
pass, data is partitioned. In the second read-write pass, the partitions are clustered further into chunks.
Additional passes may be needed to compute all aggregates in the MMST execution plan. In this case, the
initialization time may be prohibitively large. In addition, since the materialized views reside on disk,
answering OLAP queries may require multiple disk I/Os.
To address the scalability problem of MOLAP, Goil and Choudhary proposed a parallel MOLAP
infrastructure called PARSIMONY [10, 11]. Their algorithm incorporates chunking, data compression,
view optimization using a lattice framework, as well as data partitioning and parallelism. The chunks can
be stored as multi-dimensional arrays or (OffsetInChunk,data) pairs depending on whether they are
dense or sparse. The OffsetInChunk is bit-encoded (BESS). However, like other MOLAP implementations,
the algorithm still suffers from high I/O costs during aggregation because of frequent paging operations that
are necessary to access the underlying data.
In general, ROALP is more scalable in terms of the data size, while MOLAP has better performance when
the number of dimensions is small. However, the decision of whether to use ROLAP or MOLAP does not
only depend on the original data size but also the data volume which is defined as the product of the
cardinalities. To illustrate our point, consider a database with 50 billion records and three dimensions each
having 100 values. Suppose that each row needs 4*4 bytes (assuming an integer uses 4 bytes and the table
4
has one measure), then the data size is 16*50*109 = 800GB but the data volume is 100*100*100 = 1M. In
this case, MOLAP would be a better choice.
Arbor software's Essbase [3], Oracle Express [27] and Pilot LightShip [28] are based on MOLAP
technology. The latest trend is to combine ROLAP and MOLAP in order to take advantage of the best of
both worlds. For example, in PARSIMONY, some of the operations within sparse chunks are relational
while operations between chunks are multidimensional.
2.3. Work on Indexing
Specialized index structures are another way to improve the performance of OLAP queries. The use of
complex index structures is made possible by the fact that the data warehouse is a “read-mostly”
environment in which updates are performed in large batch processes. This allows time for reorganizing the
data and indexes to a new optimal clustered form.
When the domain sizes are small, a bitmap index structure [26] can be used to help speed up OLAP
queries. A bitmap index for a dimension with m values generates m bitmaps (bit vectors) of length N, where
N is the number of records in the underlying table. To initialize a bitmap index on a particular attribute
(dimension) of a table, we set the bits in each bitmap as follows: for each record we indicate the occurrence
of a particular value with a 1 in the same row of the bitmap that represents the value; the bits in all other
bitmaps for this row will be set to 0.
Bitmap indexes use bit-wise logical AND, OR, NOT operations to speed up the computation of the whereclause predicates in queries. However, simple bitmap indexes are not efficient for large-cardinality domains
and large range queries. In order to overcome this deficiency, an encoded bitmap scheme has been
proposed [5]. Suppose a dimension has 1,024 values. Instead of using 1,024 bit vectors most rows of which
are zero, log 1024 = 10 bit vectors are used plus a mapping table, and a Boolean retrieve function. A welldefined encoding can reduce the complexity of the retrieve function thus optimizing the computation.
However, designing well-defined encoding algorithms remains an open problem.
Bitmap schemes are a powerful means to evaluate complex OLAP queries when the number of records is
small enough so that the entire bitmap fits into main memory. If not all of the bitmaps fit into memory (e.g.,
when the number of records is large), query processing will require many I/O operations. Even when all the
bitmaps fit into memory, the runtime of an algorithm using bitmap indexes is proportional to the number of
records in the table. Later in the paper, we show that the runtime of a bitmap-based algorithm is much
larger than the runtime of CUBIST which has a worst-case runtime proportional to the number of
dimensions of the data cube.
A good alternative to encoded bitmaps for large domain sizes is the B-Tree index structure [9]. O’Neil and
Quass [25] provide an excellent overview of and detailed analyses for index structures which can be used to
speed up OLAP queries.
5
2.4. View Materialization
View materialization in decision support systems refers to the pre-computing of partial query results which
may be used to derive the answer for frequently asked queries. Since it is impractical to materialize all
possible views, view selection is an important research problem. For example, [16] introduced a greedy
algorithm for choosing a near-optimal subset of views from a view materialization lattice based on userspecified criteria such as available space, number of views, etc. In their approach, the next view to be
materialized is chosen such that its benefit is maximal among all the non-materialized views. The
computation of the materialized views, some of which depend on previously materialized views in the
lattice, can be expensive when the views are stored on disk. More recently [14, 19, 20], for example,
developed various algorithms for view selection in data warehouse environments.
Another optimization to processing OLAP queries using view materialization is to pipeline and overlap the
computation of group-by operations to amortize the disk reads, as proposed by [1]. To illustrate, assume a
table has four dimensions A, B, C, and D. The computation of the following group-by operations can be
pipelined: ABCD  ABC  AB A (i.e., from view ABCD we can compute view ABC form which
view AB can be computed and so on). When a data partition is computed, the memory can be reused for the
next partition. Interleaved with the pipelined aggregations are sorting or hashing operations on some nodes
on the minimal cost processing tree. When the data set is large, external sorting is required which often
needs multiple passes and is very expensive.
Other related research in this area has focused on indexing pre-computed aggregates [31] and incrementally
maintaining them [22]. Also relevant is the work on maintenance of materialized views (see [21] for a
summary of excellent papers) and processing of aggregation queries [13, 32].
However, in order to be able to support true ad-hoc OLAP queries, indexing and pre-computation of results
alone will not produce good results. For example, building an index for each attribute of the warehouse or
pre-computing every sub cube requires a lot of space results in high maintenance cost. On the other hand, if
we index only some of the dimensions or pre-compute few views, queries for which no indexes or views
are available will be slow. In the remainder of this paper we introduce a new approach which may bring us
closer to the goal of being able to answer ad-hoc OLAP queries efficiently.
3. Answering Cube Queries Using Statistics Trees
Before the description of our new algorithms that are designed to improve CubiST for hierarchies, we first
briefly introduce in this section the statistics trees and CubiST algorithms as the necessary preparation of
later sections. Many details can be found in [ref].
6
3.1. Data Structure and Maintenance of ST
A Statistics Tree (ST) is a multi-way balanced tree structure which holds statistics or aggregate information
(e.g., SUM, AVG, MIN, MAX) for one or more attributes over a set of records (e.g., tuples in a relational
table). Assume R to be a relational table with attributes A1,A2,…,Ak with cardinalities d1,d2,…,dk
respectively. The height of the statistics tree for R is k+1, where each level in the tree (except the leaf level)
corresponds to an attribute in R. Root is at level 1. The fan-out (degree) of an internal node at level j is
dj+1, where j = 1,2,...,k. The first dj pointers point to the subtrees which store information for the jth column
value of the input data. The (dj+1)th pointer is called star pointer which leads to a region in the tree where
this domain has been “collapsed,” meaning it contains all of the domain values for this dimension. This
“collapsed” domain is related to the definition of super-aggregate (a.k.a “ALL”) presented in Gray et al.
[12]. The leaf nodes at level k+1 contain the aggregation information and form a linked list.
Without lose of generality, we use the COUNT aggregate to demonstrate how to populate and maintain a
Statistics Tree. STs for other aggregate operations such as SUM, MIN, MAX, etc. can be processed
similarly. Given the number of dimensions and their cardinalities in the input data, initially, an empty
statistics tree is constructed by creating the nodes and pointers that form the entire tree structure described
above, with the count values in the leaf nodes being zero. Then, we scan the relational data set record by
record, using the attribute values to update the aggregates in the statistics tree with the recursive procedure
update_count(). For each record in the input set, the update procedure descends into the tree as follows:
Starting at the root, for each component xi of the input record x=(x1,x2,…,xk), where i indicates the current
level in the tree, follow the xith pointer as well as the star pointer to the two nodes at the next-lower level.
When reaching the nodes at level k (after repeated calls to update_count()), increment the count values of
the two leaves following the xkth pointer and the star pointer. Repeat for each record until all input records
have been processed in this fashion. A pseudo-code description of the algorithm is shown in Figure 1.
1
2
3
4
5
6
update_count(Node n, record x, int level) {
IF level == k THEN
increase count field for xkth child of Node n;
increase count field for child following star pointer;
return;
level := level + 1;
th
7
update_count( xlevel 1 child of n, x, level);
8
9
10
11
12
13
14
update_count(child of n following star pointer, x, level);
}
// end update_count
WHILE ( more records ) DO
read next record x;
update_count(root,x,1);
Figure 1: Recursive algorithm for updating aggregates in a statistics tree.
Figure 2 depicts the contents of the statistics tree after inserting the first input record (1,2,4) into the empty
tree corresponding to a relational table with three dimensions A1, A2, A3 with cardinalities d1=2, d2=3, and
7
d3=4 respectively. The update paths relating to this record are indicated as dashed lines. To improve
readability, we omit the pointers and nodes that do not play a role in this update from the picture.
Root
• • •
Star Pointer
*
Interior Nodes
• • • •
• • • •
*
*
...
...
• • • • •
*
...
• • • • •
*
0 0 0 1 1
• • • •
*
...
0 0 0 1 1
...
• • • • •
• • • • •
*
*
0 0 0 1 1
...
0 0 0 1 1
Leaf Nodes
Figure 2: Statistics tree after processing input record (1,2,4).
Statistics trees differ from multidimensional arrays in that a multi-dimensional array does not have a starpointer although the extension “ALL” to the domain of a dimension attribute has been used in the query
model and summary table described in [12]. Hence one can regard the Statistics Tree as a generalization of
the multidimensional array data structure. B-tree is an index structure for one attribute (dimension) only, as
opposed to a Statistics Tree which can contain aggregates for multiple dimensions. In addition, in the BTree, the degree of the internal nodes is restricted. The Statistics Tree on the other hand is naturally
balanced (it is always a full tree) and its height is based on the number of dimensions but independent of
the number of records and data input order.
3.2. Formal Representation of Cube Queries and CubiST Algorithm
The conceptual model in this paper is the multidimensional database model [2]. A cell of a k-dimensional
data cube with attributes A1 , A2 ,..., Ak with cardinalities d1 , d 2 ,..., d k respectively, is the smallest fulldimensional cube1 seated at a point P  ( x1 , x2 ,..., xk ) , xi  [1..d i ] . The total number of cells in a kk
dimensional data cube is
d
i
. All these cells form a whole set C. An ad-hoc cube query can be viewed
i 1
as an aggregation of the measure attributes over a subset of C. Formally, a Cube query q is a tuple of the
form to define the regions to aggregate on: q  ( si1 , si2 ,..., sir ) , where {i1, i2 ,..., ir }  {1,2,..., k} and r is the
number of dimensions specified in the query. Each selection si j can be one of the following (in decreasing
order of generality). Let w  i j :
8
1. A partial selection, {t1,t2,…,tr}, 2  r  dw, ti  {1,2,…,dw}, specifying any subset of all domain
values for dimension ij.
2. A range selection [a,b], specifying a contiguous range in the domains of some of the attributes,
a, b  [1..d w ] , [a, b]  [1, d w ] .
3. A singleton value a, a  [1..d w ] .
If r = k, we say the query is in normal form. A query can be transformed into its normal form by adding
collapsed dimensions in the query using “ALL” values.
Continuing our car sales example from the beginning, assume that the domain of manufacturer is {Ford,
GM, Honda, Toyota}, of color is {blue, red, white, black}, of style is {sports car, sedan, SUV}, of time is
{Jan, Feb, …, Dec} and of location is {Florida, Alabama, Georgia, South Carolina, Tennessee}. Then a
formal representation of the sample cube query “How many Toyota cars have been sold from January to
March this year in Florida and Georgia?” is q  ( s1 , s4 , s5 ) = (4, [1,3], {1,3}). Its normal form is (4, ALL,
ALL, [1 3], {1,3}).
1 INT cubist(Node n, query q, int level) {
2
IF level == k, THEN
3
count := count summation of z k ,1
4
5
return count;
level := level+1;
6
count := cubist( z level 1,1
7
th
th
, z k ,2
th
,...child of n;
child of n, q, level)+
th
cubist( z level 1, 2 child of n, q, level)+ ...
8
9 }
10 end cubist
11
12 Total counts in the query regions := cubist(root,q,1);
Figure 3: Recursive algorithm for evaluating cube queries.
The Statistics Tree is capable of answering complex cube queries. Figure 3 shows a sketch of our recursive
CubiST
query-answering
algorithm.
Assumes
that
the
input
query
is
in
its
normal
form: q  {zi , j  [1..di ] | i  1,2,..., k , j  1,2,..., di  1}. Here z i , j refers to the j th selected value of the
i th dimension. Cubist is a recursive function that returns the aggregate for a cube query q.
k
The amount of memory that is need to store the Statistics Tree is bounded by: c
 (d
i 1
i
 1)  M , where M
is the size of memory available to store the tree, the constant value c accounts for the space that is needed
to store the internal nodes of the tree. Please note that c is typically around 2 or 3 because the number of
1
A cube containing no “ALL” values.
9
internal nodes is no larger than twice the number of leaves. The update time of ST for each record is
O( 2 k ). The runtime of answering a singleton query is O(k), where k is the number of dimensions. The
worst case runtime of answering an arbitrary query is O(size of ST).
CUBIST is particularly useful when the number of records is large and the aggregated data volume does
not exceed the memory size. For example, this is the case in data marts where the dimensionality of the
data is low, yet the number of tuples is large enough to warrant the use of special query processing
algorithm such as CUBIST. CUBIST is also well suited for large data warehouses to provide a high-level
“view” over the data in order to decide where to start drilling down: in this scenario there are many
dimensions with large domain sizes. Finally, with the continued increase in available memory size, more
and more cube queries can be answered using ST. We will introduce a new data warehouse system
architecture in the next section that achieves good performance for cube queries when combined with
CUBIST.
3.3. System Architecture Integrating with CubiST
Above algorithms require that the particular ST tree be in memory. In many cases, having one ST to
answer ad-hoc cube queries over all dimensions is not sufficient, especially when the queries involve many
dimensions with large domain sizes. In this case, we proposed a data warehousing architecture that
integrates with CubiST. To support efficient cube query processing over large data sets, we used multiple
STs. Conceptually, we can join all the dimensions with the fact tables along the foreign keys into a single
wide table. Table 1 displays a single table schema for the car sales example.
Maker
Product
m-name m-loc m-date color style
Customer
Sale
Measures
#cyl c-name age income s-name s-loc s-date sales invoice
Table 1: Sample1-table schema for car sales.
The warehouse administrator can specify the groups of dimensions to setup STs on according to domain
knowledge or user query patterns. In general, attributes from the same source (i.e. the same dimension table
in star schema) are grouped together, since queries are more likely to request aggregates over attributes
from the same group. STs are stored in the ST Repository which is shown in the upper-left hand corner of
Figure 4. Using the ST Manager, shown directly above the data warehouse, STs are selected from the
repository to match the incoming cube queries: the dimensions and aggregate operators specified in the
query must match those of the stored STs. The ST Manager receives a query submission, it checks if there
is an ST that matches the query. If there is, the query is evaluated using this ST. Otherwise, a new ST is set
up and placed into the ST repository. The parameters of the ST (i.e., number of dimensions, their
cardinalities, aggregate operators used to compute the contents of the leaves) are stored in the metadata
repository. At the same time, the usage statistics of the STs are recorded. Due to space constraints and
maintenance overhead, the large and less frequently queried STs may be deleted. To improve the hit ratio
10
of queries against existing STs, small, frequently matched STs are periodically merged to form one larger
ST, memory size permitting. The rationale behind merging is that the attributes from frequently queried
STs are more likely be queried together later.
Query
Interface
ST
Manager
ST
Repository
Data
Warehouse
Integration System
Metadata
...
Extractor
Extractor
Extractor
Data
Source 1
Data
Source 2
Data
Source n
...
Figure 4: Warehouse architecture with CubiST.
In the same figure, the familiar back-end, consisting of source data extractors, warehouse integrator, and
warehouse is responsible for accessing, cleaning, integrating and storing the potentially heterogeneous data
from the underlying data sources. Data in the warehouse is typically stored in the form of a star or
snowflake schema with materialized views and indexes as optional support structures.
4. Hierarchies
CubiST sets up a new framework to answer cube queries efficiently. However, it does not address the
important issue appearing frequently in DSS – hierarchies. Many dimensions e.g. time, location, product
have concept abstract level hierarchies, where different levels reflect different degree of granularities for
the concepts. For example, a “year—month—day” hierarchy reflects different granularity of time
dimension. To rollup and drill down, users often submit queries that have constraints on different levels of
the hierarchies in addition to on different combinations of the dimensions. This new requirement adds
another level of complexity to OLAP systems but also provides great optimization opportunities to improve
the performance for these queries. From this section on, we will focus on these optimizations. We start with
our discussion on hierarchies.
4.1. Hierarchical Partitioning and Mapping of the Domains
Usually, the dimensions with hierarchies are viewed as common tables with the different levels as their
attributes. These attributes are processed in the same way as other non-hierarchical simple attributes. The
dimensions are joined with the fact table along the foreign keys, and then selections, projections and
aggregations (group by) are performed to compute the final results. In our new framework, we have quite
different techniques than traditional ones. Naturally, we regard the interdependent levels in a hierarchy as
11
one hierarchical attribute instead of multiple attributes. The values at different levels are mapped into
integers and a value at higher level includes multiple values at a lower level. That is, the lower level values
are partitioned into groups. The indexes or labels of the groups constitute the higher level values. Through
scanning the dimension table, we establish the mappings, hierarchical partitions and the value inclusion
relationships.
Definition: For a set V, if  V1, V2, …, Vk, where ViV, i=1,2,…,k, such that
(1) ViVj = , i,j=1,2,…,k, ij
(2) Vi = V, i=1,2,…,k
then, V1, V2, …, Vk is a partition of V.
Higher level values are just the labels of the subsets of lower level domain values that form a partition. The
higher level values as a set can further be partitioned. In this way, the hierarchical partitioning structure of
the domain values is the concept of attribute/dimension hierarchy.
For example, in time dimension, a popular hierarchy is “year—quarter—month”. For each year of 1997
and 1998, there are four quarters Q1, Q2, Q3, Q4. For each of these quarters, there are three months among
January, February, …, December. We can represent this as a table with three columns “year”, “quarter”
and “month” whose cardinalities seem to be 2, 4 and 12 respectively. However, for the hierarchy to be
valid, the functional dependency constraints between the columns must be satisfied. If column A is
functionally dependent on column B, the constraint stipulates that if two rows have the same value for
column A then they must have the same value for column B. Obviously, in this example, the first two
columns do not satisfy the constraint since Q1 can be of year 1997 or year 1998. To eliminate the
ambiguity, we can qualify Q1 as 1997/Q1 or 1998/Q1 respectively. In the same way, January needs to be
qualified as 1997/January or 1998/January. After qualifying all the quarters and months, the “year—
quarter—month” hierarchy is valid.
In categorical attributes (domains with strings) or attributes with real domain values, the mapping that maps
a value to an integer facilitates the cubing with Statistics Trees or multidimensional arrays. The following is
a mapping example for the hierarchy structure.
Now, we can more clearly see the hierarchical partitioning structure of the time dimension domain values.
Some dimensions e.g. time have static domain values and hierarchical inclusion relationships while others
do not. In that case, we establish the partitioning and mapping during the load of the dimension tables. Let
us take the Location dimension as an example.
12
Region State
MW (1) IL (1)
MN (2)
Lockey
City
State Region
1
Gainesville FL
SE
2
Atlanta
GA
SE
3
Los Angeles CA
WC
4
Chicago
IL
MW
5
Miami
FL
SE
6
San Jose
CA
WC
7
Seattle
WA WC
8
Twin City
MN MW
SE (2)
FL (3)
GA (4)
City
Chicago (1)
Twin City (2)
Gainesville (3)
Miami (4)
Atlanta (5)
WC (3) CA (5) Los Angeles (6)
San Jose (7)
WA (6)
Seattle (8)
Table 1: Location dimension
Table 2: Partitioning and Mapping of Location
Suppose Table 1 is part of the original dimension table of Location that indicates a composite attribute
(City, State, Region). Table 2 is the result of sorting along column Region1, State and City and mapping the
column values to integers (in parenthesis). This process has the effect of executing the following SQL
statement:
SELECT DISTINCT Region, State, City
FROM Location
GROUP BY Region, State, City
ORDER BY Region, State, City;
LocKey
City
1
2
3
4
5
6
7
8
State
Region
3
5
6
1
4
7
8
2
3
4
5
1
3
5
6
2
2
2
3
1
2
3
3
1
Table 3: Location after transformation
Table 3 results from the mapping of Table 1 column values. The mapping itself can be represented and
stored as an encoding index structure e.g. trie or dictionary (string  integer) and a decoding data structure
such as arrays (integer  string). The encoding mapping is very important through which strings or other
types in the user input queries will be transformed into internal integer representation to facilitate CubiST
application. Table 2 can also be directly implemented as a multi-way tree.
Notice that we only need to store Table 2 and Table 3 because they are reversible transformation of Table
1. Furthermore, we only store the finest level column; other level columns are implied from mapping. This
13
is potentially very large compression of dimensional data. In the example, column 3 and 4 in Table 3 will
not be stored. In addition, instead of storing strings like “Gainesville” (consuming 11 bytes), integers like 3
are stored (consuming at most two bytes if there are no more than 65,536 cities in the data warehouse).
Though we still store the mapping, only the transformed tables (e.g. Table 3) are read during the
initialization of STs. The data compression for the transformed tables significantly reduced the runtime of
joining operations in the initialization phase.
Some attributes have complex hierarchies, where multiple paths from high level concepts to low level
concepts exist. We can decompose the complex structure into simple hierarchies. For example, the
following time dimension hierarchy lattice structure can be decomposed into two simple hierarchies:
“year—quarter—month – day” with four levels and “year—week – day” with three levels. Two paths
correspond to two different ways of sub-setting/partitioning the days.
year
quarter
week
month
day
4.2. Free Aggregate Navigation
In the new setting, let us explore what “ad-hoc” or “free navigation” really means. We also give a new light
on the buzzwords “rollup” and “drill down”.
A1
A2
Ak
Level 1
Level 1
Level 1
Level 2
…
Level 2
…
Level 2
…
level L1
level L2
…
level Lk
Figure
To navigate freely on the aggregations, queries can be constrained on any combination of the dimensions.
For any constrained dimension with a concept hierarchy, the constraints can be in any abstract level. For a
14
particular level, the selected values can form any subset of the domain values in that level. In addition, the
aggregate functions can be different.
Theorem: If in a group, there are k attributes A1, A2, …, Ak with L1, L2,…, Lk levels in their hierarchies
respectively, then the number of level combinations is
k
 ( L  1)
i 1
i
For ith attribute (i=1,2,…,k), there are Li+1 choices of the levels. The additional choice is to choose the
attribute or not so that the combinations of the attributes are also counted in. According to composition
theory, the total number of combinations is the product of the number of choices in the attributes.
Theorem: If in a group, there are k attributes A1, A2, …, Ak with cardinalities d1, d2,…, dk respectively
in their finest levels, then the number of potential queries at the lowest level is 2d1 2d2…2dk.
Definition: When one or/and two of the following cases happen, a rolling-up operation is performed:
(1) One or more dimensions go up one or more levels along their abstract hierarchy ladders and no
dimension goes down at the same time.
(2) Leave out one or more constrained dimensions.
Similarly, we can define a drilling-down operation. Rolling-up and drilling-down are two common
operators in OLAP. In strict sense, they are not new operators but a cube operation relative to another cube
operation. However, the requirement of free navigation on the hierarchies adds additional complexity for
implementing OLAP systems. Our objective is to design new algorithms that can improve the efficiency of
free aggregate navigating.
4.3. Motivation and Overview of the New Optimizations
In the statistics tree framework, CubiST can freely navigate on the hierarchies internally, provided that the
core statistics trees include the lowest levels which the queried levels are at least the same as or above. To
implement free navigation using CUBIST directly, first transform the conditions expressed in different
abstract level to the conditions expressed in the lowest levels. We can map the higher level values to lower
level values according to the mapping functions (e.g. Table 2 or corresponding multi-way tree). The
hierarchy navigating queries are then just simply regarded as range queries or partial queries that can be
answered directly by CubiST.
Even though these queries can be answered in memory, in the worst case, answering them may need to
walk through most part of the statistics trees. In the car sales example, suppose that the time dimension has
a year-month-day hierarchy and the location hierarchy is region-state-city. The user cube query is "How
many cars were sold in southeast in 1998?" for instance. Suppose southeast region have 150 cities in the
database and the records are in finest granularities, answering the query is equivalent to answering 150*365
= 54,750 singleton queries.
15
In the real world applications, multiple STs may be used thus requiring the page-in and page-out of the
STs. If the STs are large, the I/O overhead can be also large. CubiST should be further optimized to cut
down the overhead. Notice that in order to reduce the cost of paging STs in and out of memory, we only
transmits the leaves of the ST trees to and from the ST Repository. The internal nodes can be generated in
memory without extra I/O operations. In this way, the I/O complexity of STs is the same as that of
multidimensional arrays.
The idea is to materialize some statistics trees with higher levels. For instance, if we materialize the
statistics tree in our car sales example in region and year level, then the sample cube query becomes a
singleton query and the I/O cost is negligible. To implement the free navigation, we select, compute and
materialize a set of Statistics Trees and then answer an arbitrary query by applying CubiST on a chosen
appropriate tree from these STs. First, let us give some definitions.
Definition: The base abstraction levels correspond to the lowest levels in the concept hierarchies such that
the statistics trees are in memory. If each dimension uses the lowest level, the statistics trees may not fit in
memory. In this case, use the higher levels for the large dimensions or take out some dimensions from the
group. The resulting statistics trees are called base trees.
Definition: The derived trees are the STs in the same group but with higher levels for one or more
dimensions. They are computed from base trees.
Definition: The base tree and its derived trees form a family of statistics trees.
5. Generating a Family of Statistics Trees
The families of STs are used for answering any queries in different abstraction levels for different
dimensions. To generate the family, first we need to design algorithms for choosing candidates from all
derived trees as members of the family. Second, methods of computing the derived trees from the base tree
must be found.
5.1. A Greedy Algorithm to Choose Family Members
Potentially, any combination of the levels in the attribute group can be selected as a member of the family.
However, we need to consider the space limitation and maintenance overhead. Again it is a space-time
trade-off issue. Here, we introduce a greedy algorithm to choose the members. From the base tree, each
time, the largest dimension climbs up one level in its hierarchy while other dimensions remain the same.
The newly formed ST in this way is included in the family. The iteration continues until each dimension
climbs to its highest level or the size of the family reaches the space limitation (stop conditions). The
pseudo-code is shown in Figure xx. We discussed the algorithm of line 1 in section 3.1. The rollup
algorithm for line 7 will be given in the next subsection 5.2. Lines 2 and 8 save the leaves while the
parameters saved in lines 3 and 9 are used to register the STs stored in ST Repository.
16
The size of the family is no more than two times the size of base tree. Usually, the total size of the derived
trees is just a very small fraction of the base tree. Of course, more sophisticated approaches like cost-based
algorithms are also possible. The algorithm of rolling-up the largest dimension one after another is greedy
in terms of the size reduction of the new derived tree. Without lose of generality, for numerical attributes in
our examples of this subsection, they are grouped into equal-size segments except the last one that may be
smaller.
1
2
3
4
5
6
7
8
9
10
Setup base tree T0 by scanning input data;
Save T0 to ST Repository;
Save its parameters to Metadata Repository;
T= T0;
Repeat {
Find the dimension with largest cardinality in T;
Rollup that dimension to generate a new derived Tree T;
Save T to ST Repository;
Save the parameters of T to Metadata Repository;
} Until (stop conditions);
Figure
Example 5.1
Suppose the base tree T0 has three dimensions whose sizes are 100,100, and 100. Each dimension can be
grouped ten by ten to form two-level hierarchies. The cardinalities of base tree T0 are 101,101,101.
T1 is the result of rolling up the third dimension in T0, therefore, have cardinalities 101,101,11.
T2 is the result of rolling up the second dimension in T1, its cardinalities are 101,11,11.
T3 is the result of rolling up the first dimension in T2, its cardinalities are 11,11,11.
T1, T2, T3 are derived trees. Together with base tree T0, they form a family.
5.2. Computing a New Derived Tree from Existing Trees
The base tree is setup during the initialization of ST described in Section 2. Except the first one which is
computed from the base tree, a new derived tree is computed from an existing one by merging some subtrees in the old tree. Before introducing this new roll-up algorithm, we give a definition for a new operator
“” on STs as follows.
Definition: Two STs are isomorphic if they are identical except the values of their leaves i.e. they have the
same structure. All pairs of corresponding nodes in two isomorphic STs have the same degree.
Definition: The summation S of two isomorphic STs S1, S2 is a new ST that has the same structure of S1
or S2 (i.e. isomorphic to them) but its leaf values are the summation of the corresponding leaf values (i.e.
they are of the same position) of S1 and S2. Their relationship is denoted by S=S1 S2.
17
The implementation of the new operator is straightforward. Keeping one of the STs (say S1) and updating
its leaves by repeatedly adding up the count values of the leaves directed by two sliding pointers. Notice
that the new operator can adapt to other aggregation functions such as MIN/MAX. In that case, instead of
summing the leaves up, compute the minimum or maximum of the corresponding leaves as the leaves of
the resulting ST.
The summation of several isomorphic STs is a ST which has exactly the same structure as any of the STs
except the leaves. The count fields are the summation of the count fields of corresponding leaves. Next, we
will give an algorithm of roll-up operation of case (1) in definition xx. Case (2) is covered by CubiST
algorithm directly.
To derive a new tree by rolling up a dimension2, proceed from root to the level corresponding to the rollup
dimension. According to the mapping relationship described in section 4.1, reorganizing and merging the
sub-trees for all the nodes in that level to form new children. Adjust the degree and update these nodes.
Lastly, link the newly formed (summation of subtrees) children to the new nodes. The best way to illustrate
the roll-up process is through a simple example.
Example 5.2
Suppose that during the derivation, we proceed to a node in the rollup dimension level. The dimension with
nine values has a two-level hierarchy which group the lower level values three by tree. The sample node is
shown in the left of Figure? with its children S1,.., S9, S*. We group and sum up its sub-trees (also STs)
three by three to form three new children S1#, S2# and S3# of the new node with degree of four, as shown in
the right side. We superscript #’s to indicate they are higher level values. The star sub-tree remains the
same. Notice that
S1#=S1 S2 S3
S2#=S4 S5 S6
S3#=S7 S8 S9
1#
1
S1
2#
2
S2
3
S3
4
S4
5
S5
3#
6
S6
7
S7
8
S8
18
9
S9
*
S*
1#
2#
S1#
S2#
3#
S3#
*
S*
Figure ? Rolling-up one dimension.
The following is the roll-up algorithm.
1
change(Node nd, int level, int rollDim, int factor) {
2
IF level == rollDim THEN
3
group the children of nd such that each group has factor
4
subtrees except the last group and star node;
5
compute the summation of the subtrees in each group;
6
make a new node nd’ whose children are the summations
7
and star node subtree; adjust the domain of the nd’;
8
link nd' to nd's original parent IF level is not equal to 1;
9
return;
10
level++;
11
FOR each child Ci of nd, change(Ci,level,rollDim,factor);
12
}
13
rollUP(int rollDim, int factor) {
14
15
change(root,1,rollDim,factor);
}
Figure ?: Algorithm for Rolling-up One Dimension to Form a New Derived Tree
Line 1 is the signature of a recursive function change, which proceed from a node nd at level level to the
rollup dimension rollDim. Suppose that the hierarchy is formed by grouping the lower level values factor at
a time. Lines 2 to 9 corresponds to the modifying process in the example described in Figure ?. Line 14
invokes change.
Before computing the new derived tree, we first store the old ST or base tree on disk (swap out). The
metadata about the ST such as the dimensions in the group, hierarchy levels, etc is also stored in Metadata
Repository. The new ST is usually much smaller than the old one and the remaining memory space can be
reclaimed for other use.
6. Answering Cube Queries Using the Families
Once the family is established and materialized, it is ready to answer the queries. Select a ST from the
family and answer the submitted query using CubiST. Rewrite the query if necessary.
6.1. Selecting a Tree in the Family for Answering the Query
Potentially, multiple trees can answer them but the smaller one is preferred. The proper tree in family to
select is the tree with highest levels that match the query. This can be done by the following matching
scheme. We will introduce some new concepts and data structures first.
Definition: A view matrix (VM) is a matrix, where the rows are dimensions, the columns are the hierarchy
levels and the entries are the statistics tree labels in the family. It describes what views are materialized in
the hierarchies.
19
Example 6.1
Continuing example 5.1, suppose the level with cardinality 101 has level value 0 and grouped level with
cardinality 11 has level value 1. For example, T1 has cardinality structure 101*101*11. Its first two
dimensions are at level 0 and the last dimension is at level 1, therefore, T1 is put in the first column at first
two rows and in row 2 and column 2. Here we assume that dimension and level labels start from 0 (to
match with C/C++ array convention). The following is the VM.
dim\lev
0
1
0
T0,T1,T2
T3
1
T0,T1
T2,T3
2
T0
T1,T2,T3
VM is easily implemented as a stack of level vectors which are composed of level values for the
dimensions. The stack corresponding to the above VM is
(1, 1, 1)
T3
(0, 1, 1)
T2
(0, 0, 1)
T1
T0
(0, 0, 0)
Figure
We can peel a MV one layer off from right, that is, each dimension pops off the rightmost ST. The peel
operator corresponds to a pop operation of the stack. MV creation corresponds the push of the level vectors
to an empty stack.
Example 6.2
After peeling off T3, the MV in example 4.2.1 is
dim\lev
0
1
0
T0,T1,T2
1
T0,T1
T2
2
T0
T1,T2
Definition: A query hierarchy vector (QHV) is a vector (l 1,l2,...,lk), where li is ith dimension level value in
its hierarchy in the query.
Definition: A QHV (l1,l2,...,lk) matches view T in the VM iff li>= column index of the ith row entry of T in
VM (the column indices of T compose T's column vector), i, i=1,2,...,k. In other words, when T's column
20
vector <= QHV, we say " the query or its QHV matches view T". If a query’s QHV matches view T, the
popped vector corresponding to T in stack is less than or equal to QHV.
Definition: An optimal matching view with respect to a VM is the view in VM that has the highest index
and matches the query's QHV.
Example 6.3
The QHV of query ([3#,5#], 1#, [4#,6#]) is (1,1,1). It matches T0, T1, T2, T3. But T3 is the optimal matching
view. As before, a number with a sharp superscript indicates that it belongs to higher level values. [3 #,5#]
means [30,59] in the example 5.1.
If a query's QHV is (0,1,0), T0 is the only matched view, so it is also optimal tree among the family.
Theorem: Any query with no finer granularities than the base tree has an optimal view that matches its
QHV.
This is quite straightforward, because T0 always matches and the indices of the derived trees have a total
order (increasing).
The MV is generated in the generation of family.
Now the problem is to find the optimal matching view for the query. The pseudo-code is shown as follows.
1
Compute the QHV for the query;
2
peel off the entries of VM layer by layer from right to left
3
until the view matches QHV;
4
return the view;
Algorithm 2: Find optimal matching view for the query
Example 6.4
Suppose the query is Cube( 3#, 2, 5#). Its QHV is (1,0,1). Because the column vector for T 3 is (1,1,1) which
does not match (1,0,1), T3 must be peeled off.
In the same way, T2 must also be peeled off. The resulting MV is
dim\lev
0
0
T0,T1
1
T0,T1
2
T0
1
T1
This time column vector of T1 is (0,0,1) <= (1,0,1) i.e. T1 matches QHV. So T1 is the optimal matching
view.
6.2. Rewriting and Answering the Query
In this phase, the query is evaluated based upon its optimal matching view. The answering algorithms are
the same as CubiST. Before evaluation, the query must first be transferred if its QHV is not equal to the
21
optimal view's column vector. In the Example 6.4, query Cube(3#, 2, 5#) will be translated into
Cube([30,39],2, 5#) so that its QHV (0,0,1) exactly indicates the optimal view levels. Then compute the
query using T1. The query is rewritten according to the mapping functions. Higher level values are
translated into lower level values by the inclusion relationship.
7. Performance Experiments
7.1. Datasets
We use synthetic data sets in our experiments. Our random data generator simulated the relational
warehouse with the following parameters: r is the number of records and k is the number of dimensions
with cardinalities d1 , d 2 ,..., d k respectively. The elements in the array with r rows and k columns are
uniformly distributed in the range of the domain sizes of the corresponding columns. The array is written to
disk as a text file and serves as input to our new cubing algorithms.
The testbed consists of a SUN ULTRA 10 workstation running Sun OS 5.6 with 90MB of available main
memory.
7.2. Performance improvement of Hierarchical vs Flat CubiST
In this subsection, we will show the effectiveness of materializing and using the families of the statistics
trees. Since the number of records does not play a role in the size of the statistics tree and the cube query
answering algorithms, in the set of experiments, we fixed the number of records (r=1,000,000). Suppose
there are three dimensions, each of which has a two-tier hierarchy. To simplify the hierarchical inclusion
relationship of mapping described in section xx, we assume that each higher level value includes equal
number of lower level values and called this number as a factor. For example, each year has exactly four
quarters. The factor is four. Of course, in the real world, there are many more dimensions and the eveninclusion assumption may not be true (e.g. location dimension). However, we did not lose the generality of
our new algorithms by these assumptions.
Factors
1, 1, 1
3, 2, 2
3, 3, 3
4, 4, 3
Dim. Sizes
60, 60, 60 20, 30, 30 20, 20, 20 15, 15, 20
#Leaves
226,981
20,181
9,261
5,376
I/O Time (ms)
16,818
1,517
681
384
Total Time (ms) 25,387
1,655
731
446
Table
In this set of experiments, we fixed the query but varied the factors. We compare the statistics tree sizes by
the number of leaves in the STs, the I/O time spent on loading the ST from ST repository and the total
answering time. Notice that, in general, to answer a query, the total time is composed of the time in
choosing the proper ST, in-memory empty tree formation time, leaves loading time and finally query
answering time using the newly formed tree. From Table xx, the I/O time and the total time decrease
22
dramatically as the factor increases and the number of leaves decreases. The more low-level values that a
high level value includes (the larger the factors), the faster the high level derived trees answer the queries.
7.3. Varying Number of Records
In this subsection, we compare the setup and response times of CubiST with those of two other frequently
used techniques: A simple query evaluation technique, henceforth referred to as scanning, and a bitmapbased query evaluation algorithm, henceforth referred to as bitmap. The simple query evaluation algorithm
computes the aggregates for each cube query by scanning the entire data set each time. The bitmap-based
query evaluation algorithm uses a bitmap index structure to answer cube queries without touching the
original data set (assuming the bit vectors are already setup). CubiST is implemented as described in
Sections xx.
In this series of experiments, the number of dimensions and corresponding domain sizes are fixed while
we increase the number of records. Specifically, there are five dimensions and their dimension sizes are 10,
10, 15, 15, 15 respectively. The goal is to validate our claim that CUBIST has excellent scalability in terms
of the number of records.
Response Time vs Number of
Records
Setup Time vs Number of
Records
100000
Writing
Bitmap
Cubist
50,000
40,000
Response Time (ms)
Setup Time (ms)
60,000
30,000
20,000
10,000
0
Scanning
Bitmap
Cubist
10000
1000
100
10
1
10K
100K
200K
Writing
1,211
11,527
23,125
Scanning
Bitmap
1,045
10,039
21,660
Bitmap
Cubist
16,961
32,776
50,786
Cubist
Number of Records
10K
100K
200K
979
9,196
18,347
51
494
989
32
31
32
Number of Records
Figure
Figure
Figures 6 and 7 show the performance of the three algorithms. The sample cube query on which
the experiments are based is as follows: q1 = (s1,s2,s3) = ([2,8],[2,8],[5,10]). Query q1 is an
example of a range query which computes the total number of records that lie in the region where
the first two dimension values are in the range of [2,8] and the third in [5,10]. Although the setup
time for CubiST is larger than that used by Bitmap because of the insertion operation for each
record, its response time is much faster and almost independent of the number of records. Since
there is no setup time for the scanning algorithm, we simply measured how long it would take to
23
write the data out to disk as a reference value. We call this set of measurements “writing” in the
first row in Figure 6.
7.4. Varying Domain Size and Query Complexity
In this series of experiments, we compare response times to domain size and query complexity (measured
in terms of selection components). First, the number of records (100,000) and the domain sizes
(15,15,15,15,15) are fixed but the complexity of the input queries increases as follows:
Q1 : ( s1 )  ([1,10]);
Q 2 : ( s1 , s 2 )  ([1,10], [1,10]);
Q3 : ( s1 , s 2 , s3 )  ([1,10], [1,10], [1,10]);
Q 4 : ( s1 , s 2 , s3 , s 4 )  ([1,10], [1,10], [1,10], [1,10]);
Q5 : ( s1 , s 2 , s3 , s 4 , s5 )  ([1,10], [1,10], [1,10], [1,10], [1,10]);
Although the performance of the scanning approach is less sensitive to the varying degree of a query when
compared with bitmap and CUBIST, its response times were nonetheless several orders larger (Figure 12).
As before, CUBIST is considerably faster than the bitmap approach. The large increase of the number of
the selected cells in the query, an increase from 10 to 100,000, accounts for the increase in response time of
CUBIST.
Response Time vs Queries
Response Time vs Domain Size
10,000
Reponse Time (ms)
Response Time (ms)
10000
1000
100
Scanning
Bitmap
Cubist
10
1
Q1
Q2
Q3
Q4
Q5
Scanning
8201
8109
7930
7870
7964
Bitmap
252
723
1093
1385
1742
Cubist
4
4
6
19
42
1,000
10
1
Q6
Q7
Q8
Q9
5,040
4,812
5,017
4,898
Bitmap
831
1,964
2,155
2,823
Cubist
5
6
12
24
Scanning
Queries
Scanning
Bitmap
Cubist
100
Domain Size
Figure 12: Response time vs query complexity.
Figure 13: Response time vs domain sizes.
24
We repeated the series of experiments using another data set, where we fixed the number of records
(100,000) but increased the domain sizes of each dimension from 20 to 80 (i.e. the four groups of domain
sizes are 20,20,20; 40,40,40; 60,60,60; 80,80,80) and used the following sets of queries:
Q6 : ( s1 , s 2 , s3 )  ([0,10], [0,10], [0,10]);
Q7 : ( s1 , s 2 , s3 )  ([0,20], [0,20], [0,20]);
Q8 : ( s1 , s 2 , s3 )  ([0,30], [0,30], [0,30]);
Q9 : ( s1 , s 2 , s3 )  ([0,40], [0,40], [0,40]);
From Figure 13, we can see the drastic increase in response times for both CUBIST and bitmap as the
number of selected cells increases. In summary, the above experiments showed that CUBIST demonstrates
significant improvement in performance and scalability over both the scanning algorithm as well as
ROLAP implementations based on bitmap indexing techniques.
7.5. Comparison of setup time and response time of Cubist and a top commercial product for a set
of random queries
8. Conclusion
In this paper, we have presented our new algorithms efficiently answering cube queries that have
constraints on the arbitrary hierarchy levels. Hierarchies are perceived as an iterative partitioning of the
lower level values. All attribute values are mapped into integers and the value inclusion relationship is
established among the values at different levels. From the flat base statistics tree, we expand and
materialized a family of statistics trees. We proposed a new greedy algorithm to choose family members
and a new algorithm to compute the new derived trees. A smallest query-matching tree from the family is
selected to answer the query. Before applying CubiST for the query, one may need to rewrite it according
to the value mapping and inclusion relationship.
Together with basic data structure of statistics trees and algorithm CubiST, the new methods proposed in
the paper provide a new effective way to implement truly ad-hoc cube queries. Our simulation results
demonstrate the performance benefits of the new algorithms.
References
[1]
S. Agarwal, R. Agrawal, P. Deshpande, J. Naughton, S. Sarawagi, and R. Ramakrishnan, “On The
Computation of Multidimensional Aggregates,” in Proceedings of the International Conference on
Very Large Databases, Mumbai (Bomabi), India, 1996.
[2]
[3]
R. Agrawal, A. Gupta, and S. Sarawagi, “Modeling Multidimensional Databases,” in Proceedings
of the Thirteenth International Conference on Database Engineering, Birmingham, U.K., 1997.
Arbor Systems, “Large-Scale Data Warehousing Using Hyperion Essbase OLAP Technology,”
Arbor Systems, White Paper, http://www.hyperion.com/whitepapers.cfm.
25
[4]
S. Berchtold and D. A. Keim, “High-dimensional index structures database support for next
decade's applications (tutorial),” in Proceedings of the ACM SIGMOD International Conference on
Management of Data, Seattle, WA, pp. 501, 1998.
[5]
C. Y. Chan and Y. E. Ioannidis, “Bitmap Index Design and Evaluation,” in Proceedings of the
ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp. 355-366,
1998.
[6]
S. Chaudhuri and U. Dayal, “Data Warehousing and OLAP for Decision Support,” SIGMOD
Record (ACM Special Interest Group on Management of Data), 26:2, pp. 507-508, 1997.
[7]
S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,”
SIGMOD Record, 26:1, pp. 65-74, 1997.
[8]
E. F. Codd, S. B. Codd, and C. T. Salley, “Providing OLAP (on-line analytical processing) to useranalysts: An IT mandate,” Technical Report, http://www.arborsoft.com/OLAP.html.
[9]
D. Comer, “The Ubiquitous Btree,” ACM Computing Surveys, 11:2, pp. 121-137, 1979.
[10]
S. Goil and A. Choudhary, “High Performance OLAP and Data Mining on Parallel Computers,,”
Journal of Data Mining and Knowledge Discovery, 1:4, pp. 391-417, 1997.
[11]
S. Goil and A. Choudhary, “PARSIMONY: An Infrastructure for Parallel Multidimensi onal
Analysis and Data Mining,,” Journal of Parallel and Distributed Computing, to appear.
[12]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.
Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and
Sub-Totals,” Data Mining and Knowledge Discovery, 1:1, pp. 29-53, 1997.
[13]
A. Gupta, V. Harinarayan, and D. Quass, “Aggregate-query Processing in Data Warehousing
Environments,” in Proceedings of the Eighth International Conference on Very Large Databases,
Zurich, Switzerland, pp. 358-369, 1995.
[14]
H. Gupta and I. Mumick, “Selection of Views to Materialize Under a Maintenance Cost
Constraint,” Stanford University, Technical Report.
[15]
J. Han, “Towards On-Line Analytical Mining in Large Databases,” SIGMOD Record, 27:1, pp. 97107, 1998.
[16]
V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing data cubes efficiently,” SIGMOD
Record (ACM Special Interest Group on Management of Data), 25:2, pp. 205--216, 1996.
[17]
Information Advantage, “Business Intelligence,” , White Paper, 1998,
http://www.sterling.com/eureka/.
26
[18]
Informix Inc., “Informix MetaCube 4.2, Delivering the Most Flexible Business-Critical Decision
Support Environments,” Informix, Menlo Park, CA, White Paper,
http://www.informix.com/informix/products/tools/metacube/datasheet.htm.
[19]
W. Labio, D. Quass, and B. Adelberg, “Physical Database Design for Data Warehouses,” in
Proceedings of the International Conference on Database Engineering, Birmingham, England, pp.
277-288, 1997.
[20]
M. Lee and J. Hammer, “Speeding Up Warehouse Physical Design Using A Randomized
Algorithm,” in Proceedings of the International Workshop on Design and Management of data
Warehouses (DMDW '99), Heidelberg, Germany, 1999,
[21]
D. Lomet, Bulletin of the Technical Committee on Data Engineering, vol. 18, IEEEE Computer
Society, 1995.
[22]
Z. Michalewicz, Statistical and Scientific Databases, Ellis Horwood, 1992.
[23]
Microsoft Corp., “Microsoft SQL Server,” Microsoft, Seattle, WA, White Paper,
http://www.microsoft.com/federal/sql7/white.htm.
[24]
MicroStrategy Inc., “The Case For Relational OLAP,” MicroStrategy, White Paper,
http://www.microstrategy.com/publications/whitepapers/Case4Rolap/execsumm.HTM.
[25]
P. O'Neil and D. Quass, “Improved Query Performance with Variant Indexes,” SIGMOD Record
(ACM Special Interest Group on Management of Data), 26:2, pp. 38-49, 1997.
[26]
P. E. O'Neil, “Model 204 Architecture and Performance,” in Proceedings of the 2nd International
Workshop on High Performance Transaction Systems, Asilomar, CA, pp. 40-59, 1987.
[27]
Oracle Corp., “Oracle Express OLAP Technology”, Web site,
http://www.oracle.com/olap/index.html.
[28]
Pilot Software Inc., “An Introduction to OLAP Multidimensional Terminology and Technology,”
Pilot Software, Cambridge, MA, White Paper, http://www.pilotsw.com/olap/olap.htm.
[29]
Redbrick Systems, “Aggregate Computation and Management,” Redbrick, Los Gatos, CA, White
Paper,
http://www.informix.com/informix/solutions/dw/redbrick/wpapers/redbrickvi
stawhitepaper.html.
[30]
Redbrick Systems, “Decision-Makers, Business Data and RISQL,” Informix, Los Gatos, CA, White
Paper, 1997,
http://www.informix.com/informix/solutions/dw/redbrick/wpapers/risql.html.
27
[31]
J. Srivastava, J. S. E. Tan, and V. Y. Lum, “TBSAM: An Access Method for Efficient Processing
of Statistical Queries,” IEEE Transactions on Knowledge and Data Engineering, 1:4, pp. 414-423,
1989.
[32]
W. P. Yan and P. Larson, “Eager Aggregation and Lazy Aggregation,” in Proceedings of the
Eighth International Conference on Very Large Databases, Zurich, Switzerland, pp. 345-357,
1995.
[33]
Y. Zhao, P. M. Deshpande, and J. F. Naughton, “An Array-Based Algorithm for Simultaneous
Multidimensional Aggregates,” SIGMOD Record (ACM Special Interest Group on Management of
Data), 26:2, pp. 159-170, 1997.
28