Download integrating data cube computation and emerging pattern mining for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
INTEGRATING DATA CUBE COMPUTATION AND
EMERGING PATTERN MINING FOR
MULTIDIMENSIONAL DATA ANALYSIS
by
Wei Lu
a Report submitted in partial fulfillment
of the requirements for the SFU-ZU dual degree of
Bachelor of Science
in the School of Computing Science
Simon Fraser University
and
the College of Computer Science and Technology
Zhejiang University
c Wei Lu 2010
SIMON FRASER UNIVERSITY AND ZHEJIANG UNIVERSITY
April 2010
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL
Name:
Wei Lu
Degree:
Bachelor of Science
Title of Report:
Integrating Data Cube Computation and Emerging Pattern
Mining for Multidimensional Data Analysis
Examining Committee:
Dr. Jian Pei
Associate Professor, Computing Science
Simon Fraser University
Supervisor
Dr. Qianping Gu
Professor, Computing Science
Simon Fraser University
Supervisor
Dr. Ramesh Krishnamurti
Professor, Computing Science
Simon Fraser University
SFU Examiner
Date Approved:
ii
Abstract
Online analytical processing (OLAP) in multidimensional text databases has recently become an effective tool for analyzing text-rich data such as web documents. In this capstone
project, we follow the trend of using OLAP and the data cube to analyze web documents,
but want to address a new problem from the data mining perspective. In particular, we
wish to find contrast patterns in documents of different classes and then use those patterns
in OLAP style text data and web document analysis. To this end, we propose to integrate
the data cube with an important kind of contrast pattern called the emerging pattern, to
build a new data model for solving the document analysis problem.
Specifically, this novel data model is implemented on top of the traditional data cube by
seamlessly integrating the bottom-up cubing (BUC) algorithm with two different emerging
pattern mining algorithms, the Border-Differential and the DPMiner. The processes of cube
construction and emerging pattern mining are merged together and carried out simultaneously; patterns are stored into the cube as cell measures. Moreover, we study and compare
the performance of those two integrations by conducting experiments on datasets derived
from the Frequent Itemset Mining Implementations Repository (FIMI). Finally, we suggest
improvements and optimizations that can be done in future work.
iii
To my family
iv
“For those who believe, no proof is necessary; for those who don’t believe, no proof is
possible.”
— Stuart Chase, Writer and Economist, 1888
v
Acknowledgments
First of all, I would like to express my deepest appreciation to Dr. Jian Pei, for his support
and guidance during my studies at Simon Fraser University. In various courses I took with
him and particularly this capstone project, Dr. Pei showed me his broad knowledge and
deep insights in the area of data management and mining, as well as his great personality
and patience to a research beginner like me. In his precious time, he provided me with
lots of help and advice for the project and other concerns (especially my graduate school
applications). This work would not be possible without his supervision.
I would love to thank Dr. Qianping Gu and Dr. Ramesh Krishnamurti for reviewing my
report and directing the capstone projects for this amazing dual degree program. My gratitude also goes to Dr. Ze-Nian Li, Dr. Stella Atkins, Dr. Greg Mori and Dr. Ted Kirkpatrick
for their wonderful classes I took at SFU and their good advice for my studies and career
development. Also thanks to Dr. Guozhu Dong at Wright State University and Dr. Guimei
Liu at National University of Singapore for making useful resources available for my work.
I would also like to thank Mr. Thusjanthan Kubendranathan at SFU for his time and help
in our discussions about this project.
Deepest gratefulness to my family and friends who make my life enjoyable. In particular,
I am greatly indebted to my beloved parents, for their unconditional support and encouragement. Their love accompany me wherever I go. This work is dedicated to them and I
hope they are proud of me, as I am always proud of them.
vi
Contents
Approval
ii
Abstract
iii
Dedication
iv
Quotation
v
Acknowledgments
vi
Contents
vii
List of Tables
x
List of Figures
xi
1 Introduction
1
1.1
Overview of Text Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Related Work on Multidimensional Text Data Analysis . . . . . . . . . . . . .
2
1.3
Contrast Pattern Based Document Analysis . . . . . . . . . . . . . . . . . . .
3
1.4
Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2 Literature Review
2.1
6
Data Cubes and Online Analytical Processing . . . . . . . . . . . . . . . . . .
6
2.1.1
An Example of The Data Cube . . . . . . . . . . . . . . . . . . . . . .
7
2.1.2
Data Cubing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.3
BUC: Bottom-Up Computation for Data Cubing . . . . . . . . . . . .
9
vii
2.2
Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3
Emerging Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4
2.3.1
The Border-Differential Algorithm . . . . . . . . . . . . . . . . . . . . 12
2.3.2
The DPMiner Algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . 13
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Motivation
15
3.1
Motivation for Mining Contrast Patterns . . . . . . . . . . . . . . . . . . . . . 15
3.2
Motivation for Utilizing The Data Cube . . . . . . . . . . . . . . . . . . . . . 16
3.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Our Methodology
4.1
4.2
4.3
4.4
18
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1
Normalizing Data Schema for Text Databases . . . . . . . . . . . . . . 18
4.1.2
Problem Modeling with Normalized Data Schema
. . . . . . . . . . . 19
Processing Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1
Integrating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2
Integrating Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1
BUC with DPMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2
BUC with Border-Differential and PADS . . . . . . . . . . . . . . . . 25
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Experimental Results and Performance Study
26
5.1
The Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2
Comparative Performance Study and Analysis . . . . . . . . . . . . . . . . . . 27
5.3
5.2.1
Evaluating the BUC Implementation . . . . . . . . . . . . . . . . . . . 27
5.2.2
Comparing Border-Differential with DPMiner . . . . . . . . . . . . . . 28
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Conclusions
31
6.1
Summary of The Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2
Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Bibliography
33
viii
Index
36
ix
List of Tables
2.1
A base table storing sales data [15]. . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Aggregates computed by group-by Branch.
. . . . . . . . . . . . . . . . . . .
7
2.3
The full data cube based on Table 2.1. . . . . . . . . . . . . . . . . . . . . . .
8
2.4
An sample transaction database [21]. . . . . . . . . . . . . . . . . . . . . . . . 10
3.1
A multidimensional text database concerning Olympic news. . . . . . . . . . 17
4.1
A normalized dataset derived from the Olympic news database. . . . . . . . . 19
4.2
A normalized dataset reproduced from Table 4.1. . . . . . . . . . . . . . . . . 21
5.1
Sizes of synthetic datasets for experiments . . . . . . . . . . . . . . . . . . . . 27
5.2
The complete experimental results. . . . . . . . . . . . . . . . . . . . . . . . . 30
x
List of Figures
2.1
BUC Algorithm [5, 27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
A example of FP-tree based on Table 2.4 [21]. . . . . . . . . . . . . . . . . . . 11
5.1
Running time and cube size of our BUC implementation. . . . . . . . . . . . 28
5.2
Comparing running time of the two integration algorithms. . . . . . . . . . . 29
xi
9
Chapter 1
Introduction
1.1
Overview of Text Data Mining
Analysis of documents in text databases and on the World Wide Web has been attracting
researchers from various areas, such as data mining, machine learning, information retrieval,
database systems, and natural language processing.
In general, studies in different areas have different emphases. Traditional information
retrieval techniques (e.g., the inverted index and vector-space model) prove to be efficient
and effective in searching relevant documents to answer unstructured keyword-based queries.
Machine learning approaches are also widely used in text mining, providing with effective
solutions to various problems. For example, the Naive Bayes model and the Support Vector Machines (SVMs) are used in document classification; K-means and the ExpectationMaximization (EM) algorithms are used in document clustering. The textbook by Manning
et al. [19] covers topics summarized above and much more in both traditional information
retrieval and machine learning based document analysis.
On the other hand, data warehousing and data mining also play important roles in
analyzing documents, especially those stored in a special kind of databases called multidimensional text databases (ones with both relational dimensions and text fields). While
information retrieval mainly addresses searching for documents and for information within
documents according to users’ information needs, the goal of text mining differs in the following sense: it focuses on finding and extracting useful patterns and hidden knowledge
from the information in documents and/or text databases, so as to improve the decision
making process based on the text information.
1
CHAPTER 1. INTRODUCTION
2
Currently, many real-life business, administration and scientific databases are multidimensional text databases, containing both structured attributes and unstructured text
attributes. An example of these databases can be found in Table 3.1. Since data warehousing and online analytical processing (OLAP) have proven their great usefulness in managing
and mining multidimensional data of varied granularities [11], they have recently become
important tools in analyzing such text databases [6, 17, 24, 26].
1.2
Related Work on Multidimensional Text Data Analysis
A data warehouse is a “subject-oriented, integrated, time-varying, non-volatile collection
of data that is used primarily in organizational decision making” [13]. Online analytical
processing (OLAP), which is dominated by “stylized queries that involve group-by and
aggregate operators” [27], is a powerful tool in data warehousing.
Being a multidimensional data model with various features, the data cube [10] has become an essential OLAP facility in data warehousing. Conceptually, the data cube is an
extended database with aggregates in multiple levels and multiple dimensions [15]. It generalizes the group-by operator, by precomputing and storing group-bys with regard to all
possible combinations of dimensions. Data cubes are widely used in data warehousing for
analyzing multidimensional data.
Applying OLAP techniques, especially data cubes, to analyze documents in multidimensional text databases has made significant advances. Important information retrieval
measures, i.e., term frequencies and inverted indexes, have been integrated into the traditional data cube, leading to the text cube [17]. It explores both dimension hierarchy and
term hierarchy in the text data, and is able to answer OLAP queries by navigating to a
specific cell via roll-up and drill-down operations. More recently, the work in [6] proposes
a query answering technique called TopCells to address the top-k query answering in the
text cube. Given a keyword query, TopCells is able to find the top-k ranked cells containing
aggregated documents that are most relevant to the query.
Another OLAP-based model dealing with multidimensional text data is the topic cube
[26]. Topic cube combines OLAP with probabilistic topic modeling. It explores topic hierarchy of documents and stores probability-based measures learned through a probabilistic
model. Moreover, text cubes and topic cubes have been applied to information network analysis. They are combined into an information-network-enhanced text cube called iNextCube
CHAPTER 1. INTRODUCTION
3
[24].
Most previous works emphasize data warehousing more than data mining. They mainly
deal with problems such as how to explore and establish dimensional hierarchies within the
text data, and how to efficiently answer OLAP queries using cubes built on text data.
1.3
Contrast Pattern Based Document Analysis
We follow the trend of using data cubes to analyze documents in multidimensional text
databases. But as the previous works are more data warehousing oriented, we intend to
address a more data mining oriented problem called contrast pattern based document analysis.
More specifically, we wish to find contrast patterns in documents of different classes and
then use those patterns in OLAP style document analysis (like the work in [6, 17]). This
application is promising and has real-life demands. For example, from a large collection
of documents containing information and reviews of laptop computers of various brands,
a user interested in comparing Dell and Sony laptops might wish to find text information
describing Dell’s special features that do not characterize Sony. These features contrast the
two brands effectively, and would probably make the user’s decision to select Dell easier.
To achieve this goal, we propose to integrate frequent pattern mining, especially the
emerging pattern mining, and data cubing in an efficient and effective way. Frequent pattern
mining [2] aims to find itemsets, that is, sets of items that frequently occur in a dataset.
Furthermore, for patterns that can contrast different classes of data, intuitively they must
be frequent patterns in one class, but are comparatively infrequent in other classes.
There is one important class of contrast patterns, called the emerging pattern [7], defined
as itemsets whose supports increase significantly from dataset D1 to dataset D2 . That said,
those patterns are frequent in D2 but infrequent in D1 . Because of the sharp change of their
supports among different datasets, such patterns meet our needs of showing contrasts in
different classes of web documents.
Our Contributions
To tackle the contrast pattern based document analysis problem, we propose a novel data
model by integrating efficient emerging pattern algorithms (e.g., the Border-Differential [7]
CHAPTER 1. INTRODUCTION
4
and the state-of-the-art, DPMiner [16]) with the traditional data cube. This integrated
model is novel, but also preserves features of traditional data cubes:
1. It is based on the data cube, and is constructed through a classical data cubing
algorithm called BUC (the Bottom-Up Computation for data cubing) [5].
2. It contains multidimensional text data and multiple granularity aggregates of such
data, in order to support fast OLAP operations (such as roll-up and drill-down) and
query answering.
3. Each cell in the cube contains a set of aggregated documents in the multidimensional
text database with matched dimension attributes.
4. The measure of each cell is the emerging patterns whose support rises rapidly from
the documents not aggregated in the cell to the documents aggregated in the cell.
In this capstone project, we implement this integrated data model by incorporating
emerging pattern mining seamlessly into the data cubing process. We choose BUC as our
cubing algorithm to build the cube on structured dimensions. While aggregating documents
and materializing cells, we simultaneously mine emerging patterns in documents aggregated
in each particular cell, and store such patterns as the measure of this cell. Two widely
used emerging pattern mining algorithms, the Border-Differential and the DPMiner are
integrated with BUC cubing so as to compare their performance.
We tested these two different integrations on synthetic datasets to evaluate their performance on different sizes of input data. The datasets are derived based on the Frequent
Itemset Mining Implementations Repository (FIMI) [9]. Experimental results show that the
state-of-the-art emerging pattern mining algorithm, the DPMiner, is a better choice over
the Border-Differential.
Our cube-based model shares similarity with the text cube [17] and the topic cube [26]
at the level of data structure, since all three cubes are built based on multidimensional
text data. The similarity of cube-based structure allows OLAP query answering techniques
developed in [6, 17, 24, 26] to be directly applied to our cube. In that sense, point queries
(seeking a cell), sub-cube queries (seeking an entire group-by) and top-k queries (seeking k
most relevant cells) can be answered in contrast pattern based document analysis using our
model.
CHAPTER 1. INTRODUCTION
5
Major Differences with Existing Works
This cube-based data model with emerging patterns as cell measures differs from all previous
related work. It is unlike traditional data cubes using simple aggregate functions as cell
measures, which are only adequate for relational databases. Also, our approach differs from
the text cube which uses term frequencies and inverted indexes as cell measures, and the
topic cube which uses probabilistic measures.
Most importantly, to the best of our knowledge, our data model is novel in comparison to
previous emerging pattern applications in OLAP. Specifically, a previous work in [20] used
the Border-Differential algorithm to perform cube comparisons and capture trend changes
between two precomputed data cubes. However, that work is of limited use and cannot be
applied to multidimensional text data analysis.
First, their approach worked on datasets different in kind from ours. The previous
method only works on traditional data cubes built upon relational databases with categorical
dimension attributes, while ours is designed for multidimensional text databases. Second,
their approach is to find cells with supports growing significantly from one cube to another,
but ours is able to determine emerging patterns for every single cell in the cube. Last but
not least, their approach performs the Border-Differential algorithm after two data cubes
were completely built, but our approach introduces a seamless integration: the data cubing
and emerging pattern mining are carried out simultaneously.
1.4
Structure of the Report
The rest of this capstone project report is organized as follows: Chapter 2 conducts a
literature review on previous work and background knowledge that lays the foundation for
this project. Chapter 3 motivates the contrast pattern based document analysis problem.
Chapter 4 describes our methodology to tackle the problem. This chapter formulates the
problem and proposes algorithms for constructing the integrated data model. Chapter 5
reports experimental results and studies the performance of our algorithm. Lastly, Chapter
6 concludes this capstone project and suggests improvements and optimizations that can be
done in future work.
Chapter 2
Literature Review
This chapter reviews three categories of previous research that are related to this capstone
project: data cubes and OLAP, frequent pattern mining, and emerging pattern mining.
In Section 2.1 we talk about fundamentals of data warehousing, online analytical processing (OLAP), and data cubing. We highlight BUC [5], a bottom-up approach for data
cubing. Section 2.2 introduces frequent pattern mining and an important mining algorithm
called FP-Growth [12]. Section 2.3 reviews emerging pattern mining algorithms (BorderDifferential [7] and DPMiner [16]) that are particularly useful to our work.
2.1
Data Cubes and Online Analytical Processing
A data warehouse is “a subject oriented, integrated, time-varying, non-volatile collection
of data in support of management’s decision-making process” [13]. A powerful tool of
exploiting data warehouses is the so-called online analytical processing (OLAP). Typically,
OLAP systems are dominated by “stylized queries involving many group-by and aggregation
operations” [27].
The data cube was introduced in [10] to facilitate answering OLAP queries on multidimensional data stored in data warehouses. A data cube can be viewed as “an extended
multi-level and multidimensional database with various multiple granularity aggregates”
[15]. The term data cubing refers to the process of constructing a data cube based on a
relational database table, which is often referred to as the base table . In a cubing process,
cells with non-empty aggregates will be materialized. Given a base table, we precompute
group-bys and the corresponding aggregate values with respect to all possible combinations
6
CHAPTER 2. LITERATURE REVIEW
7
of dimensions in this table. Each group-by corresponds to a set of cells. The aggregate value
for that group-by is stored as the measure of that cell. Cell measures provide with a good
and concise summary of information aggregated in the cube.
In light of the above, the data cube is a powerful data model allowing fast retrieval and
analysis of multidimensional data for decision making processes based on data warehouses.
It generalizes the group-by operator in SQL (Structured Query Language), and enable data
analysts to avoid long and complicated SQL queries when searching for unusual data patterns
in multidimensional databases [10].
2.1.1
An Example of The Data Cube
Example (Data Cube):
Table 2.1 is a sample base table in a marketing management
data warehouse [15]. It shows data organized under the schema (Branch, Product, Season,
Sales).
Branch
Product
Season
Sales
B1
B1
B2
P1
P2
P1
spring
spring
fall
6
12
9
Table 2.1: A base table storing sales data [15].
To build a data cube upon this table, group-bys are computed on three dimensions
Branch, Product and Season. Aggregate values of Sales will be cell measures. In this
example, we choose Average(Sales) as the aggregate function for this example. Since most
intermediate steps of a data cubing process are basically computing group-bys and aggregate
values to form cells, we illustrate the two cells computed by “group-by Branch” in Table 2.2.
Cell No.
Branch
Product
Season
AVG(Sales)
1
2
B1
B2
∗
∗
∗
∗
9
9
Table 2.2: Aggregates computed by group-by Branch.
In the same manner, the full data cube contains all possible group-bys on Branch,
Product and Season. It is shown in Table 2.3. Note that cells 1, 2 and 3 are derived
from the least aggregated group-by: group-by Branch, Product, Season. Such cells are
CHAPTER 2. LITERATURE REVIEW
8
called base cells. On the other hand, cell 18 (∗, ∗, ∗) is the apex cuboid aggregating all
tuples in the base table.
Cell No.
Branch
Product
Season
AVG(Sales)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
B1
B1
B2
B1
B1
B1
B2
B2
∗
∗
∗
∗
∗
B1
B2
∗
∗
∗
P1
P2
P1
P1
P2
∗
P1
∗
P1
P1
P2
∗
∗
∗
∗
P1
P2
∗
spring
spring
fall
∗
∗
spring
∗
fall
spring
fall
spring
spring
fall
∗
∗
∗
∗
∗
6
12
9
6
12
9
9
9
6
9
12
9
9
9
9
7.5
12
9
Table 2.3: The full data cube based on Table 2.1.
2.1.2
Data Cubing Algorithms
Efficient and scalable data cubing is challenging. When a base table has a large number
of dimensions and each dimension has high cardinality, time and space complexity grows
exponentially.
In general, there are three approaches of cubing in terms of the order to materialize
cells: top-down, bottom-up and a mix of both. A top-down approach (e.g., the Multiway
Array Aggregation [28]) constructs the cube from the least aggregated base cells towards
the most aggregated apex cuboid. On the contrary, a bottom-up approach such as BUC [5]
computes cells in the opposite order. Other methods, such as Star-Cubing [23], combines
the top-down and bottom-up mechanisms together to carry out the cubing process.
On fast computation of multidimensional aggregates, [11] summarizes the following optimization principles: (1). Sorting or hashing dimension attributes to cluster related tuples
CHAPTER 2. LITERATURE REVIEW
9
that are likely to be aggregated together in certain group-bys. (2). Computing higher-level
aggregates from previously computed lower-level aggregates, and caching intermediate results in memory to reduce expensive I/O operations. (3). Computing a group-by from the
smallest previously-computed group-by. (4). Mapping dimension attributes in various kinds
of formats to integers ranging between zero and the cardinality of the dimension. There are
also many other heuristics being proposed to improve the efficiency of data cubing [1, 5, 11].
2.1.3
BUC: Bottom-Up Computation for Data Cubing
BUC [5] constructs the data cube bottom-up, from the most aggregated apex cuboid to
group-bys on a single dimension, then on a pair of dimensions, and so on. It also uses
many optimization techniques introduced in the previous section. Figure 2.1 illustrates
the processing tree and the partition method used in BUC on a 4-dimensional base table.
Subfigure (b) shows the recursive nature of BUC: after sorting and partitioning data on
dimension A, we deal with the partition (a1 , ∗, ∗, ∗) first and recursively partition it on
dimension B to proceed to its parent cell (a1 , b1 , ∗, ∗) and then the ancestor (a1 , b1 , c1 , ∗)
and so on. After dealing with partition a1 , BUC continues on to process partitions a2 , a3
and a4 in the same manner until all cells are materialized.
Figure 2.1: BUC Algorithm [5, 27].
The depth-first search process for building our integrated data model (covered in Chapter
CHAPTER 2. LITERATURE REVIEW
10
4) follows the basic framework of BUC.
2.2
Frequent Pattern Mining
Frequent patterns are patterns (sets of items, sequence, etc.) that occur frequently in a
database [2]. The supports of frequent patterns must exceed a pre-defined minimal support
threshold.
Frequent pattern mining has been studied extensively in the past two decades. It lays the
foundation for many data mining tasks such as association rules [3] and emerging pattern
mining. Although its definition is concise, the mining algorithms are not trivial. Two
notable algorithms are Apriori [3] and FP-Growth [12] . FP-Growth is more important to
our work as efficient emerging pattern mining algorithms such as [4, 16] use the FP-tree
proposed in FP-Growth as data structures.
FP-Growth addressed the limitations of the breadth-first-search-based Apriori such as
multiple database scans, large amounts of candidate generations and support counting. It
is a depth-first search algorithm. The first scan of a database finds all frequent items, ranks
them in frequency-descending order, and puts them into a head table. Then it compresses
the database into a prefix tree called FP-tree. A complete set of frequent patterns can be
mined by recursively constructing projected databases and the FP-trees based on them. For
example, given a transaction database in Table 2.4 [21], we can build a FP-tree accordingly
(shown in Figure 2.2).
TID
Items
(Ordered) Frequent Items
100
200
300
400
500
f, a, c, d, g, i, m, p
a, b, c, f, l, m, o
b, f, h, j, o
b, c, k, s, p
a, f, c, e, l, p, m, n
f, a, c, m, p
f, c, a, b, m
f, b
c, b, p
f, c, a, m, p
Table 2.4: An sample transaction database [21].
Next, we define three special types of frequent patterns: the maximal frequent patterns
(max-patterns for short), the closed frequent patterns and frequent generators, as they are
closely related to emerging pattern mining.
Definition (Max-Pattern): An itemset X is a maximal frequent pattern, or maxpattern, in dataset D if X is frequent in D, and for every proper super-itemset Y such that
CHAPTER 2. LITERATURE REVIEW
11
Figure 2.2: A example of FP-tree based on Table 2.4 [21].
X ⊂ Y , Y is infrequent in D [11].
Definition (Closed Pattern and Generator): An itemset X is closed in dataset D
if there exists no proper super-itemset Y s.t. X ⊂ Y and support(X) = support(Y ) in D.
X is a closed frequent pattern in D if it is both closed and frequent in D [11].
An itemset Z is a generator in D if there exists no proper sub-itemset Z 0 such that
Z 0 ⊆ Z and support(Z 0 ) = support(Z) [18].
The state-of-the-art max-pattern mining algorithm is called the Pattern-Aware Dynamic
Search (PADS) [25]. The DPMiner, the state-of-the-art emerging pattern mining algorithm,
is also the most powerful algorithm for mining closed frequent patterns and frequent generators.
2.3
Emerging Pattern Mining
Emerging patterns [7] are patterns whose supports increase significantly from one class of
data to another. Mathematical details can be found Section 4.1 (Problem Formulation) of
this report and [4, 7, 8, 16]. The original work on emerging pattern in [7] gives an algorithm
called the Border-Differential for mining such patterns. It uses borders to succinctly represent patterns and mines the patterns by manipulating the borders only. The work in [4]
CHAPTER 2. LITERATURE REVIEW
12
used the FP-tree introduced in [12] for emerging pattern mining. Following that, the work
in [16] improves the FP-tree-based algorithm by simultaneously generating closed frequent
patterns and frequent generators to form emerging patterns. This algorithm is called the
DPMiner and is considered as the state-of-the-art for emerging pattern mining.
2.3.1
The Border-Differential Algorithm
Border-Differential uses borders to represent patterns. It involves mining max-patterns
and manipulating borders initiated by the patterns to derive the border representation of
emerging patterns.
A border is an ordered pair hL, Ri, where L and R are the left and right bounds of the
border respectively. Both L and R are collections of itemsets, but are much smaller than
the original patterns in size. Emerging patterns represented by hL, Ri are the intervals of
hL, Ri, defined as [L, R] = {Y |∃X ∈ L, ∃Z ∈ R, s.t. X ⊆ Y ⊆ Z}. For example, suppose
[L, R] = {{1}, {1, 2}, {1, 3}, {1, 2, 3}, {2, 3}, {2, 3, 4}}, it has border L = {{1}, {2, 3}}, R =
{{1, 2, 3}, {2, 3, 4}}. Itemsets other than those in L and R (e.g., {1, 3}) are intervals of
hL, Ri.
Given a pair of borders h{φ}, R1 i and h{φ}, R2 i whose left bounds are initially empty,
the differential border hL1 , R1 i is derived to satisfy [L1 , R1 ] = [{φ}, R1 ] − [{φ}, R2 ]. This
operation is the so-called Border-Differential.
Furthermore, given two datasets D1 and D2 , to determine emerging patterns using the
Border-Differential operation, first we determine the max-patterns U1 of D1 and U2 of D2
using PADS, and initiate two borders h{φ}, U1 i and h{φ}, U2 i. Then, we make the differential
between those two borders. Let U1 = {X1 , X2 , ..., Xn } and U2 = {Y1 , Y2 , ..., Ym } where
Xi and Yj are itemsets, the left bound of the differential border is computed by L1 =
Sn
i
(P owerSet(Xi ) −
Sm
j
(P owerSet(Yj ))). The right bound U1 remains the same. Lastly,
form a border hL1 , U1 i, and the set intervals [L1 , U1 ] of hL1 , U1 i are emerging patterns in
D1 .
As the size of datasets grow, the Border-Differential would become problematic because
it involves set enumerations, resulting in exponential computational costs. The work in [8], a
more recent version of [7], proposed several optimization techniques to improve the efficiency
of Border-Differential. However, in fact, the complexity of finding emerging patterns is MAX
SNP-hard, which means that polynomial time approximation schemes do not exist unless
P = N P [22].
CHAPTER 2. LITERATURE REVIEW
2.3.2
13
The DPMiner Algorithm
The work in [4] used the FP-tree and patten-growth methods to mine emerging patterns,
but it still needs to call Border-Differential to find emerging patterns. The DPMiner (stands
for Discriminative Pattern Miner) in [16] also uses FP-tree but mines emerging patterns in
a different way. It finds closed frequent patterns and frequent generators simultaneously to
form equivalent classes of such patterns, and then determine emerging patterns as “nonredundant δ-discriminative equivalent classes” [16].
An equivalent class EC is “a set of itemsets that always occur together in some transactions of dataset D” [16]. It can be uniquely represented by its set of frequent generators
G and closed frequent patterns C, in the form of EC = [G, C].
Suppose D can be divided into various classes, denoted as D = D1 ∪ D2 ∪ ... ∪ Dn . Let
δ be a small integer (usually 1 or 2) and θ be a minimal support threshold. An equivalent
class EC is a δ-discriminative equivalent class, provided that its closed pattern C’s support
is greater than θ in D1 but smaller than δ in D − D1 = D2 ∪ ... ∪ Dn . Furthermore, EC is a
non-redundant δ-discriminative equivalent class if and only if (1) it is δ-discriminative, (2)
d such that Cb ⊆ C, where Cb and C are the closed patterns of EC
d and EC
there exists no EC
respectively. The closed frequent patterns of a non-redundant δ-discriminative equivalent
class are emerging patterns in D1 .
Data Structures and Computational Steps of The DPMiner
The high efficiency of the DPMiner is mainly attributed to its revised FP-tree structure.
Unlike traditional FP-trees, it does not store items appearing in every transaction and hence
have a full support in D. These items are removed because they cannot form generators.
Such modification results in a much smaller FP-tree compared to the original.
The computational framework of the DPMiner consists of the following five steps:
(1). Given k classes of data D1 , D2 , ..., Dk as input, obtain a union of them to get D =
D1 ∪ D2 ∪ ... ∪ Dk . Also specify a minimal support threshold θ and a maximal threshold δ
(thus, patterns with supports above θ in Di but below δ in D − Di are candidate emerging
patterns in Di ).
(2). Construct a FP-tree based on D and run a depth-first search on the tree to find
frequent generators and closed patterns simultaneously. For each search path along the tree,
the search terminates whenever a δ-discriminative equivalent class is reached.
CHAPTER 2. LITERATURE REVIEW
14
(3). Determine the class label distribution for every closed pattern, i.e., find in which
class a closed pattern has the highest support. This step is necessary because patterns are
not mined separately for each Di (1 ≤ i ≤ k), but rather on the entire D.
(4). Pair up generators and closed frequent patterns to form δ-discriminative equivalent
classes.
(5). Output the non-redundant δ-discriminative equivalent classes as emerging patterns.
If a pattern is labeled as i (1 ≤ i ≤ k), then it is an emerging pattern in Di .
2.4
Summary
In this chapter, we discussed previous research addressing data cubing, frequent pattern
mining and emerging pattern mining, all of which are essential for our project. Algorithms
(the Bottom-Up Cubing, the Border-Differential and the DPMiner) closely related to our
work have been described in detail.
Chapter 3
Motivation
In this chapter, we motivate the problem of contrast pattern based document analysis. We
explain why contrast patterns (in particular, the emerging patterns) are useful, and why
data cubes should be used in analyzing documents in multidimensional text databases.
3.1
Motivation for Mining Contrast Patterns
This section answers the following two questions: (1) Why we need to mine and use contrast
patterns to analyze web documents? (2) How useful are those patterns? In other words, can
they make a significant contribution to a good text mining application? We answer these
questions by introducing motivating scenarios in real life.
Example (Contrast Patterns in Documents) Since the Calgary 1988 Olympic Winter Games, Canada has not been a host country for the Olympic Games for 22 years. Therefore, people may want to know what are the most attractive and discriminative features of
the Vancouver 2010 Winter Olympics, compared to all previous Olympic Games. Indeed,
there are exciting and touching stories in almost all Olympics and Vancouver certainly has
its unique moments. For example, the Canadian figure skater Joannie Rochette won a
bronze medal under the keenly felt pain of losing her mother a day before her event started.
Suppose a user searches the web and Google returns her a collection of documents on
Olympics, consisting of many online sports news and commentaries. There may be too
much information for her to read through and find unique stories about Vancouver 2010.
Although there is no doubt that Joannie Rochette’s accomplishment will occur frequently
in articles related to Vancouver 2010, a user who is previously unaware about Rochette may
15
CHAPTER 3. MOTIVATION
16
not be able to learn about her quickly from the search results.
Similar situations may also happen when users compare products online by searching
and reading reviews by previous buyers. Here is an example we have seen in Section 1.3:
Suppose a user is comparing Dell’s laptop computers with Sony’s. She probably wants
to know the special features of Dell which are not owned by Sony’s. For example, many
reviewers would speak in favor of Dell by commenting “high performance-price ratio” but
would not do that for Sony as it is not the case. Then “high performance-price ratio” is a
pattern contrasting Dell laptops with Sony laptops.
To let the users manually determine such contrast patterns is not feasible. Therefore,
given a collection of documents, which are ideally pre-classified and stored into a multidimensional text database, we need to develop efficient data models and corresponding
algorithms to determine contrast patterns in documents of different classes.
As mentioned in Section 1.3, we choose the emerging pattern [7] since it is a representative class of contrast patterns widely used in data mining. Also, there are good algorithms
[4, 7, 16] available for efficient mining of such patterns. Moreover, emerging patterns can
make a contribution to some other problems in text mining. A novel document classifier
could be constructed based on those patterns as they are claimed useful in building accurate
classifiers [8]. Also, since emerging patterns are able to capture discriminative features of a
class of data, they may be helpful in extracting keywords to summarize the given text.
3.2
Motivation for Utilizing The Data Cube
In many real-life database applications, documents and the text data within them are stored
in multidimensional text databases [24]. These kinds of databases are distinct from traditional data sources we deal with, including relational databases, transaction databases, and
text corpora. Formally, a multidimensional text database is defined as a relational database
with text fields. A sample text database is shown in Table 3.1. The first three dimensions (Event, Time, and Publisher) are standard dimensions, just like those in relational
databases. The last column contains text dimensions which are documents with text terms.
Text databases provide structured attributes of documents, and the information needs
of users vary where such needs can be modeled hierarchically. This makes OLAP and data
cubes applicable. For instance (using Table 3.1), if a user wants to read news on the ice
hockey games reported by the Vancouver Sun on February 20, 2010, then two documents d1
CHAPTER 3. MOTIVATION
17
Event
Time
Publisher
...
Ice hockey
Ice hockey
Ice hockey
Figure skating
Figure skating
Curling
Curling
...
2010/2/20
2010/2/23
2010/2/20
2010/2/20
2010/2/20
2010/2/23
2010/2/28
...
Vancouver Sun
Global and Mail
Vancouver Sun
Global and Mail
Vancouver Sun
New York Times
Global and Mail
...
...
...
...
...
...
...
...
...
Text Data: Documents
d1
d2
d3
d4
d5
d6
d7
= {t1 , t2 , t3 , t4 }
= {t2 , t3 , t7 , t8 }
= {t1 , t2 , t3 , t6 }
= {t2 , t4 , t6 , t7 }
= {t1 , t3 , t5 , t7 }
= {t2 , t5 , t7 , t9 }
= {t3 , t6 , t8 , t9 }
...
Table 3.1: A multidimensional text database concerning Olympic news.
and d3 matching the query {Event = Ice hockey, Time = 2010/2/20, Publisher = Vancouver
Sun} will be returned to her. If another user wants to skim all Olympic news reported by
the Vancouver Sun on that day, we shall roll up to query {Event = ∗, Time = 2010/2/20,
Publisher = Vancouver Sun} and return documents d1 , d3 and d5 to her. The opposite
operation of roll-up is called drill-down. In fact, roll-up and drill-down are two OLAP
operations of great importance [11]. Therefore, to meet different levels of information needs,
it is natural for us to apply the data cube to model and extend this text database. This is
exactly what the previous work in [17, 24, 26] did.
3.3
Summary
In light of the above, this chapter shows that contrast patterns are useful in analyzing large
scale text data and they are able to give concise information about the data. Also, the
nature of multidimensional text databases makes OLAP and the most essential OLAP tool,
the data cube, particularly suitable for modeling and analyzing text data in documents.
Chapter 4
Our Methodology
In this chapter, we describe our methodology to tackle the contrast pattern based document
analysis by building a novel integrated data model through BUC data cubing [5] and two
emerging pattern mining algorithms, the Border-Differential [7] and the DPMiner [16].
Section 4.1 formulates the problem we try to address in this work. Section 4.2 describes
the processing framework and our algorithms, from both data integration level and algorithm
integration level. Section 4.3 discusses issues related to implementation.
4.1
4.1.1
Problem Formulation
Normalizing Data Schema for Text Databases
Suppose a collection of web documents are stored in a multidimensional text database. The
text data in documents are collected under the schema containing a set of standard non-text
dimensions {SD1 , SD2 , ..., SDn }, and a set of text dimensions (terms) {T D1 , T D2 , ..., T Dm },
where m is the number of distinct text terms in this collection. For simplicity, text terms
can be mapped to items, so documents can be mapped to transactions, or itemsets (sets
of items that appear together). This mapping is similar to the bag-of-words model, which
represents text data as an unordered collection of words, disregarding word order and count.
In that sense, a multidimensional text database can be mapped to a relational base table
with a transaction database.
Under the above mapping mechanism, each tuple in a text database corresponds to a
certain document, in the form of hS, T i, where S is the set of standard dimension attributes
18
CHAPTER 4. OUR METHODOLOGY
19
and T is a transaction. The dimension attributes can be learned through a certain classifier
or labeled artificially. Words in the document are tokenized and each distinct token will
be treated as an item in the transaction. For example, the tuple corresponding to the first
row in Table 3.1 is Ice hockey, 2010/2/20, Vancouver Sun, ..., d1 = {t1 , t2 , t3 , t4 }, with
d1 = {t1 , t2 , t3 , t4 } being the transaction.
Furthermore, we normalize text database tuples to derive a simplified data schema. We
map standard dimensions to letters, e.g, Event to A, Time to B and Publisher to C, to make
them unified. Likewise, dimension attributes are mapped to items in the same manner: Ice
hockey is mapped to a1 , Figure skating is mapped to a2 and so on. Table 4.1 shows a
normalized dataset derived from the Olympic news database (Table 3.1).
A
B
C
...
a1
a1
a1
a2
a2
a3
a3
...
b1
b2
b1
b1
b1
b2
b3
...
c1
c2
c1
c2
c1
c3
c2
...
...
...
...
...
...
...
...
...
Transactions
d1
d2
d3
d4
d5
d6
d7
= {t1 , t2 , t3 , t4 }
= {t2 , t3 , t7 , t8 }
= {t1 , t2 , t3 , t6 }
= {t2 , t4 , t6 , t7 }
= {t1 , t3 , t5 , t7 }
= {t2 , t5 , t7 , t9 }
= {t3 , t6 , t8 , t9 }
...
Table 4.1: A normalized dataset derived from the Olympic news database.
4.1.2
Problem Modeling with Normalized Data Schema
Given a normalized dataset as a base table, we build our integrated cube-based data model
by computing a full data cube grouped by all standard dimensions (e.g., {A, B, C} in the
above table). In the data cubing process, every subset of {A, B, C} will be gone through
to form a group-by corresponding to a set of cells. Emerging patterns in each cell will be
mined simultaneously and stored as cell measures.
When materializing each cell, we aggregate tuples whose dimension attributes match
this particular cell. The transactions of matched tuples form the target class (or positive
class), denoted as T C. We also virtually aggregate all unmatched tuples and extract their
transactions to form the background class (or negative class), denoted as BC. The membership in T C and BC varies from cell to cell; both classes are dynamically computed and
formed for each cell.
CHAPTER 4. OUR METHODOLOGY
20
A transaction T is a full itemset in a tuple. A pattern X is a sub-itemset of T having a
non-zero support (i.e., the number of times X appears) in the given dataset. Let θ be the
minimal support threshold for T C and δ be the maximal support threshold for BC. Pattern
X is an emerging pattern in T C if and only if support(X, T C) ≥ θ and support(X, BC) ≤ δ.
In other words, the support of X grows significantly from BC to T C, exceeding a minimal
growth rate threshold ρ = θ/δ. Mathematically, growth rate(X) = support(X, T C) / support(X, BC) ≥ ρ. Note that δ can be 0, hence ρ = θ/δ = ∞. If growth rate(X) = ∞, X is
a jumping emerging pattern [7] which does not appear in BC at all.
Given predefined support thresholds θ and δ, for each cell in this cube-based model,
we mine all patterns whose support is above θ in the target class T C and below δ in its
background class BC. Thus, such patterns automatically exceed the minimal growth rate
threshold ρ, and become a measure of this cell. Upon obtaining all cells and corresponding
emerging patterns, the model building process is complete. The entire process is based on
data cubing and also requires a seamless integration of cubing and emerging pattern mining.
Example: Now let us consider a simple example regarding the base table in Table 4.1.
Let θ = 2 and δ = 1. Suppose at a certain stage, we are carrying out the group-by operation
on dimension A. We get three cells: (a1 , ∗, ∗), (a2 , ∗, ∗) and (a3 , ∗, ∗). For cell (a1 , ∗, ∗)
aggregating the first three tuples in Table 4.1, T C = {d1 , d2 , d3 }, BC = {d4 , d5 , d6 , d7 }.
Then consider pattern X = (t1 , t2 , t3 ). It appears twice in T C (in d1 and d3 ) but zero times
in BC, so support(X, T C) ≥ θ and support(X, BC) < δ. In that sense, X = (t1 , t2 , t3 ) is an
(jumping) emerging pattern in T C and hence is a measure of cell (a1 , ∗, ∗).
4.2
Processing Framework
To recapitulate, Chapter 1 introduced the contrast pattern based document analysis in
multidimensional text databases. We follow the idea of using data cubes and OLAP to
analyze multidimensional text data, and propose to merge the BUC data cubing process
with two different emerging pattern mining algorithms (the Border-Differential and the
DPMiner) to build an integrated data model based on the data cube. This model is designed
to support the contrast pattern based document analysis.
In this section, following the problem formulation in Section 4.1, we propose our algorithm to integrate emerging pattern mining into data cubing. The entire processing
framework includes both data integration and algorithm integration.
CHAPTER 4. OUR METHODOLOGY
4.2.1
21
Integrating Data
To begin with, we reproduce Table 4.1 (with slight revisions) to make the following discussion
clear. It shows a standard and ideal format of data that simplifies a multidimensional text
database. The data used in our testing will strictly follow this format: each row in a certain
dataset D is a tuple in the form of hS, T i, where S is the set of dimension attributes and T
is a transaction.
Tuple No.
A
B
C
F
Transactions
1
2
3
4
5
6
7
8
a1
a1
a1
a2
a2
a3
a3
a4
b1
b2
b1
b1
b1
b2
b3
b2
c1
c2
c2
c2
c1
c3
c2
c3
f1
f1
f2
f2
f1
f3
f3
f1
d1 = {t1 , t2 , t3 , t4 }
d2 = {t2 , t3 , t7 , t8 }
d3 = {t1 , t2 , t3 , t6 }
d4 = {t2 , t4 , t6 , t7 }
d5 = {t1 , t3 , t5 , t7 }
d6 = {t2 , t5 , t7 , t9 }
d7 = {t3 , t6 , t8 , t9 }
d8 = {t6 , t8 , t11 , t12 }
Table 4.2: A normalized dataset reproduced from Table 4.1.
The integration of data is indispensable because of the nature of the multidimensional
text mining problem. In addition, data cubing and emerging patten mining algorithms
work with data from heterogeneous sources originally. Data cubing mainly deals with relational base tables in data warehouses, while emerging pattern mining concerns transaction
databases (see an example in Table 2.4). Therefore, we should unify heterogeneous data
first and then develop algorithms for a seamless integration. Thus, we model the text
database and its normalized schema (Table 4.2) by appending transaction database tuples
to relational base table tuples.
Moreover, for the integrated data, we also apply one of the optimization techniques
discussed in Section 2.1.2: mapping all dimension attributes in various kinds of formats to
integers between zero and the cardinality of the attribute [11]. For example, in Table 4.2,
dimension A has the cardinality |A| = 3, so in implementation and testing, we map a1 to
0, a2 to 1 and a3 to 2. Similarly, items in transactions are also mapped to integers ranging
between one to the total number of items in this dataset. For instance, if all items in a
dataset are labeled from t1 to t100 , we can represent them by integers ranging from 1 to 100.
This kind of mapping facilitates sorting and hashing in data cubing. Particularly for BUC,
such mapping allows the use of the linear counting sort algorithm to reorder input tuples
CHAPTER 4. OUR METHODOLOGY
22
efficiently.
4.2.2
Integrating Algorithms
Our algorithm integrates data cubing and emerging pattern mining seamlessly. It carries out
a depth-first search (DFS) to build data cubes and mine emerging patterns as cell measures
simultaneously. The algorithm is designed to work on any valid integrated datasets like
Table 4.2 (both dimension attributes and transactions should be non-empty for tuples). We
outline the algorithm in the following pseudo-code (adapted from [5]).
Algorithm
Procedure ButtomUpCubeWithDPMiner(data, dim, theta, delta)
Inputs:
data:
the dataset upon which we build our integrated model.
dim:
number of standard dimensions in input data.
theta: the minimal support threshold of candidate emerging patterns
in the target class.
delta: the maximal support threshold of candidate emerging patterns
in the background class.
Outputs:
cells with their measures (patterns)
Method:
1: aggregate(data);
2: if (data.count == 1) then
3:
writeAncestors(data, dim);
4:
return;
5: endif
6: for each dimension d (from 0 to (dim - 1)) do
7:
C := cardinality(d);
8:
newData := partition(data, d); // counting sort.
9:
for each partition i (from 0 to (C - 1)) do
10:
cell := createEmptyCell();
11:
posData := newData.gatherPositiveTransactions();
12:
negData := newData.gatherNegativeTransactions();
13:
isDuplicate := determineCoverage(posData, negData);
14:
if (!isDuplicate) then
CHAPTER 4. OUR METHODOLOGY
15:
cell.measure := DPMiner(posData, negData, theta, delta);
16:
writeOutput(cell);
17:
subData := newData.getPartition(i);
18:
ButtomUpCubeWithDPMiner(subData, d+1, theta, delta);
19:
20:
23
endif
endfor
21: endfor
For integrating BUC with another emerging pattern algorithm Border-Differential, replace line 15 with the following pseudo-code:
15.1: posMaxPat := PADS(posData, theta);
15.2: negMaxPat := PADS(negData, theta);
15.3: cell.measure := BorderDifferential(posMaxPat, negMaxPat);
The Execution Flow
To illustrate the execution flow of our integrated algorithm, suppose the algorithm is given a
input dataset D like Table 4.2, with four dimensions namely A, B, C, F. To begin with, the
algorithm aggregates D (line 1). Then it determines the cardinality of the first dimension
A (line 7) and partitions the aggregated D on A (line 8), which creates four partitions
(a1 , ∗, ∗, ∗), (a2 , ∗, ∗, ∗), (a3 , ∗, ∗, ∗) and (a4 , ∗, ∗, ∗). Each partition is sorted linearly using
the counting sort algorithm.
Then the algorithm iterates through these partitions to construct cells and mine patterns
(line 9). It starts with cell (a1 , ∗, ∗, ∗), gathering transactions with a1 on A as the target
class (line 11), and collects the remaining ones as the background class (line 12). Both
classes will then be passed on to the DPMiner procedure to find emerging patterns in the
target class (line 15), provided that this cell’s target class is not identical to that of its
descendant cells that have been processed (line 13, more on this later).
Then, BUC is called recursively on the current partition to materialize cells, mine patterns and output them. The algorithm further sorts and partitions (a1 , ∗, ∗, ∗) to proceed
to its parent (a1 , b1 , ∗, ∗). As it continues to execute, it recurses further on ancestor cells
(a1 , b1 , c1 , ∗) and (a1 , b1 , c1 , f1 ). Upon reaching the base cells, the algorithm backtracks to
the nearest descendant cell (a1 , b1 , c2 , ∗). The complete processing order follows Figure 2.1.
CHAPTER 4. OUR METHODOLOGY
24
Optimizations
The duplicate checking function in line 13 is an optimization aimed at avoiding producing
cells with identical aggregated tuples and patterns. For example, the cell (a2 , b1 , ∗, ∗) aggregates tuples 4 and 5 in Table 4.2. Since we have already computed its descendant cell
(a2 , ∗, ∗, ∗), which also covers exactly the same two tuples, these two cells will have exactly
the same target class and background class. Therefore, processing cells like (a2 , b1 , ∗, ∗)
leads to duplicate work that is unnecessary and should be avoided. The duplicate checking
function helps in this kind of situations.
The above duplicate checking function generalizes the original BUC optimization called
writeAncestors (line 3 in the pseudo code). Our algorithm also includes writeAncestors
with slight modifications, as a special case of the duplicate checking. Consider that when
the algorithm proceeds to (a4 , ∗, ∗, ∗), a partition has only one tuple. In the same sense
as we have discussed above, the ancestor cells (a4 , b2 , ∗, ∗), (a4 , b2 , c3 , ∗), and (a4 , b2 , c3 , f1 )
all contain exactly the same tuple and hence will have identical patterns. These four cells
actually form an equivalent class. We choose to output the lower bound (a4 , ∗, ∗, ∗) together
with the upper bound (a4 , b2 , c3 , f1 ) and skip all intermediate cells in this equivalent class.
Both optimization techniques shorten the running time of our program and reduces the
number of cells to output. Experiments conducted in [5] found out that in real-life data
warehouses, about 20% of the aggregates contain only one tuple. Therefore, empirically
such optimizations are useful and helpful.
4.3
Implementations
For this capstone project, we implemented BUC and the Border-Differential in C++. We
also made use of the original DPMiner package from [16] for emerging pattern mining and
the PADS package from [25] for max-pattern mining needed by the Border-Differential.
To ensure smooth data flow in the integration, both DPMiner and PADS packages are
modified sufficiently to meet our specific needs. The original packages read input data from
certain files different from our test datasets (like Table 4.2), so for our implementation, those
packages are changed to get the input directly from BUC on the fly. Therefore, primarily,
the data structures for holding transactions, and corresponding functions to manipulate
items in transactions are modified accordingly.
CHAPTER 4. OUR METHODOLOGY
4.3.1
25
BUC with DPMiner
Integrating BUC with the DPMiner, for each cell, label the tuples whose dimension attributes match the current partition in BUC as class 1 (the target class) and the tuples
which do not match as class 0 (the background class). Then pass their transactions in two
arrays to the DPMiner procedure, which will carry out the pattern mining task. It mines
frequent generators and closed patterns for both data classes by executing the computational steps described in Section 2.3.2. After the mining process, the most general closed
patterns, i.e., the ones that have the shortest length among others in its equivalent class are
determined as the so-called non-redundant δ-discriminative patterns. Such patterns will be
returned to the cell and stored as its measure. Lastly, one file per cell will be output to disk
and the file name is a string containing the dimension attributes of that cell.
4.3.2
BUC with Border-Differential and PADS
On the integration of BUC and Border-Differential, for each cell, the target class of transactions and the background class will be collected in the same manner as above. Unlike the
DPMiner, the Border-Differential algorithm cannot determine candidate emerging patterns
(i.e., the max-patterns in both classes) itself. Instead, our algorithm employs PADS [25]
first to determine the max-patterns for both classes. Patterns will be passed on to the
Border-Differential procedure after they are mined.
Then, invoke the Border-Differential procedure to make the differential between two
borders initiated by the max-patterns. As there could be more than one max-pattern for
either the target or the background class, we might get multiple borders (each corresponding
to a max-pattern) for a single cell. Finally, one file per cell will be output to disk and the
file name is a string containing the dimension attributes of that cell.
4.4
Summary
In this chapter, we formulated the data model construction problem in the first section. Then
we described our processing framework, i.e., the integration of data cubing and emerging
pattern mining from the data level and the algorithm level. Both levels of integrations
are important. Lastly, we concluded this chapter by addressing some issues related to real
implementations.
Chapter 5
Experimental Results and
Performance Study
In this chapter, we present a comparative empirical evaluation on the algorithms developed
and implemented in this capstone project. The experiments are run on a machine with Intel
Core 2 Duo CPU, 3.0 GB main memory, and the Ubuntu 9.04 Linux operating system. The
machine is physically located in the Computing Science Instructional Labs (CSIL) at Simon
Fraser University. Our programs are implemented in C++.
5.1
The Test Dataset
To the best of our knowledge, there are no widely accepted datasets available which follow
the data schema specified in Section 4.1. This is mainly because previously data cubing and
emerging pattern mining work separately with entirely heterogeneous data sources.
We generated five relational base tables containing 100, 1,000, 2,500, 4,000 and 8,124
tuples respectively. All base tables have four standard dimensions with a cardinality of four.
In comparison, the experiments for the text cube in [17] uses a test dataset with much fewer
tuples (2,013) but more dimensions (14). We did intend to test our programs on datasets
with eight or more dimensions but that easily resulted in tens of thousands of files and ran
out our disk quota in the Ubuntu system.
For transactional data, we adopted a dataset named mushroom.dat from the Frequent
Itemset Mining Implementations Repository (FIMI) [9]. It contains 8,124 transactions with
26
CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY
27
an of average length of 30 items. The total number of items in mushroom.dat is 113.
We synthesized our normalized datasets by appending rows in mushroom.dat to tuples
in each of the five base tables. The synthesization process is not randomly conducted. Instead, we simulate a real multidimensional text database where tuples with more identical
dimension attributes tend to have more similar transactions. Therefore, in the data integration process, we first clustered similar transactions to groups and then assigned transactions
within a group to tuples having overlapping dimension attributes. On the contrary, tuples
with few identical dimension attributes would be appended with dissimilar transactions
coming from different clusters. Table 5.1 shows the sizes of the 5 synthetic datasets for our
experiments.
Num. of Tuples
Size (KB)
100
1,000
2,500
4,000
8,124
6.9
69.2
174.1
278.5
565.7
Table 5.1: Sizes of synthetic datasets for experiments
5.2
Comparative Performance Study and Analysis
We test three implementations (BUC solely, BUC with the DPMiner, and BUC with the
Border-Differential) on all five synthetic datasets described above. Each of the test case was
run ten times and the average running time was calculated and reported.
5.2.1
Evaluating the BUC Implementation
Figure 5.1 presents the test results of our BUC implementation. In a pure BUC cubing test,
transaction items are also included in input data, but the program would not process them
as they are not related to pure data cubing. Including transaction data will certainly result
in more computational overhead, in the sense that during the execution, data is moved
around both in memory and on the disk as entire tuples. However, test results show that
our BUC implementation is still robust under such conditions.
As can be seen from Figure 5.1, the running time of BUC is impressive, which grows
CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY
28
linearly as the size of input data increases. This good feature is mainly attributed to the
(fixed range of) integer representation of data, which makes it possible to use the linear
counting sort algorithm. It is shown that partitioning and sorting data are the most timeconsuming steps in a cubing process [5].
The implementation also achieves great compression ratio in cube size when the data
size is relatively small. However, as the number of tuples in the synthetic dataset grow, it
is rare to have cells containing identical aggregates, thus optimization heuristics such as the
writeAncestors becomes of little use.
(a). Running time of BUC
(b). Number of Cells Created through BUC
800
(8124, 6.56)
6
Number of Cells
Running Time (second)
8
4
(4000, 3.30)
(2500, 2.21)
2
(2500, 624)
(1000, 613) (4000, 624)
600
(8124, 624)
400
200 (100, 211)
(1000, 0.98)
0
(100, 0.11)
0
2000
4000
6000
8000
Number of Tuples
10000
0
0
2000
4000
6000
8000
Number of Tuples
10000
Figure 5.1: Running time and cube size of our BUC implementation.
5.2.2
Comparing Border-Differential with DPMiner
We compared two integration algorithms (1) BUC with the DPMiner and (2) BUC with the
Border-Differential, with respect to both running time and the size of output cubes. The
comparison results on running time is illustrated in Figure 5.2 and the complete experimental
results are summarized in Table 5.2.
For test cases with 100-tuple input, the minimal support threshold θ in target classes
is set to be 3; for test data with greater sizes (1,000 tuples and more), 3 is no longer a
reasonable value, so we take the square root of the number of tuples as the minimal support
threshold. For example, the threshold for 2,500 tuples is 50. The maximal support threshold
δ is set to 1 for all test cases.
When compared on running time (the third column in Table 5.2), the first algorithm
is faster than the second for every data except for the 1,000-tuple one (but only less than
CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY
29
1% slower). For data of 2,500 and 4,000 tuples, the Border-Differential is 1.3 times and 1.7
times as slow as the DPMiner respectively. When tested on the 8,124-tuple dataset, the
Border-Differential algorithm failed to complete in 120 seconds. We exclude this case as it
does not affect the competition at all. The running time of the algorithm integrating the
DPMiner is close to linear, while the MAX SNP-hard [22] Border-Differential proved to be
much slower in practice.
(a). Running time of BUC + DPMiner
(b). Running Time of BUC + Border−Differential
80
80
Running Time (second)
Running Time (second)
(4000, 73.6)
(8124, 66.9)
60
(4000, 43.6)
40
(2500, 30.5)
20
0
(1000, 18.6)
(100, 3.8)
0
2000
4000
6000
Number of Tuples
8000
10000
60
(2500, 41.1)
40
20
(1000, 17.5)
(100, 9.6)
0
0
1000
2000
3000
Number of Tuples
4000
Figure 5.2: Comparing running time of the two integration algorithms.
When compared on cube size (the fifth column in Table 5.2), the Border-Differential
appears to perform better than the DPMiner. But actually it is not the case for two reasons. First, the Border-Differential does not generate actual patterns, but rather its border
description. In contrast, the DPMiner generates the full representations of patterns (i.e.,
actual items). The border representation is more succinct but is much less comprehensible,
not to mention that deriving actual patterns adds to more computational costs. For contrast
pattern based document analysis, users would like to see actual text terms. So in that sense,
the DPMiner is a preferable choice.
Second, in terms of output cubes, the Border-Differential generates more empty patterns
than the DPMiner. That is mainly because the max-pattern mining procedure does not
return enough max-patterns, or the borders formed by the max-patterns differ from each
other too much to derive a valid differential border. Lowering the minimal support threshold
θ could help, as indicated by the first row in Table 5.2, when θ = 3 (a very small threshold
compared to others), the Border-Differential produced cubes of almost the same size as the
DPMiner did.
CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY
Tuples
Threshold
100
1,000
2,500
4,000
8,124
3
30
50
64
90
Time(sec)
3.8 vs.
18.6 vs.
30.5 vs.
43.6 vs.
66.9 vs.
9.6
17.5
41.1
73.6
N/A
Cells
Avg. Cell Size (KB)
210
613
624
624
624
7.2 vs. 7.1
8.2 vs. 2.4
13.4 vs. 1.8
18.8 vs. 3.5
20.6 vs. N/A
30
Table 5.2: The complete experimental results.
5.3
Summary
In light of the above, the feasibility of integrating data cubing (BUC) with emerging pattern
mining has been justified by a series of comparative experiments. The performance of
merging BUC with the DPMiner is efficient and robust on input data of reasonably large
size. Also, despite its larger cube size, the DPMiner is able to find more emerging patterns
with a large support threshold, and present them in an easy and intelligible way. Thus,
the DPMiner is a better choice over the Border-Differential for building our integrated data
model for contrast pattern based document analysis.
On the other hand, the results might be more convincing if an ideal dataset, collected
from real-life multidimensional text databases directly and entirely, could be used in our
performance study.
Chapter 6
Conclusions
6.1
Summary of The Project
It has been shown that OLAP techniques and data cubes are widely applicable to the
analysis of documents stored in multidimensional text databases [6, 17, 24, 26]. However, no
previous work has been done to address a data mining problem related to multidimensional
text data. We proposed an OLAP-style contrast pattern based document analysis in this
work and adopt the emerging pattern [7], an important class of contrast patterns to study
this problem.
In this capstone project, we developed algorithms for a novel data-cube-based model
to address the contrast pattern based document analysis. We implemented this model by
integrating a data cubing algorithm BUC [5] with two emerging pattern mining algorithms,
the Border-Differential [7] and the DPMiner [16]. Our empirical evaluations showed that
the DPMiner is preferable to the Border-Differential, for its seamless, effective, efficient and
robust integration with BUC. OLAP query answering techniques (point queries, sub-cube
queries and top-k queries) developed in [6, 17, 24, 26] can be applied directly to analyze
documents, thanks to the similarity of structure between these cube-based data models.
6.2
Limitations and Future Work
This work could be further explored to fully complete the non-trivial document analysis
problem. One of the limiting features in our model construction is that despite two optimizations used in the algorithm, the cube size is still not as small as it can be. Meanwhile,
31
CHAPTER 6. CONCLUSIONS
32
it is costly to invoke the pattern mining procedure once for every single cell. Therefore, we
propose the following ideas for future improvements.
First, it is possible to compress the data cube by introducing more optimization heuristics
such as the incremental pattern mining. When materializing an ancestor cell, sometimes it
is not necessary to gather the target class and the background class to find patterns from
scratch. It is feasible to take the patterns in its descendant cells and test the support of those
patterns against the ancestor’s target class to see if they still exceed the support threshold.
Besides, instead of full data cubes, we can construct iceberg cubes, the partially-materialized
cubes in which cells aggregating few documents (smaller than a threshold) are excluded.
Second, we can explore other data cubing techniques, such as the Star-Cubing [23], and
use them to build the cube in our model construction process. BUC is considered the most
efficient cubing algorithm for computing iceberg cubes, but it would still be interesting to
see whether other algorithms might be more efficient.
Last but not least, it is also possible to employ advanced cubing techniques to achieve
a higher level of summarization on the text data aggregated in the cube. Such techniques
include the Quotient Cube [14] and the QC-tree [15] based on BUC to compress and summarize cells. Our duplicate checking idea described in Chapter 4 is similar to but achieves
less compression than what Quotient Cube can do.
Bibliography
[1] Sameet Agarwal et al. On the Computation of Multidimensional Aggregates. In VLDB
’96: Proceedings of the 22th International Conference on Very Large Data Bases, pages
506–521, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.
[2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Association Rules Between Sets of Items in Large Databases. In SIGMOD ’93: Proceedings of the 1993 ACM
SIGMOD International Conference on Management of Data, pages 207–216, New York,
NY, USA, 1993. ACM.
[3] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association
Rules in Large Databases. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 487–499, San Francisco, CA, USA, 1994. Morgan
Kaufmann Publishers Inc.
[4] James Bailey, Thomas Manoukian, and Kotagiri Ramamohanarao. Fast Algorithms for
Mining Emerging Patterns. In PKDD ’02: Proceedings of the 6th European Conference
on Principles of Data Mining and Knowledge Discovery, pages 39–50, London, UK,
2002. Springer-Verlag.
[5] Kevin Beyer and Raghu Ramakrishnan. Bottom-up Computation of Sparse and Iceberg CUBE. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD International
Conference on Management of Data, pages 359–370, New York, NY, USA, 1999. ACM.
[6] Boling Ding et al. TopCells: Keyword-based Search of Top-k Aggregated Documents
in Text Cube. In ICDE ’10: Proceedings of the 26th International Conference on Data
Engineering, Long Beach, CA, USA, 2010. IEEE.
[7] Guozhu Dong and Jinyan Li. Efficient Mining of Emerging Patterns: Discovering
Trends and Differences. In KDD ’99: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge discovery and data mining, pages 43–52, New York,
NY, USA, 1999. ACM.
[8] Guozhu Dong and Jinyan Li. Mining Border Descriptions of Emerging Patterns from
Dataset Pairs. Knowledge Information System, 8(2):178–202, 2005.
33
BIBLIOGRAPHY
34
[9] Bart Goethals et al. Frequent itemset mining implementations repository. Website,
2003. http://fimi.cs.helsinki.fi/data/.
[10] Jim Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-by,
Cross-tab, and Sub-totals. Data Mining and Knowledge Discovery, 1(1):29–53, 1997.
[11] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2nd edition, 2006.
[12] Jiawei Han, Jian Pei, and Yiwen Yin. Mining Frequent Patterns Without Candidate
Generation. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data, pages 1–12, New York, NY, USA, 2000. ACM.
[13] William Inmon. What Is A Data Warehouse, 1995.
[14] Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. Quotient Cube: How to Summarize
the Semantics of A Data Cube. In VLDB ’02: Proceedings of the 28th International
Conference on Very Large Data Bases, pages 778–789. VLDB Endowment, 2002.
[15] Laks V. S. Lakshmanan, Jian Pei, and Yan Zhao. QC-Trees: An Efficient Summary
Structure for Semantic OLAP. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 64–75, New York, NY,
USA, 2003. ACM.
[16] Jinyan Li, Guimei Liu, and Limsoon Wong. Mining Statistically Important Equivalence Classes and Delta-discriminative Emerging Patterns. In KDD ’07: Proceedings
of the 13th ACM SIGKDD International Conference on Knowledge discovery and data
mining, pages 430–439, New York, NY, USA, 2007. ACM.
[17] Cindy Xide Lin et al. Text Cube: Computing IR Measures for Multidimensional Text
Database Analysis. In ICDM ’08: Proceedings of the 2008 Eighth IEEE International
Conference on Data Mining, pages 905–910, Washington, DC, USA, 2008. IEEE Computer Society.
[18] Guimei Liu, Jinyan Li, and Limsoon Wong. A New Concise Representation of Frequent
Itemsets Using Generators and A Positive Border. Knowledge and Information Systems,
17(1):35–56, 2008.
[19] Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[20] Sébastien Nedjar, Alain Casali, Rosine Cicchetti, and Lotfi Lakhal. Emerging Cubes for
Trends Analysis in OLAP Databases. Lecture Notes in Computer Science, 4654:135–
144, 2007.
[21] Jian Pei. Pattern-Growth Methods for Frequent Pattern Mining. PhD thesis, Simon
Fraser University, 2002.
BIBLIOGRAPHY
35
[22] Lusheng Wang, Hao Zhao, Guozhu Dong, and Jianping Li. On the complexity of finding
emerging patterns. Theoretical Computer Science, 335(1):15–27, 2005.
[23] Dong Xin et al. Star-Cubing: Computing Iceberg Cubes by Top-down and Bottom-up
Integration. In VLDB ’2003: Proceedings of the 29th International Conference on Very
Large Data Bases, pages 476–487. VLDB Endowment, 2003.
[24] Yintao Yu et al. iNextCube: Information Network-Enhanced Text Cube. Proc. VLDB
Endow., 2(2):1622–1625, 2009.
[25] Xinghuo Zeng, Jian Pei, et al. PADS: A Simple Yet Effective Pattern-Aware Dynamic
Search Method for Fast Maximal Frequent Pattern Mining. Knowledge and Information
Systems, 20(3):375–391, 2009.
[26] Duo Zhang et al. Topic Modeling for OLAP on Multidimensional Text Databases:
Topic Cube and Its Applications. Stat. Anal. Data Min., 2(56):378–395, 2009.
[27] Yan Zhao. Quotient Cube and QC-Tree: Efficient Summarizations for Semantic OLAP.
Master’s thesis, The University of British Columbia, 2003.
[28] Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton. An Array-based Algorithm for Simultaneous Multidimensional Aggregates. In SIGMOD ’97: Proceedings
of the 1997 ACM SIGMOD International Conference on Management of Data, pages
159–170, New York, NY, USA, 1997. ACM.
Index
Apex cuboid, 8
Background class, or negative class, 19
Base cell, 8
Base table, 6
Border, 12
Border-Differential algorithm, 12
BUC, the Bottom-Up Computation, 9
Closed pattern, 11
Data cube, 6
Data cubing, 6
Data warehouse, 6
Delta-discriminative equivalent class, 13
DPMiner, the Discriminative Pattern Miner,
13
Emerging pattern mining, 11
equivalent class, 13
FP-Growth algorithm, 10
FP-tree, 10
Frequent pattern mining, 10
Generator, 11
Maximal frequent pattern, 10
Multidimensional text database, 16
OLAP, Online Analytical Processing, 6
PADS, the Pattern-Aware Dynamic Search,
11
Target class, or positive class, 19
Transaction, 20
36