Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INTEGRATING DATA CUBE COMPUTATION AND EMERGING PATTERN MINING FOR MULTIDIMENSIONAL DATA ANALYSIS by Wei Lu a Report submitted in partial fulfillment of the requirements for the SFU-ZU dual degree of Bachelor of Science in the School of Computing Science Simon Fraser University and the College of Computer Science and Technology Zhejiang University c Wei Lu 2010 SIMON FRASER UNIVERSITY AND ZHEJIANG UNIVERSITY April 2010 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author. APPROVAL Name: Wei Lu Degree: Bachelor of Science Title of Report: Integrating Data Cube Computation and Emerging Pattern Mining for Multidimensional Data Analysis Examining Committee: Dr. Jian Pei Associate Professor, Computing Science Simon Fraser University Supervisor Dr. Qianping Gu Professor, Computing Science Simon Fraser University Supervisor Dr. Ramesh Krishnamurti Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii Abstract Online analytical processing (OLAP) in multidimensional text databases has recently become an effective tool for analyzing text-rich data such as web documents. In this capstone project, we follow the trend of using OLAP and the data cube to analyze web documents, but want to address a new problem from the data mining perspective. In particular, we wish to find contrast patterns in documents of different classes and then use those patterns in OLAP style text data and web document analysis. To this end, we propose to integrate the data cube with an important kind of contrast pattern called the emerging pattern, to build a new data model for solving the document analysis problem. Specifically, this novel data model is implemented on top of the traditional data cube by seamlessly integrating the bottom-up cubing (BUC) algorithm with two different emerging pattern mining algorithms, the Border-Differential and the DPMiner. The processes of cube construction and emerging pattern mining are merged together and carried out simultaneously; patterns are stored into the cube as cell measures. Moreover, we study and compare the performance of those two integrations by conducting experiments on datasets derived from the Frequent Itemset Mining Implementations Repository (FIMI). Finally, we suggest improvements and optimizations that can be done in future work. iii To my family iv “For those who believe, no proof is necessary; for those who don’t believe, no proof is possible.” — Stuart Chase, Writer and Economist, 1888 v Acknowledgments First of all, I would like to express my deepest appreciation to Dr. Jian Pei, for his support and guidance during my studies at Simon Fraser University. In various courses I took with him and particularly this capstone project, Dr. Pei showed me his broad knowledge and deep insights in the area of data management and mining, as well as his great personality and patience to a research beginner like me. In his precious time, he provided me with lots of help and advice for the project and other concerns (especially my graduate school applications). This work would not be possible without his supervision. I would love to thank Dr. Qianping Gu and Dr. Ramesh Krishnamurti for reviewing my report and directing the capstone projects for this amazing dual degree program. My gratitude also goes to Dr. Ze-Nian Li, Dr. Stella Atkins, Dr. Greg Mori and Dr. Ted Kirkpatrick for their wonderful classes I took at SFU and their good advice for my studies and career development. Also thanks to Dr. Guozhu Dong at Wright State University and Dr. Guimei Liu at National University of Singapore for making useful resources available for my work. I would also like to thank Mr. Thusjanthan Kubendranathan at SFU for his time and help in our discussions about this project. Deepest gratefulness to my family and friends who make my life enjoyable. In particular, I am greatly indebted to my beloved parents, for their unconditional support and encouragement. Their love accompany me wherever I go. This work is dedicated to them and I hope they are proud of me, as I am always proud of them. vi Contents Approval ii Abstract iii Dedication iv Quotation v Acknowledgments vi Contents vii List of Tables x List of Figures xi 1 Introduction 1 1.1 Overview of Text Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work on Multidimensional Text Data Analysis . . . . . . . . . . . . . 2 1.3 Contrast Pattern Based Document Analysis . . . . . . . . . . . . . . . . . . . 3 1.4 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Literature Review 2.1 6 Data Cubes and Online Analytical Processing . . . . . . . . . . . . . . . . . . 6 2.1.1 An Example of The Data Cube . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Data Cubing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 BUC: Bottom-Up Computation for Data Cubing . . . . . . . . . . . . 9 vii 2.2 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Emerging Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 2.3.1 The Border-Differential Algorithm . . . . . . . . . . . . . . . . . . . . 12 2.3.2 The DPMiner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Motivation 15 3.1 Motivation for Mining Contrast Patterns . . . . . . . . . . . . . . . . . . . . . 15 3.2 Motivation for Utilizing The Data Cube . . . . . . . . . . . . . . . . . . . . . 16 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Our Methodology 4.1 4.2 4.3 4.4 18 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.1 Normalizing Data Schema for Text Databases . . . . . . . . . . . . . . 18 4.1.2 Problem Modeling with Normalized Data Schema . . . . . . . . . . . 19 Processing Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 Integrating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.2 Integrating Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 BUC with DPMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 BUC with Border-Differential and PADS . . . . . . . . . . . . . . . . 25 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 Experimental Results and Performance Study 26 5.1 The Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Comparative Performance Study and Analysis . . . . . . . . . . . . . . . . . . 27 5.3 5.2.1 Evaluating the BUC Implementation . . . . . . . . . . . . . . . . . . . 27 5.2.2 Comparing Border-Differential with DPMiner . . . . . . . . . . . . . . 28 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6 Conclusions 31 6.1 Summary of The Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Bibliography 33 viii Index 36 ix List of Tables 2.1 A base table storing sales data [15]. . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Aggregates computed by group-by Branch. . . . . . . . . . . . . . . . . . . . 7 2.3 The full data cube based on Table 2.1. . . . . . . . . . . . . . . . . . . . . . . 8 2.4 An sample transaction database [21]. . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 A multidimensional text database concerning Olympic news. . . . . . . . . . 17 4.1 A normalized dataset derived from the Olympic news database. . . . . . . . . 19 4.2 A normalized dataset reproduced from Table 4.1. . . . . . . . . . . . . . . . . 21 5.1 Sizes of synthetic datasets for experiments . . . . . . . . . . . . . . . . . . . . 27 5.2 The complete experimental results. . . . . . . . . . . . . . . . . . . . . . . . . 30 x List of Figures 2.1 BUC Algorithm [5, 27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 A example of FP-tree based on Table 2.4 [21]. . . . . . . . . . . . . . . . . . . 11 5.1 Running time and cube size of our BUC implementation. . . . . . . . . . . . 28 5.2 Comparing running time of the two integration algorithms. . . . . . . . . . . 29 xi 9 Chapter 1 Introduction 1.1 Overview of Text Data Mining Analysis of documents in text databases and on the World Wide Web has been attracting researchers from various areas, such as data mining, machine learning, information retrieval, database systems, and natural language processing. In general, studies in different areas have different emphases. Traditional information retrieval techniques (e.g., the inverted index and vector-space model) prove to be efficient and effective in searching relevant documents to answer unstructured keyword-based queries. Machine learning approaches are also widely used in text mining, providing with effective solutions to various problems. For example, the Naive Bayes model and the Support Vector Machines (SVMs) are used in document classification; K-means and the ExpectationMaximization (EM) algorithms are used in document clustering. The textbook by Manning et al. [19] covers topics summarized above and much more in both traditional information retrieval and machine learning based document analysis. On the other hand, data warehousing and data mining also play important roles in analyzing documents, especially those stored in a special kind of databases called multidimensional text databases (ones with both relational dimensions and text fields). While information retrieval mainly addresses searching for documents and for information within documents according to users’ information needs, the goal of text mining differs in the following sense: it focuses on finding and extracting useful patterns and hidden knowledge from the information in documents and/or text databases, so as to improve the decision making process based on the text information. 1 CHAPTER 1. INTRODUCTION 2 Currently, many real-life business, administration and scientific databases are multidimensional text databases, containing both structured attributes and unstructured text attributes. An example of these databases can be found in Table 3.1. Since data warehousing and online analytical processing (OLAP) have proven their great usefulness in managing and mining multidimensional data of varied granularities [11], they have recently become important tools in analyzing such text databases [6, 17, 24, 26]. 1.2 Related Work on Multidimensional Text Data Analysis A data warehouse is a “subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making” [13]. Online analytical processing (OLAP), which is dominated by “stylized queries that involve group-by and aggregate operators” [27], is a powerful tool in data warehousing. Being a multidimensional data model with various features, the data cube [10] has become an essential OLAP facility in data warehousing. Conceptually, the data cube is an extended database with aggregates in multiple levels and multiple dimensions [15]. It generalizes the group-by operator, by precomputing and storing group-bys with regard to all possible combinations of dimensions. Data cubes are widely used in data warehousing for analyzing multidimensional data. Applying OLAP techniques, especially data cubes, to analyze documents in multidimensional text databases has made significant advances. Important information retrieval measures, i.e., term frequencies and inverted indexes, have been integrated into the traditional data cube, leading to the text cube [17]. It explores both dimension hierarchy and term hierarchy in the text data, and is able to answer OLAP queries by navigating to a specific cell via roll-up and drill-down operations. More recently, the work in [6] proposes a query answering technique called TopCells to address the top-k query answering in the text cube. Given a keyword query, TopCells is able to find the top-k ranked cells containing aggregated documents that are most relevant to the query. Another OLAP-based model dealing with multidimensional text data is the topic cube [26]. Topic cube combines OLAP with probabilistic topic modeling. It explores topic hierarchy of documents and stores probability-based measures learned through a probabilistic model. Moreover, text cubes and topic cubes have been applied to information network analysis. They are combined into an information-network-enhanced text cube called iNextCube CHAPTER 1. INTRODUCTION 3 [24]. Most previous works emphasize data warehousing more than data mining. They mainly deal with problems such as how to explore and establish dimensional hierarchies within the text data, and how to efficiently answer OLAP queries using cubes built on text data. 1.3 Contrast Pattern Based Document Analysis We follow the trend of using data cubes to analyze documents in multidimensional text databases. But as the previous works are more data warehousing oriented, we intend to address a more data mining oriented problem called contrast pattern based document analysis. More specifically, we wish to find contrast patterns in documents of different classes and then use those patterns in OLAP style document analysis (like the work in [6, 17]). This application is promising and has real-life demands. For example, from a large collection of documents containing information and reviews of laptop computers of various brands, a user interested in comparing Dell and Sony laptops might wish to find text information describing Dell’s special features that do not characterize Sony. These features contrast the two brands effectively, and would probably make the user’s decision to select Dell easier. To achieve this goal, we propose to integrate frequent pattern mining, especially the emerging pattern mining, and data cubing in an efficient and effective way. Frequent pattern mining [2] aims to find itemsets, that is, sets of items that frequently occur in a dataset. Furthermore, for patterns that can contrast different classes of data, intuitively they must be frequent patterns in one class, but are comparatively infrequent in other classes. There is one important class of contrast patterns, called the emerging pattern [7], defined as itemsets whose supports increase significantly from dataset D1 to dataset D2 . That said, those patterns are frequent in D2 but infrequent in D1 . Because of the sharp change of their supports among different datasets, such patterns meet our needs of showing contrasts in different classes of web documents. Our Contributions To tackle the contrast pattern based document analysis problem, we propose a novel data model by integrating efficient emerging pattern algorithms (e.g., the Border-Differential [7] CHAPTER 1. INTRODUCTION 4 and the state-of-the-art, DPMiner [16]) with the traditional data cube. This integrated model is novel, but also preserves features of traditional data cubes: 1. It is based on the data cube, and is constructed through a classical data cubing algorithm called BUC (the Bottom-Up Computation for data cubing) [5]. 2. It contains multidimensional text data and multiple granularity aggregates of such data, in order to support fast OLAP operations (such as roll-up and drill-down) and query answering. 3. Each cell in the cube contains a set of aggregated documents in the multidimensional text database with matched dimension attributes. 4. The measure of each cell is the emerging patterns whose support rises rapidly from the documents not aggregated in the cell to the documents aggregated in the cell. In this capstone project, we implement this integrated data model by incorporating emerging pattern mining seamlessly into the data cubing process. We choose BUC as our cubing algorithm to build the cube on structured dimensions. While aggregating documents and materializing cells, we simultaneously mine emerging patterns in documents aggregated in each particular cell, and store such patterns as the measure of this cell. Two widely used emerging pattern mining algorithms, the Border-Differential and the DPMiner are integrated with BUC cubing so as to compare their performance. We tested these two different integrations on synthetic datasets to evaluate their performance on different sizes of input data. The datasets are derived based on the Frequent Itemset Mining Implementations Repository (FIMI) [9]. Experimental results show that the state-of-the-art emerging pattern mining algorithm, the DPMiner, is a better choice over the Border-Differential. Our cube-based model shares similarity with the text cube [17] and the topic cube [26] at the level of data structure, since all three cubes are built based on multidimensional text data. The similarity of cube-based structure allows OLAP query answering techniques developed in [6, 17, 24, 26] to be directly applied to our cube. In that sense, point queries (seeking a cell), sub-cube queries (seeking an entire group-by) and top-k queries (seeking k most relevant cells) can be answered in contrast pattern based document analysis using our model. CHAPTER 1. INTRODUCTION 5 Major Differences with Existing Works This cube-based data model with emerging patterns as cell measures differs from all previous related work. It is unlike traditional data cubes using simple aggregate functions as cell measures, which are only adequate for relational databases. Also, our approach differs from the text cube which uses term frequencies and inverted indexes as cell measures, and the topic cube which uses probabilistic measures. Most importantly, to the best of our knowledge, our data model is novel in comparison to previous emerging pattern applications in OLAP. Specifically, a previous work in [20] used the Border-Differential algorithm to perform cube comparisons and capture trend changes between two precomputed data cubes. However, that work is of limited use and cannot be applied to multidimensional text data analysis. First, their approach worked on datasets different in kind from ours. The previous method only works on traditional data cubes built upon relational databases with categorical dimension attributes, while ours is designed for multidimensional text databases. Second, their approach is to find cells with supports growing significantly from one cube to another, but ours is able to determine emerging patterns for every single cell in the cube. Last but not least, their approach performs the Border-Differential algorithm after two data cubes were completely built, but our approach introduces a seamless integration: the data cubing and emerging pattern mining are carried out simultaneously. 1.4 Structure of the Report The rest of this capstone project report is organized as follows: Chapter 2 conducts a literature review on previous work and background knowledge that lays the foundation for this project. Chapter 3 motivates the contrast pattern based document analysis problem. Chapter 4 describes our methodology to tackle the problem. This chapter formulates the problem and proposes algorithms for constructing the integrated data model. Chapter 5 reports experimental results and studies the performance of our algorithm. Lastly, Chapter 6 concludes this capstone project and suggests improvements and optimizations that can be done in future work. Chapter 2 Literature Review This chapter reviews three categories of previous research that are related to this capstone project: data cubes and OLAP, frequent pattern mining, and emerging pattern mining. In Section 2.1 we talk about fundamentals of data warehousing, online analytical processing (OLAP), and data cubing. We highlight BUC [5], a bottom-up approach for data cubing. Section 2.2 introduces frequent pattern mining and an important mining algorithm called FP-Growth [12]. Section 2.3 reviews emerging pattern mining algorithms (BorderDifferential [7] and DPMiner [16]) that are particularly useful to our work. 2.1 Data Cubes and Online Analytical Processing A data warehouse is “a subject oriented, integrated, time-varying, non-volatile collection of data in support of management’s decision-making process” [13]. A powerful tool of exploiting data warehouses is the so-called online analytical processing (OLAP). Typically, OLAP systems are dominated by “stylized queries involving many group-by and aggregation operations” [27]. The data cube was introduced in [10] to facilitate answering OLAP queries on multidimensional data stored in data warehouses. A data cube can be viewed as “an extended multi-level and multidimensional database with various multiple granularity aggregates” [15]. The term data cubing refers to the process of constructing a data cube based on a relational database table, which is often referred to as the base table . In a cubing process, cells with non-empty aggregates will be materialized. Given a base table, we precompute group-bys and the corresponding aggregate values with respect to all possible combinations 6 CHAPTER 2. LITERATURE REVIEW 7 of dimensions in this table. Each group-by corresponds to a set of cells. The aggregate value for that group-by is stored as the measure of that cell. Cell measures provide with a good and concise summary of information aggregated in the cube. In light of the above, the data cube is a powerful data model allowing fast retrieval and analysis of multidimensional data for decision making processes based on data warehouses. It generalizes the group-by operator in SQL (Structured Query Language), and enable data analysts to avoid long and complicated SQL queries when searching for unusual data patterns in multidimensional databases [10]. 2.1.1 An Example of The Data Cube Example (Data Cube): Table 2.1 is a sample base table in a marketing management data warehouse [15]. It shows data organized under the schema (Branch, Product, Season, Sales). Branch Product Season Sales B1 B1 B2 P1 P2 P1 spring spring fall 6 12 9 Table 2.1: A base table storing sales data [15]. To build a data cube upon this table, group-bys are computed on three dimensions Branch, Product and Season. Aggregate values of Sales will be cell measures. In this example, we choose Average(Sales) as the aggregate function for this example. Since most intermediate steps of a data cubing process are basically computing group-bys and aggregate values to form cells, we illustrate the two cells computed by “group-by Branch” in Table 2.2. Cell No. Branch Product Season AVG(Sales) 1 2 B1 B2 ∗ ∗ ∗ ∗ 9 9 Table 2.2: Aggregates computed by group-by Branch. In the same manner, the full data cube contains all possible group-bys on Branch, Product and Season. It is shown in Table 2.3. Note that cells 1, 2 and 3 are derived from the least aggregated group-by: group-by Branch, Product, Season. Such cells are CHAPTER 2. LITERATURE REVIEW 8 called base cells. On the other hand, cell 18 (∗, ∗, ∗) is the apex cuboid aggregating all tuples in the base table. Cell No. Branch Product Season AVG(Sales) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 B1 B1 B2 B1 B1 B1 B2 B2 ∗ ∗ ∗ ∗ ∗ B1 B2 ∗ ∗ ∗ P1 P2 P1 P1 P2 ∗ P1 ∗ P1 P1 P2 ∗ ∗ ∗ ∗ P1 P2 ∗ spring spring fall ∗ ∗ spring ∗ fall spring fall spring spring fall ∗ ∗ ∗ ∗ ∗ 6 12 9 6 12 9 9 9 6 9 12 9 9 9 9 7.5 12 9 Table 2.3: The full data cube based on Table 2.1. 2.1.2 Data Cubing Algorithms Efficient and scalable data cubing is challenging. When a base table has a large number of dimensions and each dimension has high cardinality, time and space complexity grows exponentially. In general, there are three approaches of cubing in terms of the order to materialize cells: top-down, bottom-up and a mix of both. A top-down approach (e.g., the Multiway Array Aggregation [28]) constructs the cube from the least aggregated base cells towards the most aggregated apex cuboid. On the contrary, a bottom-up approach such as BUC [5] computes cells in the opposite order. Other methods, such as Star-Cubing [23], combines the top-down and bottom-up mechanisms together to carry out the cubing process. On fast computation of multidimensional aggregates, [11] summarizes the following optimization principles: (1). Sorting or hashing dimension attributes to cluster related tuples CHAPTER 2. LITERATURE REVIEW 9 that are likely to be aggregated together in certain group-bys. (2). Computing higher-level aggregates from previously computed lower-level aggregates, and caching intermediate results in memory to reduce expensive I/O operations. (3). Computing a group-by from the smallest previously-computed group-by. (4). Mapping dimension attributes in various kinds of formats to integers ranging between zero and the cardinality of the dimension. There are also many other heuristics being proposed to improve the efficiency of data cubing [1, 5, 11]. 2.1.3 BUC: Bottom-Up Computation for Data Cubing BUC [5] constructs the data cube bottom-up, from the most aggregated apex cuboid to group-bys on a single dimension, then on a pair of dimensions, and so on. It also uses many optimization techniques introduced in the previous section. Figure 2.1 illustrates the processing tree and the partition method used in BUC on a 4-dimensional base table. Subfigure (b) shows the recursive nature of BUC: after sorting and partitioning data on dimension A, we deal with the partition (a1 , ∗, ∗, ∗) first and recursively partition it on dimension B to proceed to its parent cell (a1 , b1 , ∗, ∗) and then the ancestor (a1 , b1 , c1 , ∗) and so on. After dealing with partition a1 , BUC continues on to process partitions a2 , a3 and a4 in the same manner until all cells are materialized. Figure 2.1: BUC Algorithm [5, 27]. The depth-first search process for building our integrated data model (covered in Chapter CHAPTER 2. LITERATURE REVIEW 10 4) follows the basic framework of BUC. 2.2 Frequent Pattern Mining Frequent patterns are patterns (sets of items, sequence, etc.) that occur frequently in a database [2]. The supports of frequent patterns must exceed a pre-defined minimal support threshold. Frequent pattern mining has been studied extensively in the past two decades. It lays the foundation for many data mining tasks such as association rules [3] and emerging pattern mining. Although its definition is concise, the mining algorithms are not trivial. Two notable algorithms are Apriori [3] and FP-Growth [12] . FP-Growth is more important to our work as efficient emerging pattern mining algorithms such as [4, 16] use the FP-tree proposed in FP-Growth as data structures. FP-Growth addressed the limitations of the breadth-first-search-based Apriori such as multiple database scans, large amounts of candidate generations and support counting. It is a depth-first search algorithm. The first scan of a database finds all frequent items, ranks them in frequency-descending order, and puts them into a head table. Then it compresses the database into a prefix tree called FP-tree. A complete set of frequent patterns can be mined by recursively constructing projected databases and the FP-trees based on them. For example, given a transaction database in Table 2.4 [21], we can build a FP-tree accordingly (shown in Figure 2.2). TID Items (Ordered) Frequent Items 100 200 300 400 500 f, a, c, d, g, i, m, p a, b, c, f, l, m, o b, f, h, j, o b, c, k, s, p a, f, c, e, l, p, m, n f, a, c, m, p f, c, a, b, m f, b c, b, p f, c, a, m, p Table 2.4: An sample transaction database [21]. Next, we define three special types of frequent patterns: the maximal frequent patterns (max-patterns for short), the closed frequent patterns and frequent generators, as they are closely related to emerging pattern mining. Definition (Max-Pattern): An itemset X is a maximal frequent pattern, or maxpattern, in dataset D if X is frequent in D, and for every proper super-itemset Y such that CHAPTER 2. LITERATURE REVIEW 11 Figure 2.2: A example of FP-tree based on Table 2.4 [21]. X ⊂ Y , Y is infrequent in D [11]. Definition (Closed Pattern and Generator): An itemset X is closed in dataset D if there exists no proper super-itemset Y s.t. X ⊂ Y and support(X) = support(Y ) in D. X is a closed frequent pattern in D if it is both closed and frequent in D [11]. An itemset Z is a generator in D if there exists no proper sub-itemset Z 0 such that Z 0 ⊆ Z and support(Z 0 ) = support(Z) [18]. The state-of-the-art max-pattern mining algorithm is called the Pattern-Aware Dynamic Search (PADS) [25]. The DPMiner, the state-of-the-art emerging pattern mining algorithm, is also the most powerful algorithm for mining closed frequent patterns and frequent generators. 2.3 Emerging Pattern Mining Emerging patterns [7] are patterns whose supports increase significantly from one class of data to another. Mathematical details can be found Section 4.1 (Problem Formulation) of this report and [4, 7, 8, 16]. The original work on emerging pattern in [7] gives an algorithm called the Border-Differential for mining such patterns. It uses borders to succinctly represent patterns and mines the patterns by manipulating the borders only. The work in [4] CHAPTER 2. LITERATURE REVIEW 12 used the FP-tree introduced in [12] for emerging pattern mining. Following that, the work in [16] improves the FP-tree-based algorithm by simultaneously generating closed frequent patterns and frequent generators to form emerging patterns. This algorithm is called the DPMiner and is considered as the state-of-the-art for emerging pattern mining. 2.3.1 The Border-Differential Algorithm Border-Differential uses borders to represent patterns. It involves mining max-patterns and manipulating borders initiated by the patterns to derive the border representation of emerging patterns. A border is an ordered pair hL, Ri, where L and R are the left and right bounds of the border respectively. Both L and R are collections of itemsets, but are much smaller than the original patterns in size. Emerging patterns represented by hL, Ri are the intervals of hL, Ri, defined as [L, R] = {Y |∃X ∈ L, ∃Z ∈ R, s.t. X ⊆ Y ⊆ Z}. For example, suppose [L, R] = {{1}, {1, 2}, {1, 3}, {1, 2, 3}, {2, 3}, {2, 3, 4}}, it has border L = {{1}, {2, 3}}, R = {{1, 2, 3}, {2, 3, 4}}. Itemsets other than those in L and R (e.g., {1, 3}) are intervals of hL, Ri. Given a pair of borders h{φ}, R1 i and h{φ}, R2 i whose left bounds are initially empty, the differential border hL1 , R1 i is derived to satisfy [L1 , R1 ] = [{φ}, R1 ] − [{φ}, R2 ]. This operation is the so-called Border-Differential. Furthermore, given two datasets D1 and D2 , to determine emerging patterns using the Border-Differential operation, first we determine the max-patterns U1 of D1 and U2 of D2 using PADS, and initiate two borders h{φ}, U1 i and h{φ}, U2 i. Then, we make the differential between those two borders. Let U1 = {X1 , X2 , ..., Xn } and U2 = {Y1 , Y2 , ..., Ym } where Xi and Yj are itemsets, the left bound of the differential border is computed by L1 = Sn i (P owerSet(Xi ) − Sm j (P owerSet(Yj ))). The right bound U1 remains the same. Lastly, form a border hL1 , U1 i, and the set intervals [L1 , U1 ] of hL1 , U1 i are emerging patterns in D1 . As the size of datasets grow, the Border-Differential would become problematic because it involves set enumerations, resulting in exponential computational costs. The work in [8], a more recent version of [7], proposed several optimization techniques to improve the efficiency of Border-Differential. However, in fact, the complexity of finding emerging patterns is MAX SNP-hard, which means that polynomial time approximation schemes do not exist unless P = N P [22]. CHAPTER 2. LITERATURE REVIEW 2.3.2 13 The DPMiner Algorithm The work in [4] used the FP-tree and patten-growth methods to mine emerging patterns, but it still needs to call Border-Differential to find emerging patterns. The DPMiner (stands for Discriminative Pattern Miner) in [16] also uses FP-tree but mines emerging patterns in a different way. It finds closed frequent patterns and frequent generators simultaneously to form equivalent classes of such patterns, and then determine emerging patterns as “nonredundant δ-discriminative equivalent classes” [16]. An equivalent class EC is “a set of itemsets that always occur together in some transactions of dataset D” [16]. It can be uniquely represented by its set of frequent generators G and closed frequent patterns C, in the form of EC = [G, C]. Suppose D can be divided into various classes, denoted as D = D1 ∪ D2 ∪ ... ∪ Dn . Let δ be a small integer (usually 1 or 2) and θ be a minimal support threshold. An equivalent class EC is a δ-discriminative equivalent class, provided that its closed pattern C’s support is greater than θ in D1 but smaller than δ in D − D1 = D2 ∪ ... ∪ Dn . Furthermore, EC is a non-redundant δ-discriminative equivalent class if and only if (1) it is δ-discriminative, (2) d such that Cb ⊆ C, where Cb and C are the closed patterns of EC d and EC there exists no EC respectively. The closed frequent patterns of a non-redundant δ-discriminative equivalent class are emerging patterns in D1 . Data Structures and Computational Steps of The DPMiner The high efficiency of the DPMiner is mainly attributed to its revised FP-tree structure. Unlike traditional FP-trees, it does not store items appearing in every transaction and hence have a full support in D. These items are removed because they cannot form generators. Such modification results in a much smaller FP-tree compared to the original. The computational framework of the DPMiner consists of the following five steps: (1). Given k classes of data D1 , D2 , ..., Dk as input, obtain a union of them to get D = D1 ∪ D2 ∪ ... ∪ Dk . Also specify a minimal support threshold θ and a maximal threshold δ (thus, patterns with supports above θ in Di but below δ in D − Di are candidate emerging patterns in Di ). (2). Construct a FP-tree based on D and run a depth-first search on the tree to find frequent generators and closed patterns simultaneously. For each search path along the tree, the search terminates whenever a δ-discriminative equivalent class is reached. CHAPTER 2. LITERATURE REVIEW 14 (3). Determine the class label distribution for every closed pattern, i.e., find in which class a closed pattern has the highest support. This step is necessary because patterns are not mined separately for each Di (1 ≤ i ≤ k), but rather on the entire D. (4). Pair up generators and closed frequent patterns to form δ-discriminative equivalent classes. (5). Output the non-redundant δ-discriminative equivalent classes as emerging patterns. If a pattern is labeled as i (1 ≤ i ≤ k), then it is an emerging pattern in Di . 2.4 Summary In this chapter, we discussed previous research addressing data cubing, frequent pattern mining and emerging pattern mining, all of which are essential for our project. Algorithms (the Bottom-Up Cubing, the Border-Differential and the DPMiner) closely related to our work have been described in detail. Chapter 3 Motivation In this chapter, we motivate the problem of contrast pattern based document analysis. We explain why contrast patterns (in particular, the emerging patterns) are useful, and why data cubes should be used in analyzing documents in multidimensional text databases. 3.1 Motivation for Mining Contrast Patterns This section answers the following two questions: (1) Why we need to mine and use contrast patterns to analyze web documents? (2) How useful are those patterns? In other words, can they make a significant contribution to a good text mining application? We answer these questions by introducing motivating scenarios in real life. Example (Contrast Patterns in Documents) Since the Calgary 1988 Olympic Winter Games, Canada has not been a host country for the Olympic Games for 22 years. Therefore, people may want to know what are the most attractive and discriminative features of the Vancouver 2010 Winter Olympics, compared to all previous Olympic Games. Indeed, there are exciting and touching stories in almost all Olympics and Vancouver certainly has its unique moments. For example, the Canadian figure skater Joannie Rochette won a bronze medal under the keenly felt pain of losing her mother a day before her event started. Suppose a user searches the web and Google returns her a collection of documents on Olympics, consisting of many online sports news and commentaries. There may be too much information for her to read through and find unique stories about Vancouver 2010. Although there is no doubt that Joannie Rochette’s accomplishment will occur frequently in articles related to Vancouver 2010, a user who is previously unaware about Rochette may 15 CHAPTER 3. MOTIVATION 16 not be able to learn about her quickly from the search results. Similar situations may also happen when users compare products online by searching and reading reviews by previous buyers. Here is an example we have seen in Section 1.3: Suppose a user is comparing Dell’s laptop computers with Sony’s. She probably wants to know the special features of Dell which are not owned by Sony’s. For example, many reviewers would speak in favor of Dell by commenting “high performance-price ratio” but would not do that for Sony as it is not the case. Then “high performance-price ratio” is a pattern contrasting Dell laptops with Sony laptops. To let the users manually determine such contrast patterns is not feasible. Therefore, given a collection of documents, which are ideally pre-classified and stored into a multidimensional text database, we need to develop efficient data models and corresponding algorithms to determine contrast patterns in documents of different classes. As mentioned in Section 1.3, we choose the emerging pattern [7] since it is a representative class of contrast patterns widely used in data mining. Also, there are good algorithms [4, 7, 16] available for efficient mining of such patterns. Moreover, emerging patterns can make a contribution to some other problems in text mining. A novel document classifier could be constructed based on those patterns as they are claimed useful in building accurate classifiers [8]. Also, since emerging patterns are able to capture discriminative features of a class of data, they may be helpful in extracting keywords to summarize the given text. 3.2 Motivation for Utilizing The Data Cube In many real-life database applications, documents and the text data within them are stored in multidimensional text databases [24]. These kinds of databases are distinct from traditional data sources we deal with, including relational databases, transaction databases, and text corpora. Formally, a multidimensional text database is defined as a relational database with text fields. A sample text database is shown in Table 3.1. The first three dimensions (Event, Time, and Publisher) are standard dimensions, just like those in relational databases. The last column contains text dimensions which are documents with text terms. Text databases provide structured attributes of documents, and the information needs of users vary where such needs can be modeled hierarchically. This makes OLAP and data cubes applicable. For instance (using Table 3.1), if a user wants to read news on the ice hockey games reported by the Vancouver Sun on February 20, 2010, then two documents d1 CHAPTER 3. MOTIVATION 17 Event Time Publisher ... Ice hockey Ice hockey Ice hockey Figure skating Figure skating Curling Curling ... 2010/2/20 2010/2/23 2010/2/20 2010/2/20 2010/2/20 2010/2/23 2010/2/28 ... Vancouver Sun Global and Mail Vancouver Sun Global and Mail Vancouver Sun New York Times Global and Mail ... ... ... ... ... ... ... ... ... Text Data: Documents d1 d2 d3 d4 d5 d6 d7 = {t1 , t2 , t3 , t4 } = {t2 , t3 , t7 , t8 } = {t1 , t2 , t3 , t6 } = {t2 , t4 , t6 , t7 } = {t1 , t3 , t5 , t7 } = {t2 , t5 , t7 , t9 } = {t3 , t6 , t8 , t9 } ... Table 3.1: A multidimensional text database concerning Olympic news. and d3 matching the query {Event = Ice hockey, Time = 2010/2/20, Publisher = Vancouver Sun} will be returned to her. If another user wants to skim all Olympic news reported by the Vancouver Sun on that day, we shall roll up to query {Event = ∗, Time = 2010/2/20, Publisher = Vancouver Sun} and return documents d1 , d3 and d5 to her. The opposite operation of roll-up is called drill-down. In fact, roll-up and drill-down are two OLAP operations of great importance [11]. Therefore, to meet different levels of information needs, it is natural for us to apply the data cube to model and extend this text database. This is exactly what the previous work in [17, 24, 26] did. 3.3 Summary In light of the above, this chapter shows that contrast patterns are useful in analyzing large scale text data and they are able to give concise information about the data. Also, the nature of multidimensional text databases makes OLAP and the most essential OLAP tool, the data cube, particularly suitable for modeling and analyzing text data in documents. Chapter 4 Our Methodology In this chapter, we describe our methodology to tackle the contrast pattern based document analysis by building a novel integrated data model through BUC data cubing [5] and two emerging pattern mining algorithms, the Border-Differential [7] and the DPMiner [16]. Section 4.1 formulates the problem we try to address in this work. Section 4.2 describes the processing framework and our algorithms, from both data integration level and algorithm integration level. Section 4.3 discusses issues related to implementation. 4.1 4.1.1 Problem Formulation Normalizing Data Schema for Text Databases Suppose a collection of web documents are stored in a multidimensional text database. The text data in documents are collected under the schema containing a set of standard non-text dimensions {SD1 , SD2 , ..., SDn }, and a set of text dimensions (terms) {T D1 , T D2 , ..., T Dm }, where m is the number of distinct text terms in this collection. For simplicity, text terms can be mapped to items, so documents can be mapped to transactions, or itemsets (sets of items that appear together). This mapping is similar to the bag-of-words model, which represents text data as an unordered collection of words, disregarding word order and count. In that sense, a multidimensional text database can be mapped to a relational base table with a transaction database. Under the above mapping mechanism, each tuple in a text database corresponds to a certain document, in the form of hS, T i, where S is the set of standard dimension attributes 18 CHAPTER 4. OUR METHODOLOGY 19 and T is a transaction. The dimension attributes can be learned through a certain classifier or labeled artificially. Words in the document are tokenized and each distinct token will be treated as an item in the transaction. For example, the tuple corresponding to the first row in Table 3.1 is Ice hockey, 2010/2/20, Vancouver Sun, ..., d1 = {t1 , t2 , t3 , t4 }, with d1 = {t1 , t2 , t3 , t4 } being the transaction. Furthermore, we normalize text database tuples to derive a simplified data schema. We map standard dimensions to letters, e.g, Event to A, Time to B and Publisher to C, to make them unified. Likewise, dimension attributes are mapped to items in the same manner: Ice hockey is mapped to a1 , Figure skating is mapped to a2 and so on. Table 4.1 shows a normalized dataset derived from the Olympic news database (Table 3.1). A B C ... a1 a1 a1 a2 a2 a3 a3 ... b1 b2 b1 b1 b1 b2 b3 ... c1 c2 c1 c2 c1 c3 c2 ... ... ... ... ... ... ... ... ... Transactions d1 d2 d3 d4 d5 d6 d7 = {t1 , t2 , t3 , t4 } = {t2 , t3 , t7 , t8 } = {t1 , t2 , t3 , t6 } = {t2 , t4 , t6 , t7 } = {t1 , t3 , t5 , t7 } = {t2 , t5 , t7 , t9 } = {t3 , t6 , t8 , t9 } ... Table 4.1: A normalized dataset derived from the Olympic news database. 4.1.2 Problem Modeling with Normalized Data Schema Given a normalized dataset as a base table, we build our integrated cube-based data model by computing a full data cube grouped by all standard dimensions (e.g., {A, B, C} in the above table). In the data cubing process, every subset of {A, B, C} will be gone through to form a group-by corresponding to a set of cells. Emerging patterns in each cell will be mined simultaneously and stored as cell measures. When materializing each cell, we aggregate tuples whose dimension attributes match this particular cell. The transactions of matched tuples form the target class (or positive class), denoted as T C. We also virtually aggregate all unmatched tuples and extract their transactions to form the background class (or negative class), denoted as BC. The membership in T C and BC varies from cell to cell; both classes are dynamically computed and formed for each cell. CHAPTER 4. OUR METHODOLOGY 20 A transaction T is a full itemset in a tuple. A pattern X is a sub-itemset of T having a non-zero support (i.e., the number of times X appears) in the given dataset. Let θ be the minimal support threshold for T C and δ be the maximal support threshold for BC. Pattern X is an emerging pattern in T C if and only if support(X, T C) ≥ θ and support(X, BC) ≤ δ. In other words, the support of X grows significantly from BC to T C, exceeding a minimal growth rate threshold ρ = θ/δ. Mathematically, growth rate(X) = support(X, T C) / support(X, BC) ≥ ρ. Note that δ can be 0, hence ρ = θ/δ = ∞. If growth rate(X) = ∞, X is a jumping emerging pattern [7] which does not appear in BC at all. Given predefined support thresholds θ and δ, for each cell in this cube-based model, we mine all patterns whose support is above θ in the target class T C and below δ in its background class BC. Thus, such patterns automatically exceed the minimal growth rate threshold ρ, and become a measure of this cell. Upon obtaining all cells and corresponding emerging patterns, the model building process is complete. The entire process is based on data cubing and also requires a seamless integration of cubing and emerging pattern mining. Example: Now let us consider a simple example regarding the base table in Table 4.1. Let θ = 2 and δ = 1. Suppose at a certain stage, we are carrying out the group-by operation on dimension A. We get three cells: (a1 , ∗, ∗), (a2 , ∗, ∗) and (a3 , ∗, ∗). For cell (a1 , ∗, ∗) aggregating the first three tuples in Table 4.1, T C = {d1 , d2 , d3 }, BC = {d4 , d5 , d6 , d7 }. Then consider pattern X = (t1 , t2 , t3 ). It appears twice in T C (in d1 and d3 ) but zero times in BC, so support(X, T C) ≥ θ and support(X, BC) < δ. In that sense, X = (t1 , t2 , t3 ) is an (jumping) emerging pattern in T C and hence is a measure of cell (a1 , ∗, ∗). 4.2 Processing Framework To recapitulate, Chapter 1 introduced the contrast pattern based document analysis in multidimensional text databases. We follow the idea of using data cubes and OLAP to analyze multidimensional text data, and propose to merge the BUC data cubing process with two different emerging pattern mining algorithms (the Border-Differential and the DPMiner) to build an integrated data model based on the data cube. This model is designed to support the contrast pattern based document analysis. In this section, following the problem formulation in Section 4.1, we propose our algorithm to integrate emerging pattern mining into data cubing. The entire processing framework includes both data integration and algorithm integration. CHAPTER 4. OUR METHODOLOGY 4.2.1 21 Integrating Data To begin with, we reproduce Table 4.1 (with slight revisions) to make the following discussion clear. It shows a standard and ideal format of data that simplifies a multidimensional text database. The data used in our testing will strictly follow this format: each row in a certain dataset D is a tuple in the form of hS, T i, where S is the set of dimension attributes and T is a transaction. Tuple No. A B C F Transactions 1 2 3 4 5 6 7 8 a1 a1 a1 a2 a2 a3 a3 a4 b1 b2 b1 b1 b1 b2 b3 b2 c1 c2 c2 c2 c1 c3 c2 c3 f1 f1 f2 f2 f1 f3 f3 f1 d1 = {t1 , t2 , t3 , t4 } d2 = {t2 , t3 , t7 , t8 } d3 = {t1 , t2 , t3 , t6 } d4 = {t2 , t4 , t6 , t7 } d5 = {t1 , t3 , t5 , t7 } d6 = {t2 , t5 , t7 , t9 } d7 = {t3 , t6 , t8 , t9 } d8 = {t6 , t8 , t11 , t12 } Table 4.2: A normalized dataset reproduced from Table 4.1. The integration of data is indispensable because of the nature of the multidimensional text mining problem. In addition, data cubing and emerging patten mining algorithms work with data from heterogeneous sources originally. Data cubing mainly deals with relational base tables in data warehouses, while emerging pattern mining concerns transaction databases (see an example in Table 2.4). Therefore, we should unify heterogeneous data first and then develop algorithms for a seamless integration. Thus, we model the text database and its normalized schema (Table 4.2) by appending transaction database tuples to relational base table tuples. Moreover, for the integrated data, we also apply one of the optimization techniques discussed in Section 2.1.2: mapping all dimension attributes in various kinds of formats to integers between zero and the cardinality of the attribute [11]. For example, in Table 4.2, dimension A has the cardinality |A| = 3, so in implementation and testing, we map a1 to 0, a2 to 1 and a3 to 2. Similarly, items in transactions are also mapped to integers ranging between one to the total number of items in this dataset. For instance, if all items in a dataset are labeled from t1 to t100 , we can represent them by integers ranging from 1 to 100. This kind of mapping facilitates sorting and hashing in data cubing. Particularly for BUC, such mapping allows the use of the linear counting sort algorithm to reorder input tuples CHAPTER 4. OUR METHODOLOGY 22 efficiently. 4.2.2 Integrating Algorithms Our algorithm integrates data cubing and emerging pattern mining seamlessly. It carries out a depth-first search (DFS) to build data cubes and mine emerging patterns as cell measures simultaneously. The algorithm is designed to work on any valid integrated datasets like Table 4.2 (both dimension attributes and transactions should be non-empty for tuples). We outline the algorithm in the following pseudo-code (adapted from [5]). Algorithm Procedure ButtomUpCubeWithDPMiner(data, dim, theta, delta) Inputs: data: the dataset upon which we build our integrated model. dim: number of standard dimensions in input data. theta: the minimal support threshold of candidate emerging patterns in the target class. delta: the maximal support threshold of candidate emerging patterns in the background class. Outputs: cells with their measures (patterns) Method: 1: aggregate(data); 2: if (data.count == 1) then 3: writeAncestors(data, dim); 4: return; 5: endif 6: for each dimension d (from 0 to (dim - 1)) do 7: C := cardinality(d); 8: newData := partition(data, d); // counting sort. 9: for each partition i (from 0 to (C - 1)) do 10: cell := createEmptyCell(); 11: posData := newData.gatherPositiveTransactions(); 12: negData := newData.gatherNegativeTransactions(); 13: isDuplicate := determineCoverage(posData, negData); 14: if (!isDuplicate) then CHAPTER 4. OUR METHODOLOGY 15: cell.measure := DPMiner(posData, negData, theta, delta); 16: writeOutput(cell); 17: subData := newData.getPartition(i); 18: ButtomUpCubeWithDPMiner(subData, d+1, theta, delta); 19: 20: 23 endif endfor 21: endfor For integrating BUC with another emerging pattern algorithm Border-Differential, replace line 15 with the following pseudo-code: 15.1: posMaxPat := PADS(posData, theta); 15.2: negMaxPat := PADS(negData, theta); 15.3: cell.measure := BorderDifferential(posMaxPat, negMaxPat); The Execution Flow To illustrate the execution flow of our integrated algorithm, suppose the algorithm is given a input dataset D like Table 4.2, with four dimensions namely A, B, C, F. To begin with, the algorithm aggregates D (line 1). Then it determines the cardinality of the first dimension A (line 7) and partitions the aggregated D on A (line 8), which creates four partitions (a1 , ∗, ∗, ∗), (a2 , ∗, ∗, ∗), (a3 , ∗, ∗, ∗) and (a4 , ∗, ∗, ∗). Each partition is sorted linearly using the counting sort algorithm. Then the algorithm iterates through these partitions to construct cells and mine patterns (line 9). It starts with cell (a1 , ∗, ∗, ∗), gathering transactions with a1 on A as the target class (line 11), and collects the remaining ones as the background class (line 12). Both classes will then be passed on to the DPMiner procedure to find emerging patterns in the target class (line 15), provided that this cell’s target class is not identical to that of its descendant cells that have been processed (line 13, more on this later). Then, BUC is called recursively on the current partition to materialize cells, mine patterns and output them. The algorithm further sorts and partitions (a1 , ∗, ∗, ∗) to proceed to its parent (a1 , b1 , ∗, ∗). As it continues to execute, it recurses further on ancestor cells (a1 , b1 , c1 , ∗) and (a1 , b1 , c1 , f1 ). Upon reaching the base cells, the algorithm backtracks to the nearest descendant cell (a1 , b1 , c2 , ∗). The complete processing order follows Figure 2.1. CHAPTER 4. OUR METHODOLOGY 24 Optimizations The duplicate checking function in line 13 is an optimization aimed at avoiding producing cells with identical aggregated tuples and patterns. For example, the cell (a2 , b1 , ∗, ∗) aggregates tuples 4 and 5 in Table 4.2. Since we have already computed its descendant cell (a2 , ∗, ∗, ∗), which also covers exactly the same two tuples, these two cells will have exactly the same target class and background class. Therefore, processing cells like (a2 , b1 , ∗, ∗) leads to duplicate work that is unnecessary and should be avoided. The duplicate checking function helps in this kind of situations. The above duplicate checking function generalizes the original BUC optimization called writeAncestors (line 3 in the pseudo code). Our algorithm also includes writeAncestors with slight modifications, as a special case of the duplicate checking. Consider that when the algorithm proceeds to (a4 , ∗, ∗, ∗), a partition has only one tuple. In the same sense as we have discussed above, the ancestor cells (a4 , b2 , ∗, ∗), (a4 , b2 , c3 , ∗), and (a4 , b2 , c3 , f1 ) all contain exactly the same tuple and hence will have identical patterns. These four cells actually form an equivalent class. We choose to output the lower bound (a4 , ∗, ∗, ∗) together with the upper bound (a4 , b2 , c3 , f1 ) and skip all intermediate cells in this equivalent class. Both optimization techniques shorten the running time of our program and reduces the number of cells to output. Experiments conducted in [5] found out that in real-life data warehouses, about 20% of the aggregates contain only one tuple. Therefore, empirically such optimizations are useful and helpful. 4.3 Implementations For this capstone project, we implemented BUC and the Border-Differential in C++. We also made use of the original DPMiner package from [16] for emerging pattern mining and the PADS package from [25] for max-pattern mining needed by the Border-Differential. To ensure smooth data flow in the integration, both DPMiner and PADS packages are modified sufficiently to meet our specific needs. The original packages read input data from certain files different from our test datasets (like Table 4.2), so for our implementation, those packages are changed to get the input directly from BUC on the fly. Therefore, primarily, the data structures for holding transactions, and corresponding functions to manipulate items in transactions are modified accordingly. CHAPTER 4. OUR METHODOLOGY 4.3.1 25 BUC with DPMiner Integrating BUC with the DPMiner, for each cell, label the tuples whose dimension attributes match the current partition in BUC as class 1 (the target class) and the tuples which do not match as class 0 (the background class). Then pass their transactions in two arrays to the DPMiner procedure, which will carry out the pattern mining task. It mines frequent generators and closed patterns for both data classes by executing the computational steps described in Section 2.3.2. After the mining process, the most general closed patterns, i.e., the ones that have the shortest length among others in its equivalent class are determined as the so-called non-redundant δ-discriminative patterns. Such patterns will be returned to the cell and stored as its measure. Lastly, one file per cell will be output to disk and the file name is a string containing the dimension attributes of that cell. 4.3.2 BUC with Border-Differential and PADS On the integration of BUC and Border-Differential, for each cell, the target class of transactions and the background class will be collected in the same manner as above. Unlike the DPMiner, the Border-Differential algorithm cannot determine candidate emerging patterns (i.e., the max-patterns in both classes) itself. Instead, our algorithm employs PADS [25] first to determine the max-patterns for both classes. Patterns will be passed on to the Border-Differential procedure after they are mined. Then, invoke the Border-Differential procedure to make the differential between two borders initiated by the max-patterns. As there could be more than one max-pattern for either the target or the background class, we might get multiple borders (each corresponding to a max-pattern) for a single cell. Finally, one file per cell will be output to disk and the file name is a string containing the dimension attributes of that cell. 4.4 Summary In this chapter, we formulated the data model construction problem in the first section. Then we described our processing framework, i.e., the integration of data cubing and emerging pattern mining from the data level and the algorithm level. Both levels of integrations are important. Lastly, we concluded this chapter by addressing some issues related to real implementations. Chapter 5 Experimental Results and Performance Study In this chapter, we present a comparative empirical evaluation on the algorithms developed and implemented in this capstone project. The experiments are run on a machine with Intel Core 2 Duo CPU, 3.0 GB main memory, and the Ubuntu 9.04 Linux operating system. The machine is physically located in the Computing Science Instructional Labs (CSIL) at Simon Fraser University. Our programs are implemented in C++. 5.1 The Test Dataset To the best of our knowledge, there are no widely accepted datasets available which follow the data schema specified in Section 4.1. This is mainly because previously data cubing and emerging pattern mining work separately with entirely heterogeneous data sources. We generated five relational base tables containing 100, 1,000, 2,500, 4,000 and 8,124 tuples respectively. All base tables have four standard dimensions with a cardinality of four. In comparison, the experiments for the text cube in [17] uses a test dataset with much fewer tuples (2,013) but more dimensions (14). We did intend to test our programs on datasets with eight or more dimensions but that easily resulted in tens of thousands of files and ran out our disk quota in the Ubuntu system. For transactional data, we adopted a dataset named mushroom.dat from the Frequent Itemset Mining Implementations Repository (FIMI) [9]. It contains 8,124 transactions with 26 CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY 27 an of average length of 30 items. The total number of items in mushroom.dat is 113. We synthesized our normalized datasets by appending rows in mushroom.dat to tuples in each of the five base tables. The synthesization process is not randomly conducted. Instead, we simulate a real multidimensional text database where tuples with more identical dimension attributes tend to have more similar transactions. Therefore, in the data integration process, we first clustered similar transactions to groups and then assigned transactions within a group to tuples having overlapping dimension attributes. On the contrary, tuples with few identical dimension attributes would be appended with dissimilar transactions coming from different clusters. Table 5.1 shows the sizes of the 5 synthetic datasets for our experiments. Num. of Tuples Size (KB) 100 1,000 2,500 4,000 8,124 6.9 69.2 174.1 278.5 565.7 Table 5.1: Sizes of synthetic datasets for experiments 5.2 Comparative Performance Study and Analysis We test three implementations (BUC solely, BUC with the DPMiner, and BUC with the Border-Differential) on all five synthetic datasets described above. Each of the test case was run ten times and the average running time was calculated and reported. 5.2.1 Evaluating the BUC Implementation Figure 5.1 presents the test results of our BUC implementation. In a pure BUC cubing test, transaction items are also included in input data, but the program would not process them as they are not related to pure data cubing. Including transaction data will certainly result in more computational overhead, in the sense that during the execution, data is moved around both in memory and on the disk as entire tuples. However, test results show that our BUC implementation is still robust under such conditions. As can be seen from Figure 5.1, the running time of BUC is impressive, which grows CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY 28 linearly as the size of input data increases. This good feature is mainly attributed to the (fixed range of) integer representation of data, which makes it possible to use the linear counting sort algorithm. It is shown that partitioning and sorting data are the most timeconsuming steps in a cubing process [5]. The implementation also achieves great compression ratio in cube size when the data size is relatively small. However, as the number of tuples in the synthetic dataset grow, it is rare to have cells containing identical aggregates, thus optimization heuristics such as the writeAncestors becomes of little use. (a). Running time of BUC (b). Number of Cells Created through BUC 800 (8124, 6.56) 6 Number of Cells Running Time (second) 8 4 (4000, 3.30) (2500, 2.21) 2 (2500, 624) (1000, 613) (4000, 624) 600 (8124, 624) 400 200 (100, 211) (1000, 0.98) 0 (100, 0.11) 0 2000 4000 6000 8000 Number of Tuples 10000 0 0 2000 4000 6000 8000 Number of Tuples 10000 Figure 5.1: Running time and cube size of our BUC implementation. 5.2.2 Comparing Border-Differential with DPMiner We compared two integration algorithms (1) BUC with the DPMiner and (2) BUC with the Border-Differential, with respect to both running time and the size of output cubes. The comparison results on running time is illustrated in Figure 5.2 and the complete experimental results are summarized in Table 5.2. For test cases with 100-tuple input, the minimal support threshold θ in target classes is set to be 3; for test data with greater sizes (1,000 tuples and more), 3 is no longer a reasonable value, so we take the square root of the number of tuples as the minimal support threshold. For example, the threshold for 2,500 tuples is 50. The maximal support threshold δ is set to 1 for all test cases. When compared on running time (the third column in Table 5.2), the first algorithm is faster than the second for every data except for the 1,000-tuple one (but only less than CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY 29 1% slower). For data of 2,500 and 4,000 tuples, the Border-Differential is 1.3 times and 1.7 times as slow as the DPMiner respectively. When tested on the 8,124-tuple dataset, the Border-Differential algorithm failed to complete in 120 seconds. We exclude this case as it does not affect the competition at all. The running time of the algorithm integrating the DPMiner is close to linear, while the MAX SNP-hard [22] Border-Differential proved to be much slower in practice. (a). Running time of BUC + DPMiner (b). Running Time of BUC + Border−Differential 80 80 Running Time (second) Running Time (second) (4000, 73.6) (8124, 66.9) 60 (4000, 43.6) 40 (2500, 30.5) 20 0 (1000, 18.6) (100, 3.8) 0 2000 4000 6000 Number of Tuples 8000 10000 60 (2500, 41.1) 40 20 (1000, 17.5) (100, 9.6) 0 0 1000 2000 3000 Number of Tuples 4000 Figure 5.2: Comparing running time of the two integration algorithms. When compared on cube size (the fifth column in Table 5.2), the Border-Differential appears to perform better than the DPMiner. But actually it is not the case for two reasons. First, the Border-Differential does not generate actual patterns, but rather its border description. In contrast, the DPMiner generates the full representations of patterns (i.e., actual items). The border representation is more succinct but is much less comprehensible, not to mention that deriving actual patterns adds to more computational costs. For contrast pattern based document analysis, users would like to see actual text terms. So in that sense, the DPMiner is a preferable choice. Second, in terms of output cubes, the Border-Differential generates more empty patterns than the DPMiner. That is mainly because the max-pattern mining procedure does not return enough max-patterns, or the borders formed by the max-patterns differ from each other too much to derive a valid differential border. Lowering the minimal support threshold θ could help, as indicated by the first row in Table 5.2, when θ = 3 (a very small threshold compared to others), the Border-Differential produced cubes of almost the same size as the DPMiner did. CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY Tuples Threshold 100 1,000 2,500 4,000 8,124 3 30 50 64 90 Time(sec) 3.8 vs. 18.6 vs. 30.5 vs. 43.6 vs. 66.9 vs. 9.6 17.5 41.1 73.6 N/A Cells Avg. Cell Size (KB) 210 613 624 624 624 7.2 vs. 7.1 8.2 vs. 2.4 13.4 vs. 1.8 18.8 vs. 3.5 20.6 vs. N/A 30 Table 5.2: The complete experimental results. 5.3 Summary In light of the above, the feasibility of integrating data cubing (BUC) with emerging pattern mining has been justified by a series of comparative experiments. The performance of merging BUC with the DPMiner is efficient and robust on input data of reasonably large size. Also, despite its larger cube size, the DPMiner is able to find more emerging patterns with a large support threshold, and present them in an easy and intelligible way. Thus, the DPMiner is a better choice over the Border-Differential for building our integrated data model for contrast pattern based document analysis. On the other hand, the results might be more convincing if an ideal dataset, collected from real-life multidimensional text databases directly and entirely, could be used in our performance study. Chapter 6 Conclusions 6.1 Summary of The Project It has been shown that OLAP techniques and data cubes are widely applicable to the analysis of documents stored in multidimensional text databases [6, 17, 24, 26]. However, no previous work has been done to address a data mining problem related to multidimensional text data. We proposed an OLAP-style contrast pattern based document analysis in this work and adopt the emerging pattern [7], an important class of contrast patterns to study this problem. In this capstone project, we developed algorithms for a novel data-cube-based model to address the contrast pattern based document analysis. We implemented this model by integrating a data cubing algorithm BUC [5] with two emerging pattern mining algorithms, the Border-Differential [7] and the DPMiner [16]. Our empirical evaluations showed that the DPMiner is preferable to the Border-Differential, for its seamless, effective, efficient and robust integration with BUC. OLAP query answering techniques (point queries, sub-cube queries and top-k queries) developed in [6, 17, 24, 26] can be applied directly to analyze documents, thanks to the similarity of structure between these cube-based data models. 6.2 Limitations and Future Work This work could be further explored to fully complete the non-trivial document analysis problem. One of the limiting features in our model construction is that despite two optimizations used in the algorithm, the cube size is still not as small as it can be. Meanwhile, 31 CHAPTER 6. CONCLUSIONS 32 it is costly to invoke the pattern mining procedure once for every single cell. Therefore, we propose the following ideas for future improvements. First, it is possible to compress the data cube by introducing more optimization heuristics such as the incremental pattern mining. When materializing an ancestor cell, sometimes it is not necessary to gather the target class and the background class to find patterns from scratch. It is feasible to take the patterns in its descendant cells and test the support of those patterns against the ancestor’s target class to see if they still exceed the support threshold. Besides, instead of full data cubes, we can construct iceberg cubes, the partially-materialized cubes in which cells aggregating few documents (smaller than a threshold) are excluded. Second, we can explore other data cubing techniques, such as the Star-Cubing [23], and use them to build the cube in our model construction process. BUC is considered the most efficient cubing algorithm for computing iceberg cubes, but it would still be interesting to see whether other algorithms might be more efficient. Last but not least, it is also possible to employ advanced cubing techniques to achieve a higher level of summarization on the text data aggregated in the cube. Such techniques include the Quotient Cube [14] and the QC-tree [15] based on BUC to compress and summarize cells. Our duplicate checking idea described in Chapter 4 is similar to but achieves less compression than what Quotient Cube can do. Bibliography [1] Sameet Agarwal et al. On the Computation of Multidimensional Aggregates. In VLDB ’96: Proceedings of the 22th International Conference on Very Large Data Bases, pages 506–521, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. [2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Association Rules Between Sets of Items in Large Databases. In SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, New York, NY, USA, 1993. ACM. [3] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [4] James Bailey, Thomas Manoukian, and Kotagiri Ramamohanarao. Fast Algorithms for Mining Emerging Patterns. In PKDD ’02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 39–50, London, UK, 2002. Springer-Verlag. [5] Kevin Beyer and Raghu Ramakrishnan. Bottom-up Computation of Sparse and Iceberg CUBE. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pages 359–370, New York, NY, USA, 1999. ACM. [6] Boling Ding et al. TopCells: Keyword-based Search of Top-k Aggregated Documents in Text Cube. In ICDE ’10: Proceedings of the 26th International Conference on Data Engineering, Long Beach, CA, USA, 2010. IEEE. [7] Guozhu Dong and Jinyan Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In KDD ’99: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge discovery and data mining, pages 43–52, New York, NY, USA, 1999. ACM. [8] Guozhu Dong and Jinyan Li. Mining Border Descriptions of Emerging Patterns from Dataset Pairs. Knowledge Information System, 8(2):178–202, 2005. 33 BIBLIOGRAPHY 34 [9] Bart Goethals et al. Frequent itemset mining implementations repository. Website, 2003. http://fimi.cs.helsinki.fi/data/. [10] Jim Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-tab, and Sub-totals. Data Mining and Knowledge Discovery, 1(1):29–53, 1997. [11] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition, 2006. [12] Jiawei Han, Jian Pei, and Yiwen Yin. Mining Frequent Patterns Without Candidate Generation. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 1–12, New York, NY, USA, 2000. ACM. [13] William Inmon. What Is A Data Warehouse, 1995. [14] Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. Quotient Cube: How to Summarize the Semantics of A Data Cube. In VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases, pages 778–789. VLDB Endowment, 2002. [15] Laks V. S. Lakshmanan, Jian Pei, and Yan Zhao. QC-Trees: An Efficient Summary Structure for Semantic OLAP. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 64–75, New York, NY, USA, 2003. ACM. [16] Jinyan Li, Guimei Liu, and Limsoon Wong. Mining Statistically Important Equivalence Classes and Delta-discriminative Emerging Patterns. In KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge discovery and data mining, pages 430–439, New York, NY, USA, 2007. ACM. [17] Cindy Xide Lin et al. Text Cube: Computing IR Measures for Multidimensional Text Database Analysis. In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 905–910, Washington, DC, USA, 2008. IEEE Computer Society. [18] Guimei Liu, Jinyan Li, and Limsoon Wong. A New Concise Representation of Frequent Itemsets Using Generators and A Positive Border. Knowledge and Information Systems, 17(1):35–56, 2008. [19] Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [20] Sébastien Nedjar, Alain Casali, Rosine Cicchetti, and Lotfi Lakhal. Emerging Cubes for Trends Analysis in OLAP Databases. Lecture Notes in Computer Science, 4654:135– 144, 2007. [21] Jian Pei. Pattern-Growth Methods for Frequent Pattern Mining. PhD thesis, Simon Fraser University, 2002. BIBLIOGRAPHY 35 [22] Lusheng Wang, Hao Zhao, Guozhu Dong, and Jianping Li. On the complexity of finding emerging patterns. Theoretical Computer Science, 335(1):15–27, 2005. [23] Dong Xin et al. Star-Cubing: Computing Iceberg Cubes by Top-down and Bottom-up Integration. In VLDB ’2003: Proceedings of the 29th International Conference on Very Large Data Bases, pages 476–487. VLDB Endowment, 2003. [24] Yintao Yu et al. iNextCube: Information Network-Enhanced Text Cube. Proc. VLDB Endow., 2(2):1622–1625, 2009. [25] Xinghuo Zeng, Jian Pei, et al. PADS: A Simple Yet Effective Pattern-Aware Dynamic Search Method for Fast Maximal Frequent Pattern Mining. Knowledge and Information Systems, 20(3):375–391, 2009. [26] Duo Zhang et al. Topic Modeling for OLAP on Multidimensional Text Databases: Topic Cube and Its Applications. Stat. Anal. Data Min., 2(56):378–395, 2009. [27] Yan Zhao. Quotient Cube and QC-Tree: Efficient Summarizations for Semantic OLAP. Master’s thesis, The University of British Columbia, 2003. [28] Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton. An Array-based Algorithm for Simultaneous Multidimensional Aggregates. In SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pages 159–170, New York, NY, USA, 1997. ACM. Index Apex cuboid, 8 Background class, or negative class, 19 Base cell, 8 Base table, 6 Border, 12 Border-Differential algorithm, 12 BUC, the Bottom-Up Computation, 9 Closed pattern, 11 Data cube, 6 Data cubing, 6 Data warehouse, 6 Delta-discriminative equivalent class, 13 DPMiner, the Discriminative Pattern Miner, 13 Emerging pattern mining, 11 equivalent class, 13 FP-Growth algorithm, 10 FP-tree, 10 Frequent pattern mining, 10 Generator, 11 Maximal frequent pattern, 10 Multidimensional text database, 16 OLAP, Online Analytical Processing, 6 PADS, the Pattern-Aware Dynamic Search, 11 Target class, or positive class, 19 Transaction, 20 36