Download Distributed and Stream Data Mining Algorithms for

Università Ca’ Foscari di Venezia Dipartimento di Informatica Dottorato di Ricerca in Informatica Ph.D. Thesis: TD-2006-4 Distributed and Stream Data Mining Algorithms for Frequent Pattern Discovery Claudio Silvestri Supervisor Prof. Salvatore Orlando PhD Coordinator Prof. Simonetta Balsamo Author’s Web Page: http://www.dsi.unive.it/∼claudio Author’s e-mail: [email protected] Author’s address: Dipartimento di Informatica Università Ca’ Foscari di Venezia Via Torino, 155 30172 Venezia Mestre – Italia tel. +39 041 2348411 fax. +39 041 2348419 web: http://www.dsi.unive.it To my wife Abstract The use of distributed systems is continuously spreading in several applications domains. Extracting valuable knowledge from raw data produced by distributed parties, in order to produce a unified global model, may presents various challenges related to either the huge amount of managed data or their physical location and ownership. In case data are continuously produced (stream) and their analysis is required to be performed in real time, communication costs and resource usage are issues that require careful attention in order to run computation in the optimal location. In this thesis, we examine in details the problems related to the Frequent Pattern Mining (FPM) in distributed and stream data and present a general framework for adapting an exact FPM algorithm to a distributed or streaming context. The FPM problems we consider are Frequent Itemset Mining (FIM), and Frequent Sequences Mining (FSM). In the first case, the input data are sets of items and the frequent patterns are those included in a user-specified number of input set. The second one consists in finding frequent sequential patterns in a database of time-stamped events. Since the proposed framework uses (exact) frequent pattern mining algorithms as the building block of the approximate distributed/stream algorithms, we will also describe two efficient algorithms for FIM and FSM: DCI, introduced by Orlando et al., and CCSM, which is one of the original contributions of this thesis. The resulting algorithms for distributed and stream FIM have been tested with real world and synthetic datasets, and are able to find efficiently a good approximation of the exact results and scale gracefully. The framework for FSM is almost identical, but has not been tested yet. The few differences are highlighted in the conclusion chapter. Sommario La diffusione dei sistemi distribuiti è in continuo aumento in svariati campi applicativi e l’estrazione di correlazioni non evidenti nei dati grezzi prodotti può essere strategica per le organizzazioni coinvolte. Questo tipo di operazione è generalmente non banale e, quando i dati sono distribuiti, presenta ulteriori difficoltà legate sia alla mole di dati coinvolti che alla loro proprietà e locazione fisica. Nel caso i dati siano prodotti in flussi continui (stream) e sia necessario analizzarli in tempo reale, l’ottimizzazione dei costi di comunicazione e delle risorse necessarie sono aspetti che debbono essere presi attentamente in considerazione. In questa tesi sono analizzati in modo dettagliato i problemi legati alla ricerca di pattern frequenti (FPM) su dati distribuiti e stream di dati. In particolare è presentato un metodo generale per ottenere, a partire da un qualunque algoritmo esatto per FPM, un algoritmo approssimato per il FPM su dati distribuiti e stream di dati. I tipi di pattern presi in considerazione sono gli itemset frequenti (FIM) e le sequenze frequenti (FSM). Nel primo caso i dati in ingresso sono insiemi di elementi (transazioni) ed i pattern frequenti sono a loro volta degli insiemi contenuti almeno in un numero di transazioni specificato dall’utente. Il secondo consiste invece nella ricerca di pattern sequenziali frequenti in una collezione di sequenze di eventi associati a precisi istanti di tempo. Poiché il metodo proposto utilizza degli algoritmi esatti per l’estrazione di pattern frequenti come parti da riunire per ottenere degli algoritmi per dati distribuiti e stream di dati, verranno anche descritti due algoritmi efficienti per FIM e FSM: DCI, presentato da Orlando ed altri, e CCSM, che è uno dei contributi originali di questa tesi. Gli algoritmi ottenuti applicando il metodo proposto sono stati utilizzati sia con dati reali sia con dati sintetici per valutarne l’efficacia. Gli algoritmi per FIM si sono dimostrati scalabili ed in grado di estrarre efficientemente una buona approssimazione della soluzione esatta. Il modello per FSM è quasi identico, ma non è ancora stato verificato sperimentalmente. Le poche differenze sono evidenziate nel capitolo finale. Acknowledgments I would like to thank Prof. Salvatore Orlando for his guidance and support during my Ph.D. studies. I am also grateful to him for the opportunity to collaborate with the ISTI-CNR High Performance Computing Lab. In this context, I would like to thank Raffaele Perego, Fabrizio Silvestri, and Claudio Lucchese who co-authored some of the papers I published and, in several ways, helped me in improving the quality of my work. I thank my referees, Prof. Hillol Kargupta and Prof. Rosa Meo, for their attention in reading this thesis and their valuable comments. Most part of this work has been carried out at the Dipartimento di Informatica, Università Ca’ Foscari di Venezia. I would like to thank all the faculty and personnel for their support and for making the department a friendly place for doing research. Special thanks to Moreno and Matteo, for the long discussions on free software and other interesting subjects, and to all the others (ex-) Ph.D. students for the pleasant time spent together: Chiara, Claudio, Damiano, Fabrizio, Francesco, Giulio, Marco, Matteo, Massimiliano, Ombretta, Paolo, Silvia, and Valentino. In this last year I have been a guest at the Dipartimento di Informatica e Comunicazione, Università degli Studi di Milano. I am grateful to Maria Luisa Damiani, for the opportunity of collaboration, and to the people working at the DB&SEC Lab, for the friendly working environment. This work was partially supported by the PRIN’04 Research Project entitled ”GeoPKDD - Geographic Privacy-aware Knowledge Discovery and Delivery”. Finally, I would like to thank my extended family, who has never lost faith in this long-term project, and all of my friends. Contents 1 Introduction 1.1 Data distribution . . . . . . . . . 1.2 Data evolution . . . . . . . . . . 1.3 Applications . . . . . . . . . . . . 1.4 Association Rules Mining . . . . . 1.4.1 Frequent Itemsets Mining 1.4.2 Frequent Sequence Mining 1.4.3 Taxonomy of Algorithms . 1.5 Contributions . . . . . . . . . . . 1.6 Thesis overview . . . . . . . . . . I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First Part 2 Frequent Itemset Mining 2.1 The problem . . . . . . . . . 2.1.1 Related works . . . . 2.2 DCI . . . . . . . . . . . . . . 2.2.1 Candidate generation 2.2.2 Counting phase . . . 2.2.3 Intersection phase . . 2.3 Conclusions . . . . . . . . . 1 2 3 6 7 9 10 11 13 14 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Frequent Sequence Mining 3.1 Introduction . . . . . . . . . . . . . . . . 3.2 Sequential patterns mining . . . . . . . . 3.2.1 Problem statement . . . . . . . . 3.2.2 Apriori property and constraints . 3.2.3 Contiguous sequences . . . . . . . 3.2.4 Constraints enforcement . . . . . 3.3 GSP . . . . . . . . . . . . . . . . . . . . 3.3.1 Candidate generation . . . . . . . 3.3.2 Counting . . . . . . . . . . . . . . 3.4 SPADE . . . . . . . . . . . . . . . . . . . 3.4.1 Candidate generation . . . . . . . 3.4.2 Candidate support check . . . . . 3.4.3 cSPADE: managing constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 20 20 21 22 22 24 25 . . . . . . . . . . . . . 27 28 30 30 32 32 33 34 34 35 35 35 36 37 ii Contents 3.5 3.6 3.7 II CCSM . . . . . . . . . . . . . . 3.5.1 Overview . . . . . . . . 3.5.2 The CCSM algorithm . . 3.5.3 Experimental evaluation Related works . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Second Part 37 38 38 43 45 47 49 4 Distributed datasets 4.1 Introduction . . . . . . . . . . . . . . . . . 4.1.1 Frequent itemset mining . . . . . . 4.2 Approximated distributed frequent itemset 4.2.1 Overview . . . . . . . . . . . . . . 4.2.2 The Distributed Partition algorithm 4.2.3 The APRed algorithm . . . . . . . . 4.2.4 The APInterp algorithm . . . . . . . 4.2.5 Experimental evaluation . . . . . . 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 53 53 54 55 57 59 62 68 5 Streaming data 5.1 Streaming data . . . . . . . . . 5.1.1 Issues . . . . . . . . . . 5.2 Frequent items . . . . . . . . . 5.2.1 Problem . . . . . . . . . 5.2.2 Count-based algorithms 5.2.3 Sketch-based algorithms 5.3 Frequent itemsets . . . . . . . . 5.3.1 Related work . . . . . . 5.3.2 The APStream algorithm 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 74 74 75 75 81 82 82 84 95 III Conclusions A Approximation assessment Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 99 103 107 List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 Incremental data mining . . . . . . . . . . . Data stream mining . . . . . . . . . . . . . . Transaction dataset . . . . . . . . . . . . . . Sequence dataset . . . . . . . . . . . . . . . Effect of maxGap constraint . . . . . . . . . Taxonomy of algorithms for frequent pattern . . . . . . . . . . . . . . . . . . . . . . . . . mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 Set of itemsets compressed data structure . . . . . . . . . . . . . . . 23 Example of cache usage . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 GSP candidate generation . . . . . . . . . . . . . . . . . . . . . CCSM candidate generation . . . . . . . . . . . . . . . . . . . . Example of cache usage . . . . . . . . . . . . . . . . . . . . . . . CCSM idlist reuse . . . . . . . . . . . . . . . . . . . . . . . . . . Number of intersection for different intersection methods . . . . Number of frequent sequences in datasets CS11 and CS21 . . . . Execution times of CCSM and cSPADE- variable maxGap value . Execution times of CCSM and cSPADE- fixed maxGap value . . . . . . . . . . . . . . . . . . 34 40 41 42 43 44 45 45 4.1 4.2 4.3 4.4 4.5 Similarity of APRed approximate results . . . . . . . . . . . . . . . . Number of spurious patterns as a function of the reduction factor r fpSim of the APInterp results . . . . . . . . . . . . . . . . . . . . . . Comparison of Distributed One-pass Partition vs. APInterp . . . . . . Speedup for two of the experimental datasets . . . . . . . . . . . . . . . . . . 66 67 68 69 70 . . . . . . . . . . . . . . . 5 . 6 . 9 . 10 . 12 . 13 5.1 Similarity and ASR as a func. of memory/transactions/hash entries . 95 5.2 Similarity and ASR as a function of stream length . . . . . . . . . . . 96 C.3 Distributed stream mining framework . . . . . . . . . . . . . . . . . . 101 iv List of Figures List of Tables 1.1 Taxonomy of data mining environments . . . . . . . . . . . . . . . . . 4.1 Datasets used in APRed experimental evaluation . 4.2 Datasets used in APInterp experimental evaluation 4.3 Test results for APRed . . . . . . . . . . . . . . . . 4.4 Accuracy indicators for APInterp results . . . . . . 5.1 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 63 64 65 71 Sample supports and reduction ratios . . . . . . . . . . . . . . . . . . 88 Datasets used in experimental evaluation . . . . . . . . . . . . . . . . 93 vi List of Tables 1 Introduction Data mining is, informally, the extraction of knowledge hidden in huge amounts of data. However, if we are interested in a more detailed definition, several different ones do exist [23]. Depending on the application domain (and on the author), Data mining could just mean the extraction of a particular aggregate information from somehow preprocessed data, or the whole process beginning with data cleaning and integration, and ending with result visual representation. From now on, we will reserve the term Data mining to the first meaning, using the more general KDD (Knowledge Discover in Databases) for the whole workflow that is needed in order to apply Data mining algorithms to real world problems. The kind of knowledge we are interested in, together with the organization of input data and the criteria used to discriminate among useful and useless information, contributes to characterize a specific data mining problem and its possible algorithmic solutions. Common data mining tasks are the classification of new objects according to a scheme learned from examples, the partitioning of a set of objects into homogeneous subsets, the extraction of association rules and numerical rules from a database. In several interesting application frameworks, such as wireless network analysis and fraud detection, data are naturally distributed among several entities and/or evolve continuously. In all of the above-indicated data mining tasks, dealing with either of these peculiarities provides additional challenges. In this thesis we will focus on the distribution and evolution issues related to the extraction of Association Rules from transactional databases (ARM), one of the most important and common data mining task, both for the immediate applicability of the knowledge extracted by this kind of analysis, and for the wide range of application fields where it can be used. Association Rules are rules relating the occurrence of distinct subset of items in the same set, i.e. ”65 % of market basket containing steak and salad will also contains wine”, or in the same collection of set ”50 % of customer that buy a CD player will, later, buy CDs”. In particular, we will concentrate our attention on the most computationally expensive phase of ARM, the mining of frequent patterns from distributed and stream data. These patterns can be either frequent itemsets (FIM) or frequent sequences (FSM), i.e., respectively subsets contained in at least a user indicated number of input set and subsequence of at least a user specified number 2 1. Introduction input sequences. Since we will use frequent pattern mining algorithms for static and non-evolving datasets as the building block for our approximate algorithms, to be exploited on distributed and stream data, we will also describe efficient algorithms for FIM and FSM: DCI, introduced by Orlando et al. in [44], and CCSM, which is one of the original contributions of this thesis. This chapter introduces, without focusing on any particular data mining task, the general issues concerning the evolution of data, and their distribution/partitioning among several entities. Then it quickly introduces ARM and its core FIM/FSM phase in centralized and non-evolving datasets. Both will be discussed more in detail both the first part of the thesis, since they constitute the foundation for the distributed and streaming FIM/FSM problems. We also deal with taxonomy of both FIM and FSM algorithms, which will be useful in understanding the reasons that lead us to the choice of DCI and CCSM algorithms as the building blocks for our distributed and stream algorithms. The chapter concludes with a summary of the achievements of our research, and a description of the structure of the rest of the thesis. 1.1 Data distribution Reasons leading to data distribution. In many real systems, data are naturally distributed, usually due to a plural ownership or to a geographical distribution of the processes that produce the data. The logistic organization of entities involved in the data collection process, performance and storage constraints, as well as privacy and company interest, may lead to the choice of using separate databases, instead of a centralized one accessed by several remote locations. The sales point of a large chain are a typical example: there is no need of a central database for performing ordinary sale activities, and using it would make the operations of the shop dependent on the reliability and bandwidth of the communication infrastructure. Gathering all data to a single site, after they have been produced, would be subject to the same ownership/privacy issues as using a centralized database. In other cases, data are produced locally in large volumes and immediately moved to other storage and analysis locations, due to the impossibility to store or process them with the resources available at a single site, as in the case of satellite image analysis or high-energy physics experiments. In all of these cases, performing a data mining task means to coordinate the sites in a mix of partial movement of data and exchange of local knowledge, in order to get the required global model. Homogeneous vs. heterogeneous data. Problems that are seemingly similar may need sensibly different solutions, if considered in different communication and data localization settings. Data can be either homogeneous or heterogeneous. If data are represented by tuples, in the first case all data presents the same dimensions, 1.2. Data evolution 3 while in the second one data each node has its own schema. Let us consider two examples: the sales data of a chain of shops and the personal data collected about us by different department of public administration. Sales data contain a representation of the sale transactions and are maintained by the shop where the items were bought. In this case, data are homogeneous: data collected at different shop are similar, but related to different transactions. Personal data are also maintained at different site: the register office manages birth data, the tax register own tax data, another register collect data about our cars. In this case, data are heterogeneous, since for each individual each register maintains different kind of data. Data localization is a key factor in characterizing data mining problems. Most classical data mining algorithms expect all data to be grouped in a unique data repository. Each data mining task presents different issues when it is considered in a distributed environment, instead of a centralized one. However it is possible to identify a general requirement common to most distributed data mining system architectures and tasks: careful attention should be paid to both communication and computation resources, in order to use them in a nearly optimum way. Data distribution issues and algorithms will be discussed in more details in chapter 4, with a focus on frequent pattern mining algorithms. A good survey on distributed data mining algorithms and applications is [48]. 1.2 Data evolution In different application context, data mining can be applied either to past data, as a one time task, or repeatedly to evolving datasets. The classical data mining algorithms refer to the first case: the full dataset is available and there will be no data modification during the computation or between two consecutive computations. This is enough to understand a phenomenon and make plans for the future in most cases. In several applications, like wireless network analysis, intrusion detection, stock market analysis, sensor network data analysis, and, in general, any setting in which every information available should be used to make an immediate decision, the approach based on finite statically stored data sets could be not satisfactory. These cases demands for new classes of algorithms, able to cope with evolutions of data. In particular, two issues need to be addressed: the complexity of recomputing everything from scratch and the potential infiniteness of data. In case only the first issue is present, the setting is referred to as Incremental/Evolving Data Mining; otherwise, it is indicated as Stream Data Mining. The presence and kind of data evolution is another key factor in characterizing data mining problems. 4 1. Introduction Data localization Centralized A single entity can access every data Distributed Each node can access just a part of the data and ... homogeneous ... data related to the same entity (e.g.: people) are owned by just one node heterogeneous ... data related to the same entity (e.g.: people) may be spread among several nodes Data evolution Statical Data are definitively stored and invariable (e.g.: related to some past and concluded event) Incremental New data are inserted and access to past data is possible (e.g.: related to an ongoing event) Evolving The dataset is modified with either updates, insertions or deletions, and access to past data is possible. Streaming Data arrives continuously and for an indefinite time. Access to past data is restricted to a limited to part of them or summaries. Table 1.1: Taxonomy of data mining environments. Incremental and Evolving Data Mining. In incremental data mining, new data are repeatedly inserted into the dataset. Some algorithm also take care of deletions or modifications of previous data, this case is indicated as evolving data mining. In a typical situation, we have a dataset D and the results of the required data mining task on D. Then D is modified and the system is asked for a new result set. Obviously, a way to obtain the new result is to recompute everything from scratch, and it is possible since all past data are accessible. However, this implies a computation time that in some case may clash with near real time system response requirements, whereas in other cases is just a waste of resources, especially when the dataset get bigger. Incremental/Evolving data mining algorithms, instead, are able to update the solution according to dataset updates, modifying just the part of the result set that is interested by the modifications of the dataset. A fitting example could concern the sales data of a supermarket: at the end of each day, the daily update is performed. The overall amount of data is still reasonable for an ordinary computation; however, there is no point in reprocessing some year of past sales data. A better approach would be considering the past result and the new data, and querying the past data only when a modification of the result 1.2. Data evolution 5 is expected. Figure 1.1 summarize the simultaneous evolution of data and results Figure 1.1: Incremental data mining: previous result availability allows for a reduction of necessary computation. after each mining step in incremental data mining. Stream Data Mining. An increasing number of applications require support for processing data that arrive continuously and in huge volumes. This setting is referred as Stream Data Mining. The main difference with Incremental/Evolving Data Mining is the large and potentially infinite amount of data, but also the continuity aspect deserves some attention. The first consequence is that Stream Data Mining algorithms, since they are dealing with infinite data, cannot access to every data received in the past, but just to a limited subset of them. In case of sustained arrival rate, this means that each received data can be read only a few times, often just once. An algorithm dealing with data streams should require an amount of memory that is not related to the (infinite) amount of data analyzed. At the same time, it should be able to cope with the data arrival rate, returning, if necessary, an approximate solution in order to keep up with the stream. Building a model based on every received data until the user makes the query can be simply impossible in most cases, either for response time or resource contraints. Even the apparently trivial task of exactly counting the number of items received so far, potentially requires infinite memory, since after N items received we will need log2 (N ) bits in order to represent the counters. A solution that requires O(log(N )) memory is, however, considered suitable for a streaming context, since for real data stream infinite actually means really long. However, 6 1. Introduction Figure 1.2: Data stream mining: data are potentially infinite and accessible just on arrival. Results can be referred to the whole stream or a limited part. if we slightly extend the problem, asking for the number of ”distinct” items, an exact answer is impossible without using a O(N ) memory. For this reason, in data stream mining, approximate algorithms are quite common. Another way to reduce the resource requirement is to restrict the problem to a user specified temporal window, e.g. the last week. This approach is called Window Model, whereas the previously introduced one is the Landmark Model. Figure 1.2 summarize these two different approaches. 1.3 Applications The issues encountered when mining data originated by distributed sources may be related to the quality of received data, the high data arrival rate, the kind of communication infrastructure available between data sources and data sinks, or the need to avoid privacy breach. Let us see three practical cases of distributed systems and practical motivations that may lead to the use of distributed data mining algorithms for the analysis of data, instead of collecting and processing everything in a central repository. Geographically distributed Web-farm. Popular web sites generate a lot of traffic from the server to the clients. A solution viable to ensure high availability and throughput is to use several geographically distributed replicas and let each client connect to the closer one (e.g. www.tucows.com). This approach, even if really practical for system availability network and response time, makes the analysis of data for user behavior and intrusion detection more complex. In fact, while using a single web-farm all access log are available in the same site, in this case they are partitioned among several farm, sometimes connected by high latency link. A naı̈ve 1.4. Association Rules Mining 7 solution is to collect all data in a centralized database, either periodically or in real time, and it is in most case the best solution, at least if the data arrival rate is low or we are not interested in recent data. However, this is not satisfying when log data are huge, and real time analysis is required, as for intrusion detection. Sensor network. The same kind of problems may arise, even worse, when the sources of data streams are sensors connected by a network. Quite often communication link with sensors have a reduced bandwidth, for example in case of seismic sensors placed in inhabited places, far from computation infrastructures. Financial network. Furthermore data centralization may be unfeasible when confidential information are handled and must not be shared with unauthorized subjects in order to protect privacy rights or company interests. A classical example concerns the credit card fraud detection. Let us suppose that a group of banks is interested in automatically detecting possible frauds; each participating entity is interested in making the resulting model accurate, and based on as much data as possible, but banks cannot communicate the transactions of their customers to other banks. In all these cases, even if for different reasons, collecting all raw data to a repository before analyzing them is unfeasible and distributed techniques are needed in order to elaborate, at least partially, the data in place. 1.4 Association Rules Mining As we have seen in the previous section, dealing with evolving and distributed data presents several issues, independently of the particular targeted data mining task. However, each data mining task has its peculiarities, and the issues in different cases are not really the same, but just similar and related to the same aspect. In order to analyze more thoroughly the issues and possible solutions, we have to focus on a particular task or group of tasks. We have decided, in this thesis, to concentrate our attention on Association Rules Mining, and more precisely on its most computationally expensive phase, the mining of frequent patterns in distributed dataset and data stream, where these patterns can be either itemsets (FIM) or sequences (FSM). In this section we will quickly introduce the Association Rule Mining (ARM), one of the most popular DM task [4, 18, 19, 54], both for the immediate applicability of the knowledge extracted by this kind of analysis and for the wide range of application fields where it can be applied, from medical symptoms developed in a patient to objects sold in a commercial transaction. Here our goal is just to quickly introduce this topic, and its computational challenging Frequent Pattern Mining sub problem, by limiting our attention to the centralized case. A more detailed description of the problem will be found in chapters 2 and 3 for the centralized sub problems, in chapter 4 for the distributed one and in chapter 2 for the stream case. 8 1. Introduction The essence of associative rules is the analysis of the co-occurrence of facts in a collection of set of facts. If, for instance, the data represent the objects bought in the same shopping chart by the customers of a supermarket, then the goal will be finding rules relating the fact that a market basket contains an item with the fact that another item has been bought at the same time. One of these rules could be ”people who buy item A also buy item C in conf % cases”, but also the more complex ”people who buy item A and item B also buy item C in conf % cases” where conf % is the confidence of the rule, i.e. a measure of how much that rule can be trusted. Another interestingness measure, frequently used in conjunction with confidence, is the support of a rule, which is defined as the number of records in the database that confirm the rule. Generally, the user specifies minimum thresholds for both, so an interesting rule should have both a high support and a high confidence, i.e. it should be based on a significant number of cases to be useful, and at the same time, there should be few cases in which it is not valid. The combined use of support and confidence is the measure of interestingness most commonly adopted in literature, but in some case can be misleading if the user does not look carefully at the big picture. Consider the following example: both A and B appear in 80% of input data, and in 60% of cases, they appear in the same = 75%, transaction. The rule ”A implies B” has support 60% and confidence 60 80 thus apparently this is a good rule. However, if we analyze the full context, we can see that the confidence is lower than the support of B, hence the actual meaning of this rule is that A negatively influences B. The usage of other interestingness measures has been widely discussed. However, there is no clear winner, and the choice depends on the specific application field. Sequential rules or (temporal association rules) are an extension of association rules, which also considers sequential relationships. In this case, the input data are sequences of set of facts and the rules have to deals with both co-occurrences and ”followed by” relationships. Continuing with the previous example about market basket analysis (MBA), this means considering each transaction as related to a customer, identified by a fidelity card or something similar. So each input sequence is the shopping history of a customer and a rule could be ”people who buy item A and item B at the same time will also buy item C later in conf % cases” or ”people who buy item A followed by item B within one week will also buy item C later in conf % cases”. The extraction of both association rules and sequential rules from a database is typically composed of two phases. First, it is necessary to find the so-called frequent patterns, i.e. patterns that occur in a significant number of records. Once such patterns are determined, the actual association rules can be derived in the form of logical implications: X ⇒ Y , which reads whenever X occurs in a transaction (sequence), most likely also Y will occur (later). The computationally intensive part is the determination of frequent patterns, more precisely of frequent itemsets for association rules and frequent sequences for sequential rules. 1.4. Association Rules Mining 1.4.1 9 Frequent Itemsets Mining The Frequent Itemsets Mining (FIM ) problem consists in the discovery of subsets that are common at least to a user-defined number of input set. Figure 1.3 shows a small dataset related to the previous MBA example. There are eight transactions, each containing a variable number of distinct items. If the user chosen minimum support is three, then the pair ”scanner and speaker” is a frequent pattern, whereas ”scanner and telephone” is not a frequent one. Obviously, any larger pattern containing both a scanner and a telephone cannot be frequent. This fact is known as apriori principle and, expressed in a more formal way, states that a pattern can be frequent only if all its subsets are frequent too. Figure 1.3: Transaction dataset. The computational complexity of the FIM problem derives from the exponential size of its search space P(M ), i.e. the power set of M , where M is the set of items contained in the various transactions of 12/02/2002 a dataset D. In the example in Figure 1.3, 10/01/2002 23/12/2002 there are 8 distinct items and the larger transaction contains four items, this lead to P4 8 k=1 k = 162 possible patterns to examine, considering all transaction of maximal length, and 48 considering the actual transaction lengths. However, the number of distinct patterns is 29 and the number of frequent pattern is even smaller, e.g., there are just 7 items and 4 pairs occurring more than once, but only 4 items contained in more than two transactions. 20/04/2002 10/11/2002 Clearly, the naı̈ve approach consisting in generating all subset for every transaction and updating a set of counters would be extremely inefficient. A way to prune the search space is to consider only those patterns whose subsets are all frequent. The correctness of this approach derives from the apriori principle, which grants that it is impossible for discarded pattern to be frequent. The Apriori algorithm [6] and other derived algorithms16/05/2002 [2, 9, 11, 25, 49, 50, 44, 55, 68] exactly exploits this 10/06/2002 10 1. Introduction pruning technique. 1.4.2 Frequent Sequence Mining Sequential pattern mining (FSM) [7] represents an evolution of Frequent Itemsets Mining, allowing also for the discovery of before-after relationships between subsets of input data. The patterns we are looking for are sequences of sets, indicating that the elements of a set occurred at the same time and before the elements contained in the following sets. The ”occurs after” relationship is indicated with an arrow, e.g. {A, B} → {B} indicates an occurrence of both item A and B followed by an occurrence of item B. Clearly, the inclusion relationship is more complex than in case of subsets, so it needs to be defined. Here we informally introduce this concept, which we will define formally in chapter 3. For now, we consider that a sequence pattern Z is supported by an input sequence IS, if Z can be obtained by removing items and sets from IS. As an example the input sequence {A, B} → {C} → {A} supports the sequential patterns {A, B}, {A} → {C}, {A} → {A}, but not the pattern {A, C}, because the occurrence of A and C in the input sequence are not simultaneous. We highlight that the ”occurs after” relationship is satisfied by {A} → {A}, since anything between the two items can be removed. Figure 1.4 shows a small dataset containing just three input sequences, each 10/01/2002 12/02/2002 20/04/2002 16/05/2002 23/12/2002 10/11/2002 10/06/2002 Figure 1.4: Sequence dataset. associated with a customer according to the above example. For each transaction, 1.4. Association Rules Mining 11 the date is printed, but for the moment, we consider the time just a key for sorting transactions. If we set the minimum support to 50%, we can see that the pattern ”computer and camera followed by a speaker” is frequent and supported by the behavior of two customers. We observe that the apriori principle still holds for sequence patterns. If we define the containment relationship between patterns, analogously to the one defined between patterns and input sequence, we can state that every subsequence of a frequent sequence is frequent. So we are sure that ”computer followed by a speaker” is a frequent pattern without looking at the dataset, because the above-mentioned pattern is frequent, and, at the same time, we know that every pattern containing a ”lamp” is not frequent. The computational complexity of FSM is higher than that of FIM, due to the possible repetitions of items within each pattern. Thus, having a small number of distinct items often does not help, unless the length of input sequences is small too. However, since the apriori principle is still valid, several efficient algorithms for FSM exist, based on the generation of frequent patterns from smaller frequent ones. In several application context it is interesting to exploit the presence of a time dimension in order to obtain a more precise knowledge, and, in some case, to also transform an intractable problem into a tractable one by restricting our attention only to a the cases we are looking for. For example, if data represents the failure in a network infrastructure, when looking for congestion we are interested in short time periods, and the failure of an equipment a day after another one may be not as significant as the same sequence of failures within a few seconds. In this case, an expert of this domain can enforce a constraint on the maximum gap between occurrences of events, thus obtaining a better focus on actually important patterns and a strong reduction in the complexity. In the example in Figure 1.4, if we decide to limit our research to occurrences having a maximum gap smaller than seven month the pattern ”computer followed by a speaker” will be supported just by one customer shopping sequence, since in the first one the gap between the occurrence of the computer and the occurrence of the speaker is too large. Figure 1.5 shows the effect of maximum gap constraint on the support of some of the patterns of the above example: the deleted ones simply disappear, because their occurrences have inadequate gaps. This behavior poses serious problems to some of the most efficient algorithms, as we will explain in chapter 3, since some of their super-pattern may be frequent anyway. It is the case of the pattern ”camera followed by scanner followed by speaker” which has one occurrence with maximum gap equal to seven month even if ”camera followed by speaker” has no occurrence at all. 1.4.3 Taxonomy of Algorithms The apriori principle states that no superset of an infrequent set can be frequent. This determines a powerful pruning strategy that suggests a level-wise approach to solve both FIM [4] and FSM [7] problems. Apriori is the first of a family of algorithms 12 1. Introduction Figure 1.5: Effect of maxGap constraint. based on this method. First, every frequent item is found, and then the focus is on pairs composed of frequent items, and so on. Exploring the search space level-wise grants that every time a new candidate is considered, the support of all its sub patterns is known. An alternative approach is the depth first discovery of frequent patterns: by enforcing the apriori constraint just on some of the sub patterns, the search space is explored deeply. This is usually done in an attempt to preserve locality, examining consecutively similar patterns [25, 24, 52]. In both cases, the support of patterns is computed by updating a counter each time an occurrence is found. Moreover, when all the data fit in main memory, a more efficient approach based on intersection can be devised. Each item x is associated with all the IDs of all transactions where x appears, and the support is equal to the size of the intersection of the two sets. This set of IDs can be obtained using either bitmap [44] or tidlists [68]. In FSM, the technique is similar, but needs a longer description; an exhaustive explanation can be found in chapter 3. The use of intersection in depth-first algorithms is highly efficient, thanks to the availability of partial intersection results related to shorter patterns. For example, if we examine every pattern with a given prefix before moving to a different one, then the list of occurrences associated with that prefix can be reused, with little waste of memory, in the computations related to its descendant. However, the results obtained are unsorted and this can be a problem in case the results are to be merged with other ones, as in case of mining on distributed and streaming data, since we are forced to wait the end of the computation before being able to merge the results. On the other hand, level-wise algorithms pose a strong obstacle to the efficient reuse of partial intersection results due to the limited locality in the search space traversal. When the search space is not partitioned as in the depthfirst algorithm, it is impossible to exploit the partial intersection results computed 1.5. Contributions 13 Figure 1.6: Taxonomy of algorithms for frequent pattern mining. at level k − 1 in order to compute the intersections at level k, as partial result can become quickly too large to be maintained in main memory. To the best of our knowledge the only two level-wise algorithms that solved this issue, using a result cache and an efficient partial result reuse, are DCI for FIM and CCSM for FSM. DCI was introduced in [44] and extended in [43] with an efficient support inference optimization, whereas CCSM was introduced in [47]. Since these algorithms grant some ordering on the results, they have been chosen as the basic building block of our distributed and streaming algorithms in the second part of this thesis (in the future work chapter for the part concerning FSM), since they make an heavy use of result merging. Figure 1.6 summarizes this taxonomy of FIM and FSM algorithms. 1.5 Contributions In this thesis, we present original contributions in three related area: frequent sequence mining with gap constraints, approximate mining of frequent patterns on distributed datasets and approximate mining of frequent pattern on streaming data. The original contribution in the sequence mining field is CCSM, a novel algorithm for the discovery of frequent sequence patterns in collection of list of temporally annotated sets, with constraints on the maximum gap between the occurrences of two part of the sequence (maxGap). The proposed method consists in choosing an ordering that improve the locality and reduces the number of test on pattern support when the maxGap constraint is enforced, combined with an effective caching policy of intermediate results. This work has been published in [46, 47]. Another original contribution, this one on approximate distributed frequent itemset mining, deals with homogeneous distributed datasets: several entities cooperate, and each one has its own dataset with exclusive access. The two proposed algo- 14 1. Introduction rithms [59, 61], allow for obtaining a good approximate solution, and need just one synchronization in one case and none in the other. In APRed , the algorithm proposed in [59], each node begins the computation with a reduced support threshold. After a first phase, needed to understand the peculiarities of the dataset, the minimum support is increased again to an intermediate value chosen according to the behavior of pattern computed during the first phase. Thereafter each node can continue independently and send, at the end of the computation, the result to the master, which reconstructs an approximation of the set of all global frequent patterns. The goal of the support reduction is to force infrequent patterns to be revealed in partitions where they have nearly frequent support. The results obtained by this method are close to the exact ones for several real-world datasets originated by shopping chart and web navigation. To the best of our knowledge, this is the first algorithm for approximate distributed FIM based on an adaptive support reduction scheme. Similar accuracy in the results, but with higher performance thanks to the asynchronous behavior and no support reduction, are achieved by the APInterp algorithm that we have introduced in [61]. It is based on the interpolation of unknown pattern supports, based on the knowledge acquired from the other partitions. The absence of synchronizations, and of any two-way communication between the master and the worker nodes, makes APInterp suitable for streaming data, considering each new incoming block of data as a partition, and the rest of data as another one. In this way the merge and interpolate task can be applied repeatedly. This is the basic idea of APStream , the algorithm we presented in [60]. In our tests on real world datasets, the results are similar to the exact ones, and the algorithm processes the stream in linear time. The described interpolation framework can be easily extended to distributed stream and to the FSM problem, using the CCSM algorithm locally. A more challenging extension, due to the subsumption-related result merging issues, concerns the approximate distributed computation of Frequent Closed Itemsets (FCI), described in our preliminary work [32]. Furthermore, the heuristic used in interpolation can be easily substituted with another one, better fitted to the particular targeted application. However even the very simple and generic one used in our tests gives good results. To the best of our knowledge, the AP method is the first distributed approach that requires just one way communications (i.e., with global pruning optimization disabled, the worker nodes use only local information), tries to interpolate the missing supports by exploiting the available knowledge and is suitable to both distributed and stream settings. 1.6 Thesis overview This thesis is divided into self-contained chapters. Each chapter begins with as a short overview containing an informal introduction to the subject and a description 1.6. Thesis overview 15 of the scope of the chapter. The first section in most chapters is usually a more formal introduction to the problem, with definitions and references to related works. When other algorithms are used, either to describe the proposal contained in the core of the chapter or its improvements in relation to the state of the art, these algorithms are described immediately after the introduction. The core part of the chapter contains an in depth description of the proposed method, followed by a discussion on its pro and cons, and the descriptions of the experimental setup and results. For the sake of readability, since parts of the citations are common to more chapters, the references are listed at the end of the thesis. For the same reason the measures used for evaluating the approximation of the solutions are described in and appendix. The first part of the thesis is made of two chapters that deal with algorithms that we will use in the following chapters about distributed and streaming data mining, as previously explained in the section about FIM and FSM algorithm taxonomy. The first chapter introduces the frequent itemset mining problem and describes DCI [44], a state of the art algorithm for frequent itemset mining that we will use extensively in the rest of the thesis. The second chapter describes CCSM, a new algorithm for gap constrained sequence mining that we presented in [47]. In the second part of the thesis, the third chapter deals with approximate frequent itemset mining in homogeneous distributed datasets, and describes our two novel approximate algorithms APRed and APInterp , based on support reduction and interpolation. The fourth chapter extends the support interpolation method, introduced in the previous chapter, to streaming data [60]. Finally, the last chapter describes some future works and draws some conclusion. In particular, we describe how to extend the proposed interpolation framework in order to deal with frequent sequences, using CCSM in local computation. Moreover, we discern how to combine the APInterp and APStream in an algorithm for the discovery of frequent itemset on distributed data streams. 16 1. Introduction I First Part 2 Frequent Itemset Mining Each data mining task has its peculiarities and issues when dealing with evolving and distributed data, as we have briefly outlined in the introduction. A more detailed analysis requires focusing on a particular task. In this thesis, we have decided to analyze in detail this problem by discussing Association Rules Mining (ARM) and Sequential Association Rules Mining (SARM), two of the most popular DM task. The crucial steps in ARM, and by far the most computationally challenging, is the extraction of frequent subsets from an input database of sets of distinct items, also known as Frequent Itemset Mining (FIM). In case the datasets is referred to the activities of a shop, and data are sale transactions composed of several items, the goal of FIM is to find the sets of items that are bought together, at least, in a user specified number of transactions. The challenges in FIM derive from the large size of its search space, which, in the worst case, corresponds to the power set of the set of items, and thus is exponential in the number of distinct items. Restricting as much as possible this space and efficiently performing computations on the remaining part are key issues for FIM algorithms. This chapter formally introduces the itemset mining problem and describes DCI (Direct Count and Intersect), a hybrid level-wise algorithm, which dynamically adapts its search strategy to the characteristics of the dataset and to the evolution of the computation. This algorithm was introduced in [44], and extended in [43] with an efficient, key pattern based, support inference method. With respect to the Frequent Pattern Mining algorithms taxonomy, presented in the introduction, DCI is a level-wise algorithm, able to ensure an ordering of the results, and use an efficient hybrid counting strategy, switching to an in-core intersection based support computation as soon as there is enough memory available for distributed and stream settings. DCI has been chosen as the building block for our approximate algorithms due to its efficiency, and the results ordering, which is particularly important when merging different result sets. Moreover, DCI exactly knows the exact amount of memory needed for the whole intersection phase before starting it, and this has been exploited in APStream , our stream algorithm, for dynamically choosing the size of the block of transactions to process at once. 20 2.1 2. Frequent Itemset Mining The problem A dataset D is a collection of subsets of a set of items I = it1 , . . . , itm . Each element of D is called a transaction. A pattern x is frequent in dataset D, with respect to a minimum support minsup, if its support is greater than σmin = minsup · |D|, i.e., the pattern occurs in at least σmin transactions, where |D| is the number of transactions in D. A k-pattern isSa pattern composed of k items, Fk is the set of all frequent k-patterns, and F = i Fi is the set of all frequent patterns. F1 is also called the set of frequent items. The computational complexity of the FIM problem derives from the exponential size of its search space P(M ), i.e., the power set of M , where M is the set of items contained in the various transactions of D. 2.1.1 Related works A way to prune the search space P(M ), first introduced in the Apriori [6] algorithm, is to restrict the search to itemsets whose subsets are all frequent. Apriori is a levelwise algorithm, since it examines the k-patterns only when all the frequent patterns of length k − 1 have been discovered. At each iteration k, a set of potentially frequent patterns, having all of their subset frequent, are generated starting from the previous level results. Then the dataset is read sequentially, and the counters associated with each candidate are updated according to the occurrences founds. After the database scan, only the candidates having a support greater than the threshold are inserted in the result set and used for generating the next iteration candidates. Several other algorithms based on the apriori principle have been proposed. Some use the same level wise approach, but introduce efficient optimizations, like a hybrid count/intersection support computation [44] or the reduction of the number of candidates using a hash based technique [49]. Others use a depth-first approach, either class based [68] or projection based [2, 25]. Others again, use completely different approaches, based on multiple independent computations on smaller part of the dataset, like [55, 50]. Related research topics are the discovery of maximal and closed frequent itemsets. The first ones are those frequent itemsets that are not included in any other larger frequent itemset. As an example, consider the FIM result set F = {{A} : 4, {B} : 4, {C} : 3, {A, B} : 4, {A, C} : 3, {B, C} : 3, {A, B, C} : 3}, where the notation set : count indicates frequent itemsets along with their supports. In this case there is only one maximal frequent pattern Fmax = {{A, B, C} : 3}, since the other itemsets are included in it. Clearly the algorithms that are able to mine directly the set of maximal pattern, like [9, 3, 11], are faster and produce a more compact output than FIM algorithms. Unfortunately, the information contained in the result set are not the same: in the above example, there is no way to deduce the support of pattern A from Fmax . Frequent closed itemsets are those frequent itemsets that are set-included in any larger frequent itemset having the same support. The group of patterns subsumed by the same itemset appears exactly in the 2.2. DCI 21 same set of transactions, and forms a class of equivalence, whose representative element is the largest. Considering again the previous example, the patterns {A} and {B} are subsumed by the pattern {A, B} whereas the patterns {C}, {A, C} and {B, C} are subsumed by {A, B, C}. Thus, the set of frequent closed itemsets is Fclosed = {{A, B} : 4, {A, B, C} : 3}. Note that in this case the support of any frequent itemset can be deduced as the support of its smaller superset contained in the result, thus the {A, C} pattern support equal to 3, i.e., it has the same support than the pattern {A, B, C}. 2.2 DCI The approximate algorithms that we will propose in the second part of the thesis for distributed and stream data are build on traditional FPM algorithms, used for local computations. The partial ordering of results, the foreseeable resource usage, and the ability to recompute quickly a pattern support using the in-core vertical bitmap, made DCI our algorithm of choice. DCI is a multi-strategy algorithm that runs in two phases, both level-wise. During its initial count-based phase, DCI exploits an out-of-core horizontal database, with variable-length records. At the beginning of each iteration k, a set Ck of k-candidates is generated, based on the frequent patterns contained in Fk−1 , then their number of occurrences is verified during a database scan. At the end of the scan, the itemsets in Ck having a support greater than the threshold σmin are inserted into Fk . As execution progress, the dataset size is reduced by removing transactions and items no longer needed for computation using a technique inspired by DHP [49]. As soon as the pruned dataset becomes small enough to fit in memory, DCI adaptively changes its behavior. It builds a vertical layout database in-core, and starts adopting an intersection-based approach to determine frequent sets. During this second phase DCI uses intersections to check the support of kcandidates, generated on the fly by composing all the pairs of (k − 1)-itemsets that are included in Fk−1 and share a common (k − 2)-prefix. When a candidate is found to be frequent, it is inserted into Fk . In order to ensure high spatial and temporal locality, each Fi is maintained lexicographically ordered. This grants that (k −1)-patterns sharing a common prefix are stored contiguously in Fk−1 and, at the same time, the candidates are considered in lexicographical order, thus granting the ordering of the result. Furthermore, this allows accessing previous iteration results from disk in a nearly sequential way and storing immediately each pattern as soon as it is discovered to be frequent. DCI uses several optimization techniques, such as support counting inference based on key patterns [43] and heuristics to dynamically adapt to both dense and sparse datasets. Here, however we will put our attention only on candidate generation and the counting/intersection phases. Also in the pseudo-code, contained in algorithm 1, the code part related to optimizations has been removed. 22 2.2.1 2. Frequent Itemset Mining Candidate generation Candidates are generated in both phases, even if at different times. In the countbased phase, all the candidates are generated at the beginning of each iteration and then their supports are verified, whereas in the intersection-based one, the candidates are generated and their supports are checked on the fly. Another important difference concerns the memory usage: during the first phase the candidates and the results are maintained in memory and the dataset is on disk, whereas during the second phase the candidates are generated on the fly, the result are immediately offloaded to disk, and the dataset is kept in main memory. The generation of candidates of length k is based on the composition of patterns of k − 1 items sharing the same k − 2 long prefix. For example, if F2 = {A, B}, {A, C}, {A, D}, {B, C} is the set of frequent 2-patterns, then the set of candidates for the 3rd iteration will be C3 = {A, B, C}, {A, B, D}. DCI organizes itemsets of length k in a compressed data structure, optimized for spatial locality and fast access to groups of candidates sharing the same prefix, taking advantage of lexicographical ordering. A first array contains the k − 1 prefix and a second one contains an index to the contiguous block of item suffixes contained in the third array. Figure 2.1 shows the usage of these arrays. The patterns {A, B, D, M } and {A, B, D, I} are represented by the second prefix followed by the suffixes in positions from 7 to 8, i.e., from the index position to the position before the one associated to the next prefix. Generating the candidates using this data structure is straightforward, and simply consists of the generation of all the pairs for each block of suffixes. E.g. for the block corresponding to the prefix {A, B, C}, {A, B, D, G} is inserted in candidate prefixes, with suffixes H,I and L, followed by {A, B, D, H} with suffixes I and L, followed {A, B, D, I} with suffix L. Not every generated candidate obeys the apriori principle, so we can observe that the candidate pattern {A, B, D}, in the first example, cannot be frequent, since its subpattern {B, D} is not frequent. When the candidates are stored in memory, during the counting-based phase, the apriori principle is enforced before inserting candidates into candidate set. On the other hand, checking the presence of every subset has a cost, which increases as the patterns get longer. If we also consider that the relevant subpatterns are not in any particular order, this disrupts both spatial and temporal locality in the access to the previous iteration results (Fk−1 ). For this reason, and the low cost and high locality of intersection-based support checking, the authors has decided to limit the candidate pruning step to the count-based phase. 2.2.2 Counting phase In the first iteration, similarly to all FSC algorithms, DCI exploits a vector of counters. In subsequent iterations, it uses a Direct Count technique, introduced by the same authors in [42]. The goal of this technique is to make the access to the counters associated with candidates as fast as possible. So, instead of using a hash tree, 2.2. DCI 23 prefix index a b c a b d b d f 3 7 9 a a a a a a a a b b b b b b b b b b d d c c c c d d d d f f d e f g h i l m i n suffix d e f g h i l m i n 0 1 2 3 4 5 6 7 8 9 Compressed Memory = 9 + 3 + 10 = 21 Non−Compressed Memory = 4 x 10 = 40 Figure 2.1: Compressed data structure used for itemset collection can also improve candidate generation. This figure originally appeared in [43]. or others complex data structures, it extends the approach used for items. When k = 2, each pair of (frequent) items is associated with a counter in an array through an order preserving perfect hash function. Since the order in pairs of items is not significant, and the elements of a pair are distinct, the number of counters needed is m(m−1) = m2 , where m is the number of frequent items. 2 When k > 2, using direct access to counters would require a large amount of memory. In this case, the direct access prefix table contains a pointer to a contiguous block of ordered candidates sharing the same 2-prefix. Note that the number of mk m locations in the prefix table is 2 6 2 , where mk is the number of distinct items in the dataset during iteration k, which is less than or equal to m, thanks to pruning. Indeed, during the k th count-based iteration, DCI removes from each generic transaction t every item that is not contained in at least k − 1 frequent itemsets of Fk−1 and k candidate itemsets of Ck . Clearly, as the execution progress, the size of the dataset actually used in computations decrease and, thanks to pruning, the whole dataset will rapidly shrink enough to fit in main memory, for the final intersection-based phase. Even with large datasets and limited memory, this often happen after 3 or 4 iterations, thus limiting the drawbacks of the count-based phase, which becomes less efficient as k increases. 24 2. Frequent Itemset Mining Algorithm 1: DCI input : D, minsup 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 // find the frequent itemsets; F1 ← f irst scan(D, minsup); //second and following scans on a temporary db D0 ; F2 ← second scan(D0 , minsup); k ← 2; while D0 .vertical size() > memory available() do k ← k + 1; Fk ← DCI count(D0 , minsup, k); end k ← k + 1; // count-based iteration and create vertical database VD ; Fk ← DCI count(D0 , VD, minsup, k); while Fk 6= ∅ do k ← k + 1; Fk ← DCI intersect(VD, minsup, k) ; end 2.2.3 Intersection phase The intersection-based phase uses a vertical database, in which each item α is paired with a set of transactions tids(α) containing α, different from the horizontal one used before, in which a set of items is associated with each transaction. Since a transaction t supports pattern x iff x ⊆ t, the set of transactions supporting x can be obtained by intersecting the sets of transactions (tidlist) associated with each items in x. Thus the support σ(x) of a pattern x will be \ σ(x) = tids(α) α∈x In DCI the sets of transactions are represented as bit-vectors, where the ith bit is equal to 1 when the ith transactions contains the item and is equal to 0 otherwise. This representation allows for efficient intersections-based on the bitwise and operator. The memory necessary to contain this bitmap-based vertical representation is mk ·nk bits, where mk and nk are respectively the numbers of items and transactions in the pruned database used at iteration k. As soon as this amount is less than the available memory, the vertical dataset representation can be built on the fly in main memory in order to begin the intersection based phase of DCI. During this phase, the candidates are generated on the fly in lexicographical order, and their supports are checked using tidlist intersections. The above-described method for support computation is indicated as k-way intersection. The k bit-vectors 2.3. Conclusions 25 associated with items contained in a k-pattern are and-intersected and the support is obtained as the number of 1’s present in the resulting bit-vector. If this value is greater than the support threshold σmin , then the candidate is inserted into Fk . Since the candidates are generated on the fly, the set of candidates needs no longer to be maintained. Moreover, both Fk−1 and Fk can be kept on disk. Indeed, Fk−1 is lexicographically ordered and can be loaded in block having the same (k −2)prefix, and, thanks to the order of candidate generation, appending frequent patterns at the end of Fk preserves the lexicographic order. The set intersection is a commutative and associative operation, thus the operands can be intersected in any order and grouped in any way. A possible method is intersecting the tidlist of items pair wise, starting from the beginning, i.e., the first with the second, the result with the third, the result with the fourth and so on. Since the candidates are lexicographically ordered, consecutive candidates are likely to share a prefix of some length. Hence, the intersections related to this prefix are pointlessly repeated for each candidate. In order to exploit this locality, DCI uses an effective cache containing the intermediate results of intersections. When the support of a candidate c is checked immediately after the candidate c0 , the tidlist associated with their common prefix can be obtained directly from the cache. 1 2 3 4 Cached Pattern {A} {A, B} {A, B, C} {A, B, C, D} Cached tidList tids(A) tids(A) ∩ tids(B) (tids(A) ∩ tids(B)) ∩ tids(C) ((tids(A) ∩ tids(B)) ∩ tids(C)) ∩ tids(D) Figure 2.2: Example of cache usage. For example, after the computation of the support of the itemset {A, B, C, D}, the tidlists associated with all of its prefix are present in cache, as showed in Figure 2.2. Note that each cache position is obtained from the previous one by intersection with the tidlist of a single item. Hence, if the next candidate pattern is {A, B, C, G}, only the last position of the cache need to be replaced, and this implies just one tidlist intersection, since the tids intersections of {A, B, C} can be retrieved from the third entry of the cache. 2.3 Conclusions In this chapter, we have described the frequent itemset mining (FIM) problem, the state of the art of FIM algorithms, and DCI, an efficient FIM algorithm, introduced by Orlando et al. in [44]. We will use DCI in the second part of the thesis as a building block for our approximate algorithm for distributed and stream data. DCI 26 2. Frequent Itemset Mining has been chosen among the other FIM algorithms thanks to its efficiency and the result ordering, which is particularly important when merging different result sets. Moreover, we can exactly predict the exact amount of memory needed by DCI for the whole intersection phase before starting it, and this has been exploited in APStream , our stream algorithm, for dynamically choosing the size of the block of transactions to process at the same time. 3 Frequent Sequence Mining The previous chapter has introduced the Frequent Itemset Mining (FIM) Problem, the most computationally challenging part of Association Rules Mining. This chapter deals with Sequential Association Rules Mining (SARM) and in particular with its Frequent Sequence Mining (FSM) phase. In this thesis work we have decided to focus on this two popular data mining tasks, with particular regard to the issues related to distributed and stream settings, and the usage of approximate algorithms in order to overcome these problems. The algorithm proposed in this chapter, can be used as a building block for the Frequent Sequence version of our approximate distributed and stream algorithms described in the second part of this thesis, thanks to its efficiency, and results ordering, which is particularly important when merging different result sets. The frequent sequence mining (FSM) problem consists in finding frequent sequential patterns in a database of time-stamped events. Going on with the supermarket example, market baskets are linked to a time-line and no longer anonymous. An important extension to the base FSM problem is the introduction of time constraints. For example, several application domains require limiting the maximum temporal gap between events occurring in the input sequences. However pushing down this constraint is critical for most sequence mining algorithms. This chapter formally introduces the sequence mining problem and proposes CCSM (Cache-based Constrained Sequence Miner), a new level-wise algorithm that overcomes the troubles usually related to this kind of constraint. CCSM adopts an innovative approach based on k-way intersections of idlists to compute the support of candidate sequences. Our k-way intersection method is enhanced by the use of an effective cache that stores intermediate idlists for future reuse inspired by DCI [44] (see previous chapter). The reuse of intermediate results entails a surprising reduction in the actual number of join operations performed on idlists. CCSM has been experimentally compared with cSPADE [69], a state of the art algorithm, on several synthetically generated datasets, obtaining better or similar results in most cases. Since some concept introduced in GSP [62] and SPADE [70] algorithm are used to explain the CCSM algorithm, a quick description of these two follows the problem description. Other related works are referred at the end of the chapter. 28 3.1 3. Frequent Sequence Mining Introduction The problem of mining frequent sequential patterns was introduced by Agarwal and Srikant in [7]. In a subsequent work, the same authors discussed the introduction of constraints on the mined sequences, and proposed GSP [62], a new algorithm dealing with them. In the last years, many innovative algorithms were presented for solving the same problem, also under different user-provided constraints [69, 70, 53, 20, 52, 8]. We can think of the problem of mining Frequent Sequence Mining (FSM) as a generalization of Frequent Itemset Mining (FIM) to temporal databases. FIM algorithms aims to find patterns (itemsets) occurring with a given minimum support within a transactional database D, whose transactions correspond to collections of items. A pattern is frequent if its support is greater than (or equal to) a given threshold s%, i.e. if it is set-included in at least s%·|D| input transactions, where |D| is the total number of transactions in D. An input database D for the FSM problem is instead composed of a collection of sequences. Each sequence corresponds to a temporally ordered list of events, where each event is a collection of items (itemset) occurring simultaneously. The temporal ordering among the events is induced from the absolute timestamps associated with the events. A sequential pattern is frequent if its support is greater than (or equal to) a given threshold s%, i.e. if it is ”contained” in (or it is a subsequence of) at least s% · |D| input sequences, where |D| is the number of sequences included in D. To make more intuitive both problem formulations, we may consider them within the application context of the market basket analysis (MBA). In this context, each transaction (itemset) occurring in a database D of the FIM problem corresponds to the collection of items purchased by a customer during a single visit to the market. The FIM problem for MBA consists in finding frequent associations among the items purchased by customers. In the general case, we are thus not interested in the timestamp of each purchased basket, or in the identity of its customer, so the input database does not need to store such information. Conversely, FSM problem for MBA consists in predicting customer behaviors on the basis of their past purchases. Thus, D has also to include information about timestamp and customer identity of each basket. The sequences of events included in D correspond to sequences of ”baskets” (transactions) purchased by the same customer during distinct visits to the market, and the items of a sequential pattern can span a set of subsequent transactions belonging to the same customer. Thus, while the FIM problem is interested in finding intra-transaction patterns, the FSM problem determines intertransaction sequential patterns. Due to the similarities between the FIM and FSM problems, several FIM algorithms have been adapted for mining frequent sequential patterns as well. Like FIM algorithms, also FSM ones can adopt either a count-based or intersection-based approach for determining the support of frequent patterns. The GSP algorithm, which is derived from Apriori [7], adopts a count-based approach, together with a 3.1. Introduction 29 level-wise visit (Breadth-First) of the search space. At each iteration k, a set of candidate k-sequences (sequences of length k) is generated, and the dataset, stored in horizontal form, is scanned to count how many times each candidate is contained within each input sequences. The other approach, i.e. the intersection-based one, relies on a vertical-layout database, where for each item X appearing in the various input sequences we store an idlist L(X). The idlist contains information about the identifiers of the input sequences (sid ) that include X, and the timestamps (eid ) associated with each occurrence of X. Idlists are thus composed of pairs (sid, eid ), and are considerably more complex than the lists of transaction identifiers (tidlists) exploited by intersection-based FIM algorithms. Using an intersectionbased method, the support of a candidate is determined by joining lists. In the FIM case, tidlist joining is done by means of simple set-intersection operations. Conversely, idlist joining in FSM intersection-based algorithms exploits a more complex temporal join operation. Zaki’s SPADE algorithm [70] is the best representative of such intersection-based FSM algorithms. Several real applications of FSM enforce specific constraints on the type of sequences extracted [62, 53]. For example, we might be interested in finding frequent sequences of purchase events which contain a given subsequence (super pattern constraint), or where the average price of items purchased is over a given threshold (aggregate constraint), or where the temporal intervals between each pair of consecutive purchases is below a given threshold (maxGap constraint). Obviously, we could solve this problem with a post-processing phase: first, we extract from the database all the frequent sequences, and then we filter them on the basis of the posed constraints. Unfortunately, when the constraint is not on the sequence itself but on its occurrences (as in the case of the maxGap constraint), sequence filtering requires an additional scan of the database to verify whether a given frequent pattern has still a minimum support under the constraint. In general, FSM algorithms that directly deal with user-provided constraints during the mining process are much more efficient, since constraints may involve an effective prune of candidates, thus resulting in a strong reduction of the computational cost. Unfortunately, the inclusion of some support-related constraints may require large modifications in the code of an unconstrained FSM algorithm. For example, the introduction of the maxGap constraint in the SPADE algorithm, gave rise to cSPADE, a very different algorithm [69]. All the FSM algorithms rely on the anti-monotonic property of sequence frequency: every subsequence of a frequent sequence is frequent as well. More precisely most algorithms rely on a weaker property, restricted to a well-characterized part of subsets. This property is used to generate candidate k-sequence from frequent (k − 1)-sequences. When an intersection-based approach is adopted, we can determine the support of any k-sequence by means of join operations performed [55] on the idlist associated with its subsequences. As a limit case, we could compute the support of a sequence by joining the atomic idlists associated with the single items included in the sequence, i.e., through a k-way join operation [44]. More efficiently, 30 3. Frequent Sequence Mining we could compute the support of a sequence by joining the idlists associated with two generating (k − 1)-subsequences, i.e., through a 2-way join operation. SPADE [70] just adopts this 2-way intersection method, and computes the support of a ksequence by joining two of its (k − 1)-subsequences that share a common suffix. Unfortunately, the adoption of 2-way intersections requires maintaining the idlists of all the (k − 1)-subsequences computed during the previous iteration. To limit memory requirement, SPADE subdivides the search space into small, manageable chunks. This is accomplished by exploiting suffix-based equivalence classes: two k-sequences are in the same class only if they share a common (k − 1)-suffix. Since all the generating subsequences of a given sequence belong to the same equivalence class, equivalence classes are used to partition the search space in a way that allow each class to be processed independently in memory. Unfortunately, the efficient method used by SPADE to generate candidates and join their idlists, cannot be exploited when a maximum gap constraint is considered. Therefore, cSPADE is forced to adopt a different and much more expensive way to generate sequences and join idlists, also maintaining in memory F2 , the set of frequent 2-sequences. This chapter discuss CCSM (Cache-based Constrained Sequence Miner), a new level-wise intersection-based FSM algorithm, dealing with the challenging maximum gap constraint. The main innovation of CCSM is the adoption of k-way intersections to compute the support of candidate sequences. Our k-way intersection method is enhanced by the use of an effective cache, which store intermediate idlists. The idlist reuse allowed by our cache entails a surprising reduction in the actual number of join operations performed, so that the number of joins performed by CCSM approaches the number of joins performed when a pure 2-way intersection method is adopted, but require much less memory. In this context, it becomes interesting to compare the performances of CCSM with the ones achieved by cSPADE when a maximum gap constraint is enforced. The rest of the chapter is organized as follows. Section 3.2 formally defines the FSM problem, while Section 3.5.2 describes the CCSM algorithm. In Section 3.5.3, there are some experimental results and a discussion about them. Finally, Section 5.4 presents some concluding remarks. 3.2 3.2.1 Sequential patterns mining Problem statement Definition 1. (Sequence of events) Let I = {i1 , ..., im } be a set of m distinct items. An event (itemset) is a non-empty subset of I. A sequence is a temporally ordered list of events. We denote an event as (j1 , . . . , jm ) and a sequence as (α1 → . . . → αk ), where each ji is an item and each αi is an event (ji ∈ I and αi ⊆ I). The symbol → denotes a happens-after relationship. The items that appear together in an event happen simultaneously. The length |x| of a sequence x is the number 3.2. Sequential patterns mining of items contained in the sequence (|x| = k-sequence. 31 P |αi |). A sequence of length k is called a Even if an event represents a set of items occurring simultaneously, it is convenient to assume that there exists an ordering relationship R among them. Such order makes unequivocal the way in which a sequence is written, e.g., we cannot write BA → DBF since the correct way is AB → BDF . This allows us to say, without ambiguity, that the sequence A → BD is a prefix of A → BDF → A, while DF → A is a suffix. A prefix/suffix of a given sequence α are particular subsequences of α (see Def. below). Definition 2. (Subsequence) A sequence α = (α1 → . . . →αk ) is contained in a sequence β = (β1 →...→βm ) (denoted as αβ), if there exist integers 1≤i1 <...<ik ≤m such that α1 ⊆βi1 , ..., αk ⊆βik . We also say that α is a subsequence of β, and that β is a super-sequence of α. Definition 3. (Database) A temporal database is a collection of input sequences: D = {α| α = (sid, α, eid)}, where sid is a sequence identifier, α = (α1 → . . . → αk ) is an event sequence, and eid = (eid1 , . . . , eidk ) is a tuple of unique event identifiers, where each eidi is the timestamp (occurring time) of event αi . Definition 4. (Gap constrained occurrence of a sequence) Let β a given input sequence, whose events (β1 → . . . →βm ) are time-stamped with (eid1 , . . . , eidm ). The gap between two consecutive events βi and βi+1 is thus defined as (eidi+1 −eidi ). A sequence α = (α1 → . . . →αk ) occurs in β under the minimum gap and maximum gap constraints, denoted as α vc β, if there exists integers 1≤i1 <...<ik ≤m such that α1 ⊆βi1 , ..., αk ⊆βik , and ∀j, 1 < j ≤ k, minGap ≤ (eidij − eidij−1 ) ≤ maxGap, where minGap and maxGap are user specified thresholds. When no constraints are specified, we denote the occurrence of α in β as α v β. This case is a simpler case of sequence occurrence, since we have that α v β simply if αβ holds. Definition 5. (Support and constraints) The support of a sequence pattern α, denoted as σ(α), is the number of distinct input sequences β such that α v β. If a maximum/minimum gap constraint has to be satisfied, the “occurrence” relation to hold is α vc β. Definition 6. (Sequential pattern mining) Given a sequential database and a positive integer minsup (a user-specified threshold), the sequential mining problem deals with finding all patterns α along with their corresponding supports, such that σ(α) ≥ minsup. 32 3. Frequent Sequence Mining 3.2.2 Apriori property and constraints Also in the FSM problem the Apriori property holds: all the subsequences of a frequent sequence are frequent. A FSM constraint C is anti-monotone if and only if for any sequence β satisfying C, all the subsequences α of β satisfy C as well (or, equivalently, if α does not satisfy C, none of the super-sequences β of α can satisfy C). Note that the Apriori property is a particular anti-monotone constraint, since it can be restated as ’the constraint on minimum support is anti-monotone’. In the problem statement above, we have already defined two new constraints besides the minimum support one: given two consecutive events appearing in a sequence, these constraints regards the maximum/minimum valid gap between the occurrences of the two events in the various input database sequences. Consider first the minGap constraint. Let δ be an input database sequence. If β vc δ, then all its subsequences α, α β, satisfy α vc δ. This property holds because α β implies that the gaps between the events of α result ”not shorter” than the gaps relative to β. Hence, we can deduce that the minGap constraint is an anti-monotone constraint. Conversely, if the maxGap constraint is considered and α β vc δ, we do not know whether α vc δ holds or not. This is because α β implies that the gap between the events of α may be larger than the gaps relative to β. For example, if (A→B→C) vc δ, the gaps relative to A→C (i.e. the gaps between the events A and C in δ) are surely larger than the gaps relative to A→B and B→C. Therefore, if the gap between the events B and C is exactly equal to maxGap, the maximum gap constraint cannot be satisfied by A→C, i.e. A→C 6vc δ. Hence, we can conclude that, using this definition of sub/super-sequence based on , the maxGap constraint is not anti-monotonic. 3.2.3 Contiguous sequences We have shown that the property ’β satisfies maxGap constraint’ does not propagate to all subsequences α of β (α β). Nevertheless, we can introduce a new definition of subsequence that allows such inference to hold. Definition 7. (Contiguous subsequence) Given a sequence β = (β1 →...→βm ) and a subsequence α = (α1 →...→αn ), α is a contiguous subsequence of β, denoted as α - β, if one of the following holds: 1. α is obtained from β by dropping an item from either β1 or βm ; 2. α is obtained from β by dropping an item from βi , where |βi | ≥ 2; 3. α is a contiguous subsequence of α0 , and α0 is a contiguous subsequence of β. Note that during the derivation of a contiguous subsequence α from β, middle events of β cannot be removed, so that the gaps between events are preserved. 3.2. Sequential patterns mining 33 Therefore, if δ is an input database sequence and β vc δ, and α - β, then α vc δ is satisfied in presence of maxGap constraints. Lemma 8. If we use the concept of contiguous subsequence (-), the maximum gap constraint becomes anti-monotone as well. Therefore, if β is a frequent sequential pattern that satisfies the maxGap constraint, then every α, α - β, is frequent and satisfies the same constraint. Definition 9. (Prefix/Suffix subsequence) Given a sequence α = (α1 →...→αn ) of length k = |α|, let (k − 1)-prefix(α) ((k − 1)-suffix(α)) be the sequence obtained from α by removing the first (last) item of the event α1 (αn ). We can say that an item is the first/last one of an event without ambiguity, due to the lexicographic order of items within events. We can now recursively define a generic n-prefix(α) in terms of the (n+1)-prefix(α). The n-prefix(α) is obtained by removing the first (last) item of the first (last) event appearing in the (n + 1)-prefix(α). A generic n-suffix(α) can be defined similarly. It is worth noting that a prefix/suffix of a sequence α is a particular contiguous subsequence of α, i.e. n-prefix(α) - α and n-suffix(α) - α. 3.2.4 Constraints enforcement Algorithms solving the FSM problems usually search for Fk exploiting in some way the knowledge of Fk−1 . The enforcement of anti-monotone constraints can be pushed deep into the mining algorithm, since patterns not satisfying an antimonotone constraint C can be discarded immediately, with no alteration to the algorithm completeness (since their super-patterns do not satisfy C too). More importantly, the anti-monotone constraint C is used during the generation of candidates. Remember that, according to the Apriori definition, a k-sequence α can be a ”candidate” to include in Fk only if all of its (k − 1)-subsequences result to be included in Fk−1 . We will use the - relation to support the notion of subsequence, in order to ensure that all the contiguous (k − 1)-subsequences of α ∈ Fk will belong to Fk−1 . Note that if we used the general notion of subsequence (), the number of the (k − 1)-subsequences of α should be k. Each of them could be obtained by removing a distinct item from one of the events of α. Conversely, since we have to use the contiguous subsequence relation (-), the number of contiguous (k −1)-subsequences of α may be less than k: each of them can be obtained by removing a single item only from particular events in α, e.g. items belonging to the starting/ending event of α, or contained in events composed of more than one item. In practice, each candidate k-sequence can simply be generated by combining a single pair of its contiguous (k − 1)-subsequences in Fk−1 . 34 3.3 3. Frequent Sequence Mining GSP The first algorithm that proposed this candidate generation method, based on pairs of contiguous sequences, was presented in [62] by Srikant and Agrawal. Their algorithm, GSP, is a level-wise algorithm that scans repeatedly the dataset, and counts the occurrences of the candidate frequent patterns contained in a set, which is generated before the beginning of each iteration. Each k-candidates is generated by merging a pair of frequent (k − 1)-patterns that share a (k − 2) long contiguous sub-sequence. Figure 3.1: GSP candidate generation. The 3-patterns and 4-patterns are connected with their generators using a thick line. Candidates discarded after support check are not shown. 3.3.1 Candidate generation During the k-candidate generation phase, GSP merges every pair of frequent (k − 1)patterns α and β, such that (k − 2) − suf f ix(α) = (k − 2) − pref ix(β). The result of the merge is pattern α concatenated with the last item contained in β, 1 − suf f ix(β). This item is inserted as part of the last event, if this was the case in β, or as a new event otherwise. For example, the patterns A → B and B → C generate the candidate A → B → C, whereas the patterns A → B and BC generate the candidate A → BC. In case some of the k − 1-subsequences of the obtained candidate are not frequent, the candidate is discarded. In the above example, in case A → C is not frequent, A → BC can be safely discarded. However, the same is not true, for A → B → C, since A → C is not one of its contiguous subsequences. Indeed, even in case A → C was not frequent due to the maxGap constraint, A → B → C could be frequent. 3.4. SPADE 35 The set of candidates Ck is represented using a hash tree. Each node in the tree is either a leaf node, containing sequences along with their counters, or an internal node, containing pointers to other nodes. In order to find the counter for a pattern, the tree is traversed starting from the root. The next branch to visit is chosen using a hash function on the pth item in the sequence, where p is the depth of the node. Figure 3.1 represents a lattice of frequent patterns. Each ellipse indicates a pattern, a line indicates the relationship includes/included by, and a thick line indicates the ones exploited by GSP for the generation of candidates containing more than two items. 3.3.2 Counting As soon as GSP completes the generation of the set of candidates Ck , it start reading the input sequences in the dataset one by one. When an input sequence d is processed, GSP search the hash tree recursively, processing all the branches that are compatibles with the time-stamps contained in d. Each time a leaf is reached, GSP check if any of the sequence patterns in the leaf is supported by d, and, in case the time constraints are satisfied, it increments the associated counter. The inclusion check of sequence pattern s in the input sequence d is performed using a vertical representation of d, i.e., each item in d is associated with a list of time-stamps corresponding to its occurrences in d. This representation enables GSP to align efficiently the pattern s with the input sequence d, starting from the first element and stretching gaps as long as the constraints are satisfied. 3.4 SPADE A completely different approach was proposed by Zaki in SPADE [70]. SPADE is an intersection-based algorithm, i.e. each item is associated to a list of pairs (sid, eid) and the support of a pattern is obtained using intersections. The pair (sid, eid) corresponds to an occurrence of the items in an input sequence sid (sequence id) with time-stamp eid (event id). Since these lists are kept in memory, the candidates can be generated and checked on the fly, and there is no need to maintain in memory the set of candidates, or to scan the dataset multiple times. SPADE, as GSP, merges pairs of (k − 1)-patterns to obtain k-candidates, however the pairs are chosen in a different way. 3.4.1 Candidate generation SPADE generates a candidate k-sequence from a pair of frequent (k−1)-subsequences that share a common (k − 2)-prefix1 . The generate candidate is composed of α 1 In some version of the algorithm the author uses suffixes instead of prefixes. This is not relevant, however, unless we need to restrict the search space to patterns beginning/ending with 36 3. Frequent Sequence Mining followed by the last element of β, either as one-item event, or as part of the last event as we will explain later. For example, α = A→B→C→D is obtained by combining the two subsequences A→B→C and A→B→D, which share the 2-prefix A→B. Since also the resulting candidates share the same prefixes, a set of k-patterns sharing a common (k − 1)-prefix is closed with respect to candidate generation, and can be processed independently. The generation of 2-candidates is in some way an exception: every pair of frequent items can generate candidates since they share a 0-prefix. For this reason, SPADE use intersections for candidates containing at least 3 items, but uses a count based approach for frequent items and 2-patterns. In order to generate k-candidates, SPADE considers each pair of frequent (k − 1)patterns sharing the same (k − 2)-prefix, included pairs containing twice the same pattern. Each pair can produce one, two, three or no candidates at all, depending on their last events. Let α and β be two frequent (k − 1)-patterns sharing a common prefix P and ending respectively with items X and Y . The last event in α may contain one or more items. The first case is indicated as α = P → X, the second one as α = P X. Four cases may arise: α = P → X, β = P → Y P → XY , P → X → Y and P → Y → X are valid candidates. In case X = Y , P → X → X is the only candidate generated by α and β. α = P → X, β = P Y P Y → X is the only candidate generated by α and β. α = P X, β = P → Y P X → Y is the only candidate generated by α and β. α = P X, β = P Y In case X < Y , the candidate is P XY . Otherwise the candidate is P Y X, unless X = Y . In this case, no candidate is generated. 3.4.2 Candidate support check Immediately after the generation of a candidate, SPADE check its support using idlist intersections. An idlist is a sorted list of the occurrences, i.e., pairs (sid, eid), where sid identifies a specific input sequence, and eid one of its events. The ordering is on sid, with eid as secondary key. In SPADE, an idlist can be referred either to an item or to a pattern. In the first case, the list corresponds to the occurrences of the pattern, in the second case to the last position of each occurrence of the sequence. For example, if the only input sequence in the dataset is (sid = 1, {({A, B}, eid = 1), ({A, C}, eid = 2), ({C}, eid = 6)}), then idlist(A) = {(1, 1), (1, 2)}, idlist(AC) = {(1, 2)}, and idlist(A → C) = {(1, 2), (1, 3)}. Note some items or sequence of items. 3.5. CCSM 37 that it is not relevant that there are two distinct occurrences of A → C ending in (1, 3). Two kind of intersection are possible: ordinary intersection, or equality join, and temporal intersection, or temporal join. The first one is used when the candidate is P XY = αY , or P → XY = αY , and exactly corresponds to the common set intersection of idlist(α) and idlist(Y ): the results is the set of pairs (sid, eid) appearing in both idlists. The second one is slightly more complex and corresponds to the candidates α → Y (P X → Y and P → X → Y ). In this case the results is the subset of idlist(Y ) containing only those entries (sid, eid2 ) such that an entry (sid, eid1), with eid1 < eid2 , exists in idlist(α). Thanks to the ordering of idlist both operation can be implemented efficiently. Furthermore, the idlist of α is available from the previous level. Note that, thanks to the closure of common prefix classes with respect to candidate generation, the search space can be traversed depth-first by recursively exploring each prefix class. Thus, the idlist of prefixes can be reused with limited memory requirement. SPADE can also be implemented in a strictly level-wise manner, however it would be far less efficient. 3.4.3 cSPADE: managing constraints In case the maxGap constraint is enforced, the solution found by the SPADE algorithm is no longer complete. For example, α = A→B→C→D is obtained by combining the two subsequences A→B→C and A→B→D, which share the 2-prefix A→B. Unfortunately, A→B→D is not a contiguous subsequence of α. This implies that, even if α is frequent and satisfy a given maxGap constraint, i.e. α ∈ F4 , its subsequence A→B→D could not have been included in F3 as not satisfying the same maxGap constraint. In other words, SPADE might loose candidates and relative frequent sequences. cSPADE [69] overcomes this limit by exactly using the contiguous subsequence concept. α = A→B→C→D is now obtained from A→B→C and C→D, i.e. by combining the (k − 1)-prefix and the 2-suffix of α. It is straightforward to see that both the (k − 1)-prefix and 2-suffix of α are contiguous subsequences of α. Unfortunately, the need for contiguous subsequences to guarantee anti-monotonicity under the maxGap constraint partially destroys the prefix-class equivalence self-inclusion of SPADE, which ensures high locality and low memory requirement. While each prefix-class is mined, cSPADE also needs to maintain F2 in the main memory, since it uses 2-suffixes to extend frequent (k − 1)-sequences. 3.5 CCSM The reason behind the choice to use F2 for candidate generation in cSPADE, is that F2 is usually smaller than Fk−1 for k > 3, so the idlists of frequent 2-sequences should fit in memory. However, even when this is true, the idlists of (k−1)-patterns contains more elements, thus the average cost of an intersection is greater. In addition, the 38 3. Frequent Sequence Mining number of intersection is generally increased. In fact, the generation of a candidate depends on finding a pair of patterns with a matching common part. Hence, when the match is required on just one item, as in the case of intersection with F2 , the probability of generating a false positive (discarded candidate) is higher. On the other hand, since the suffixes of processed candidates are in no particular order, using Fk−1 for the same purpose, can be excessively memory demanding. CCSM, the algorithm we propose, avoid these issues using a suitable traversal order of the search space and an improved bidirectional idlist intersection operation. 3.5.1 Overview The candidate generation method adopted by CCSM was inspired by GSP [62], which is also based on the contiguous subsequence concept. We generate a candidate ksequence α from a pair of frequent (k − 1)-sequences, which share with α either a (k − 2)-prefix or a (k − 2)-suffix. It easy to see that both these frequent (k − 1)sequences are contiguous subsequences of α. As we have already highlighted above, the candidates generated by cSPADE are more than those generated by CCSM/GSP are. We show this with an example. Suppose that A → B → C ∈ F3 , and that the only frequent 3-sequence having prefix B → C is B → C → D. CCSM directly combines these two 3-sequences to obtain a single potentially frequent 4-sequence A → B → C → D. Conversely, cSPADE tries instead to extend A → B → C with all the 2-sequences in F2 that start with C. In this way, cSPADE might generate a lot of candidates, even if, due to our hypotheses, the only candidate that has chances to be frequent is A → B → C → D. 3.5.2 The CCSM algorithm Like GSP, CCSM visits level-wise and bottom-up the lattice of the frequent sequential patterns, building at each iteration Fk , the set of all frequent k-sequences. CCSM starts with a count-based phase that mines a horizontal database, and extracts F1 and F2 . During this phase, the database is scanned, and each input sequence is checked against a set of candidate sequences. If the input sequence contains a candidate sequence, the counter associated with the candidate is incremented accordingly. At the end of this count-based phase, the pruned horizontal database is transformed into a vertical one, so that our intersection-based phase can start. Thereinafter, when a candidate k-sequence is generated from a pair of frequent (k − 1)-patterns, its support is computed on-the-fly using items idlist intersections. This happens by joining the atomic idlists (stored in the vertical database) that are associated with the frequent items in F1 , as well as several previously computed intermediate idlists that are found in a cache. In order to describe how the intersection-based phase works, it is necessary to discuss how candidates are generated, how idlists are represented and joined, and how CCSM idlist cache is organized. 3.5. CCSM 39 Candidate generation. At iteration k, we generate the candidate k-sequences starting from the frequent (k − 1)-sequences in Fk−1 . For each f ∈ Fk−1 , we generate candidate k-sequences by merging f with every f 0 ∈ Fk−1 such that (k − 2)-suffix(f ) = (k − 2)-prefix(f 0 ). For example, f : BD→B is extended with f 0 : D→B→B to generate the candidate 4-sequence BD→B→B. Note that by construction, f and f 0 are contiguous subsequences of the new candidate. To make more efficient the search in Fk−1 for pairs of sequences f and f 0 that share a common suffix/prefix, we aggregate and link the various groups of sequences in Fk−1 . Figure 3.2 illustrates the generation of the candidate 4-sequences starting from F3 . On the left-hand and on the right-hand side of the figure two copies of the 3-sequences in F3 are shown. These sequences are lexicographically ordered either with respect to their 2-suffixes or to their 2prefixes. Moreover, sequences sharing the same suffix/prefix are grouped (this is represented by circling each aggregation/partition with dotted boxes). For example, a partition appearing on the left side is {BD → B, D → D → B}. If two partitions that appear on the opposite sides share a common contiguous 2-subsequence (2-suffix = 2-prefix), they are also linked together. For instance, two linked partition are {BD → B, D → D → B} (on the left), and {D → BD, D → B → B} (on the right). Due to the sharing of suffix/prefix within and between linked partitions, we can obviously save memory to represent F3 . The linked partitions of frequent sequential patterns are the only ones we must combine to generate all the candidates. In the middle of Figure 3.2, we show the candidates generated for this example. Candidates than do not result frequent are dashed boxed, while the frequent ones are indicated with solid line boxes. Note that, before passing to the next pair, we first generate all the candidates from the current pair of linked partitions. The order in which candidates are generated enhances temporal locality, because the same prefix/suffix is encountered several times in consecutively generated candidates. Our caching system takes advantage of this locality, storing and reusing intermediate idlist joins. Idlist intersection. To determine the support of a candidate k-sequence p, we have first to produce the associated idlist L(p). Its support will correspond to the number of distinct sid values contained in L(p). To produce L(p), we have to join the idlists associated with two or more subsequences of p. If both L(p01 ) and L(p02 ) are available, where p01 are p02 are the two contiguous subsequences whose combination produces p, L(p) can be generated very efficiently through a 2-way intersection: L(p) = L(p01 ) ∩ L(p02 ). Otherwise, we have to intersect idlists associated with smaller subsequences of p. The limit case is a 40 3. Frequent Sequence Mining Figure 3.2: CCSM candidate generation. k-way intersection, when we have to intersect atomic idlists associated with single items. As an example of a k-way intersection, consider the candidate 3-sequence A→B→C. Our vertical database stores L(A), L(B) and L(C), which can be joined to produce L(A→B→C). Each atomic list stores (sid, eid) pairs, i.e. the temporal occurrences (eid) of the associated item within the original input sequences (sid). When L(A), L(B) and L(C) are joined, we search for all occurrences of A followed by an occurrence of B, and then, using the intermediate result L(A→B), for occurrences of C after A→B. If a maximum or minimum gap constraint must be satisfied, it is also checked on the associated timestamps (eids). Note that in this case we have generated the pattern A→B→C by extending the pattern from left to right. An important question regards what information has to be stored along with the intermediate list L(A→B). We can simply show that, if we extend the pattern from left to right, the only information needed for this operation is those related to timestamps associated with the last item/event of the sequence. With respect to L(A→B), this information consists in the list of (sid, eid) pairs of the B event. Each pair indicates that an occurrence of the specified sequential pattern occurs in the input sequence sid, ending at time eid. 3.5. CCSM 41 On the other hand, if we generate the sequence by extending it from right to left, the intermediate sequence should be B→C, but the information to store in L(B→C) should be related to the first item/event of the sequence (B). In this case, each (sid, eid) pair stored in the idlist should indicate that an occurrence of the specified sequential pattern exists in input sequence sid, starting at time eid. Consider now that we use a cache to store intermediate sequences and associated idlists. In order to improve cache reuse, we want to exploit cached sequences to extend other sequences from left to right and vice versa. Therefore, the lists of pairs (sid, eid) should be replaced with lists of terns (sid, f irst eid, last eid), indicating that an occurrence of the specified sequential pattern occurs in input sequence sid, starting at time f irst eid and ending at time last eid. Finally, note that two types of idlist join are possible: equality join (denoted as ∩e ) and temporal join (denoted as ∩t ). The first is the usual set-intersection, and is used when we search for occurrences of one item appearing simultaneously with the last item of the current sequence: for example, L(A→BC) = L(A→B) ∩e L(C)). Temporal join is instead an ordering-aware intersection operation, which may also check whether the minimum and maximum gap constraints are satisfied. Consider the join of the example above, i.e. L(A→B→C) = L(A→B) ∩t L(C)). The result of this join is obtained from L(C) by discarding all its pairs (sid2 , eid2 ) with nonmatching sid1 ’s in the first idlist (L(A→B)), or with a matching sid1 that is not associated with any eid1 smaller than eid2 . More formal definitions of the two base cases (lists of pairs) for equality join, and (min gap, max gap) constraint-enforcing temporal join are shown below: L1 ∩e L2 = {(sid2 , eid2 ) ∈ L2 |(∃(sid1 , eid1 ) ∈ L1 ) (sid1 = sid2 ∧ eid1 = eid2 )} L1 ∩t L2 = {(sid2 , eid2 ) ∈ L2 |(∃(sid1 , eid1 ) ∈ L1 ) (sid1 = sid2 ∧ eid1 < eid2 ∧ minGap ≤ |eid2 − eid1 | ≤ maxGap)} 1 2 3 4 5 Cached Sequence Cached Idlist A L(A) A→A L(A) ∩t L(A) A→A→B [L(A) ∩t L(A)] ∩t L(B) A→A→BC [[L(A) ∩t L(A)] ∩t L(B)] ∩e L(C) A→A→BC→D [[[L(A) ∩t L(A)] ∩t L(B)] ∩e L(C)] ∩t L(D) Figure 3.3: Example of cache usage. 42 3. Frequent Sequence Mining Idlist caching. Our k-way intersection method can be improved using a cache of k idlists. Figure 3.3 shows how our caching strategy works: the table represents the status of the cache after the idlist associated with sequence A→A→BC→D has been computed. Each cache entry is numbered and contains two values: a sequence and its idlist. Each sequence entry i is obtained from entry (i − 1) by appending an item. In a similar way, the associated idlist is the result of a join between the previous cached idlist and the idlist associated with the last appended item. When a new sequence is generated, the cache is searched for a common prefix and the associated idlist. If a common prefix is found, CCSM reuses the associated idlist, and rewrites subsequent cache lines. Considering the example of Figure 3.3, if the candidate A→A→BF is then generated, the third cache line corresponding to the common prefix A→A→B will be reused. In this way, the support of A→A→BF can be computed by performing a single equality join between the idlist in line 3 and L(F ). The result of this join is written in line 4 for future reuse. Since the cache contains all the prefixes of the current sequence along with the associated idlists, reuse is optimal when candidate sequences are generated in lexicographic order. Furthermore, as idlist length (and join cost) decreases as sequence length increases, the joins saved by exploiting the cached idlists are the most expensive ones. Figure 3.4: CCSM idlist reuse. The combined effect of cache use and candidate generation is illustrated in Figure 3.4. On the left-hand side, a fragment of the lists of the linked partitions sharing a common infix is shown. The right-hand side of the Figure illustrates instead how candidates are generated. First, we consider the P artition(F G→A), i.e. the set of sequences sharing the prefix/suffix F G→A. L(F G→A) is processed first, using the cache as described before. L(A) and L(B) are then joined left to right with 3.5. CCSM 43 L(F G→A) to obtain L(F G→A→A) and L(F G→A→B). Finally, we join right to left the lists so obtained with L(A) and L(C) to produce the lists associated with all the possible candidates. When P artition(F G→A) has been processed, all the intermediate idlists except those stored in cache are discarded, and the next P artition(F G→B) is processed. The cache currently contains L(F G→A) and all its intermediate idlists, so that L(F G) can be reused for computing L(F G→B). Since partitions are ordered with respect to the common infix, similar reuses are very frequent. Dataset cs11 Min support 0.30 % Max-gap 8 1.2e+06 2-ways cached k-ways (CCSM) pure k-ways 1e+06 2-ways cached k-ways (CCSM) pure k-ways 1e+06 800000 800000 600000 600000 # # Dataset cs21 Min support 0.40 % Max-gap 8 1.2e+06 400000 400000 200000 200000 0 0 4 6 8 10 12 Pattern Length 14 16 4 6 8 10 Pattern Length 12 14 Figure 3.5: Number of intersection operations actually performed using 2-ways, pure k-ways and cached k-ways intersection methods while mining two synthetic datasets. Figure 3.5 shows the efficacy of CCSM caching strategy. The plots report the actual number of intersection operations performed using 2-ways, pure k-ways and CCSM cached k-ways intersection methods while mining two synthetic datasets. As it can be seen, our small cache is very effective since it allows saving a lot of intersection operations over a pure k-ways method, although memory requirements are significantly lower than those deriving from the adoption of a pure 2-ways intersection method. 3.5.3 Experimental evaluation In order to evaluate the performances of the CCSM algorithm, we conducted several tests on a Linux box equipped with a 450MHz Pentium II processor, 512MB of RAM and an IDE HD. The datasets used were CS11, and CS21, two synthetic datasets generated using the publicly available synthetic data generator code from the IBM Almaden Quest data mining project [7]. In particular, the datasets contain 100, 000 customer sequences composed in the average of 10 (CS11) and 20 (CS21) transactions of average length 5. The other parameters Ns , Ni , N , I used to generate the maximal sequences of average size S = 4 (CS11) and S = 8 (CS21), were set to 5000, 25000, 10000 and 2.5, respectively. Note that these values are the same as those used to generate the synthetic datasets in [62, 69, 70]. Figure 3.6 plots the 44 3. Frequent Sequence Mining number of frequent sequences found in datasets CS11 and CS21 as a function of the pattern length for different values of the maxGap constraint. As expected, the number of frequent sequences is maximum when no maxGap constraint is imposed, while it decreases rapidly for decreasing values of the maxGap constraint. Dataset cs11 Min support 0.30 % 160000 Dataset cs21 Min support 0.40 % 450000 maxGap= maxGap= 1 maxGap= 2 maxGap= 4 maxGap= 8 maxGap= 12 140000 120000 350000 300000 Pattern # 100000 Pattern # maxGap= maxGap= 1 maxGap= 2 maxGap= 4 maxGap= 8 maxGap= 12 400000 80000 60000 250000 200000 150000 40000 100000 20000 50000 0 0 4 6 8 10 12 Pattern Length 14 16 4 6 8 10 Pattern Length 12 14 Figure 3.6: Number of frequent sequences in datasets CS11 (minsup=0.30) and CS21 (minsup=0.40) as a function of the pattern length for different values of the maxGap constraint. In order to assess the relative performance of our algorithm, we compared its running times with the ones obtained under the same testing conditions by cSPADE (we acknowledge Prof. M.J. Zaki for kindly providing us cSPADE code) [69, 70]. Figure 3.7 reports the total execution times of CCSM and cSPADE on datasets CS11 and CS21 as a function of the maxGap value. In the tests conducted with cSPADE we tested different configurations of the command line options available to specify the number of partitions into which the dataset has to be split (-e #, default no partitioning), and the maximum amount of memory available to the application (-m #, default 256MB). From the plots, we can see that while on the CS11 dataset performances of the two algorithms are comparable, on the CS21 dataset CCSM remarkably outperforms cSPADE for large values of maxGap, while cSPADE is faster when maxGap is small. This holds because for large values of maxGap, the actual number of frequent sequences is large (see Figure 3.6), and cSPADE has to perform a lot of intersections between relatively long lists belonging to F2 . CCSM on the other hand, reuses in this case several intersections found in the cache. Since execution times increase rapidly for increasing values of maxGap, we think that the behavior of CCSM is in general preferable over cSPADE one. The same considerations can be done looking at the plots reported in Figure 3.8 that report for a fixed maxGap constraint (maxGap=8), the execution times of CCSM and cSPADE on datasets CS11 and CS21 as a function of the minimum support threshold. The CCSM and cSPADE execution times resulted very similar on the CS11 dataset, while on the CS21 dataset CCSM resulted, for maxGap=8, about 3.6. Related works 45 twice faster than cSPADE. Dataset cs11 Min support 0.30 % 10000 10000 Running time (s) CCSM cSPADE cSPADE -e4-m40 cSPADE -e4-m70 cSPADE -e8-m40 cSPADE -e8-m70 cSPADE -m100 1000 Running time (s) Dataset cs21 Min support 0.40 % 100 CCSM cSPADE cSPADE -e4-m40 cSPADE -e4-m70 cSPADE -e8-m40 cSPADE -e8-m70 cSPADE -m100 1000 100 10 1 10 0 2 4 6 Max Gap 8 10 12 0 2 4 6 Max Gap 8 10 12 Figure 3.7: Execution times of CCSM and cSPADE on datasets CS11 (minsup=0.30) and CS21 (minsup=0.40) as a function of the maxGap value. Dataset cs11 max gap: 8 120 CCSM cSPADE cSPADE -e4 cSPADE -e4-m40 cSPADE -e4-m70 cSPADE -e8-m40 cSPADE -e8-m70 cSPADE -m100 80 60 40 20 0 0.35 CCSM cSPADE cSPADE -e4-m40 cSPADE -e4-m70 cSPADE -e8-m40 cSPADE -e8-m70 cSPADE -m100 1000 Running time (s) 100 Running time (s) Dataset cs21 max gap: 8 1200 800 600 400 200 0.4 0.45 Min support (%) 0.5 0 0.4 0.45 Min support (%) 0.5 Figure 3.8: Execution times of CCSM and cSPADE on datasets CS11 and CS21 with a fixed maxGap constraint (maxGap=8) as a function of the minimum support threshold. 3.6 Related works The problem as been initially introduced by Agrawal and Srikant in [7], where they present AprioriAll, a count-based algorithm for solving this problem. The same authors in [62] generalize the problem and introduce GSP a new count based algorithm characterized by a better counter management and candidate generation policy. Another algorithm very similar to GSP, but using more efficient data structures that exploits the presence of common suffixes shared by several frequent patterns is PSP [37]. 46 3. Frequent Sequence Mining As in the association case both intersection based and projection based algorithms exists. Two of the best in first category algorithms are SPADE [65, 67, 70], which computes the support of candidates using list intersection and SPAM [8] which performs the same operation using vectors of boolean and bitwise operations. Two representatives of the second category are FreeSpan [24] and PrefixSpan [52]. Mannila, Toivonen e Verkamo [35] define a slightly different problem: instead of frequent patterns common to several input sequences, they search for episodes frequently appearing in a unique long input sequence. The support for a subsequence is the number of temporal windows containing it. In a subsequent work [34, 36] the same authors introduces constraints on single items and pairs of elements present inside episodes. Generalizations introduced in GSP [62] are the usage of a taxonomy, the possibility to group together events contained in a specified temporal frame and temporal constraints on mininum and maximum allowable distance between two consecutive events(minGap/maxGap). The proposed algorithm does not handle them efficiently. The performances of SPADE with constraint enforcement (cSPADE [69]) are widely better when no constraint is required on maxGap, but are limited as for GSP when it is enforced. CCSM (S.Orlando,R.Perego,C.Silvestri [46, 47]) have been specifically designed in order to overcome his limitation, using a candidate generation method that is not affected by maxGap constraint anti-monotonicity. PrefixSpan have been extended in order to handle several kinds of constraints [53]. A further evolution of PrefixSpan is CloSpan [64], an algorithm that is able to detect all closed sequential patterns2 , pruning early during computation most patterns that are frequent but not closed. Closed sequential pattern, even if they are much more compact, exactly represent the whole set of frequent sequential pattern and it is possible to switch from one representation to the other. Nevertheless building the complete set of pattern and checking for inclusion is more expensive than in case of associations. CloSpan was the first algorithm dealing with closed sequential pattern. More recently, J.Wang and J.Han proposed BIDE a new algorithm that find all and only closed patterns, without false positives that need to be corrected with post processing. One of the first algorithms for incremental sequence mining is ISM [51] which use a method similar to that used by SPADE and in addition maintains the set of infrequent candidates(negative border ) in order to minimize recomputation. This entails a non-trivial resource usage for large datasets, in contrast with ISE [38, 39], which does not need additional data and use just inference from already known pattern. 2 Closed sequential patterns are those sequential patterns that are not contained in any other pattern having the same support. If B contains A, and both have the same support then every input sequence containing A also contains B (the converse is always true). 3.7. Conclusions 3.7 47 Conclusions In this chapter, we have presented CCSM, a new FSM algorithm that mine temporal databases in the presence of user-defined constraints. CCSM searches for sequential patterns level-wise, and adopts an intersection-based method to determine the support of candidate k-sequences. Each time a candidate k-sequence α is generated, its support is determined on the fly by joining the k atomic idlists associated with the frequent items (1-sequences) constituting the candidate. This k-way intersection is, however, a limit case of our method. In fact, our order of generation of candidate ensures high locality, so that with high probability successively generated candidates share a common subsequence of α. A cache is thus used to store the intermediate idlists associated with all the possible prefixes of α. When the idlist of another candidate β has to be built, we reuse the idlist corresponding to the common subsequence of maximal length. The exploitation of such caching strategy entails a strong reduction in the number of join operations actually performed. Finally, CCSM is able to consider the very challenging maxGap constraint over the sequential pattern extracted. Preliminary experiments conducted on synthetically generated datasets showed that CCSM remarkably outperforms cSPADE when the selectivity of the gap constraint is not high. Since we are conscious that further optimization can be pushed into the code, we consider these results as encouraging. CCSM result sets are strictly ordered on (common part, pref ix item, suf f ix item), thus different result sets can be efficiently merged using a simple list merge. Since the distributed and stream FIM algorithms that are presented in the second part of this thesis make a heavy use of result merging, CCSM can be used for efficiently extending them for solving the FSM problem. In the last chapter, we give some more detail on this use of CCSM. 48 3. Frequent Sequence Mining II Second Part 4 Distributed datasets In many real systems, data are naturally distributed, usually due to a plural ownership, or a geographical distribution of the processes that produce the data. Moving all the data to one single location for processing could be impossible due to either policy or technical reason. Furthermore, the communications between the entities owning parts of the data may be not particularly fast and immediate. In this context, the communication efficiency of an algorithm is often more important than the exactness of its results. In this chapter, we will focus on distributed association mining. We will start by characterizing the different ways data can be distributed, and describe some useful techniques common to several distributed association mining algorithms. Then we will introduce the frequent itemset mining problem for homogeneous distributed datasets and present two novel communication efficient distributed algorithms for approximate mining of frequent patterns from transactional databases. Both the algorithms we propose locally compute frequent patterns, and then merge local results. The first algorithm, APRed , adaptively reduces the support threshold used in local computation in order to improve the accuracy of the result, whereas the second one, APInterp , uses an effective method for inferring the local support of locally infrequent itemsets. Both strategies give a good approximation of the set of the globally frequent patterns and their supports for sparse datasets, but APInterp is more resilient to data skew. In the last part of the chapter, we report the results of part of the tests we have conducted on publicly available datasets. The goal of these tests is to evaluate the similarity between the exact result set and the approximate ones returned by our distributed algorithms in different cases, as well as the scalability of APInterp . 4.1 Introduction As suggested before, there are several cases in which data can be distributed among different entities, that we will call nodes. In the case of cellular phone networks, each cell or group of cells may have its separate database for performance and resilience reasons. At the same time other information about the customer that own a device are available at the account department, and are kept separate due 52 4. Distributed datasets to privacy reasons. Where a particular data can be found influences the kind of solutions a problem can have. Therefore, before describing any algorithm for a particular data mining problem, we need to specify in which context it will be used. Homogeneous and heterogeneous data distribution The two above examples of distributed databases, related to the cellular phones domain, fall in two distinct major classes of data distribution. In the first case, each node has its own database, containing the log of the activities of a device in the area controlled by the group of antennas. Every local database contains different data, but the kind of information is the same for every node. This situation is indicated as homogeneous data distribution. On the other hand, if we are also interested in data about customers, nodes having different kinds of data need to cooperate. In the example, the cell database could contain the information that a device stopped for several hours in the same place, whereas the accounting department database knows which customer is associated with that device and its home address. This situation is indicated as heterogeneous data distribution. In this chapter, we will focus on association mining on homogeneously distributed data. Communication bandwidth and latency issues A key factor in the implementation of distributed algorithms is the kind of communication infrastructure available. An algorithm suitable for nodes connected by high-speed network links, can be of little use if nodes are connected by a modem and the public telephone network. Furthermore, for an algorithm that entails several blocking communications, a high latency is definitely a serious issue. Distributed systems are usually characterized by links having low speed, or high latency, or both. Hence, efficient algorithms need to exchange as few data as possible, and avoid blocking situation in which the local computation cannot resume until some remote feedback arrives. Parallel vs Distributed Parallel (PDM) and distributed (DDM) data-mining are a natural evolution of datamining technologies, motivated by the need of scalable and high performance systems, or policy/logistic reasons. The main differences between these two approaches is that while in PDM data can be moved (centralized) to a tightly coupled parallel system before starting computation, DDM algorithms must deal with limited possibilities for data movement/replication, due either to specific policies or technical reasons like large network latencies. A good review of algorithms and issues in distributed data mining is [48]. 4.2. Approximated distributed frequent itemset mining 4.1.1 53 Frequent itemset mining There exists algorithms for distributed frequent itemset mining (FIM) that usually performs in a homogeneous context, and algorithms able to cope with heterogeneous data, linked by primary keys [27] as, for instance, the individual number in the previously seen example about personal data. The two main parallel/distributed approaches [66], in the homogeneous case, are Count Distribution, in which each node computes the support for the same set of candidates on his own dataset, and Candidate/Data Distribution, where each node computes the support of a part of candidates, using also part of the dataset owned by other nodes. More in detail, algorithms based on Count-distribution compute the support of each pattern locally, and then exchange (or collect) and sum all the supports to obtain the global support. On the other hand, in Data Distribution and Candidate Distribution each processor handles a disjoint set of candidate patterns, and access all the data partitions for computing global support. The difference between the two approaches is that, in Data Distribution, candidates are partitioned merely to divide the workload, and all data are accessed by all processors, whereas in Candidate Distribution the candidates are partitioned in such a way that each processor can proceed independently and data are selectively replicated. Since only the counters are sent, Count Distribution minimizes the communications, making it suitable for loosely coupled setting. The other two techniques, instead, are more appropriate for parallel systems. A first parallel version of Apriori is introduced in [5], while other more efficient solutions are found in [44, 45, 58, 22, 13, 56, 41, 13, 22, 5, 27, 66]. The diversity of possible use cases makes the selection of the best algorithm a hard task. Even metrics used for comparison may be more or less appropriate according to the specific system architecture. A good survey on parallel association mining algorithms is [66]. Most these algorithms, however, are not suitable for loosely coupled settings. Only a few papers discussing truly distributed FIM algorithms recently appeared in the literature [56, 57, 63]. Nevertheless, as previously explained there are several real world systems that are intrinsically distributed and loosely coupled. For this reason we have chose to prefer DDM solutions, able to deal with such cases. 4.2 Approximated distributed frequent itemset mining In this section, we will introduce a novel approximate algorithm for distributed frequent itemset mining. After a brief summary of the notation used for frequent itemset, we will introduce the centralized algorithm that inspired our algorithms and its naı̈ve distributed version. Then we will describe APRed and APInterp , the algorithms we propose, and the experimental results we have obtained. Finally, we 54 4. Distributed datasets will draw some conclusions. 4.2.1 Overview A dataset D is a collection of subsets of items I = it1 , . . . , itm . Each element of D is called a transaction. A pattern x is frequent in D with respect to a minimum support minsup, if its support is greater than σmin = minsup · |D|, i.e. the pattern occurs in at least σmin transactions, where |D| is the number of transactions in D. A k-patternSis a pattern composed of k items, Fk is the set of all frequent k-patterns, and F = i Fi is the set of all frequent patterns. F1 is also called the set of frequent items. In this section, we discuss two distributed algorithms for approximate mining of frequent itemsets: APRed (Approximate Partition with dynamic minimum support Reduction) and APInterp (Approximate Partition with Interpolation). Both exploit DCI [44], a state-of-the-art algorithm for FIM, as the miner engine used for local computations. The name ”Approximate Partition” derives from the distributed computation method adopted, which is inspired by the Partition algorithm [55], and its distributed straightforward version [41]. We assume that our dataset D is divided into several disjoint partitions Di , i = {1, ..., n}, located on n collaborating entities, where each transaction completely belongs to one of the partitions. In particular, we consider that the dataset is already partitioned, according to some business rules, among geographically distributed systems. Collaborating entities are loosely coupled, and even if available network bandwidths sometimes is not an issue, latency surely is. A fitting example is a set of insurance companies connected by the Internet and collaborate in order to detect frauds. In this kind of setting, we should avoid sending lots of messages with several barrier synchronizations. Thus, a small loss of accuracy is a fair trade-off for a reduced number of communications/synchronizations. Both APRed and APInterp compute independently local solution for each node and then merge local results. Instead of making a second pass, as Distributed Partition does, we propose other methods to be used during the merge phase in order to improve the support count. To this end, the minimum support threshold used in local computation is adaptively reduced in APRed , whereas an approximate support inference heuristic is used in APInterp . Experimental tests show that the solutions produced by both APRed and APInterp are good approximation of the exact global result, and that APInterp is more efficient than APRed . Unfortunately, the APInterp method may also generate a few false positives, whose approximate supports is usually very close to the exact one. Therefore, the support of the rules extracted from these false positive patterns should not bother analysts. This is especially true when a positive result just indicate a case that need the attention of the operator for further investigation, as in the case of fraud detection: if a pattern with support slightly higher than the threshold is interesting, probably a slightly lower one will be interesting too. A single synchronization is required to compute and redistribute 4.2. Approximated distributed frequent itemset mining 55 the reduced support threshold, in APRed , and the knowledge of F2 , used by slaves for global pruning in both algorithms. This is particularly important in the described distributed setting, where the network latency is often a more critical factor than the available bandwidth, and the reduced number of communications is worth a small reduction in the accuracy of results. In APInterp , it is also possible to disable local pruning; at the cost of a larger number of false positive, the algorithm become asynchronous and suitable for unidirectional communications. 4.2.2 The Distributed Partition algorithm Our APInterp and APRed algorithms were inspired by Partition [55], a sequential algorithm that divides the dataset in several partitions processed independently. The basic idea exploited by Partition is the following: each globally frequent pattern must be locally frequent in at least one partition. This guarantees that the union of all local solutions is a superset of the global solution. However, one further pass over the database is necessary to remove all false positives, i.e. patterns that result locally frequent but globally infrequent. Obviously, Partition can be straightforwardly implemented in a distributed setting with a master/slave paradigm [41]. Each slave becomes responsible of a local partition, while the master performs the sum-reduction of local counters (first phase) and orchestrates the slaves for computing the missing local supports for potentially globally frequent patterns (second phase) to remove patterns having global support less than minsup (false positive patterns collected during the first phase). While the Distributed Partition algorithm gives the exact values for supports, it has pros and cons with respect to other distributed algorithms. The pros are related to the number of communications/synchronizations: other methods as countdistribution [22, 68] require several communications/synchronizations, while the Distributed Partition algorithm only requires two communications from the slaves to the master, one single message from the master to the slaves and one synchronization after the first scan. The cons are the size of messages exchanged, and the possible additional computation performed by the slaves when the first phase of the algorithm produces false positives. Consider that, when low absolute minimum supports are used, it is likely to produce a lot of false positives due to data skew present in the various dataset partitions [50]. This has a large impact also on the cost of the second phase of the algorithm too: most of the slaves will participate in counting the local supports of these false positives, thus wasting a lot of time. One naı̈ve work-around, that we will name Distributed One-pass Partition, consists in stopping Distributed Partition after the first-pass. So in Distributed One-pass Partition each slave independently computes locally frequent patterns and sends them to the master which sum-reduces the support for each pattern and writes in the result set only patterns having the sum of the known supports greater than (or equal to) minsup. Distributed One-pass Partition has obvious performance advantages vs. Distributed Partition. On the other hand, it yields a result that is approximate. Whereas 56 4. Distributed datasets it is sure that at least the number of occurrences reported in the results exists for each pattern, it is likely that some pattern has also occurrences in other partitions in which it was not frequent. This is formalized in the following lemma. Lemma 10 (Bounds on support after first pass). Let P=1,...,N be the set of the N partition indexes. Then let f part(x) = {j ∈ P |σj (x) > minsup · |Dj |} be the set of indexes of the partitions where the pattern x is frequent and let f part(x) = (P r f part) be its complement. The support for a pattern x is greater than or equal to the support computed by the Distributed One-pass Partition algorithm: X σ(x)lower = σj (x) j∈f part(x) and is less than or equal to σ lower (x) plus the maximum support the same pattern can have in partitions where it is not frequent: X minsup · |Dj | − 1 σ(x)upper = σ(x)lower + j∈f part(x) Note that when a pattern does not result frequent in a partition, its actual local support can be at most equal to the local minimum support threshold minus one. We can easily transform the two absolute bounds defined above into the corresponding relative ones: sup(x)upper = σ(x)upper σi (x)lower , sup(x)lower = |D| |D| These bounds can be used to calculate the Average Support Range described in appendix A (ASR(B), Definition 14). Any approximate algorithm based on Distributed One-pass Partition will yield results with at most this average error on all the supports. The main issue with Distributed One-pass Partition is that for every pattern the computed support is a very conservative estimate, since it always chooses the lower bounds to approximate the results. The first method we propose, APRed , aim at increasing this lower bound. This is obtained by mean of a reduction of the minimum support used for local computation in order to increase the probability that globally frequent patterns turn out to be locally frequent in most of the dataset partitions. Generally, any algorithm returning a support value between the bounds will have better chances of being more accurate. Following this idea, we devised another algorithm based on Distributed One-pass Partition, APInterp , which uses a smart interpolation of support. Moreover, it is resilient to skewed item distributions. 4.2. Approximated distributed frequent itemset mining 4.2.3 57 The APRed algorithm The key idea of APRed , our first approximate FIM algorithm, is to use a slightly reduced minimum support threshold (an adaptively selected one) for local elaborations. The APRed algorithms exploits the same number of communication as the Partition one, and consists of two phases too. The first phase allows the master to compute a ”good approximation” R0 of R = F1 ∪ F2 , where R0 ⊆ R, and a lower bound σ 0 (x) for support σ(x) of any patterns x ∈ R. This knowledge of R0 is then used by each slave for globally pruning the candidates during the second phase. This should reduce the production of false positives on the various slaves. Moreover, at the end of this first phase, the master also reduces the user-provided minsup, and this new support threshold is adopted by all the slave for the rest of the computation. The rationale of lowering minsup in local slave computation is to increase the probability that globally frequent patterns turn out to be locally frequent in most of the dataset partitions. Note that when a pattern is locally frequent in all the partitions, the master is able to determine exactly its support. At the end of second phase the master collects the locally frequent patterns (with respect to the reduced minsup) from the slaves, and simply builds the approximate sets {Fi |i > 2} by summing the supports associated with corresponding locally frequent patterns. Obviously, even if the local frequent patterns have been computed by lowering minsup, the master considers a pattern frequent only if this sum is at least |D| · min supp. The two points to clarify are: • how the master arrives at a ”good approximation” R0 of R = F1 ∪ F2 (at the end of the first phase) • how the master decides the support reduction ratio r to be used for the rest of the computation (during the second phase). A ”good approximation” for the frequent patterns composed of at most two items is built using a significantly reduced minsup for local computation during the first . In phase. In our tests, this initial support threshold was set to minsup0 = minsup 2 several cases F1 and F2 have much less elements than the following sets Fk , thus using such a low minimum support during the very first part of the computation could be reasonable for wide ranges of user-specified values of minsup and sparse datasets. Nevertheless, R0 gives us an accurate knowledge of R = F1 ∪F2 . However, minsup0 is usually too small, and cannot be used for the following iterations. Before describing the criteria used for deciding the support to use during the remaining iterations, we need to introduce a measure of similarity, which is used to compare two different result sets A and B. The Sim(A, B) measure, described in details in Appendix A, ranges from 0 to 1, and considers both false positive/negative and non-matching support values. The master chose the new support threshold, minsup00 ∈ [minsup0 , minsup], in such a way that Sim(R00 , R) is high, where R00 ⊆ R0 is introduced in the following. 58 4. Distributed datasets Note that, since the correct result R is not available, we have to exploit the selfsimilarity between the best known approximation of R, i.e. R0 , and a more relaxed one R00 , obtained as if all the slaves had mined their patterns (composed of one or two items) using the support threshold minsup00 ∈ [minsup0 , minsup]. The idea is to arrive at determining a value for minsup00 that is very close to minsup, thus entailing a small increase in the computational complexity. In practice the master chooses the highest minsup00 value which ensures a self-similarity (above a specified threshold, 98% in our tests) between R00 and R0 . The pseudo-code of the algorithm is contained in algorithm 2 and 3 for the slave and master parts respectively. In the pseudo-code R0 i , F 0 i1 , F 0 i2 , σi (x) are related to partition Di assigned to slave i, while the corresponding symbols without i are related to global results and datasets. The truth function [[expr]], which is equals to 1 if expr is TRUE and 0 otherwise, is used to select only the frequent patterns with respect to the specified support threshold. Algorithm 2: APRed - Slave i 1 2 3 4 5 6 Compute local R0 i = F 0 i1 ∪ F 0 i2 w.r.t. minsup0 = 12 · minsup ; Send local partial result R0 i to the master ; Receive the global approximation R0 of R ; Receive minsup00 . ; Continue computation w.r.t. minsup00i ; use R0 for pruning candidates.; Send local results to the master. ; Algorithm 3: APRed - Master 1 2 3 4 5 6 7 Receive local partial S results P R0 i from all the slaves ; 0 0 Compute R = {x ∈ i R i | i σi (x) > minsup · |D|} ; Send R0 to all the slaves ; Compute r00 = max{r ∈ [0.5, 1]|Sim(R0 , R00 (r)) >Pγ} where γ is a user provided similarity threshold, R00 (r) = {x ∈ R0 | i σir (x) > minsup · |D|}, σir (x) = [[σi (x) > r · minsup · |Di |]] · σi (x) ; Send minsup00 = r00 · minsup to all the slaves ; Receive local results R00 i from S P all the slaves ; 0 00 Return R ∪ {x ∈ i R i | i σi (x) > minsup · |D|} ; It is worth noting that the master discards already computed local results. In particular, the presence of patterns in R0 i (see point 2) and R00 i (see point 7) that do not result globally frequent, causes a waste of resources. This is a negative side effect and, in the experimental section, we will use this quantity as a measure of the efficiency of the proposed algorithm, in order to asses the impact on performance of lowering the minimum support threshold. We will see, however, that by exploiting 4.2. Approximated distributed frequent itemset mining 59 the approximate knowledge R0 of F1 ∪ F2 for candidate pruning we can effectively reduce this drawback. 4.2.4 The APInterp algorithm APInterp , the second distributed algorithm we propose in this chapter, tries to overcome some of the problems encountered by APRed and Distributed One-pass Partition when the data skew between the data partitions is high. The more evident is that several false positives could be generated, increasing the resource utilization and the execution time of both Distributed Partition and Distributed One-pass Partition. As APRed , also APInterp addresses this issue by means of global pruning based on partial knowledge of F2 : each locally frequent pattern that contains a globally non-frequent 2-pattern will be locally removed from the set of frequents patterns before sending it to the master and performing next candidate generation. Moreover this skew might cause a globally frequent pattern x to result infrequent on a given partition Di only. In other words, since σi (x) < minsup · |Di |, x will not be returned as a frequent pattern by the ith slave. As a consequence, the master of Distributed One-pass Partition cannot count on the knowledge of σi (x), and thus cannot exactly compute the global support of x. Unfortunately, in Distributed One-passP Partition the master might also deduce that x is not globally frequent, because j,j6=i σj (x) < minsup · |D|. As explained in the previous section, APRed uses support reduction in order to limit this issue. Unfortunately, this method exposes APRed to the combinatorial explosion of the intermediate results, in case the reduced minsup is too small for the processed dataset. APInterp , instead, allows the master to infer an approximate value for this unknown σi (x) by exploiting an interpolation method. The master bases its interpolation reasoning on the knowledge of: • the exact support of each single item on all the partitions, and • the average reduction of the support count of pattern x on all the partitions where x resulted actually frequent (and thus returned to the master by the slave), with respect to the support of the least frequent item contained in x: σj (x) j∈f part(x) ( minitem∈x (σj (item)) ) P avg reduct(x) = |f part(x)| where fpart(x) corresponds to the set of data partitions Dj where x actually resulted frequent, i.e. where σj (x) ≥ minsup · |Dj |. The master can thus deduce the unknown support σi (x) on the basis of avg reduct(x) as follows: σi (x)interp = min (σi (item) ∗ avg reduct(x)) item∈x 60 4. Distributed datasets It is worth remarking that this method works if the support of larger itemsets decrease similarly in all the dataset partitions, so that an average reduction factor (different for each pattern) can be used to interpolate unknown values. Finally note that, as regards the interpolated value above, we expect that the following inequalities hold: σi (x)interp < minsup · |Di | (4.1) So, if we obtain that σi (x)interp ≥ minsup · |Di |, this interpolated result cannot be accepted. If it was true, the exact value σi (x) should have already been returned by the ith slave. Hence, in those few cases where the inequality (4.1) does not hold, the interpolated value returned will be: σi (x)interp = (minsup · |Di |) − 1 The proposed interpolation schema yields a better approximation of exact results than Distributed One-pass Partition. The support values computed by the latter algorithm are, in fact, always equal to the lower bounds of the intervals containing the exact support of any particular pattern. Hence any kind of interpolation producing an approximate result set, whose supports are between the interval bounds, should be, generally, more accurate than peeking always its lower bound. Obviously several other way of computing a support interpolation could be devised. Some are really simple as the average of the bounds while others are complex as counting inference, used in a different context in [43]. We chose this particular kind of interpolation because it is simple to calculate, since it is based on data that we already maintain for other purposes, and it is aware of the data partitioning enough to allow for accurate handling of datasets characterized by heavy data-skew on item distributions. We can finally introduce the pseudo-code of APInterp (algorithms 4 and 5). As in Distributed Partition, we have a master and several slaves, each in charge of a horizontal partition Di of the original dataset. The slaves send information to the master about the counts of single items and locally frequent 2-itemsets. Upon reception of all local results (synchronization), the master communicates to the slaves an approximate global knowledge on F 0 2 , used by the slaves to prune candidates for the rest of the mining process. Finally, once received information about all locally frequent patterns, the master exploits the interpolation method sketched above for inferring unknown support counts. Note that when a pattern is locally frequent in all the partitions, the master is able to determine exactly its support. Otherwise, an approximate inferred support value is produced, along with an upper bound and a lower bound for that support. In the pseudo-code Fki denotes the set of frequent k-patterns in partition i (or globally when i is not present), F 0 k indicate an approximation of Fk and Single Countsi1 is the support of all 1-patterns in partition i. For the sake of simplicity, some detail of the algorithm has been altered in the pseudo-code. In particular, points 4 and 5 of the slave pseudo-code are an over- 4.2. Approximated distributed frequent itemset mining Algorithm 4: APInterp - Slave i 1 2 3 4 5 Compute local Single Countsi1 and F2i . ; Send local partial results to the master ; Receive the global approximation F 0 2 of F2 ; Continue computation, by using F 0 2 for pruning candidates ; Send local results to the master. If computation is over, send an empty set ; Algorithm 5: APInterp - Master 1 2 3 4 5 6 7 8 9 Receive local partial results Single Countsi1 and F2i from all the slaves; Compute the exact F1 , on the basis of the local counts of single items; Compute the approximate P S F 0 2 = {x ∈ i F2i | i counti (x) > minsup · |D|} i where if x ∈ F2 then counti (x) is equal to σi (x), or is equal to σi (x)interp otherwise ; Send F 0 2 to all the slaves ; Receive local results from all the slaves (empty for slaves terminated before the third iteration) ; Compute and return, S for each P k, the approximate F 0 k = {x ∈ i Fki | i counti (x) > minsup · |D|} i where if x ∈ Fk then counti (x) is equal to σi (x), or is equal to σi (x)interp otherwise; 61 62 4. Distributed datasets simplification of the actual code: pattern are sent, asynchronously, as soon as they are available in order to optimize communication. Each slave terminates when, at iteration k, less than k + 1 pattern are frequent; this is equivalent to checking emptiness of F 0 ik+1 , but more efficient. On the other side, the master continuously collects results from still active slaves and processes them as soon as all expected result sets of the same length arrive. 4.2.5 Experimental evaluation In the following part of the section, we describes the behavior exhibited by our distributed approximate algorithms in our experiments. We have run the APRed and APInterp algorithms on several datasets using different parameters. The goal of these tests is to understand, how similarities of the results vary as the minimum support and number of partitions change and the scalability. Similarity and Average Support Range. The method we are proposing yields approximate results. In particular APInterp computes pattern supports which may be slightly different from the exact ones, thus the result set may miss some frequent patterns (false negatives) or include some infrequent patterns (false positives). In order to evaluate the accuracy of the results we use a widely used measure of similarity between two pattern sets introduced in [50], and based on support difference. At the same time, we have introduced a novel similarity measure, derived from the previous one and used along with it in order to assess the quality of the algorithm output. To the same end, we use the Average support Range (ASR), an intrinsic measure of the correctness of the approximation introduced in [61]. An extensive description of this measures and a discussion on their use can be found in the appendix A. Experimental environment The experiments were performed on a cluster of seven high-end computers, each equipped with an Intel Xeon 2 GHz, 1 GB of RAM memory and local storage. In all our tests, we mapped a single process (either master or slave) to each node. This system offers communications with good latency (a dedicated Fast Ethernet). However, since APInterp requires just one synchronization, and all communication are pipelined, its communication pattern should be suitable even for a distributed system characterized by a high latency network. Experimental data We performed several tests using datasets from the FIMI’03 contest [1]. We randomly partitioned each dataset and used the resulting partitions as input data for different slaves. 4.2. Approximated distributed frequent itemset mining 63 During the test for APRed , we used two different partitioning, briefly indicated with the suffix P1 and P2 in plot and tables. In doing so, we tried to cover different number of possible cases with respect to partition size and number of partitions. Table 4.2.5 show a list of these datasets along with their cardinality, the number of partitions used in tests, and the minimum and maximum sizes of the partitions. Each dataset is also identified by a short reference code. Table 4.1: Datasets used in APRed experimental evaluation. P1 and P2 in the dataset name refers to different partitioning of the same dataset. Dataset (reference) accidents-P1 (A1) accidents-P2 (A2) kosarak-P1 (K1) kosarak-P2 (K2) mushroom-P1 (M1) mushroom-P2 (M2) retail-P1 (R1) retail-P2 (R2) T10I4D100K-P1(T10-1) T10I4D100K-P2(T10-2) T40I10D100K-P1(T40-1) T40I10D100K-P2(T40-2) #Trans. /1000 340 340 990 990 8 8 88 88 100 100 100 100 # Part 10 10 20 20 4 10 4 4 10 10 10 10 Part. size /1000 13..56 15..55 11..79 21..78 1..3 0.5..1 14..31 10..31 2..17 8..16 3..19 5..13 In APInterp tests, each dataset was divided in to a number of partitions ranging from 1 to 6, both in partition of similar and significantly different size. The first ones, the balanced partitioned datasets, were used in order to assess speedup for the tests on our parallel test bed. Table 4.2 shows a list of these datasets along with their cardinality and the minimum and maximum sizes of the partitions (for the largest number of partition). Each dataset is also identified by a short code, starting with U in case the sizes of partitions differ significantly. The number of partitions is not reported in this table, since it depends on the number of slaves involved in the specific distributed test. For each dataset, we computed the reference solution using DCI [44], an efficient sequential algorithm for frequent itemset mining (FIM). APRed experimental results First we present the results obtained using APRed , for which we only used the most strict Absolute Similarity measure (α = 1, see appendix A) for accuracy testing. Table 4.3 shows a summary of computation results for all datasets, obtained by using a self-similarity threshold γ = 0.98 to determine minsup0 = r · minsup, where r ∈ [0.5, 1]. We have reported the absolute similarity of approximate results to 64 4. Distributed datasets Table 4.2: Datasets used in APInterp experimental evaluation. When a datasets is referenced by a keyword prefixed by U (see Reference column), this means that it was partitioned in an unbalanced way, with partitions of significantly different sizes. Dataset accidents-bal accidents-unbal kosarak-bal kosarak-unbal mushroom-bal mushroom-unbal retail-bal retail-unbal pumbs-bal pumbs-unbal pumbs-star-bal pumbs-star-unbal connect-bal Reference A UA K UK M UM R UR P UP PS UPS C #Trans. 340183 340183 990002 990002 8124 8124 88162 88162 49046 49046 49046 49046 67557 Part. size 55778..57789 3004..84011 163593..166107 112479..237866 1337..1385 328..1802 14307..14888 6365..23745 8044..8289 1207..12138 8034..8291 3156..12089 11086..11439 exact results, the number of globally frequent patterns and the number of distinct discarded patterns, i.e. the patterns that are locally frequent but are discarded in points 2 and 7 of master pseudo-code because they are not globally frequent. Figure 4.1 shows several plots comparing the self-similarity used during computation, i.e. based on similarity between R0 and R00 , with the exact similarity between the global approximate results and the exact ones for r ∈ [0.5, 1]. If we pick a particular value of r in the plot, corresponding to a value of selfsimilarity γ, we can graphically find the similarity of the whole approximate solution to the exact one when r is used for the second part of the computation. We have found that in sparse datasets the similarity is usually nearly equal to (greater than) self-similarity, so the proposed empirical determination of r should yield good results. Even when the selection is slightly mislead by an excessively good partial result on R00 . This is the case of the Accident P2 dataset. Table 4.3 shows that APRed for this dataset chooses a support reduction factor of 0.95, and the similarity of the final result is 95%, which is a remarkably good result. Nevertheless, in the bottom left plot in Figure 4.1, we can see that by using a slightly smaller reduction factor (0.75), it was possible to boost the similarity of the final result close to 100%. Figure 4.2 shows the number of discarded patterns (points 2 and 7 of master pseudo-code) as a function of r. In order to put in evidence the effectiveness of the pruning based on F 0 1 and F 0 2 , we report curves relative to different types of pruning. Pruning local patterns using an approximate knowledge of F1 and F2 is enough to obtain a good reduction in the number of discarded pattern in most of the sparse datasets. 4.2. Approximated distributed frequent itemset mining 65 Table 4.3: Test results for APRed , obtained using the empirically computed local minimum support (minsup00 = r00 · minsup) for patterns with more than 2 items (for self-similarity threshold γ = 0.98). Dataset A1 A2 K1 K2 M1 M2 R1 R2 T10-1 T10-2 T40-1 T40-2 Min. supp. 40 % 40 % 0.6 % 0.3 % 40 % 40 % 0.2 % 0.2 % 0.2 % 0.2 % 2% 2% Simil. r00 0.95 0.95 0.85 0.80 0.50 0.50 0.55 0.55 0.60 0.65 0.80 0.85 0.95 0.96 0.97 0.99 0.44 0.07 0.92 0.91 0.93 0.94 0.92 0.96 # Freq 29646 28675 1132 4997 413 399 2675 2682 13205 13173 2293 2293 # Discarded 678289 633908 1968 12379 366 288 7492 7786 31353 17444 19220 18186 The APRed algorithm performed worse on dense datasets, such as Accidents, where too many locally frequent patterns are discarded, and Mushroom, where similarity of approximate results to exact results was really low. Large data skews seem to be a big issues for APRed , since in these cases several frequent patterns are not returned at all (lots of false negatives, and thus small values for both Recall and Similarity). APInterp experimental Results The experiments were run for several minimum support values and for different partitioning on each dataset. In particular, except when showing the effects of varying the minimum support and the number of partitions, we reported results corresponding to three and six partitions and to the two smallest minimum support thresholds used, usually characterized by a difference of about one order of magnitude in execution time. Table 4.4 shows a summary of computation results for all datasets, obtained for three and six partitions using two different minimum support values. The first four columns contain the code of the dataset and the parameters of the test. The next two columns contain the number of frequent patterns contained in the approximate solution and the execution time. The average support range column contains the average distance between the upper and lower bounds for the support of the various patterns, expressed as a percentage of the number of transactions in the dataset (see Definition 14). The following columns show the precision and recall metrics and the number of false positives/negatives. As expected, there are really few false negatives and consequently the value of Recall is close to 100%, but the Precision 66 4. Distributed datasets Dataset: T10I4D100K P2 Min. supp = 0.2% Dataset: Kosarak P2 Min. supp = 0.3% 100 100 95 95 90 90 % % 85 80 85 75 80 70 Self-similarity F1+F2 Similarity 65 0.5 0.55 0.6 0.65 0.7 Self-similarity F1+F2 Similarity 75 0.75 r 0.8 0.85 0.9 0.95 1 0.5 0.55 0.6 0.65 100 99 90 98 80 97 70 96 50 94 40 30 Self-similarity F1+F2 Similarity 92 0.5 0.55 0.6 0.65 0.7 0.8 0.85 0.9 0.95 1 0.85 0.9 0.95 1 Self-similarity F1+F2 Similarity 20 0.75 r 0.8 60 95 93 0.75 r Dataset: Mushroom P1 Min. supp = 40% 100 % % Dataset: Accidents P2 Min. supp = 40% 0.7 0.85 0.9 0.95 1 0.5 0.55 0.6 0.65 0.7 0.75 r 0.8 Figure 4.1: Similarity between the approximate distributed result and the exact one for APRed . The most strict value (α = 1) was used for support difference weight. This means that patterns with different supports are considered as not matching. Selfsimilarity is a measure used for similarity estimation during distributed elaboration, when true results are not available. is slightly smaller. Unfortunately, since these metrics do not take into account the support, a false positive having true support really close to the threshold has the same weight than one having a very small support. The last columns contain the similarity measure for the approximate results introduced in Definitions 12 and 13. The very high value of the f pSim proves that false positives have a support close to the exact one (but smaller than the exact one, so that they are actually not frequent). This behavior, i.e. a lot of false positives with a value of f pSim close to 100%, is particularly evident for datasets K and UK. Figure 4.3 shows a plot of the fpSim measure obtained for different datasets partitioned among a variable number of slaves. As expected, the similarity is higher when the dataset is partitioned in few partitions. Anyway, in most case there is no significant decrease. We have also compared the similarity of the approximate result obtained using support interpolation to the Distributed One-pass Partition one. The results are shown in Figure 4.4. The proposed heuristic for support interpolation does improve similarity, in particular for small minimum support values. Since no false positives are produced by Distributed One-pass Partition, in this case f pSim would be identical 4.2. Approximated distributed frequent itemset mining Dataset: T10I4D100K P2 Min. supp = 0.2% Dataset: Kosarak P2 Min. supp = 0.3% 90 no pruning F1 pruning F1+F2 pruning 4 Discarded local pattern / Frequent pattern Discarded local pattern / Frequent pattern 4.5 3.5 3 2.5 2 1.5 1 no pruning F1 pruning F1+F2 pruning 80 70 60 50 40 30 20 10 0 0.5 0.55 0.6 0.65 0.7 0.75 r 0.8 0.85 0.9 0.95 1 0.5 0.55 0.6 0.65 Dataset: Accidents P2 Min. supp = 40% 200 no pruning F1 pruning F1+F2 pruning 180 0.7 0.75 r 0.8 0.85 0.9 0.95 1 0.9 0.95 1 Dataset: Mushroom P1 Min. supp = 40% 3500 Discarded local pattern / Frequent pattern Discarded local pattern / Frequent pattern 67 160 140 120 100 80 60 40 20 0 no pruning F1 pruning F1+F2 pruning 3000 2500 2000 1500 1000 500 0 0.5 0.55 0.6 0.65 0.7 0.75 r 0.8 0.85 0.9 0.95 1 0.5 0.55 0.6 0.65 0.7 0.75 r 0.8 0.85 Figure 4.2: Relative number of distinct locally frequent patterns that are not globally frequent as a function of r for different pruning strategies for APRed . They are discarded at point 2 and 7 of the master pseudo-code. This is a measure of the waste of resources due to both data-skewness and minimum support lowering. Accidents, a dense dataset, causes a lot of trashed locally frequent patterns. to Sim, thus this measure is plotted just for the APInterp algorithm. Finally, we have verified the speedup of the APInterp algorithm, using only uniformly sized partitions. Figure 4.5 shows the measured speedup when an increasing number of slaves is exploited. Note that when more slaves are used, the dataset has to be partitioned accordingly. The APInterp algorithm performed worse on dense datasets, such as Connect, where too many locally frequent patterns are discarded when we add slaves. On the other hand, in some cases we obtained also superlinear speedups. This could be due to the approximate nature of our algorithm: the support of several patterns could be computed even if some slaves does not participate in the elaboration. Acknowledgment The datasets used during the experimental evaluation are some of those used for the FIMI’03 (Frequent Itemset Mining Implementations) contest [1]. Thanks to the owners of these data and people who made them available in current format. In particular Karolien Geurts [21] for Accidents, Ferenc Bodon for Kosarak, Tom 68 4. Distributed datasets fpSimilarity(%) 100 99.5 99 % 98.5 98 A (minsupp= 20%) C (minsupp= 20%) K (minsupp= 0.1%) M (minsupp= 5%) P (minsupp= 70%) PS (minsupp= 25%) R (minsupp= 0.05%) UA (minsupp= 20%) UK (minsupp= 0.1%) UPS (minsupp= 25%) UR (minsupp= 0.05%) 97.5 97 96.5 96 95.5 1 2 3 4 5 6 # partitions Figure 4.3: fpSim of the APInterp results relative to datasets partitioned in different ways. Brijs [10] for Retail and Roberto Bayardo for the conversion of UCI datasets. Other datasets were generated using the publicly available synthetic data generator code from the IBM Almaden Quest data mining project [6]. 4.3 Conclusions In this chapter, we have discussed APRed and APInterp , two new distributed algorithms for approximate frequent itemset mining. The key idea of APRed is that by using a reduced minimum support (r · minsup, r ∈ [0.5, 1]) for distributed local elaboration on dataset partitions, without modifying the support threshold for global evaluation of fetched results, we can be confident that the final approximate results obtained will result quite correct. Moreover, even if we lower the support threshold, APRed still results efficient, and the amount of data sent to the master by the local slave is relatively small. This is due to a strong pruning activity: locally frequent candidate patterns are in fact pruned by using an approximate knowledge of F2 (often discarding more than 90% of globally infrequent candidate patterns). In our test, APRed performs particularly well on sparse datasets: in several cases 4.3. Conclusions 69 Similarity Dataset Kosarak - 6 unbalanced partitions 100 % 90 80 70 60 Distr. One-pass Part. Sim() AP Sim() AP fpSim() 50 0.1 0.2 0.3 0.4 0.5 Minimum support (%) 0.6 0.7 0.8 Figure 4.4: Comparison of Distributed One-pass Partition vs APInterp . an 80% reduction of minsup is enough to achieve a similarity close to 100%. On the other end on most dense dataset the number of missing and spurious patterns is definitely too high. APInterp , instead, exploits a novel interpolation method to infer unknown counts of some patterns, which are locally frequent in some dataset partitions. Since no support reduction is involved, APInterp is able to mine dense dataset for values of minsup that are too small to be used with APRed . For the same reason, also the issue related to bad choice of the support reduction factor (see the Accident dataset case in the APRed results), are avoided. For dataset partitioning characterized by high data skew, the APInterp approach is able to strongly improve the accuracy of the approximate results. Our tests prove that this method is particularly suitable for several (mainly sparse) datasets: it yields a good accuracy and scale nicely. The best approximate results obtained for the various datasets were characterized by a similarity above 99%. Even if some false positives are found, the high similarity value computed on the whole result set proves that the exact supports of these false positives are actually close to the support threshold, and thus of some interest to the analyst. The accuracy of the results is better than in Distributed One-pass Partition case. The main reason for this is that the Distributed One-pass Partition algorithm yields, 70 4. Distributed datasets Speed up 10 A minsupp=20% K minsupp=0.1% 9 8 Speedup(n) 7 6 5 4 3 2 1 1 2 3 4 5 6 # partitions Figure 4.5: Speedup for two of the experimental datasets, Kosarak(K) and Accidents (A), with balanced partitioning. for any patterns, a support value that is the lower bound of the interval in which the exact support is included. Hence, the count estimated by our algorithm, which falls between the lower and upper bounds, is generally closer to the exact count than the lower bound. Furthermore, the proposed interpolation schema does not increase significantly the overall space/time complexity and is resilient to heavy skew in the distribution of items. Finally, both in APInterp and APRed , synchronization occurs just once as in a naı̈ve distributed Partition, and, differently from Partition, slaves do not have to be polled for specific pattern counts, thus limiting potential privacy breaches related to low support patterns. 4.3. Conclusions 71 Table 4.4: Accuracy indicators for APInterp results obtained using the maximum number of partitions and the lowest support. Dataset A A A A C C C C K K K K M M M M P P P P PS PS PS PS R R R R R R UA UA UA UA UK UK UK UK UP UP UP UP UPS UPS UPS UPS UR UR UR UR UR UR # slaves 3 3 6 6 3 3 6 6 3 3 6 6 3 3 6 6 3 3 6 6 3 3 6 6 3 3 3 6 6 6 3 3 6 6 3 3 6 6 3 3 6 6 3 3 6 6 3 3 3 6 6 6 Minsup % 20.00 30.00 20.00 30.00 70.00 80.00 70.00 80.00 0.10 0.20 0.10 0.20 5.00 8.00 5.00 8.00 70.00 80.00 70.00 80.00 25.00 30.00 25.00 30.00 0.05 0.10 0.20 0.05 0.10 0.20 20.00 30.00 20.00 30.00 0.10 0.20 0.10 0.20 70.00 80.00 70.00 80.00 25.00 30.00 25.00 30.00 0.05 0.10 0.20 0.05 0.10 0.20 Minsup (count) 68036 102054 68036 102054 47289 54045 47289 54045 990 1980 990 1980 406 649 406 649 34332 39236 34332 39236 12261 14713 12261 14713 44 88 176 44 88 176 68036 102054 68036 102054 990 1980 990 1980 34332 39236 34332 39236 12261 14713 12261 14713 44 88 176 44 88 176 # freq 899740 151065 912519 152873 4239440 546795 4335664 560499 852636 42963 947486 59601 3773538 864245 3888898 926827 2858126 145435 2921763 152855 2177124 441472 2227435 444542 17766 6105 1902 18372 6190 1967 901687 151268 916744 152942 818017 52212 922792 49420 2800681 149253 2879809 152124 2207340 455973 2162459 453334 17654 6185 1896 17901 6390 1968 Time (s) 51.92 7.45 27.72 4.57 56.09 6.79 93.50 10.73 68.77 8.14 31.94 5.15 41.14 8.78 67.61 15.49 39.07 2.14 58.23 2.62 29.31 5.80 45.9 9.06 0.86 0.53 0.34 0.69 0.41 0.30 66.72 10.07 35.19 5.55 121.46 11.65 45.30 5.54 38.14 2.25 56.82 2.69 29.79 6.26 44.49 9.02 0.96 0.57 0.36 0.80 0.43 0.29 Avg.Sup. Range(%) 0.289 0.378 0.574 0.768 2.401 2.894 4.093 5.191 0.013 0.033 0.024 0.077 0.542 0.862 0.899 1.182 3.766 3.170 6.068 7.020 1.672 1.238 2.526 2.261 0.005 0.009 0.018 0.006 0.010 0.024 0.309 0.440 0.639 0.782 0.011 0.062 0.020 0.050 3.217 5.101 5.216 6.777 2.102 1.980 1.976 2.359 0.005 0.010 0.019 0.005 0.012 0.025 Precision % 98.87 98.95 97.51 97.80 97.37 97.67 95.24 95.24 88.81 89.53 80.56 65.93 99.50 76.13 96.57 71.00 94.42 97.57 92.36 92.99 94.80 97.82 92.72 96.98 91.07 93.39 94.59 88.63 92.47 92.63 98.68 98.81 97.06 97.75 92.62 74.21 82.76 79.27 96.30 95.17 93.71 93.46 93.53 94.90 95.51 95.46 91.56 92.48 94.75 90.56 90.33 92.56 Recall % 99.96 99.97 99.99 100.00 99.93 99.98 99.97 99.93 99.11 98.05 99.89 99.80 99.97 99.98 100.00 100.00 99.99 99.81 99.99 99.97 99.93 99.78 99.99 99.61 99.89 99.84 99.82 99.92 99.88 99.96 99.98 99.96 99.99 99.98 99.17 98.54 99.95 99.69 99.94 99.91 99.99 100.00 99.96 99.98 99.99 99.99 99.91 99.83 99.78 99.95 99.91 99.93 False pos % 1.13 1.05 2.49 2.20 2.63 2.33 4.76 4.76 11.19 10.47 19.44 34.07 0.50 23.87 3.43 29.00 5.58 2.43 7.64 7.01 5.20 2.18 7.28 3.02 8.93 6.61 5.41 11.37 7.53 7.37 1.32 1.19 2.94 2.25 7.38 25.79 17.24 20.73 3.70 4.83 6.29 6.54 6.47 5.10 4.49 4.54 8.44 7.52 5.24 9.44 9.67 7.44 False neg % 0.04 0.03 0.01 0.00 0.07 0.02 0.03 0.07 0.89 1.95 0.11 0.20 0.03 0.02 0.00 0.00 0.01 0.19 0.01 0.03 0.07 0.22 0.01 0.39 0.11 0.16 0.18 0.08 0.12 0.04 0.02 0.04 0.01 0.02 0.83 1.46 0.05 0.31 0.06 0.09 0.01 0.00 0.04 0.02 0.01 0.01 0.09 0.17 0.22 0.05 0.09 0.07 Sim % 98.83 98.92 97.51 97.80 97.30 97.65 95.20 95.17 88.10 87.97 80.49 65.84 99.45 76.12 96.52 71.00 94.40 97.39 92.34 92.96 94.73 97.61 92.69 96.59 90.97 93.25 94.42 88.57 92.37 92.60 98.66 98.77 97.05 97.73 91.91 73.40 82.72 79.08 96.24 95.07 93.69 93.44 93.48 94.88 95.49 95.43 91.49 92.34 94.55 90.52 90.25 92.50 fpSim % 99.81 99.76 99.58 99.44 98.70 98.73 97.17 96.74 99.20 98.24 99.90 99.81 99.94 98.62 99.81 97.98 97.37 98.52 95.50 95.28 99.05 99.34 98.45 98.85 99.90 99.85 99.82 99.88 99.89 99.95 99.84 99.77 99.41 99.31 99.23 98.89 99.94 99.72 98.02 97.04 96.66 96.05 99.11 99.19 98.92 98.70 99.92 99.84 99.78 99.94 99.91 99.92 72 4. Distributed datasets 5 Streaming data Many critical applications require a nearly immediate result based on a continuous and infinite stream of data. In our case, we are interested in mining all frequent patterns and their supports from an infinite stream of transactions. We begin this chapter by describing the peculiarities of streaming data, then we will introduce the problem of finding the most frequent items and itemset in a stream, along with some state of the art algorithms for solving them. Finally, we will describe our contribution: a streaming algorithm for approximate mining of frequent patterns. 5.1 Streaming data Before introducing the notation used in this chapter, we briefly summarize the notation previously used for frequent itemset and frequent items. A dataset D is a collection of subsets of items I = it1 , . . . , itm . Each element of D is called a transaction. A pattern x is frequent in dataset D with respect to a minimum support minsup, if its support is greater than σmin = minsup · |D|, i.e. the pattern occurs in at least σmin transactions, where |D| is the number of transactions in D. A kpattern isSa pattern composed of k items, Fk is the set of all frequent k-patterns, and F = i Fi is the set of all frequent patterns. If D contains just transactions of one item, then all of the frequent patterns are 1-patterns. These patterns are named frequent items. Since the stream is infinite, new data arrive continuously and results change continuously as well. Hence, we need a notation for indicating that a particular dataset or result is referred to a particular time interval. To this end, we write the interval as a subscript after the entity. Thus D[t0 ,t1 ) indicates the part of the stream received since t0 and before t1 . For the sake of simplicity we will write just D instead of D[1,t] , when referring to all data received until current time t, if this notation is not ambiguous. As usual, a square bracket indicates that the bound is part of the interval, whereas a parenthesis indicates that it is excluded. A pattern x is frequent at time t in the stream D[1,t] , with respect to a minimum support minsup, if its support is greater than σmin[1,t] = minsup · |D[1,t] |, i.e. the pattern occurs in at least σmin[1,t] transactions, where |D[1,t] | is the number of transactions in the stream D until time t. A k-pattern is a pattern composed of k 74 5. Streaming data items, Fk[1,t] is the set of all frequent k-patterns, and F[1,t] is the set of all frequent patterns. 5.1.1 Issues The infinite nature of these data sources is a serious obstacle to the use of most of the traditional methods since available computing resources are limited. One of the first effects is the need to process data as they arrive. The amount of previously happened events is usually overwhelming, so they can be either dropped after processing or archived separately in secondary storage. In the first case access to past data is obviously impossible whereas in the second case the cost for data retrieval is likely to be acceptable only for some ”ad hoc” queries, especially when several scan of past data are needed to obtain just one single result. Other important differences with respect to having all data available for the mining processed at the same time regard the obtained results. As previously explained, both the data and the results evolve continuously. Hence a result is referred to a part of the stream and, in our case, to the whole part of the stream preceding a given time t. Obviously, an algorithm suitable for streaming data should be able to compute the ’next step’ solution on-line, starting from the previously known D[1,t−1) and the current data D[t−1,t) , if necessary with some additional information stored along with the current solution. In our case, this information is the count of a significant part of frequent single items, and a transaction hash table used for improving deterministic bounds on supports returned by the algorithm, as we will explain later in this chapter. 5.2 Frequent items Even the apparently simple discovery of frequent items in a stream is challenging, since its exact solution requires to store a counter for each distinct item received. Some items may appear initially in a sporadic way and then become frequent, thus the only way to exactly compute its support is to maintain a counter since its first appearance. This could be acceptable when the number of distinct items is reasonably bounded. If the stream contains a large and potentially unbounded number of spurious items, as in case of data with probabilities of occurrence that follows Zipf’s law like internet traffic data, this approach may lead to a huge waste of memory. Furthermore, the number of distinct items is potentially proportional to the length of the stream. The Top Frequent items problem is closely related to the frequent items one, except that the user does not directly decide the support threshold: the result set contains only a given number of items having the highest supports. In this case too the resource usage is unbounded. This issue has been addressed by several approximate algorithms, which sacrifice the exactness of the result in order to limit the space complexity. In this section, we will formally introduce the problem, and 5.2. Frequent items 75 then we will describe some representative approximate algorithms for finding the set of most frequent items. 5.2.1 Problem Let D[1,n] = s1 , s2 , . . . , sn be a data stream, where each position in the stream si contains an element of the items I = it1 , . . . , itm . Let item iti occur σ[1,n] (iti ) times in D[1,n] . The k items having the highest frequencies are named the top-k items whereas items whose frequencies are greater than σmin = minsup · |D| are named frequent items. As explained before and in [12] the exact solution of this problem is a highly memory intensive problem. Two relaxed versions of this problem have been introduced in [12]: FindCandidateTop(S, k, l) and FindApproxTop(S, k, ). The first one is exact and consists in finding a list of l items containing the k most frequent, whereas the second one is approximate. Its goal is to find a list of items having a frequency greater than (1 − ) · σ[1,n] (itk ) where itk is the k th most frequent item. FindCandidateTop can be very hard to solve for some input distributions, in particular when the frequencies of the k th and the (l + 1)th are similar. In such cases, the approximate problem is more practical to solve. Several variation of the top-k frequent items problem have been proposed. The Hot Items problem described in [40] for large datasets and, several year after, adapted to data streams ([16, 30]) is essentially the top-k frequent items problem formalized in a slightly different way. The techniques used for solving this family of problems can be classified into two large categories: count-based techniques and sketch-based techniques. The first ones monitor a limited set of potentially ”interesting” items, using a counter for each one of them. In this case, an error arise when an item is erroneously kept out of the set or inserted too late. The second family provides a frequency estimation for every item by using a hash-indexed vector of counters. In this case, the risk of completely missing the occurrences of an item is avoided, at the cost of looser guarantees on the computed frequencies. 5.2.2 Count-based algorithms Count-based algorithms maintains a set of counters, each one associated with a specific item. When the number of distinct items is supposed to be high, there could be not enough memory for allocating all the counters. In this case, it is necessary to limit our attention to a set of items compatible with the available memory. Only the items in the monitored set have an associated counter, which is incremented upon their arrival. Other items have just an opportunity to replace one of the monitored items. In fact, in most methods, the set of monitored items varies during the computation. Each algorithm of this family is characterized by the data structure for the efficient maintenance of counters and the policy for replacing ”old” counters it uses. 76 5. Streaming data The Frequent algorithm This method has been originally proposed in [40] for large datasets and is inspired by an algorithm for finding a majority element. More recently two unrelated works ([16, 30]) have described new versions adapted to streams. A well-known algorithm for discovering the most frequent item in a set, containing repetitions of two distinct items, consists in removing pairs of distinct items from the set while this is possible. The elements left are all identical and their identity is the solution. In the case there are more than two distinct items this method will still work, provided that the majority exists, i.e., the most frequent element has more than n2 occurrences, where n is the stream length. Algorithm 6 shows the most efficient implementation of Majority. It requires just two variables: one contains the value currently supposed to be the majority and the other is a counter, indicating a lower bound for the advantage of the current candidate against any other opponent. At the end of the scan of the data, the only possible candidate is known. In case we are not dealing with streams, a second scan over the data will give the definitive answer. Algorithm 6: Majority input : data [1]...data [n] output: majority element if any 1 2 3 4 5 6 7 8 9 10 11 12 13 candidate ← data[1]; C ← 1; for i ← 2 to n do if C = 0 then candidate ← data[i]; if candidate = data[i] then C ← C + 1; else C ← C − 1; end C ← 0; for i ← 1 to n do if candidate = data[i] then C ← C + 1; end if C 6 n2 then return NULL; else return candidate In order to efficiently discard pairs, items are examined only when they arrive. The candidate variable keep track of the item currently prevailing, and a counter C indicates the minimum number of different items required to reach a tie. In other words, the number of item having the prevailing identity that are waiting to be matched with a different item. If a majority exists, i.e. if an item has support greater than n2 , it will be found. This is granted by the fact that an item is discarded only when paired with a different item, and, for the majority element, this cannot 5.2. Frequent items 77 happen for all of its occurrences. In case the most frequent item has a support smaller than n2 the behavior of the Majority algorithm is unpredictable. Furthermore, we may be interested in finding more than one of the top frequent items. The Frequent algorithm (algorithm 7) is thus a generalization of Majority, and is able to deal with these two cases. Its goal is to find a set of m items containing every item having a relative support strictly greater than m1 . The key idea is to keep a limited number m of counters and, when a new item arrives, decrement every counter, and replace one of the items having the counter value equal to zero, if there is any. In this way an item is always discarded Algorithm 7: Frequent input : data [1]...data [n] output: superset of items having support greater than 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 m C ← {}; for i ← 1 to n do if ∃f (data[i], f ) ∈ C then replace (data[i], f ) with (data[i], f + 1); else if ∃item (item, 0) ∈ C then replace (item, 0) with (data[i], 1) in C ; else if |C| < m then insert (data[i], 1) in C ; else foreach (item, f ) ∈ C do replace (item, f ) with (item, f − 1) in C ; end end end return {item : ∃f (data[i], f ) ∈ C} together with m − 1 occurrences of other symbols, or m when the incoming symbol is discarded too because no counter has reached zero, i.e. a total of m or m + 1 symbols are discarded. Hence if a frequent symbol x is discarded d times, either before of after its insertion in the counter set, then a total of at most d · (m + 1) 6 n n n stream positions will be discarded. Since x is frequent σ(x) > m > m+1 > d. Thus, an item that is frequent in the first n position of the stream will be in the set of counters after the processing of the nth position. In order to manage counters efficiently, a specifically designed data structure is required. In particular the operations of insertion, update and removal of a counter as well as the decrement of the whole set of counter need to be optimized. Both [16] and [30] propose a data structure based on differential support encoding and a mix of hash and double linked list which grants a worst-case amortized time complexity which is O(1), and O(m) worst-case space bound. 78 5. Streaming data This algorithm, in its original formulation, find just a superset of the frequent items, with no indication on support and no warranty on the absence of false positives. In the case of an ordinary dataset, both issues can be avoided with a second scan over the dataset but, on streaming data, this is not possible. However if we are allowed to use some additional space, it is possible to find also an estimate of the actual support of each items with some upper bound. In order to reach this goal, we need to maintain an additional counter which is never decreased, corresponding to a lower bound the support of each item, and a constant value indicating the maximum number of previous occurrences before the insertion in the counter set. Since the Frequent algorithm is correct, this amount is σmin[1,t] − 1, the maximum integer smaller than the support threshold for the corresponding stream portion. Furthermore, it is possible to exclude from the result set every item having the support under a specified value by increasing the number of counters and applying a post-filter as described in [29] for itemsets. The Lossy count algorithm The Lossy Count algorithm (algorithm 8) was introduced in [33]. Its main advantages versus the original formulation of Frequent are the presence of a constraint on false positive and the computation of an approximate support, similarly to the modified version of Frequent. Furthermore it is easily extensible to frequent itemsets, as we will see later in this chapter. The kind of solution this algorithm find is called an -deficient synopsis and consists in a result set containing every frequent item, but no item having relative support less than minsup − , along with a support approximation that is smaller than the exact relative support by at most . The algorithm manages a set C of items, each associated with a counter and a bound on its error. When a new item x arrives and x is known, its counter is incremented. Otherwise a new entry (item, 1 1, bucket − 1) is inserted in C, where bucket is the number of blocks of w = elements seen so far, and bucket − 1 is the maximum number of previously missed occurrences of item x. The algorithm is granted to maintain correctly the support in the -deficient synopsis. Hence, at the beginning of a new block it is possible to delete every counter having a best-case estimated support less than the error it would have if reinserted from scratch, which is equal to bucket − 1. Since estimated frequencies are less than true frequencies by at most , in order to get every frequent item but no item having relative support less than minsup − it is enough to return only the items having an upper bound for support, f + ∆, greater than or equal to minsup − . The Sticky Sampling algorithm Both [16] and [33] propose also some non deterministic methods. The idea is to keep the most frequent counters and delete the others in order to free space for new potentially frequent items. The way this is done, however, is different in the 5.2. Frequent items 79 Algorithm 8: Lossy Count input : data[1]...data[n] minsup, output: set containing every item having support greater than minsup · n and no item whose support is less than (minsup − ) · n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 bcurrent ← 1 ; C ← {}; for i ← 1 to n do if (∃f, ∆) (data[i], f, ∆) ∈ C then replace (data[i], f, ∆) with (data[i], f + 1, ∆); else insert (data[i], 1, bcurrent − 1) in C ; end if i mod 1 = 0 then bcurrent ← bcurrent + 1 ; foreach (item, f, ∆) ∈ C do if f + ∆ < bcurrent − 1 then remove (item, f, ∆) from C ; end end end end return {item : (∃f, ∆) (item, f, ∆) ∈ C ∧ f + ∆ > (minsup − ) · n)} 80 5. Streaming data Algorithm 9: Sticky Sampling input : data[1]...data[n] minsup, , δ output: set containing every item having support greater than minsup · n and no item whose support is less than (minsup − ) · n with probability of failure δ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 C ← {}; 1 ; t ← 1 · log minsup·δ block len ← 2 · t; rate ← 1; for i ← 1 to n do if i mod block len then rate ← 2 · rate ; block len ← t · rate ; correct counters ; foreach (item, f ) ∈ C do while binomial(1, 12 )=0 do replace (item, f ) with (item, f − 1); end end if ∃f (data[i], f ) ∈ C then replace (data[i], f ) with (data[i], f + 1) ; 1 )=0 then else if binomial(1, rate insert (data[i], 1) in C ; end end return {item : ∃f (data[i], f ) ∈ C ∧ f > (minsup − ) · n} 5.2. Frequent items 81 two cases. Probabilistic-Inplace [16] discard one-half of the counters every r received items and select the first items found immediately after the discard occurs. Sticky Sampling [33] (algorithm 9) use, instead, a uniform sampling strategy over the whole stream. In order to keep the number of counters probabilistically bounded, the sampling rate is decreased for increasing stream lengths, and the previously known frequencies are corrected to reflect the new rate using a stochastic method. 5.2.3 Sketch-based algorithms As Count-based algorithms, also Sketch-based ones maintain a set of counters but, instead of associating the counters with particular items, they are associated with different overlapping groups of items. The analysis of the values of the counters for the various groups containing an item allows us to give an estimate of its support. In this approach there is no notion of monitored item, and the support estimation is possible for any item. Algorithms included in this family, as in the case of the Count-based family, share the same basic skeleton. The main differences are in the management of the counters, the kind of other queries that can be answered by using the same count-sketch and the exact function used for support estimation, which directly influence the space requirements based on the user selected acceptable error probability. In [15] G.Cormode and S.Muthukrishnan present their particularly flexible Count-Min Sketch data structure as well as a good comparison to other state of the art sketch techniques. We will adopt their unification framework in order to describe a generic sketch based algorithm. A sketch is a two dimensional array of dimension w by d. Let m be the number of distinct items, h1 . . . hd be hash functions mapping {1 . . . m} into {1 . . . w} and let g1 . . . gd be other hash functions defined on items. The (j, k) entry of the sketch is defined to be X σ(i) · gk (i) i:hk (i)=j In other words, when an item i arrives, for each k ∈ {1 . . . d} the entry (hk (i), k) is increased by the amount gk (i), which is algorithm dependent. Thus, the update time complexity is O(d) and the space complexity is O(wd), provided that the hash functions can be stored efficiently. The way the data structure is used in order to answer a particular query, the required randomness and independence of the hash functions, as well as the minimum size of the sketch array needed to guarantee the fulfillment of probability of error constraints, are algorithm dependent. A particularly simple count sketch is Count-Min [15]. In its case the values of functions gk (item) are always 1, i.e. each counter is incremented by one each time an item is transformed into its identifier by a hash function. The approximate value is computed as the smallest of the counters associated with an item by any hash function. Since several items can be hashed to the same integer, the approximate value is always greater than the exact one. The two fragments of pseudo-code show 82 5. Streaming data the simple updateSketch procedure and approxSupport function used by Count-Min sketches. Procedure updateSketch(sketch,item) - Count-Min sketch 1 2 3 foreach k ∈ {1 . . . d} do sketch[hk (item), k] ← sketch[hk (item), k] + 1; end Function approxSupport(sketch, item) - Count-Min sketch 1 return mink∈{1...d} sketch[hk (item), k]; Other count sketch based methods are described in [12, 17, 28, 14]. 5.3 Frequent itemsets In this section, we introduce a new algorithm for approximate mining of frequent patterns from streams of transactions using a limited amount of memory. In most cases, finding an exact solution is not compatible with limited resources available and real time constraints, but an approximation of the exact result is enough for most purposes. The proposed algorithm consists in the computation of frequent itemsets in recent data and an effective method for inferring the global support of previously infrequent itemsets. Both upper and lower bounds on the support of each pattern found are returned along with the interpolated support. Before introducing our algorithm, we will shortly describe two other algorithms for approximate frequent itemset mining. Then we will give an overview of APStream , our algorithm, followed by a more detailed description and an extensive experimental evaluation showing that APStream yields a good approximation of the exact global result considering both the set of patterns found and their support. 5.3.1 Related work The frequent itemset mining problem on stream of transactions (input itemsets) poses additional memory and computational issues due to the exponential growth of solution size with respect to the corresponding problem on streams of items. Here we describe two representative approximate algorithms. The Lossy Count algorithm for frequent itemsets Manku and Motwani proposed in [33] an extension of their Lossy Count approximate algorithm to the case of frequent itemsets. A straightforward conversion of Lossy 5.3. Frequent itemsets 83 Count, using the same data structure in order to store the support of patterns as the transactions arrive, is possible but it would be highly inefficient. This is due to the exponential number of patterns supported by each transaction. Actually, it would be the same than computing the full set of itemset with no support constraint and removing periodically infrequent pattern. In order to avoid this issue, the authors process the transactions in blocks, so that the apriori constraint may be applied. The algorithm is much similar to that previously described for items, so we will focus on differences. The most notable 1 is that the transactions are processed in batches containing several buckets of size . As many transactions as the available memory can fit are buffered and then mined, using the number of buckets β as minimum support. This is roughly equivalent to searching patterns appearing at least once in each bucket, but more efficient. Every pattern x with support f in the transactions currently buffered is inserted in the set of counters as (x, f, bucket − β), where bucket indicates the last bucket contained in the buffer. At the same time the support of every pattern already in the counter set is checked in current buckets, updating the counters if needed and removing patterns that no longer satisfy the f + ∆ > bucket inequality. Clearly, in order to avoid the insertion in the counter set of spurious patterns, β should be a large number. Hence, a larger available memory increase the accuracy and reduce the running time. The Frequent algorithm for frequent itemsets In [29] R.Jin and G.Agrawal propose SARM, a new algorithm for frequent itemset mining based on Frequent [30]. Also in this case, the immediate extension of the base algorithm has serious shortcomings. This is mainly due to the potentially high 1 counters number of frequent patterns. While in the frequent items case just minsup are needed, for frequent itemsets one of the arguments used in the correctness proof is no longer true. In fact, in a stream of n transactions there can be more than n k-patterns having support greater than minsup. More precisely there can minsup n n n frequent items, 2l · minsup frequent pairs, and in general kl · minsup be l · minsup frequent k-pattern, where l is the length of transactions. Since the maximum length of frequent patterns is unknown before computation, the user would need to specify the maximal pattern length, maxlen, to use in order to correctly size the counter set. Thus the number of counters needed for the computation of frequent itemsets would be maxlen X l 1 minsup k=1 k Furthermore, unless the transactions are processed in batches as in Lossy Count, all the subpatterns of each incoming transaction need to be examined. In order to avoid these side effects, the SARM algorithm maintains separate sets Lk of potentially frequent itemsets, one for each different pattern length k. These sets are updated using a hybrid approach: SARM updates L1 and L2 using the same 84 5. Streaming data method proposed in Frequent, and at the same time buffers transactions for a levelwise batched processing. When a transaction t arrives, it is inserted in a buffer, and both L1 and L2 are updated either by incrementing the count, for already known 1 , patterns, or inserting the new ones. If the size of L2 exceeds the limit f · minsup· where ∈ [0, 1] is a factor used for increasing the accuracy, and f is the average number of 2-patterns per transaction, then the size of L2 is reduced by executing the CrossOver operation, consisting in decreasing every counter and removing, as in Frequent, patterns having count equal to zero. Every time this operation is performed, the transaction buffer is processed. For increasing values of k > 2, the k-patterns appearing in the buffer and having all subpatterns included in Lk−1 are used for updating Lk . Then the buffer is emptied and the CrossOver operation is applied to each Lk . The ∈ [0, 1] factor can be used for enforcing a bound on result accuracy. If < 1 then no itemset having relative support less than (1−)·minsup will be in the result set. Thus F minsup ⊆ L ⊆ F (1−)·minsup , where L is the result set, and F s is the set of itemset whose support exceed s. When = 1 the SARM algorithm is not able to give any guarantee on the accuracy, as the Frequent algorithm. Furthermore, both Lossy Count for itemsets and SARM ignore previous potential occurrences of a pattern when it is inserted into the set of frequent patterns. In the case of Lossy Count the maximum number of neglected occurrences is returned along with the support, but no other information available during the stream processing is exploited. 5.3.2 The APStream algorithm In order to overcome these limitations APStream (Approximate Partition for Stream), the algorithm we propose, uses the available knowledge on the support of other patterns to estimate a support for previously disregarded ones. The APStream algorithm was inspired by Partition [55], a sequential algorithm that divides the dataset into several partitions processed independently and then merges local solutions. The adjectives global and local are referred to temporal locality. So they are used in conjunction with properties of, respectively, the whole stream and just a relatively small and contiguous part of the stream, hereinafter called a block of transactions. Furthermore, we suppose that each block corresponds to one time unit: hence, D[1,n) will indicate the first n−1 data blocks, and Dn the nth block. This hypothesis allows us to adopt a lighter notation and cause no loss of generality. The Streaming Partition algorithm. The basic idea exploited by Partition is the following: if the dataset is divided into several partitions, then each globally frequent pattern must be locally frequent in at least one partition. This guarantees that the union of all local solutions is a superset of the global solution. However, one further pass over the database is necessary to remove all false positives, i.e. patterns that result locally frequent but globally infrequent. 5.3. Frequent itemsets 85 In order to extend this approach to a stream setting, blocks of data received from the stream are used as an infinite set of partitions. A block of data is processed as soon as ”enough” transactions are available, and results are merged with the current approximate result, which is referred to the past part of the stream. Unfortunately, in the stream case, only recent raw data (transactions) can be maintained available for processing due to memory limits, thus the usual Partition second pass will be restricted to accessible data. Only the partial results extracted so far from previous blocks, and some other additional information, can be available for determining the global result set, i.e. the frequent patterns and their supports. One naı̈ve workaround is to avoid the second pass and keep in the result set only patterns having the sum of the known supports, i.e. only those corresponding to patterns that resulted to be locally frequent in the various blocks mined so far, greater than (or equal to) minsup. We will name this algorithm Streaming Partition. The first time a pattern x is reported, its support corresponds to the support computed in the current block. In case it appeared previously, this mean introducing an error. If j is the first block where x is frequent, then this error can be at most σmin[1,j] − 1. This is formalized in the following lemma. Lemma 11 (Bounds on support after first pass). Let P = {1, ..., n} be the set of indexes of the n block received so far. Then let f part(x) = {j ∈ P |σj (x) > minsup · |Dj |} be the set of indexes of the blocks where the pattern x is frequent and let f part(x) = (P r f part) be its complement. The support for a pattern x is no less than the support computed by the Streaming Partition algorithm (σ lower (x)) and is less than or equal to σ lower (x) plus the maximum support the same pattern can have in blocks where it is not frequent: σ(x)lower = X j∈f part(x) σj (x) , σ(x)upper = σ(x)lower + X minsup · |Dj | − 1 j∈f part(x) Note that when a pattern x is frequent in a block Dj , its local support is summed to both the upper and lower bounds. Otherwise, its local support can range from 0 (no occurrence) to the local minimum support threshold minus one (i.e. minsup · |Dj | − 1), thus the lower bound remains the same, whereas the upper bound is increased. We can easily transform the two absolute bounds defined above into the corresponding relative ones, usable to calculate the Average Support Range, defined in appendix A: n upper sup(x) X σ(x)upper σi (x)lower = , sup(x)lower = , where |D| = |Dj | |D| |D| j=1 Streaming Partition has serious resource usage issues. In order to keep track of frequent itemsets, a counter for each distinct pattern found to be frequent in at least one block is needed. This obviously leads to an unacceptable memory usage in most cases. The only way to overcome this limitation is introducing some kind of forget policy: in the remainder of this paper when we refer to Streaming Partition we 86 5. Streaming data mean Streaming Partition with the deletion of patterns that resulted to be globally infrequent after each block processing. Another problem with Streaming Partition is that for every pattern the computed support is a very conservative estimate, since it always chooses the lower bounds to approximate the results. Generally, any algorithm returning a support value between the bounds will have better chances of being more accurate. Following this idea, we devised a new algorithm based on Streaming Partition that uses a smart interpolation of support. Moreover, it is resilient to skewed item distributions. The APStream algorithm. The streaming algorithm we propose, APStream , tries to overcome some of the problems encountered by Streaming Partition and other similar algorithms for association mining on streams when the data skew between different incoming blocks is high. The most evident is that several globally infrequent patterns may be locally frequent, increasing both resource utilization and execution time of these algorithms. APStream addresses this issue by means of global pruning based on historical exact (when available) or interpolated support: each locally frequent pattern that is not globally frequent according to its interpolated support will be immediately removed and will not produce any child candidate. Moreover this skew might cause a globally frequent pattern x to result infrequent on a given data block Di . In other words, since σi (x) < minsup·|Di |, x will not be found as a frequent pattern in the ith block. As a consequence, we will not be able to count on the knowledge of σi (x), and thus we cannot exactly compute the support of x. Unfortunately, P Streaming Partition might also deduce that x is not globally frequent, because j,j6=i σj (x) < minsup · |D|. Result merge and interpolation When an input block Di is available for processing, APStream extract its frequent itemsets using the DCI algorithm. Then for each pattern x, included either in past combined results or in the recent FIM results, it computes the approximate global support σ[1,i] (x)interp in different ways, according to the specific situation. The approximate past support (σ[1,i) (x)interp ) was obtained by merging the FIM results of blocks D1 . . . Di−1 using the technique currently discussed. σ[1,i) (x)interp can be either known or not, depending on the presence of x in the past combined results. In the same way, σi (x) is known only if x is frequent in Di . The following table summarizes the possible cases and the action taken by APStream : σ[1,i) (x)interp known known unknown σi (x) known unknown known Action sum σi (x) to past support and bounds. recount σi (x) on recent, still available, data. interpolate past support σ[1,i) (x)interp The first case is the simpler to handle: the new support σ[1,i] (x)interp will be the sum of σ[1,i) (x)interp and σi (x). Since σi (x) is exact, the width of the error interval 5.3. Frequent itemsets 87 will remain the same. The second one is similar, except that we need to look at recent data for computing σi (x). The key difference with Streaming Partition is the handling of the last case. APStream , instead of supposing that x never appeared in the past, tries to interpolate σ[1,i) (x). The interpolation is based on the knowledge of: • the exact support of each item (or optionally just the approximate support of a fixed number of most frequent items) • the reduction factors of the support count of subpatterns of x in current block with respect to its interpolated support over the past part of the stream. The algorithm will thus deduce the unknown support σ[1,i) (x) of itemset x on the part of the stream preceding the ith block as follows: ( interp σ[1,i) (x) = σi (x) ∗ min min ( σ[1,i) (item) σ[1,i) (x r item)interp , σi (item) σi (x r item) ) )! item ∈ x In the previous formula the result of the inner min is the minimum among the ratios of supports of items contained in pattern x in past and recent data, and the same values computed for itemsets obtained from x by removing one of its items. Note that during the processing of recent data, the search space is visited level-wise and the merge of the results is performed starting from shorter pattern too. Hence the interpolated supports σ[1,i) (x r item)interp of all the k − 1-subpatterns of a k-pattern x are known. In fact, each support can be either known from the processing of the past part of the stream or computed during the previous iteration on recent data. Example of interpolation. Suppose that we have received 440 transactions so far, and that 40 of these are in the current block. The itemset {A, B, C}, briefly indicated as ABC, is frequent locally whereas it was infrequent in previous data. Table 5.1 reports the support of every subpattern involved in the computation. The first column contains the patterns, the second and third columns contain the supports of the patterns in the last received block and in the past part of the stream. Finally, the last column shows the reduction ratio for each pattern. The algorithm examines itemsets of size k − 1 (two in this simple example), and single items, and choose the one having the minimum ratio. In this case the minimum is 2.5, corresponding to the subpattern {A, C}. Since in recent data the support of itemset x = {A, B, C} is σi (x) = 6, the interpolated support will be σ[1,i) (x)interp = 6 · 2.5 = 15 It is worth remarking that this method works if the support of larger itemsets decreases similarly in most parts of the stream, so that a reduction factor (different for each pattern) can be used to interpolate unknown values. Finally note that, as regards the interpolated value above, the following inequality should hold: σ[1,i) (x)interp < minsup · |D[1,i) |. If it is not satisfied, the interpolated result should 88 5. Streaming data x σi (x) σ[1,i) (x)interp {A, B, C} 6 ? {A, B} 8 50 {A, C} 12 30 {B, C} 10 100 {A} 17 160 {B} 14 140 {C} 18 160 {} 40 400 σ[1,i) (x)interp σi (x) ? 6.2 2.5 10 9.4 10 8.9 - Table 5.1: Sample supports and reduction ratios (σmin[1,t) = 20). not be accepted since, otherwise, the exact value σi (x) should have already been found. Hence, in those few cases where the above inequality does not hold, the interpolated value will be: σ[1,i) (x)interp = (minsup · |D[1,i) |) − 1. In the example described in table 5.1 the interpolated support for {A, B, C} is 15 and the minimum support threshold for past data is 20, so the bound is respected. Otherwise, the interpolated support would be forced to 19. The proposed interpolation schema yields a better approximation of exact results than Streaming Partition, in particular with respect to the approximation of the support of frequent patterns. The supports computed by the latter algorithm are, in fact, always equal to the lower bounds of the intervals containing the exact support of any particular pattern. Hence any kind of interpolation producing an approximate result set, whose supports are between the interval bounds, should be, generally, more accurate than picking always its lower bound. For the same reason the computed support values should be also more accurate than those computed by Lossy Count (Frequent does not return any support value). Obviously several other way of computing a support interpolation could be devised. Some are simple as the average of the bounds while others are complex as counting inference, used in a different context in [43]. We chose this particular kind of interpolation because it is simple to calculate, since it is based on data that we already maintain for other purposes, and it is aware of the underlying data enough to allow for accurate handling of datasets characterized by data-skew on item distributions among different blocks. We can finally introduce the pseudo-code of APStream . As in Streaming Partition the transactions are received and buffered. DCI, the algorithm used for the local computations, is able to exactly know the amount of memory required for mining a dataset during the intersection phase. Since frequent patterns are processed sequentially and can be offloaded to disk, the memory needed for efficient computation of frequent patterns is just that used by the bitmap representing the vertical dataset and can be computed knowing the number of transactions and the number of frequent items. 5.3. Frequent itemsets Procedure processBlock(frequentItems,buffer, globFreq) 1 2 3 4 5 6 7 locF req[1] ← f requentItems ; k←2; while locF req[k − 1].size >= k do locF req[k] ← computeF requent(k, locF req, globF req) ; if k =2 then V D ← f illV erticalDataset(buf f er, f requentItems) ; commitInsert(V D, k, locF req, globF req) ; end Procedure commitInsert(VertData,k,locFreq, globFreq) 1 2 3 4 5 6 7 foreach pat ∈ globF req[k] : pat ∈ / locF req[k] do compute support of pat in VertData ; if pat is frequent then pre-insert pat in globFreq[k] ; end end replace globFreq[k] with sorted insertBuffer; Function computeFrequent(k,locFreq, globFreq) 1 2 3 4 5 6 7 8 9 compute local frequent pattern ; foreach locally frequent pattern pat do compute global interpolated support and bounds ; if pat is globally frequent then insert pat in locFreq[k] ; pre-insert pat in globFreq[k] ; end end return Fk ; 89 90 5. Streaming data Thus, we can use this knowledge in order to maximize the size of the block of transactions processed at once. For the sake of simplicity we will neglect the quite obvious main loop with code related to buffering, concentrating on the processing of each data block. The interpolation formula has been omitted too, in the pseudocode, for the same reason. Each block is processed, visiting the search space level-wise, for discovering frequent patterns. In this way, itemsets are sorted according to their length and the interpolated support for frequent subpattern is always available when required. The processing of patterns of length k is performed in two steps. First frequent patterns are computed in the current block and then the actual insertion into the current set of frequent patterns is carried out. When a pattern is found to be frequent in the current block its support on past data is immediately checked: if it was already known then the local support is summed to previous support and previous bounds. Otherwise, a support and a pair of bounds are inferred for past data and summed to the support in the current block. In both cases, if the resulting support pass the support test, the pattern is queued for delayed insertion. After every locally frequent pattern of the current length k has been processed, the support of every previously known pattern that is not locally frequent is computed on recent data. Patterns passing the support test are queued for delayed insertion too. Then the set of pre-inserted itemsets is sorted and the actual insertion take place. Bounds on computed support errors As a consequence of using an interpolation method to guess an approximate support value in the past part of the stream, it is very important to establish some bounds on the support found for each pattern. In the previous subsection, we have already indicated a pair of really loose bounds: each support cannot be negative, and if a pattern was not frequent in a time interval then its interpolated support should be less than the minimum support threshold for the same interval. The lower bound is obviously always satisfied, whereas in case a support value σ[1,i−1] (x)interp breaks its upper bound value, it will be forced to (minsup · |D[1,i−1] |) − 1 which is the greatest value compatible with the bound. This criterion is completely true for nonevolving distributed dataset (distributed frequent pattern mining) or for the first two data block of the stream. In the stream case, the upper bound is based on previous approximate results, and could be inexact if the pattern corresponds to a false negative. Nevertheless, it does represent a useful indication. Bounds based on pattern subset The first bounds that interpolated supports should obey, derive from the Apriori property: no set can have a support greater than those of any of its subset can. Since recent results are merged level-wise with previously known ones, the interpolation can exploit already interpolated subset support. When a subpattern is missing during interpolation this mean that it has been examined during a previous level and discarded. In that case, all of its superset 5.3. Frequent itemsets 91 may be discarded as well. The computed bound is thus affected by the approximation of past results: a pattern with an erroneous support will affect the bounds for each of its superset. To avoid this issue it is possible to compute the upper bound for a pattern x simply using the upper bounds of its sub-patterns instead of their support. In this way, the upper bounds will be weaker, but there will be less false negatives due to erroneous bounds enforcement. Bounds based on transaction hash In order to address the issue of error propagation in support bounds we need to devise some other kind of bounds that are computed exclusively from received data and thus are independent of any previous results. Such bounds can be obtained using inverted transaction hashes. This technique was first introduced in the algorithm IHP [26], an association mining algorithm where it was used for finding upper bounds for the support of candidates in order to prune infrequent ones. As we will show this method can be used also for lower bounds. The key idea is that each item has an associated hashed set of counters that are accessed by using transaction id as a key. More in detail, each array hcnt[item] associated with an item is an array of hsize counters initialized to zero. When the tidth transaction t = {ti } is processed, a hash function transforms the tid value into an index to be used for the array of counters. Since tids are consecutive integer numbers, a trivial hash function as h(tid) = tid mod hsize will guarantee an equal repartition of transactions among all hash bins. For each item ti ∈ t the counter at position h(tid) in the array hcnt[ti ] is incremented. The hash function implicitly subdivides the transactions of the dataset. Each partition corresponds to a position in the array of counters, while the value of each counter represents the number of occurrences of an item in a given set of transactions. These hashes are a sort of ”compressed” tid-list and can be intersected to obtain deterministic bounds for the number of occurrences of a specified pattern. Notably these arrays of counters have a fixed size, independent from the number of transactions processed. Let hsize = 1, A and B two items and hA = hcnt[A][0] and hB = hcnt[B][0] the only counters contained in their respective hashes, i.e. hA and hB are the number of occurrences of items A and B in the whole dataset. According to the Apriori principle the support σ({A, B}) for the pattern {A, B} can be at most equal to min(hA , hB ). Furthermore, we are able to indicate a lower bound for the same support. Let n[i] be the number of transactions associated with the ith hash position, which, in this case, corresponds to the total number of transactions n. We know from the inclusion/exclusion principle that σ({A, B}) should be greater than or at least equal to max(0,hA + hB − n). In fact if n − hA transactions does not contains the item A then at least hB − (n − hA ) of the hB transactions containing B will also contain A. Suppose that n = 10, hA = 8, hB = 7. If we represent with an X each transaction supporting a pattern and with a dot any other transaction we obtain the following diagrams: 92 5. Streaming data Best case(ub(AB)= 7) A: XXXXXXXX.. B: XXXXXXX... AB: XXXXXXX... Worst case(lb(AB)=5) XXXXXXXX.. ...XXXXXXX ...XXXXX.. Then no more than 7 transactions will contain both A and B. At the same time at least 8 + 7 − 10 = 5 transactions will satisfy that constraint. Since each counter represents a set of transaction, this operation is equivalent to the computation of the minimal and maximal intersections of the tid-lists associated with the single items. Usually hsize will be larger than one. In that case, the previously explained computations will be applied to each hash position, yielding an array of lower bounds and an array of upper bounds. The sums of their elements will give the pair of bounds for pattern {A, B} as we will show in the following example. Let hsize = 3, h(tid) = tid mod hsize the hash function, A and B two items and n[i] = 10 be the number of transactions associated with the ith hash position. Suppose that hcnt[A] = {8, 4, 6} and hcnt[B] = {7, 5, 6}. Using the same notation previously introduced we obtain: h(tid)=0 Best case Worst case A: XXXXXXXX.. XXXXXXXX.. B: XXXXXXX... ...XXXXXXX AB: XXXXXXX... ...XXXXX.. supp 7 5 h(tid)=1 Best case Worst case A: XXXX...... XXXX...... B: XXXXX..... .....XXXXX AB: XXXX...... .......... supp 4 0 h(tid)=2 Best case Worst case A: XXXXXX.... XXXXXX.... B: XXXXXX.... ....XXXXXX AB: XXXXXX.... ....XX.... supp 6 2 Each pair of columns represents the transactions having a tid mapped into the corresponding location by the hash function. The lower and upper bounds for the support of pattern AB will be respectively 5 + 0 + 2 = 7 and 7 + 4 + 6 = 17. Both lower bounds and upper bounds computations can be extended to larger itemsets by associativity: the bounds for the first two items are composed with the third element counters and so on. The sums of the elements of the last pair of resulting arrays will be the upper and the lower bounds for the given pattern. This is possible since the reasoning previously explained still holds if we considers the occurrences of itemsets instead of those of single items. The lower bound computed in this way will be often equal to zero in sparse dataset. Conversely, on dense datasets this method did proved to be effective in narrowing the two bounds. Experimental evaluation In the final part of this section, we study the behavior of the proposed method. We have run the APStream algorithm on several datasets using different parameters. The goal of these tests is to understand how similarities of the results vary as the stream length increases, the effectiveness of the hash-based pruning, and, in general, how dataset peculiarities and invocation parameters affect the accuracy of the results. Furthermore, we studied how execution time evolves in time when the stream length increases. 5.3. Frequent itemsets 93 Similarity and Average Support Range. The method we are proposing yields approximate results. In particular APStream computes pattern supports which may be slightly different from the exact ones, thus the result set may miss some frequent patterns (false negatives) or include some infrequent patterns (false positives). In order to evaluate the accuracy of the results we use a widely used measure of similarity between two pattern sets introduced in [50], and based on support difference. To the same end, we use the Average support Range (ASR), an intrinsic measure of the correctness of the approximation introduced in [61]. An extensive description of both measures and a discussion on their use can be found in the appendix A. Experimental data. We performed several tests using both real world datasets, mainly from the FIMI’03 contest [1], and synthetic dataset generated using the IBM generator. We randomly shuffled each dataset and used the resulting datasets as input streams. Table 5.2 shows a list of these datasets along with their cardinality. The datasets having the name starting with T are synthetic datasets, which mimic the behavior of market basket transactions. The sparse dataset family T20I8N5k has transactions composed, on average, of 20 items, chosen from 5000 distinct items, and include maximal patterns whose average length is 8. The dataset family T30I30N1k was generated with the parameters synthetically indicated in its name and is composed of moderately dense datasets, since more than 10,000 frequent patterns can be extracted even with a minimum support of 30%. A description of all other datasets can be found in [1]. Kosarak and Retail are really sparse datasets, whereas all other real world dataset used in experimental evaluation are dense. Table 5.2 also indicates for each dataset a short identifying code that will be used in our charts. Dataset Reference accidents A kosarak K retail R pumbs P pumbs-star PS connect C T20I8N5k S2..6 T25I20N5k S7..11 T30I30Nf1k D1..D9 #Trans. 340183 990002 88162 49046 49046 67557 77302..3189338 89611..1433580 50000..3189338 Table 5.2: Datasets used in experimental evaluation. Experimental Results. For each dataset and several minimum support thresholds, we computed the exact reference solutions by using DCI [44], an efficient sequential algorithm for frequent pattern mining (FPM). Then we ran APStream for 94 5. Streaming data different values of available memory and number of hash entries. The first test is focused on catching the effect of used memory on the behaviour of the algorithm when the block of transactions processed at once is sized dynamically according to the available resources. In this case, data are buffered as long as all the item counters, and the representation of the transactions included in the current block fit into the available memory. Note that the size of all frequent itemsets, either mined locally or globally, is not considered in our resource evaluation, since they can be offloaded to disk if needed. The second test is somehow related to the previous one. In this case, the amount of required memory is variable, since we determine a-priori the number of transactions to include in a single block, independently of the stream content. Since the datasets used in the tests are quite different, in both cases we used really different ranges of parameters. Therefore, in order to fit all the datasets in the same plot, the numbers reported in the horizontal axis are relative quantities, corresponding to the block sizes actually used in each test. These relative quantities are obtained by dividing the memory/block size used in the specific test by the smallest one for that dataset. For example, the series 50KB, 100KB, 400KB thus becomes 1,2,8. The first plot in figure 5.1 shows the results obtained in the fixed memory case, while the second one refers to the case of a fixed number of transactions per block. The relative quantities reported in the plots refer to different base values of either memory or transactions per blocks. These values are reported in the legend of each plot. In general, when we increase the number of transaction processed at once, either statically or dynamically on the basis the memory available, we also improve the results similarity. Nevertheless, the variation is in most cases small and sometimes there is a slightly negative trend caused by the nonlinear relation between used memory and transactions per block. In our test we noted that choosing an excessively low amount of available memory for some datasets lead to performance degradation and sometimes also to similarity degradation. The last plot shows the effectiveness of the hash-based bounds on reducing the Average Support Range (zero corresponds to an exact result). As expected the improvement is evident only on more dense datasets. The last batch of tests makes use of a family of synthetic datasets with homogeneous distribution parameters and varying lengths. These datasets are obtained from the larger dataset of the serie by truncating it to simulate streams with different lengths. For each truncated dataset we computed the exact result set, used as reference value in computing the similarity of the corresponding approximate result obtained by APStream . The first chart in figure 5.2 plots both similarity and ASR as the stream length increases. We can see that similarity remains almost the same, whereas the ASR decreases when an increasing amount of stream is processed. Finally, the last plot shows the evolution of execution time as the stream length increases. The execution time increases linearly with the length of the stream, hence the average time per transaction is constant if we fix the dataset and the execution parameters. 5.4. Conclusions 95 Similarity(%) Similarity(%) 95 95 90 90 85 85 80 80 % 100 % 100 75 75 70 70 A (minsupp= 30%, base mem=2MB) C (minsupp= 70%, base mem=2MB) P (minsupp= 70%, base mem=5MB) PS (minsupp= 40%, base mem=5MB) R (minsupp= 0.05%, base mem=5MB) K (minsupp= 0.1%, base mem=16MB) 65 60 55 1 2 4 Reletive available memory A (minsupp= 30%, base trans/block=10k) C (minsupp= 70%, base trans/block=4k) K (minsupp= 0.1%, base trans/block=20k) P (minsupp= 70%, base trans/block=8k) PS (minsupp= 40%, base trans/block=4k) R (minsupp= 0.05%, base trans/block=2k) 65 60 55 8 16 1 2 4 8 Relative transaction number per block 16 Average support range(%) 5 A (minsupp= 30%) C (minsupp= 70%) K (minsupp= 0.1%) P (minsupp= 70%) 4.5 4 3.5 % 3 2.5 2 1.5 1 0.5 0 0 100 200 Hash entries 300 400 Figure 5.1: Similarity and Average Support Range as a function of available memory, number of transactions per block, and number of hash entries. Acknowledgment The datasets used during the experimental evaluation are some of those used for the FIMI’03 (Frequent Itemset Mining Implementations) contest [1]. Thanks to the owners of these data and people who made them available in current format. In particular Karolien Geurts [21] for Accidents, Ferenc Bodon for Kosarak, Tom Brijs [10] for Retail and Roberto Bayardo for the conversion of UCI datasets. Other datasets were generated using the publicly available synthetic data generator code from the IBM Almaden Quest data mining project [6]. 5.4 Conclusions In this chapter we have discussed APStream , a new algorithm for approximate frequent pattern mining on streams, and described several related algorithms for frequent item and itemset mining. APStream exploits a novel interpolation method to infer the unknown past counts of some patterns, which are frequents only on recent data. Since the support values computed by the algorithm are approximate, we have also proposed a method for establishing a pair of upper and lower bounds 96 5. Streaming data dataset: T30I30N1k min_supp=30% dataset: T30I30N1k min_supp=30% 100 0.2 32 99.8 0.1 99.4 16 Relative time 99.6 ASR (%) Similarity (%) 0.15 8 4 0.05 99.2 2 Similarity ASR 99 1 2 4 8 Stream length (/100k) 16 0 32 relative time 1 1 2 4 8 Stream length (/100k) 16 32 Figure 5.2: Similarity and Average Support Range as a function of different stream lengths. for each interpolated value. These bounds are computed using the knowledge of subpattern frequency in past data and the intersection of a hash based compressed representation of past data. Experimental tests show that the solution produced by APStream is a good approximation of the exact global result. The comparisons with exact results consider both the set of patterns found and their support. The metric used in order to assess the quality of the algorithm output is the similarity measure introduced in [50]. The interpolation works particularly well for dense dataset, achieving a similarity close to 100% in best cases. The adaptive behaviour of APStream allows us to limit the amount of used memory. As expected, we have found that a larger amount of available memory corresponds to a more accurate result. Furthermore, as the length of the processed stream increases, the similarity with the exact result remains almost the same. At the same time, we observed a decrease in the average difference between upper and lower bounds, which is an intrinsic measure of result accuracy. This means that when the stream length increase, the relative bounds on support get closer. Finally, the time needed to process a block of transactions does not depend on the stream length, hence the total execution time is linear with respect to the stream length. In the future, we plan to improve the proposed method by adding other stricter bounds on the approximate support and to extend it to closed patterns. Conclusions The knowledge discovery process and, particularly, its data mining algorithmic part, have been extensively studied in the literature during the last twenty years, and is still an active discipline. Several problem and analysis methods have been proposed, and the extraction of valuable and hidden knowledge from operational databases is, currently, a strategic issue for most medium and large companies. Most of these organizations are geographically spread by nature, and distributed database systems are widely diffused due to either logistic, failure resilience, or performance reasons. Banks, telecommunication companies, wireless access provider, are just some of the users of distributed system for the management of both historical and operational data. Furthermore, in several cases, the data are produce and/or modified continuously and at a sustained rate. The usage of data mining algorithms in distributed and stream settings may introduce several challenging issues. Problems may be either technical, related to the network infrastructure and the huge amount of data, political, related to privacy, company interest or ownership of data. The issues to solve, however, depend on the kind of knowledge we are interested to extract from data. In this thesis, we have analyzed in detail the issues related to the Association Rules Mining, and more precisely to its most computationally expensive phase, the mining of frequent patterns in distributed dataset and data stream, where these patterns can be either itemsets (FIM) or sequences (FSM). The core contribution of this work is a general framework for adapting an exact Frequent Pattern Mining algorithm to a distributed or streaming context. The resulting algorithms are able to find efficiently an approximation of the exact results with a strong reduction of communication size, in the distributed case, and memory usage, in the stream case. In both cases, the approximate support of each pattern is returned along with an interval containing the true value. The proposed methods have been evaluated in a distributed setting and in a stream one, using several real world and synthetic datasets. The results of our test show that this framework gives a fairly accurate approximation of exact results, even only exploiting simple and generic interpolation schemas as those used in the tests. In the distributed case, the interpolation based method exhibits linear speedup as the number of partitions increases. In the stream case, the time that is required to process a block is on average constant, hence the total execution time is linear with respect to the length of the data stream. At the same time, both the similarity to the exact results and the absolute width of the error interval are almost constant. Thus, the algorithm is suitable for mining infinite amount of data. One further original contribution presented in this thesis is an algorithm for 100 Conclusions frequent sequence mining with gap constraints. CCSM is a novel algorithm for the discovery of frequent sequence patterns with constraints on the maximum gap between the occurrences of two part of the sequence (maxGap). The proposed algorithm has been compared with cSPADE, a state of the art algorithm, obtaining better performance result for significant value of the maxgap constraint. Thanks to the particular transversal order of the search space exploited by CCSM, the intermediate results are highly reused, and the output is ordered. This is particularly important and allows to efficiently integrate the CCSM algorithm in the proposed distributed/stream framework, as explained in the next section. Future works Frequent Sequence Mining on distributed/stream data The methods presented for frequent itemset extraction can easily be extended to the other kind of frequent patterns considered in this thesis: the frequent sequences. This only involves minor modifications of the algorithms: replacing the interpolation formula with one suitable for sequences, and the FIM algorithm with a FSM algorithm. The CCSM algorithm is a suitable FSM candidate to be inserted in our distributed and stream framework, since it is level-wise and returns ordered set of frequent sequences. This ordering allows for merging on-the-fly the sequence patterns as they arrive, and the level-wise behavior makes more information available to be exploited by the interpolation schema in order to give a better approximation. Furthermore, the on-the-fly merge reduces both memory requirement and computational cost of the merge phase. As the overall framework remains exactly the same, all the improvements and limits that we have explained for frequent itemsets are still valid. The only differences are those originated by the intrinsic difference between frequent itemset and frequent sequences, which make the result of FSM potentially larger and more likely to be affected by combinatorial explosion. Frequent Itemset Mining on distributed stream data The proposed merge/interpolation framework can be extended seamlessly to manage distributed streams in several ways. The most straightforward one is based on the composition of APInterp , followed by APStream . Each slave is responsible for extracting frequent itemsets from its local streams. The results of each processed block are sent to the master and merged, first among them using APInterp , and then with the past combined results as in APStream . The schema on the left of Figure III illustrates this framework. Resnode,i is the FIM result on the ith block of the node stream, whereas Resi is the result of the merge of all local ith results, and Hist Resi is the historical global result, i.e., from the beginning to the ith block. Conclusions 101 Figure C.3: Distributed stream mining framework. On the left distributed merge followed by stream merge, on the right local stream merge followed by distributed merge. A first improvement on this base idea could be the replacement of the two cascaded merge phases, one distribution related and the other stream related, with a single one. This would allow for better accuracy of results and stricter bounds, thanks to the reduction of cumulated errors. Clearly, the recount step, used in APStream for assessing the support of recently non-frequent itemsets that were frequent in past data, is impossible in both cases. Since the merge is performed in the master node, only the received locally frequent patterns are available. However, this step proved to be effective in our preliminary tests on APStream , particularly for dense datasets. In order to introduce the local recount phase, it is necessary to move the stream merge phase to the slave nodes. In this way, recent data are still available in the reception buffer, and can be used to improve the results. Each slave node then sends its local results, related to the whole history of its streams, to the master node that simply merges them like in APInterp . Since these results are sent each time a block is processed, it would be advisable to send only the differences in the results related to the last processed block. This involves rethinking the central merge phase but, in our opinion, it should yield good results. The schema on the right of Figure III 102 Conclusions illustrates this framework. DCI result streams are directly processed by APStream , yielding Hist Resnode,i , i.e. the results on the whole node stream at time i. APInterp collect these results and output the final result Hist Resi . The last aspect to consider is synchronization. Each stream evolves, potentially at a different rate with respect to other streams. This means that when the stream reception buffer of a node is full other nodes could be still collecting data. Thus, the collect and merge framework should allow for asynchronous and incremental result merge, with some kind of forced periodical synchronization, if needed. Limiting the combinatorial explosion of the output It should be noted that, both in the distributed and in the stream settings, the actual time needed to process a partition is mainly related to the statistical properties of data. This problem is not specific to our algorithms. Instead, it is a peculiarity of the frequent itemset/sequences problems, and is directly linked to the exponential size of the result sets. Our goal was to find an approximate solution as close as possible to the exact one, and this is exactly what we achieved. However, this means that in case the exact solution is huge, the approximate solution will be huge too. In this case, if we want to ensure that data can be processed at a given rate, choosing a different approach is mandatory. Two approaches can be devised: the first one is based on alternative representations of results, such as closed/condensed/maximal frequent patterns. As quickly explained in the related works of chapter 2, both the result size and the information on support of patterns decrease from the first to the last of the three problems, but the presence of a pattern in the results is always certain. The second one, instead, aim at discovering only a useful subset of the result, as in the case of alignment patterns [31]. We have done some preliminary work on approximate distributed closed itemset mining [32], but also the second approach will be matter of further investigations. We believe it should be particularly effective in the sequence case, which is more affected by the combinatorial explosion problem. A Approximation assessment The methods we are proposing yields approximate results. In particular APInterp computes pattern supports which may be slightly different from the exact ones, thus the result set may miss some frequent pattern (false negatives) or include some infrequent pattern (false positives). In order to evaluate the accuracy of the results we need a measure of similarity between two pattern sets. A widely used one has been introduced in [50], and is based on support difference. Definition 12 (Similarity). Let A and B respectively be the reference (correct) result set and the approximate result set. supA (x) ∈ [0, 1] and supB (y) ∈ [0, 1], where x ∈ A and y ∈ B, correspond to the relative support found in A and B respectively. Note that since B corresponds to the frequent patterns found by the approximate algorithm under observation, A − B thus corresponds to the set of false negatives, while B − A are the false positives. The Similarity is thus computed as P max{0, 1 − α ∗ |supA (x) − supB (x)|} Simα (A, B) = x∈A∩B |A ∪ B| where α > 1 is a scaling parameter, which increase the effect of the support dissimilarity. Moreover, α1 indicates the maximum allowable error on (relative) pattern supports. We will use the notation Sim() to indicate the default case for α, i.e. α = 1. In case absolute supports are used instead than relative ones, the parameter α will be smaller than or equal to 1. We will name this measure Absolute Similarity, indicated as SimABS (A, B). This measure of similarity is thus the sum of at most |A ∩ B| values in the range [0, 1], divided by |A ∪ B|. Since |A ∩ B| 6 |A ∪ B|, similarity lies in [0, 1] too. When a pattern appears in both sets and the difference between the two supports is greater than α1 , it does not improve similarity, otherwise similarity is increased according to the scaled difference. If α = 20, then the maximum allowable error in the relative support is 1/20 = 0.05 = 5%. Supposing that the support difference for a particular pattern is 4%, the numerator of the similarity measure will be increased by a small quantity: 1 − (20 ∗ 0.04) = 0.2. When α is 1 (default value), only patterns 104 A. Approximation assessment whose support difference is at most 100% contribute to increase similarity. On the other hand, when we set α to a very high value, only patterns with a very similar supports in both the approximate and reference sets will contribute to increase the similarity measure (which is roughly the same than using Absolute Similarity with α close to 1). It is worth noting that the presence of several false positives and negatives in the approximate result set B contributes to reduce our similarity measure, since this entails an increase in A ∪ B (the denominator of the Simα formula) with respect to A∩B. Moreover, if a pattern has an actual support that is slightly less than minsup but the approximate support (supB ) is slightly greater than minsup, similarity is decreased even if the computed support was almost correct. This could be an undesired behavior. While a false negative can constitute a big issue, because some potentially important association rules will be not generated at all, a false positive with a support very close to the exact one could be tolerated by an analyst. In order to overcome this issue we propose a new similarity measure, f pSim (where f p stand for false positive). Since this measure consider every pattern included in the approximate result set B (instead of A ∩ B), it can be used in order to assess whether false positives have an approximate support value close to the exact one or not. A high value of f pSim compared with a smaller value of Sim simply means that in the approximate result set B there are several false positive with a true support close to minsup. Definition 13 (fpSimilarity). Let A and B respectively be the reference (correct) result set and the approximate result set. supB (x) ∈ [0, 1], where x ∈ B, corresponds to the support found in result sets B, while sup(x) ∈ [0, 1] is the actual support of the same pattern. fpSimilarity is thus computed as P max{0, 1 − α ∗ |sup(x) − supB (x)|} fpSimα (A, B) = x∈B |A ∪ B| where α > 1 is a scaling parameter. We will use the notation Sim() to indicate the default case for α, i.e. α = 1. Note that the numerator of this new measure considers all the patterns found in the set B, thus also false positives. Hence finding a pattern with a support close to the true one is considered a ”good” result in any case, even if this pattern is not actually frequent. For example, suppose that minimum support threshold is 50% and x is an infrequent pattern such that sup(x) = 49.9. If supB (x) = 50%, it will result to be a false positive. However, since supB (x) is very close to the exact support sup(x), the value of f pSimα () will be increased. In Definition 13 we used sup(x) instead of supA (x) to indicate the actual support of itemset x since it is possible, as in the example case, that a pattern is present in B even if it is not frequent (hence not present in A). In both definitions above, we used sup(x) to indicate the (relative) support, ranging from 0 to 1. In the remainder of the paper, in particular in the algorithm 105 description, we will also use the notation σ(x) = sup(x) · |D| to indicate the support count (absolute support), ranging from 0 to the total number of transactions. When bounds on the support of each pattern are available, an intrinsic measure of the correctness of the approximation is the average width of the interval between the upper bound and the lower bound. Definition 14 (Average support range). Let B be the approximate result set, sup(x) the exact support for pattern x and sup(x)lower and sup(x)upper the lower and upper bounds on sup(x), respectively. The average support range is thus defined as: ASR(B) = 1 X sup(x)upper − sup(x)lower |B| x∈B Note that, while this definition can be used for every approximate algorithm, how to compute sup(x)lower and sup(x)upper is algorithm specific. In the next section, we will present a way that is suitable for the class of algorithms containing the one we are proposing. Other, less accurate, similarity measures can be borrowed from the Information Retrieval theory: Definition 15 (Recall & Precision). Let A and B respectively be the reference (correct) result set and the approximate result set. Note that since B corresponds to the frequent patterns found by the approximate algorithm under observation, A − B thus corresponds to the set of false negatives, while B − A are the false positives. Let P (A, B) ∈ [0, 1] be the Precision of the approximate result, defined as follows: B∩A B Hence the Precision is maximal (P (A, B) = 1) iff B ∩ A = B, i.e. the approximate result set B is completely contained in the exact one A, and no false positives occurs. Let R(A, B) ∈ [0, 1] be the Recall of the approximate result, defined as follows: P (A, B) = B∩A A Hence the Recall is maximal (R(A, B) = 1) iff B ∩ A = A, i.e. the exact result set A is completely contained in the approximate one B, and no false negative occurs. R(A, B) = According to our remarks above concerning the benefits of the f pSim measure (Def. 13), we have that a ”good” approximate result should be characterized by to a very high Recall, where the supports of the possible false positive patterns should be however very close to the exact ones. Conversely, in order to minimize the standard measure of similarity (Def. 12), we need to maximize both Recall and Precision, while keeping small the difference in the approximate supports of frequent patterns. 106 A. Approximation assessment Bibliography [1] Workshop on frequent itemset mining implementations FIMI’03 in conjunction with ICDM’03. In fimi.cs.helsinki.fi, 2003. [2] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. Parallel and Distributed Computing, 2000. [3] R. Agarwal, C. Aggarwal, and V.V.V. Prasad. Depth first generation of long patterns. In KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 108–118, New York, NY, USA, 2000. ACM Press. [4] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 1993. [5] R. Agrawal and J.C. Shafer. Parallel mining of association rules. In IEEE Transaction On Knowledge and Data Engineering, 1996. [6] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 1994. [7] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th Int. Conf. Data Engineering, ICDE, pages 3–14. IEEE Press, 1995. [8] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. Sequential pattern mining using bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. [9] R. J. Bayardo Jr. Efficiently Mining Long Patterns from Databases. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 85–93, Seattle, Washington, USA, 1998. [10] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254–260, 1999. [11] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: a maximal frequent itemset for transactional databases. In Proc. of the International Conference on Data Endineering ICDE, pages 443–452. IEEE Computer Society, 2001. 108 Bibliography [12] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP ’02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, London, UK, 2002. Springer-Verlag. [13] D.W. Cheung, J. Han, V.T. Ng, A.W. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In DIS ’96: Proceedings of the fourth international conference on on Parallel and distributed information systems, pages 31–43, Washington, DC, USA, 1996. IEEE Computer Society. [14] G. Cormode and S. Muthukrishnan. What’s hot and what’s not: tracking most frequent items dynamically. In PODS ’03: Proceedings of the twentysecond ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 296–306. ACM Press, 2003. [15] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005. [16] E.D. Demaine, A. López-Ortiz, and J.I. Munro. Frequency estimation of internet packet streams with limited space. In ESA ’02: Proceedings of the 10th Annual European Symposium on Algorithms, pages 348–360, London, UK, 2002. Springer-Verlag. [17] C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270–313, 2003. [18] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI Press, 1998. [19] V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining Very Large Databases. IEEE Computer, 32(8):38–45, 1999. [20] M.N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential pattern mining with regular expression constraints. In The VLDB Journal, pages 223–234, 1999. [21] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12-16, page 18pp, 2003. [22] E-H.S. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In IEEE Transaction on Knowledge and Data Engineering, 2000. Bibliography 109 [23] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 1st edition, 2000. [24] J. Han, J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In In Proc. ACM 6th Int. Conf. on Knowledge Discovery and Data Mining, pages 355–359, 2000. [25] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of the ACM SIGMOD Int. Conference on Management of Data, 2000. [26] J.D. Holt and S.M. Chung. Mining association rules using inverted hashing and pruning. Inf. Process. Lett., 83(4):211–220, 2002. [27] V.C. Jensen and N. Soparkar. Frequent itemset counting across multiple tables. In In 4th PAcific Asia Conference on Knowledge Discovery and Data Minig, 2000. [28] C. Jin, W. Qian, C. Sha, J.X. Yu, and A. Zhou. Dynamically maintaining frequent items over a data stream. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 287– 294, New York, NY, USA, 2003. ACM Press. [29] R. Jin and G. G. Agrawal. An algorithm for in-core frequent itemset mining on streaming data. To appear in ICDM’05, 2005. [30] R.M. Karp, S. Shenker, and C.H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51–55, 2003. [31] H. Kum, J. Pei, W. Wang, and D. Duncan. ApproxMAP: Approximate mining of consensus sequential patterns. In Proceedings of the Third International SIAM Conference on Data Mining, 2003. [32] C. Lucchese, S. Orlando, R. Perego, and C. Silvestri. Mining frequent closed itemsets from highly distributed repositories. In Proc. of the 1st CoreGRID Workshop on Knowledge and Data Management in Grids in conjunction with PPAM2005, September 2005. [33] G. Manku and R. Motwani. Approximate frequency counts over data streams. In In Proceedings of the 28th International Conference on Very Large Data Bases, August 2002. [34] H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Knowledge Discovery and Data Mining, pages 146–151, 1996. 110 Bibliography [35] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, Canada, 1995. AAAI Press. [36] H. Mannila, H. Toivonen, and A.I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, 1997. [37] F. Masseglia, F. Cathala, and P. Poncelet. The PSP approach for mining sequential patterns. In Principles of Data Mining and Knowledge Discovery, pages 176–184, 1998. [38] F. Masseglia, P. Poncelet, and M. Teisseire. Incremental mining of sequential patterns in large databases. Technical report, LIRMM, France, January 2000. [39] F. Masseglia, P. Poncelet, and M. Teisseire. Incremental mining of sequential patterns in large databases. Data and Knowledge Engineering, 46(1):97–121, 2003. [40] J. Misra and D. Gries. Finding repeated elements. Technical report, Ithaca, NY, USA, 1982. [41] A. Mueller. Fast sequential and parallel algorithms for association rules mining: A comparison. Technical Report CS-TR-3515, Univ. of Maryland, 1995. [42] S. Orlando, P. Palmerini, and R. Perego. Enhancing the apriori algorithm for frequent set counting. In DaWaK ’01: Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery, pages 71–82, London, UK, 2001. Springer-Verlag. [43] S. Orlando, P. Palmerini, R. Perego, C. Lucchese, and F. Silvestri. kDCI: a multi-strategy algorithm for mining frequent sets. In Proceedings of the workshop on Frequent Itemset Mining Implementations FIMI’03 in conjunction with ICDM’03, 2003. [44] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adaptive and resourceaware mining of frequent sets. In Proc. of the 2002 IEEE International Conference on Data Mining, ICDM, 2002. [45] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. An efficient parallel and distributed algorithm for counting frequent sets. In Proc. of Int. Conf. VECPAR 2002 - LNCS 2565, pages 197–204. Spinger, 2002. [46] S. Orlando, R. Perego, and C. Silvestri. CCSM: an efficient algorithm for constrained sequence mining. In Proceedings of the 6th International Workshop on High Performance Data Mining: Pervasive and Data Stream Mining, in conjunction with Third International SIAM Conference on Data Mining, 2003. Bibliography 111 [47] S. Orlando, R. Perego, and C. Silvestri. A new algorithm for gap constrained sequence mining. To appear in Proceedings of ACM Symposim on Applied Computing SAC - Data Mining track, Nicosia, Cyprus, March 2004. [48] B. Park and H. Kargupta. Distributed Data Mining: Algorithms, Systems, and Applications. In Data Mining Handbook, pages 341–358. IEA, 2002. [49] J.S. Park, M.S. Chen, and P.S. Yu. An Effective Hash Based Algorithm for Mining Association Rules. In Proceedings of 1995 ACM SIGMOD Int. Conf. on Management of Data, pages 175–186. [50] S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), page 354. IEEE Computer Society, 2002. [51] S. Parthasarathy, M.J. Zaki, M. Ogihara, and S. Dwarkadas. Incremental and interactive sequence mining. In CIKM, pages 251–258, 1999. [52] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering, page 215. IEEE Computer Society, 2001. [53] J. Pei, J. Han, and W. Wang. Mining sequential patterns with constraints in large databases. In Proc. of Proceedings of the 11-th Int. Conf. on Information and Knowledge Management (CIKM 02), pages 18–25, 2002. [54] N. Ramakrishnan and A. Y. Grama. Data Mining: From Serendipity to Science. IEEE Computer, 32(8):34–37, 1999. [55] A. Savasere, E. Omiecinski, and S.B. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, pages 432–444. Morgan Kaufmann, September 1995. [56] A. Schuster and R. Wolff. Communication Efficient Distributed Mining of Association Rules. In ACM SIGMOD, Santa Barbara, CA, April 2001. [57] A. Schuster, R. Wolff, and D. Trock. A High-Performance Distributed Algorithm for Mining Association Rules. In The Third IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL, November 2003. [58] T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In PDIS: International Conference on Parallel and Distributed Information Systems. IEEE Computer Society Technical Committee on Data Engineering, and ACM SIGMOD, 1996. 112 Bibliography [59] C. Silvestri and S. Orlando. Distributed association mining: an approximate method. In Proceedings of 7th International Workshop on High Performance and Distributed Mining, in conjunction with Fourth International S, April 2004. [60] C. Silvestri and S. Orlando. Approximate mining of frequent patterns on streams. In Proc. of the 2nd International Workshop on Knowledge Discovery from Data Streams in conjunction with PKDD2005, October 2005. [61] C. Silvestri and S. Orlando. Distributed approximate mining of frequent patterns. In Proceedings of ACM Symposim on Applied Computing SAC - Data Mining track, March 2005. [62] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology, EDBT, volume 1057, pages 3–17. Springer-Verlag, 1996. [63] R. Wolff and A. Schuster. Mining Association Rules in Peer-to-Peer Systems. In The Third IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL, November 2003. [64] X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large datasets. In Proc. 2003 SIAM Int.Conf. on Data Mining (SDM’03), 2003. [65] M.J. Zaki. Fast mining of sequential patterns in very large databases. Technical Report TR668, University of Rochester, Computer Science Department, 1997. [66] M.J. Zaki. Parallel and distributed association mining: A survey. In IEEE Concurrency, 1999. [67] M.J. Zaki. Parallel sequence mining on shared-memory machines. In LargeScale Parallel Data Mining, pages 161–189, 1999. [68] M.J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12:372–390, May/June 2000. [69] M.J. Zaki. Sequence mining in categorical domains: incorporating constraints. In Proceedings of the ninth international conference on Information and knowledge management, pages 422–429. ACM Press, 2000. [70] M.J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1-2):31–60, 2001. List of PhD Thesis TD-2004-1 Moreno Marzolla ”Simulation-Based Performance Modeling of UML Software Architectures” TD-2004-2 Paolo Palmerini ”On performance of data mining: from algorithms to management systems for data exploration” TD-2005-1 Chiara Braghin ”Static Analysis of Security Properties in Mobile Ambients” TD-2006-1 Fabrizio Furano ”Large scale data access: architectures and performance” TD-2006-2 Damiano Macedonio ”Logics for Distributed Resources” TD-2006-3 Matteo Maffei ”Dynamic Typing for Security Protocols” TD-2006-4 Claudio Silvestri ”Distributed and Stream Data Mining Algorithms for Frequent Pattern Discovery”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Distributed and Stream Data Mining Algorithms for