Download Distributed and Stream Data Mining Algorithms for

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Università Ca’ Foscari di Venezia
Dipartimento di Informatica
Dottorato di Ricerca in Informatica
Ph.D. Thesis: TD-2006-4
Distributed and Stream Data Mining
Algorithms for Frequent Pattern Discovery
Claudio Silvestri
Supervisor
Prof. Salvatore Orlando
PhD Coordinator
Prof. Simonetta Balsamo
Author’s Web Page: http://www.dsi.unive.it/∼claudio
Author’s e-mail: [email protected]
Author’s address:
Dipartimento di Informatica
Università Ca’ Foscari di Venezia
Via Torino, 155
30172 Venezia Mestre – Italia
tel. +39 041 2348411
fax. +39 041 2348419
web: http://www.dsi.unive.it
To my wife
Abstract
The use of distributed systems is continuously spreading in several applications
domains. Extracting valuable knowledge from raw data produced by distributed
parties, in order to produce a unified global model, may presents various challenges
related to either the huge amount of managed data or their physical location and
ownership. In case data are continuously produced (stream) and their analysis is
required to be performed in real time, communication costs and resource usage are
issues that require careful attention in order to run computation in the optimal
location.
In this thesis, we examine in details the problems related to the Frequent Pattern
Mining (FPM) in distributed and stream data and present a general framework for
adapting an exact FPM algorithm to a distributed or streaming context. The FPM
problems we consider are Frequent Itemset Mining (FIM), and Frequent Sequences
Mining (FSM). In the first case, the input data are sets of items and the frequent
patterns are those included in a user-specified number of input set. The second one
consists in finding frequent sequential patterns in a database of time-stamped events.
Since the proposed framework uses (exact) frequent pattern mining algorithms as
the building block of the approximate distributed/stream algorithms, we will also
describe two efficient algorithms for FIM and FSM: DCI, introduced by Orlando et
al., and CCSM, which is one of the original contributions of this thesis.
The resulting algorithms for distributed and stream FIM have been tested with
real world and synthetic datasets, and are able to find efficiently a good approximation of the exact results and scale gracefully. The framework for FSM is almost
identical, but has not been tested yet. The few differences are highlighted in the
conclusion chapter.
Sommario
La diffusione dei sistemi distribuiti è in continuo aumento in svariati campi applicativi e l’estrazione di correlazioni non evidenti nei dati grezzi prodotti può essere
strategica per le organizzazioni coinvolte. Questo tipo di operazione è generalmente
non banale e, quando i dati sono distribuiti, presenta ulteriori difficoltà legate sia
alla mole di dati coinvolti che alla loro proprietà e locazione fisica. Nel caso i dati
siano prodotti in flussi continui (stream) e sia necessario analizzarli in tempo reale,
l’ottimizzazione dei costi di comunicazione e delle risorse necessarie sono aspetti che
debbono essere presi attentamente in considerazione.
In questa tesi sono analizzati in modo dettagliato i problemi legati alla ricerca
di pattern frequenti (FPM) su dati distribuiti e stream di dati. In particolare è
presentato un metodo generale per ottenere, a partire da un qualunque algoritmo
esatto per FPM, un algoritmo approssimato per il FPM su dati distribuiti e stream
di dati. I tipi di pattern presi in considerazione sono gli itemset frequenti (FIM)
e le sequenze frequenti (FSM). Nel primo caso i dati in ingresso sono insiemi di
elementi (transazioni) ed i pattern frequenti sono a loro volta degli insiemi contenuti
almeno in un numero di transazioni specificato dall’utente. Il secondo consiste invece
nella ricerca di pattern sequenziali frequenti in una collezione di sequenze di eventi
associati a precisi istanti di tempo. Poiché il metodo proposto utilizza degli algoritmi
esatti per l’estrazione di pattern frequenti come parti da riunire per ottenere degli
algoritmi per dati distribuiti e stream di dati, verranno anche descritti due algoritmi
efficienti per FIM e FSM: DCI, presentato da Orlando ed altri, e CCSM, che è uno
dei contributi originali di questa tesi.
Gli algoritmi ottenuti applicando il metodo proposto sono stati utilizzati sia con
dati reali sia con dati sintetici per valutarne l’efficacia. Gli algoritmi per FIM si
sono dimostrati scalabili ed in grado di estrarre efficientemente una buona approssimazione della soluzione esatta. Il modello per FSM è quasi identico, ma non è
ancora stato verificato sperimentalmente. Le poche differenze sono evidenziate nel
capitolo finale.
Acknowledgments
I would like to thank Prof. Salvatore Orlando for his guidance and support during
my Ph.D. studies. I am also grateful to him for the opportunity to collaborate with
the ISTI-CNR High Performance Computing Lab. In this context, I would like to
thank Raffaele Perego, Fabrizio Silvestri, and Claudio Lucchese who co-authored
some of the papers I published and, in several ways, helped me in improving the
quality of my work.
I thank my referees, Prof. Hillol Kargupta and Prof. Rosa Meo, for their attention in reading this thesis and their valuable comments.
Most part of this work has been carried out at the Dipartimento di Informatica,
Università Ca’ Foscari di Venezia. I would like to thank all the faculty and personnel
for their support and for making the department a friendly place for doing research.
Special thanks to Moreno and Matteo, for the long discussions on free software and
other interesting subjects, and to all the others (ex-) Ph.D. students for the pleasant
time spent together: Chiara, Claudio, Damiano, Fabrizio, Francesco, Giulio, Marco,
Matteo, Massimiliano, Ombretta, Paolo, Silvia, and Valentino.
In this last year I have been a guest at the Dipartimento di Informatica e Comunicazione, Università degli Studi di Milano. I am grateful to Maria Luisa Damiani,
for the opportunity of collaboration, and to the people working at the DB&SEC
Lab, for the friendly working environment.
This work was partially supported by the PRIN’04 Research Project entitled
”GeoPKDD - Geographic Privacy-aware Knowledge Discovery and Delivery”.
Finally, I would like to thank my extended family, who has never lost faith in
this long-term project, and all of my friends.
Contents
1 Introduction
1.1 Data distribution . . . . . . . . .
1.2 Data evolution . . . . . . . . . .
1.3 Applications . . . . . . . . . . . .
1.4 Association Rules Mining . . . . .
1.4.1 Frequent Itemsets Mining
1.4.2 Frequent Sequence Mining
1.4.3 Taxonomy of Algorithms .
1.5 Contributions . . . . . . . . . . .
1.6 Thesis overview . . . . . . . . . .
I
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
First Part
2 Frequent Itemset Mining
2.1 The problem . . . . . . . . .
2.1.1 Related works . . . .
2.2 DCI . . . . . . . . . . . . . .
2.2.1 Candidate generation
2.2.2 Counting phase . . .
2.2.3 Intersection phase . .
2.3 Conclusions . . . . . . . . .
1
2
3
6
7
9
10
11
13
14
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Frequent Sequence Mining
3.1 Introduction . . . . . . . . . . . . . . . .
3.2 Sequential patterns mining . . . . . . . .
3.2.1 Problem statement . . . . . . . .
3.2.2 Apriori property and constraints .
3.2.3 Contiguous sequences . . . . . . .
3.2.4 Constraints enforcement . . . . .
3.3 GSP . . . . . . . . . . . . . . . . . . . .
3.3.1 Candidate generation . . . . . . .
3.3.2 Counting . . . . . . . . . . . . . .
3.4 SPADE . . . . . . . . . . . . . . . . . . .
3.4.1 Candidate generation . . . . . . .
3.4.2 Candidate support check . . . . .
3.4.3 cSPADE: managing constraints . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
20
20
21
22
22
24
25
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
30
30
32
32
33
34
34
35
35
35
36
37
ii
Contents
3.5
3.6
3.7
II
CCSM . . . . . . . . . . . . . .
3.5.1 Overview . . . . . . . .
3.5.2 The CCSM algorithm . .
3.5.3 Experimental evaluation
Related works . . . . . . . . . .
Conclusions . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Second Part
37
38
38
43
45
47
49
4 Distributed datasets
4.1 Introduction . . . . . . . . . . . . . . . . .
4.1.1 Frequent itemset mining . . . . . .
4.2 Approximated distributed frequent itemset
4.2.1 Overview . . . . . . . . . . . . . .
4.2.2 The Distributed Partition algorithm
4.2.3 The APRed algorithm . . . . . . . .
4.2.4 The APInterp algorithm . . . . . . .
4.2.5 Experimental evaluation . . . . . .
4.3 Conclusions . . . . . . . . . . . . . . . . .
. . . .
. . . .
mining
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
53
53
54
55
57
59
62
68
5 Streaming data
5.1 Streaming data . . . . . . . . .
5.1.1 Issues . . . . . . . . . .
5.2 Frequent items . . . . . . . . .
5.2.1 Problem . . . . . . . . .
5.2.2 Count-based algorithms
5.2.3 Sketch-based algorithms
5.3 Frequent itemsets . . . . . . . .
5.3.1 Related work . . . . . .
5.3.2 The APStream algorithm
5.4 Conclusions . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
73
74
74
75
75
81
82
82
84
95
III
Conclusions
A Approximation assessment
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
99
103
107
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
Incremental data mining . . . . . . . . . . .
Data stream mining . . . . . . . . . . . . . .
Transaction dataset . . . . . . . . . . . . . .
Sequence dataset . . . . . . . . . . . . . . .
Effect of maxGap constraint . . . . . . . . .
Taxonomy of algorithms for frequent pattern
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
mining .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.1
2.2
Set of itemsets compressed data structure . . . . . . . . . . . . . . . 23
Example of cache usage . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
GSP candidate generation . . . . . . . . . . . . . . . . . . . . .
CCSM candidate generation . . . . . . . . . . . . . . . . . . . .
Example of cache usage . . . . . . . . . . . . . . . . . . . . . . .
CCSM idlist reuse . . . . . . . . . . . . . . . . . . . . . . . . . .
Number of intersection for different intersection methods . . . .
Number of frequent sequences in datasets CS11 and CS21 . . . .
Execution times of CCSM and cSPADE- variable maxGap value .
Execution times of CCSM and cSPADE- fixed maxGap value . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
40
41
42
43
44
45
45
4.1
4.2
4.3
4.4
4.5
Similarity of APRed approximate results . . . . . . . . . . . . . . . .
Number of spurious patterns as a function of the reduction factor r
fpSim of the APInterp results . . . . . . . . . . . . . . . . . . . . . .
Comparison of Distributed One-pass Partition vs. APInterp . . . . . .
Speedup for two of the experimental datasets . . . . . . . . . . . . .
.
.
.
.
.
66
67
68
69
70
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 5
. 6
. 9
. 10
. 12
. 13
5.1 Similarity and ASR as a func. of memory/transactions/hash entries . 95
5.2 Similarity and ASR as a function of stream length . . . . . . . . . . . 96
C.3 Distributed stream mining framework . . . . . . . . . . . . . . . . . . 101
iv
List of Figures
List of Tables
1.1
Taxonomy of data mining environments . . . . . . . . . . . . . . . . .
4.1 Datasets used in APRed experimental evaluation .
4.2 Datasets used in APInterp experimental evaluation
4.3 Test results for APRed . . . . . . . . . . . . . . . .
4.4 Accuracy indicators for APInterp results . . . . . .
5.1
5.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
63
64
65
71
Sample supports and reduction ratios . . . . . . . . . . . . . . . . . . 88
Datasets used in experimental evaluation . . . . . . . . . . . . . . . . 93
vi
List of Tables
1
Introduction
Data mining is, informally, the extraction of knowledge hidden in huge amounts of
data. However, if we are interested in a more detailed definition, several different
ones do exist [23]. Depending on the application domain (and on the author), Data
mining could just mean the extraction of a particular aggregate information from
somehow preprocessed data, or the whole process beginning with data cleaning and
integration, and ending with result visual representation. From now on, we will
reserve the term Data mining to the first meaning, using the more general KDD
(Knowledge Discover in Databases) for the whole workflow that is needed in order
to apply Data mining algorithms to real world problems.
The kind of knowledge we are interested in, together with the organization of input data and the criteria used to discriminate among useful and useless information,
contributes to characterize a specific data mining problem and its possible algorithmic solutions. Common data mining tasks are the classification of new objects
according to a scheme learned from examples, the partitioning of a set of objects
into homogeneous subsets, the extraction of association rules and numerical rules
from a database.
In several interesting application frameworks, such as wireless network analysis
and fraud detection, data are naturally distributed among several entities and/or
evolve continuously. In all of the above-indicated data mining tasks, dealing with
either of these peculiarities provides additional challenges. In this thesis we will focus
on the distribution and evolution issues related to the extraction of Association Rules
from transactional databases (ARM), one of the most important and common data
mining task, both for the immediate applicability of the knowledge extracted by this
kind of analysis, and for the wide range of application fields where it can be used.
Association Rules are rules relating the occurrence of distinct subset of items in the
same set, i.e. ”65 % of market basket containing steak and salad will also contains
wine”, or in the same collection of set ”50 % of customer that buy a CD player
will, later, buy CDs”. In particular, we will concentrate our attention on the most
computationally expensive phase of ARM, the mining of frequent patterns from
distributed and stream data. These patterns can be either frequent itemsets (FIM)
or frequent sequences (FSM), i.e., respectively subsets contained in at least a user
indicated number of input set and subsequence of at least a user specified number
2
1. Introduction
input sequences. Since we will use frequent pattern mining algorithms for static and
non-evolving datasets as the building block for our approximate algorithms, to be
exploited on distributed and stream data, we will also describe efficient algorithms
for FIM and FSM: DCI, introduced by Orlando et al. in [44], and CCSM, which is
one of the original contributions of this thesis.
This chapter introduces, without focusing on any particular data mining task, the
general issues concerning the evolution of data, and their distribution/partitioning
among several entities. Then it quickly introduces ARM and its core FIM/FSM
phase in centralized and non-evolving datasets. Both will be discussed more in
detail both the first part of the thesis, since they constitute the foundation for the
distributed and streaming FIM/FSM problems. We also deal with taxonomy of both
FIM and FSM algorithms, which will be useful in understanding the reasons that
lead us to the choice of DCI and CCSM algorithms as the building blocks for our
distributed and stream algorithms. The chapter concludes with a summary of the
achievements of our research, and a description of the structure of the rest of the
thesis.
1.1
Data distribution
Reasons leading to data distribution. In many real systems, data are naturally
distributed, usually due to a plural ownership or to a geographical distribution of
the processes that produce the data. The logistic organization of entities involved in
the data collection process, performance and storage constraints, as well as privacy
and company interest, may lead to the choice of using separate databases, instead
of a centralized one accessed by several remote locations.
The sales point of a large chain are a typical example: there is no need of a
central database for performing ordinary sale activities, and using it would make
the operations of the shop dependent on the reliability and bandwidth of the communication infrastructure. Gathering all data to a single site, after they have been
produced, would be subject to the same ownership/privacy issues as using a centralized database.
In other cases, data are produced locally in large volumes and immediately moved
to other storage and analysis locations, due to the impossibility to store or process
them with the resources available at a single site, as in the case of satellite image
analysis or high-energy physics experiments. In all of these cases, performing a data
mining task means to coordinate the sites in a mix of partial movement of data and
exchange of local knowledge, in order to get the required global model.
Homogeneous vs. heterogeneous data. Problems that are seemingly similar
may need sensibly different solutions, if considered in different communication and
data localization settings. Data can be either homogeneous or heterogeneous. If data
are represented by tuples, in the first case all data presents the same dimensions,
1.2. Data evolution
3
while in the second one data each node has its own schema. Let us consider two
examples: the sales data of a chain of shops and the personal data collected about us
by different department of public administration. Sales data contain a representation
of the sale transactions and are maintained by the shop where the items were bought.
In this case, data are homogeneous: data collected at different shop are similar, but
related to different transactions. Personal data are also maintained at different
site: the register office manages birth data, the tax register own tax data, another
register collect data about our cars. In this case, data are heterogeneous, since for
each individual each register maintains different kind of data.
Data localization is a key factor in characterizing data mining problems. Most
classical data mining algorithms expect all data to be grouped in a unique data
repository. Each data mining task presents different issues when it is considered
in a distributed environment, instead of a centralized one. However it is possible
to identify a general requirement common to most distributed data mining system
architectures and tasks: careful attention should be paid to both communication
and computation resources, in order to use them in a nearly optimum way. Data
distribution issues and algorithms will be discussed in more details in chapter 4,
with a focus on frequent pattern mining algorithms.
A good survey on distributed data mining algorithms and applications is [48].
1.2
Data evolution
In different application context, data mining can be applied either to past data,
as a one time task, or repeatedly to evolving datasets. The classical data mining
algorithms refer to the first case: the full dataset is available and there will be no
data modification during the computation or between two consecutive computations.
This is enough to understand a phenomenon and make plans for the future in most
cases.
In several applications, like wireless network analysis, intrusion detection, stock
market analysis, sensor network data analysis, and, in general, any setting in which
every information available should be used to make an immediate decision, the approach based on finite statically stored data sets could be not satisfactory. These
cases demands for new classes of algorithms, able to cope with evolutions of data.
In particular, two issues need to be addressed: the complexity of recomputing everything from scratch and the potential infiniteness of data. In case only the first
issue is present, the setting is referred to as Incremental/Evolving Data Mining;
otherwise, it is indicated as Stream Data Mining.
The presence and kind of data evolution is another key factor in characterizing
data mining problems.
4
1. Introduction
Data localization
Centralized
A single entity can access every data
Distributed
Each node can access just a part of the data and ...
homogeneous
... data related to the same entity (e.g.: people) are
owned by just one node
heterogeneous
... data related to the same entity (e.g.: people) may be
spread among several nodes
Data evolution
Statical
Data are definitively stored and invariable (e.g.: related
to some past and concluded event)
Incremental
New data are inserted and access to past data is possible
(e.g.: related to an ongoing event)
Evolving
The dataset is modified with either updates, insertions
or deletions, and access to past data is possible.
Streaming
Data arrives continuously and for an indefinite time. Access to past data is restricted to a limited to part of them
or summaries.
Table 1.1: Taxonomy of data mining environments.
Incremental and Evolving Data Mining.
In incremental data mining, new data are repeatedly inserted into the dataset. Some
algorithm also take care of deletions or modifications of previous data, this case is
indicated as evolving data mining. In a typical situation, we have a dataset D and
the results of the required data mining task on D. Then D is modified and the
system is asked for a new result set. Obviously, a way to obtain the new result is to
recompute everything from scratch, and it is possible since all past data are accessible. However, this implies a computation time that in some case may clash with near
real time system response requirements, whereas in other cases is just a waste of resources, especially when the dataset get bigger. Incremental/Evolving data mining
algorithms, instead, are able to update the solution according to dataset updates,
modifying just the part of the result set that is interested by the modifications of
the dataset. A fitting example could concern the sales data of a supermarket: at the
end of each day, the daily update is performed. The overall amount of data is still
reasonable for an ordinary computation; however, there is no point in reprocessing
some year of past sales data. A better approach would be considering the past result
and the new data, and querying the past data only when a modification of the result
1.2. Data evolution
5
is expected. Figure 1.1 summarize the simultaneous evolution of data and results
Figure 1.1: Incremental data mining: previous result availability allows for a reduction of necessary computation.
after each mining step in incremental data mining.
Stream Data Mining.
An increasing number of applications require support for processing data that arrive
continuously and in huge volumes. This setting is referred as Stream Data Mining.
The main difference with Incremental/Evolving Data Mining is the large and potentially infinite amount of data, but also the continuity aspect deserves some attention.
The first consequence is that Stream Data Mining algorithms, since they are dealing with infinite data, cannot access to every data received in the past, but just to
a limited subset of them. In case of sustained arrival rate, this means that each
received data can be read only a few times, often just once.
An algorithm dealing with data streams should require an amount of memory
that is not related to the (infinite) amount of data analyzed. At the same time,
it should be able to cope with the data arrival rate, returning, if necessary, an
approximate solution in order to keep up with the stream.
Building a model based on every received data until the user makes the
query can be simply impossible in most cases, either for response time or resource
contraints. Even the apparently trivial task of exactly counting the number of
items received so far, potentially requires infinite memory, since after N items
received we will need log2 (N ) bits in order to represent the counters. A solution
that requires O(log(N )) memory is, however, considered suitable for a streaming
context, since for real data stream infinite actually means really long. However,
6
1. Introduction
Figure 1.2: Data stream mining: data are potentially infinite and accessible just on
arrival. Results can be referred to the whole stream or a limited part.
if we slightly extend the problem, asking for the number of ”distinct” items, an
exact answer is impossible without using a O(N ) memory. For this reason, in data
stream mining, approximate algorithms are quite common. Another way to reduce
the resource requirement is to restrict the problem to a user specified temporal
window, e.g. the last week. This approach is called Window Model, whereas the
previously introduced one is the Landmark Model. Figure 1.2 summarize these two
different approaches.
1.3
Applications
The issues encountered when mining data originated by distributed sources may
be related to the quality of received data, the high data arrival rate, the kind of
communication infrastructure available between data sources and data sinks, or
the need to avoid privacy breach. Let us see three practical cases of distributed
systems and practical motivations that may lead to the use of distributed data
mining algorithms for the analysis of data, instead of collecting and processing
everything in a central repository.
Geographically distributed Web-farm. Popular web sites generate a lot of traffic
from the server to the clients. A solution viable to ensure high availability and
throughput is to use several geographically distributed replicas and let each client
connect to the closer one (e.g. www.tucows.com). This approach, even if really
practical for system availability network and response time, makes the analysis of
data for user behavior and intrusion detection more complex. In fact, while using
a single web-farm all access log are available in the same site, in this case they are
partitioned among several farm, sometimes connected by high latency link. A naı̈ve
1.4. Association Rules Mining
7
solution is to collect all data in a centralized database, either periodically or in real
time, and it is in most case the best solution, at least if the data arrival rate is low
or we are not interested in recent data. However, this is not satisfying when log
data are huge, and real time analysis is required, as for intrusion detection.
Sensor network. The same kind of problems may arise, even worse, when the
sources of data streams are sensors connected by a network. Quite often communication link with sensors have a reduced bandwidth, for example in case of seismic
sensors placed in inhabited places, far from computation infrastructures.
Financial network. Furthermore data centralization may be unfeasible when
confidential information are handled and must not be shared with unauthorized
subjects in order to protect privacy rights or company interests. A classical example
concerns the credit card fraud detection. Let us suppose that a group of banks is
interested in automatically detecting possible frauds; each participating entity is
interested in making the resulting model accurate, and based on as much data as
possible, but banks cannot communicate the transactions of their customers to other
banks.
In all these cases, even if for different reasons, collecting all raw data to a repository before analyzing them is unfeasible and distributed techniques are needed in
order to elaborate, at least partially, the data in place.
1.4
Association Rules Mining
As we have seen in the previous section, dealing with evolving and distributed data
presents several issues, independently of the particular targeted data mining task.
However, each data mining task has its peculiarities, and the issues in different cases
are not really the same, but just similar and related to the same aspect. In order
to analyze more thoroughly the issues and possible solutions, we have to focus on a
particular task or group of tasks. We have decided, in this thesis, to concentrate our
attention on Association Rules Mining, and more precisely on its most computationally expensive phase, the mining of frequent patterns in distributed dataset and
data stream, where these patterns can be either itemsets (FIM) or sequences (FSM).
In this section we will quickly introduce the Association Rule Mining (ARM), one
of the most popular DM task [4, 18, 19, 54], both for the immediate applicability of
the knowledge extracted by this kind of analysis and for the wide range of application
fields where it can be applied, from medical symptoms developed in a patient to
objects sold in a commercial transaction.
Here our goal is just to quickly introduce this topic, and its computational challenging Frequent Pattern Mining sub problem, by limiting our attention to the
centralized case. A more detailed description of the problem will be found in chapters 2 and 3 for the centralized sub problems, in chapter 4 for the distributed one
and in chapter 2 for the stream case.
8
1. Introduction
The essence of associative rules is the analysis of the co-occurrence of facts in a
collection of set of facts. If, for instance, the data represent the objects bought in
the same shopping chart by the customers of a supermarket, then the goal will be
finding rules relating the fact that a market basket contains an item with the fact
that another item has been bought at the same time. One of these rules could be
”people who buy item A also buy item C in conf % cases”, but also the more complex
”people who buy item A and item B also buy item C in conf % cases” where conf %
is the confidence of the rule, i.e. a measure of how much that rule can be trusted.
Another interestingness measure, frequently used in conjunction with confidence, is
the support of a rule, which is defined as the number of records in the database
that confirm the rule. Generally, the user specifies minimum thresholds for both, so
an interesting rule should have both a high support and a high confidence, i.e. it
should be based on a significant number of cases to be useful, and at the same time,
there should be few cases in which it is not valid.
The combined use of support and confidence is the measure of interestingness
most commonly adopted in literature, but in some case can be misleading if the user
does not look carefully at the big picture. Consider the following example: both A
and B appear in 80% of input data, and in 60% of cases, they appear in the same
= 75%,
transaction. The rule ”A implies B” has support 60% and confidence 60
80
thus apparently this is a good rule. However, if we analyze the full context, we can
see that the confidence is lower than the support of B, hence the actual meaning
of this rule is that A negatively influences B. The usage of other interestingness
measures has been widely discussed. However, there is no clear winner, and the
choice depends on the specific application field.
Sequential rules or (temporal association rules) are an extension of association
rules, which also considers sequential relationships. In this case, the input data are
sequences of set of facts and the rules have to deals with both co-occurrences and
”followed by” relationships. Continuing with the previous example about market
basket analysis (MBA), this means considering each transaction as related to a
customer, identified by a fidelity card or something similar. So each input sequence
is the shopping history of a customer and a rule could be ”people who buy item A
and item B at the same time will also buy item C later in conf % cases” or ”people
who buy item A followed by item B within one week will also buy item C later in
conf % cases”.
The extraction of both association rules and sequential rules from a database is
typically composed of two phases. First, it is necessary to find the so-called frequent
patterns, i.e. patterns that occur in a significant number of records. Once such
patterns are determined, the actual association rules can be derived in the form
of logical implications: X ⇒ Y , which reads whenever X occurs in a transaction
(sequence), most likely also Y will occur (later). The computationally intensive part
is the determination of frequent patterns, more precisely of frequent itemsets for
association rules and frequent sequences for sequential rules.
1.4. Association Rules Mining
1.4.1
9
Frequent Itemsets Mining
The Frequent Itemsets Mining (FIM ) problem consists in the discovery of subsets
that are common at least to a user-defined number of input set. Figure 1.3 shows a
small dataset related to the previous MBA example. There are eight transactions,
each containing a variable number of distinct items. If the user chosen minimum
support is three, then the pair ”scanner and speaker” is a frequent pattern, whereas
”scanner and telephone” is not a frequent one. Obviously, any larger pattern containing both a scanner and a telephone cannot be frequent. This fact is known as
apriori principle and, expressed in a more formal way, states that a pattern can be
frequent only if all its subsets are frequent too.
Figure 1.3: Transaction dataset.
The computational complexity of the FIM problem derives from the exponential
size of its search space P(M ), i.e. the power set of M , where M is the set of items
contained in the various
transactions of 12/02/2002
a dataset D. In the example
in Figure 1.3,
10/01/2002
23/12/2002
there
are
8
distinct
items
and
the
larger
transaction
contains
four
items,
this lead to
P4
8
k=1 k = 162 possible patterns to examine, considering all transaction of maximal
length, and 48 considering the actual transaction lengths. However, the number of
distinct patterns is 29 and the number of frequent pattern is even smaller, e.g., there
are just 7 items and 4 pairs occurring more than once, but only 4 items contained
in more than two transactions.
20/04/2002
10/11/2002
Clearly, the naı̈ve approach consisting in generating all subset for every transaction and updating a set of counters would be extremely inefficient. A way to prune
the search space is to consider only those patterns whose subsets are all frequent.
The correctness of this approach derives from the apriori principle, which grants
that it is impossible for discarded pattern to be frequent. The Apriori algorithm [6]
and other derived algorithms16/05/2002
[2, 9, 11, 25, 49, 50, 44,
55, 68] exactly exploits this
10/06/2002
10
1. Introduction
pruning technique.
1.4.2
Frequent Sequence Mining
Sequential pattern mining (FSM) [7] represents an evolution of Frequent Itemsets
Mining, allowing also for the discovery of before-after relationships between subsets
of input data. The patterns we are looking for are sequences of sets, indicating that
the elements of a set occurred at the same time and before the elements contained
in the following sets. The ”occurs after” relationship is indicated with an arrow,
e.g. {A, B} → {B} indicates an occurrence of both item A and B followed by an
occurrence of item B. Clearly, the inclusion relationship is more complex than in case
of subsets, so it needs to be defined. Here we informally introduce this concept, which
we will define formally in chapter 3. For now, we consider that a sequence pattern Z
is supported by an input sequence IS, if Z can be obtained by removing items and
sets from IS. As an example the input sequence {A, B} → {C} → {A} supports the
sequential patterns {A, B}, {A} → {C}, {A} → {A}, but not the pattern {A, C},
because the occurrence of A and C in the input sequence are not simultaneous.
We highlight that the ”occurs after” relationship is satisfied by {A} → {A}, since
anything between the two items can be removed.
Figure 1.4 shows a small dataset containing just three input sequences, each
10/01/2002
12/02/2002
20/04/2002
16/05/2002
23/12/2002
10/11/2002
10/06/2002
Figure 1.4: Sequence dataset.
associated with a customer according to the above example. For each transaction,
1.4. Association Rules Mining
11
the date is printed, but for the moment, we consider the time just a key for sorting
transactions. If we set the minimum support to 50%, we can see that the pattern ”computer and camera followed by a speaker” is frequent and supported by
the behavior of two customers. We observe that the apriori principle still holds for
sequence patterns. If we define the containment relationship between patterns, analogously to the one defined between patterns and input sequence, we can state that
every subsequence of a frequent sequence is frequent. So we are sure that ”computer
followed by a speaker” is a frequent pattern without looking at the dataset, because
the above-mentioned pattern is frequent, and, at the same time, we know that every
pattern containing a ”lamp” is not frequent.
The computational complexity of FSM is higher than that of FIM, due to the
possible repetitions of items within each pattern. Thus, having a small number of
distinct items often does not help, unless the length of input sequences is small too.
However, since the apriori principle is still valid, several efficient algorithms for FSM
exist, based on the generation of frequent patterns from smaller frequent ones.
In several application context it is interesting to exploit the presence of a time
dimension in order to obtain a more precise knowledge, and, in some case, to also
transform an intractable problem into a tractable one by restricting our attention
only to a the cases we are looking for. For example, if data represents the failure
in a network infrastructure, when looking for congestion we are interested in short
time periods, and the failure of an equipment a day after another one may be not
as significant as the same sequence of failures within a few seconds. In this case,
an expert of this domain can enforce a constraint on the maximum gap between
occurrences of events, thus obtaining a better focus on actually important patterns
and a strong reduction in the complexity. In the example in Figure 1.4, if we decide
to limit our research to occurrences having a maximum gap smaller than seven
month the pattern ”computer followed by a speaker” will be supported just by one
customer shopping sequence, since in the first one the gap between the occurrence
of the computer and the occurrence of the speaker is too large.
Figure 1.5 shows the effect of maximum gap constraint on the support of some
of the patterns of the above example: the deleted ones simply disappear, because
their occurrences have inadequate gaps. This behavior poses serious problems to
some of the most efficient algorithms, as we will explain in chapter 3, since some of
their super-pattern may be frequent anyway. It is the case of the pattern ”camera
followed by scanner followed by speaker” which has one occurrence with maximum
gap equal to seven month even if ”camera followed by speaker” has no occurrence
at all.
1.4.3
Taxonomy of Algorithms
The apriori principle states that no superset of an infrequent set can be frequent.
This determines a powerful pruning strategy that suggests a level-wise approach to
solve both FIM [4] and FSM [7] problems. Apriori is the first of a family of algorithms
12
1. Introduction
Figure 1.5: Effect of maxGap constraint.
based on this method. First, every frequent item is found, and then the focus is on
pairs composed of frequent items, and so on. Exploring the search space level-wise
grants that every time a new candidate is considered, the support of all its sub
patterns is known. An alternative approach is the depth first discovery of frequent
patterns: by enforcing the apriori constraint just on some of the sub patterns, the
search space is explored deeply. This is usually done in an attempt to preserve
locality, examining consecutively similar patterns [25, 24, 52].
In both cases, the support of patterns is computed by updating a counter each
time an occurrence is found. Moreover, when all the data fit in main memory, a more
efficient approach based on intersection can be devised. Each item x is associated
with all the IDs of all transactions where x appears, and the support is equal to the
size of the intersection of the two sets. This set of IDs can be obtained using either
bitmap [44] or tidlists [68]. In FSM, the technique is similar, but needs a longer
description; an exhaustive explanation can be found in chapter 3.
The use of intersection in depth-first algorithms is highly efficient, thanks to the
availability of partial intersection results related to shorter patterns. For example,
if we examine every pattern with a given prefix before moving to a different one,
then the list of occurrences associated with that prefix can be reused, with little
waste of memory, in the computations related to its descendant. However, the
results obtained are unsorted and this can be a problem in case the results are
to be merged with other ones, as in case of mining on distributed and streaming
data, since we are forced to wait the end of the computation before being able to
merge the results. On the other hand, level-wise algorithms pose a strong obstacle
to the efficient reuse of partial intersection results due to the limited locality in the
search space traversal. When the search space is not partitioned as in the depthfirst algorithm, it is impossible to exploit the partial intersection results computed
1.5. Contributions
13
Figure 1.6: Taxonomy of algorithms for frequent pattern mining.
at level k − 1 in order to compute the intersections at level k, as partial result can
become quickly too large to be maintained in main memory.
To the best of our knowledge the only two level-wise algorithms that solved this
issue, using a result cache and an efficient partial result reuse, are DCI for FIM and
CCSM for FSM. DCI was introduced in [44] and extended in [43] with an efficient
support inference optimization, whereas CCSM was introduced in [47].
Since these algorithms grant some ordering on the results, they have been chosen
as the basic building block of our distributed and streaming algorithms in the second
part of this thesis (in the future work chapter for the part concerning FSM), since
they make an heavy use of result merging. Figure 1.6 summarizes this taxonomy of
FIM and FSM algorithms.
1.5
Contributions
In this thesis, we present original contributions in three related area: frequent sequence mining with gap constraints, approximate mining of frequent patterns on
distributed datasets and approximate mining of frequent pattern on streaming data.
The original contribution in the sequence mining field is CCSM, a novel algorithm
for the discovery of frequent sequence patterns in collection of list of temporally
annotated sets, with constraints on the maximum gap between the occurrences of
two part of the sequence (maxGap). The proposed method consists in choosing an
ordering that improve the locality and reduces the number of test on pattern support
when the maxGap constraint is enforced, combined with an effective caching policy
of intermediate results. This work has been published in [46, 47].
Another original contribution, this one on approximate distributed frequent itemset mining, deals with homogeneous distributed datasets: several entities cooperate,
and each one has its own dataset with exclusive access. The two proposed algo-
14
1. Introduction
rithms [59, 61], allow for obtaining a good approximate solution, and need just one
synchronization in one case and none in the other. In APRed , the algorithm proposed
in [59], each node begins the computation with a reduced support threshold. After
a first phase, needed to understand the peculiarities of the dataset, the minimum
support is increased again to an intermediate value chosen according to the behavior
of pattern computed during the first phase. Thereafter each node can continue independently and send, at the end of the computation, the result to the master, which
reconstructs an approximation of the set of all global frequent patterns. The goal
of the support reduction is to force infrequent patterns to be revealed in partitions
where they have nearly frequent support. The results obtained by this method are
close to the exact ones for several real-world datasets originated by shopping chart
and web navigation. To the best of our knowledge, this is the first algorithm for
approximate distributed FIM based on an adaptive support reduction scheme.
Similar accuracy in the results, but with higher performance thanks to the asynchronous behavior and no support reduction, are achieved by the APInterp algorithm
that we have introduced in [61]. It is based on the interpolation of unknown pattern
supports, based on the knowledge acquired from the other partitions.
The absence of synchronizations, and of any two-way communication between the
master and the worker nodes, makes APInterp suitable for streaming data, considering
each new incoming block of data as a partition, and the rest of data as another one.
In this way the merge and interpolate task can be applied repeatedly. This is the
basic idea of APStream , the algorithm we presented in [60]. In our tests on real world
datasets, the results are similar to the exact ones, and the algorithm processes the
stream in linear time.
The described interpolation framework can be easily extended to distributed
stream and to the FSM problem, using the CCSM algorithm locally. A more challenging extension, due to the subsumption-related result merging issues, concerns the
approximate distributed computation of Frequent Closed Itemsets (FCI), described
in our preliminary work [32]. Furthermore, the heuristic used in interpolation can
be easily substituted with another one, better fitted to the particular targeted application. However even the very simple and generic one used in our tests gives good
results.
To the best of our knowledge, the AP method is the first distributed approach
that requires just one way communications (i.e., with global pruning optimization
disabled, the worker nodes use only local information), tries to interpolate the missing supports by exploiting the available knowledge and is suitable to both distributed
and stream settings.
1.6
Thesis overview
This thesis is divided into self-contained chapters. Each chapter begins with as a
short overview containing an informal introduction to the subject and a description
1.6. Thesis overview
15
of the scope of the chapter. The first section in most chapters is usually a more
formal introduction to the problem, with definitions and references to related works.
When other algorithms are used, either to describe the proposal contained in the
core of the chapter or its improvements in relation to the state of the art, these
algorithms are described immediately after the introduction. The core part of the
chapter contains an in depth description of the proposed method, followed by a
discussion on its pro and cons, and the descriptions of the experimental setup and
results. For the sake of readability, since parts of the citations are common to more
chapters, the references are listed at the end of the thesis. For the same reason the
measures used for evaluating the approximation of the solutions are described in
and appendix.
The first part of the thesis is made of two chapters that deal with algorithms that
we will use in the following chapters about distributed and streaming data mining, as
previously explained in the section about FIM and FSM algorithm taxonomy. The
first chapter introduces the frequent itemset mining problem and describes DCI [44],
a state of the art algorithm for frequent itemset mining that we will use extensively
in the rest of the thesis. The second chapter describes CCSM, a new algorithm for
gap constrained sequence mining that we presented in [47].
In the second part of the thesis, the third chapter deals with approximate frequent itemset mining in homogeneous distributed datasets, and describes our two
novel approximate algorithms APRed and APInterp , based on support reduction and
interpolation. The fourth chapter extends the support interpolation method, introduced in the previous chapter, to streaming data [60]. Finally, the last chapter
describes some future works and draws some conclusion. In particular, we describe
how to extend the proposed interpolation framework in order to deal with frequent
sequences, using CCSM in local computation. Moreover, we discern how to combine
the APInterp and APStream in an algorithm for the discovery of frequent itemset on
distributed data streams.
16
1. Introduction
I
First Part
2
Frequent Itemset Mining
Each data mining task has its peculiarities and issues when dealing with evolving and
distributed data, as we have briefly outlined in the introduction. A more detailed
analysis requires focusing on a particular task. In this thesis, we have decided to
analyze in detail this problem by discussing Association Rules Mining (ARM) and
Sequential Association Rules Mining (SARM), two of the most popular DM task.
The crucial steps in ARM, and by far the most computationally challenging, is
the extraction of frequent subsets from an input database of sets of distinct items,
also known as Frequent Itemset Mining (FIM). In case the datasets is referred to the
activities of a shop, and data are sale transactions composed of several items, the
goal of FIM is to find the sets of items that are bought together, at least, in a user
specified number of transactions. The challenges in FIM derive from the large size of
its search space, which, in the worst case, corresponds to the power set of the set of
items, and thus is exponential in the number of distinct items. Restricting as much
as possible this space and efficiently performing computations on the remaining part
are key issues for FIM algorithms.
This chapter formally introduces the itemset mining problem and describes DCI
(Direct Count and Intersect), a hybrid level-wise algorithm, which dynamically
adapts its search strategy to the characteristics of the dataset and to the evolution of the computation. This algorithm was introduced in [44], and extended in
[43] with an efficient, key pattern based, support inference method.
With respect to the Frequent Pattern Mining algorithms taxonomy, presented
in the introduction, DCI is a level-wise algorithm, able to ensure an ordering of
the results, and use an efficient hybrid counting strategy, switching to an in-core
intersection based support computation as soon as there is enough memory available
for distributed and stream settings.
DCI has been chosen as the building block for our approximate algorithms due
to its efficiency, and the results ordering, which is particularly important when
merging different result sets. Moreover, DCI exactly knows the exact amount of
memory needed for the whole intersection phase before starting it, and this has
been exploited in APStream , our stream algorithm, for dynamically choosing the size
of the block of transactions to process at once.
20
2.1
2. Frequent Itemset Mining
The problem
A dataset D is a collection of subsets of a set of items I = it1 , . . . , itm . Each element
of D is called a transaction. A pattern x is frequent in dataset D, with respect to
a minimum support minsup, if its support is greater than σmin = minsup · |D|,
i.e., the pattern occurs in at least σmin transactions, where |D| is the number of
transactions in D. A k-pattern isSa pattern composed of k items, Fk is the set of
all frequent k-patterns, and F = i Fi is the set of all frequent patterns. F1 is also
called the set of frequent items. The computational complexity of the FIM problem
derives from the exponential size of its search space P(M ), i.e., the power set of M ,
where M is the set of items contained in the various transactions of D.
2.1.1
Related works
A way to prune the search space P(M ), first introduced in the Apriori [6] algorithm,
is to restrict the search to itemsets whose subsets are all frequent. Apriori is a levelwise algorithm, since it examines the k-patterns only when all the frequent patterns
of length k − 1 have been discovered. At each iteration k, a set of potentially frequent patterns, having all of their subset frequent, are generated starting from the
previous level results. Then the dataset is read sequentially, and the counters associated with each candidate are updated according to the occurrences founds. After
the database scan, only the candidates having a support greater than the threshold
are inserted in the result set and used for generating the next iteration candidates.
Several other algorithms based on the apriori principle have been proposed. Some
use the same level wise approach, but introduce efficient optimizations, like a hybrid count/intersection support computation [44] or the reduction of the number of
candidates using a hash based technique [49]. Others use a depth-first approach,
either class based [68] or projection based [2, 25]. Others again, use completely
different approaches, based on multiple independent computations on smaller part
of the dataset, like [55, 50].
Related research topics are the discovery of maximal and closed frequent itemsets. The first ones are those frequent itemsets that are not included in any other
larger frequent itemset. As an example, consider the FIM result set F = {{A} :
4, {B} : 4, {C} : 3, {A, B} : 4, {A, C} : 3, {B, C} : 3, {A, B, C} : 3}, where the
notation set : count indicates frequent itemsets along with their supports. In this
case there is only one maximal frequent pattern Fmax = {{A, B, C} : 3}, since the
other itemsets are included in it. Clearly the algorithms that are able to mine directly the set of maximal pattern, like [9, 3, 11], are faster and produce a more
compact output than FIM algorithms. Unfortunately, the information contained in
the result set are not the same: in the above example, there is no way to deduce
the support of pattern A from Fmax . Frequent closed itemsets are those frequent
itemsets that are set-included in any larger frequent itemset having the same support. The group of patterns subsumed by the same itemset appears exactly in the
2.2. DCI
21
same set of transactions, and forms a class of equivalence, whose representative element is the largest. Considering again the previous example, the patterns {A}
and {B} are subsumed by the pattern {A, B} whereas the patterns {C}, {A, C}
and {B, C} are subsumed by {A, B, C}. Thus, the set of frequent closed itemsets
is Fclosed = {{A, B} : 4, {A, B, C} : 3}. Note that in this case the support of any
frequent itemset can be deduced as the support of its smaller superset contained in
the result, thus the {A, C} pattern support equal to 3, i.e., it has the same support
than the pattern {A, B, C}.
2.2
DCI
The approximate algorithms that we will propose in the second part of the thesis
for distributed and stream data are build on traditional FPM algorithms, used for
local computations. The partial ordering of results, the foreseeable resource usage,
and the ability to recompute quickly a pattern support using the in-core vertical
bitmap, made DCI our algorithm of choice.
DCI is a multi-strategy algorithm that runs in two phases, both level-wise. During
its initial count-based phase, DCI exploits an out-of-core horizontal database, with
variable-length records. At the beginning of each iteration k, a set Ck of k-candidates
is generated, based on the frequent patterns contained in Fk−1 , then their number of
occurrences is verified during a database scan. At the end of the scan, the itemsets
in Ck having a support greater than the threshold σmin are inserted into Fk . As
execution progress, the dataset size is reduced by removing transactions and items
no longer needed for computation using a technique inspired by DHP [49]. As soon as
the pruned dataset becomes small enough to fit in memory, DCI adaptively changes
its behavior. It builds a vertical layout database in-core, and starts adopting an
intersection-based approach to determine frequent sets.
During this second phase DCI uses intersections to check the support of kcandidates, generated on the fly by composing all the pairs of (k − 1)-itemsets
that are included in Fk−1 and share a common (k − 2)-prefix. When a candidate
is found to be frequent, it is inserted into Fk . In order to ensure high spatial and
temporal locality, each Fi is maintained lexicographically ordered. This grants that
(k −1)-patterns sharing a common prefix are stored contiguously in Fk−1 and, at the
same time, the candidates are considered in lexicographical order, thus granting the
ordering of the result. Furthermore, this allows accessing previous iteration results
from disk in a nearly sequential way and storing immediately each pattern as soon
as it is discovered to be frequent.
DCI uses several optimization techniques, such as support counting inference
based on key patterns [43] and heuristics to dynamically adapt to both dense and
sparse datasets. Here, however we will put our attention only on candidate generation and the counting/intersection phases. Also in the pseudo-code, contained in
algorithm 1, the code part related to optimizations has been removed.
22
2.2.1
2. Frequent Itemset Mining
Candidate generation
Candidates are generated in both phases, even if at different times. In the countbased phase, all the candidates are generated at the beginning of each iteration and
then their supports are verified, whereas in the intersection-based one, the candidates are generated and their supports are checked on the fly. Another important
difference concerns the memory usage: during the first phase the candidates and
the results are maintained in memory and the dataset is on disk, whereas during
the second phase the candidates are generated on the fly, the result are immediately
offloaded to disk, and the dataset is kept in main memory.
The generation of candidates of length k is based on the composition of patterns of k − 1 items sharing the same k − 2 long prefix. For example, if F2 =
{A, B}, {A, C}, {A, D}, {B, C} is the set of frequent 2-patterns, then the set of
candidates for the 3rd iteration will be C3 = {A, B, C}, {A, B, D}. DCI organizes
itemsets of length k in a compressed data structure, optimized for spatial locality
and fast access to groups of candidates sharing the same prefix, taking advantage
of lexicographical ordering. A first array contains the k − 1 prefix and a second one
contains an index to the contiguous block of item suffixes contained in the third
array. Figure 2.1 shows the usage of these arrays. The patterns {A, B, D, M } and
{A, B, D, I} are represented by the second prefix followed by the suffixes in positions
from 7 to 8, i.e., from the index position to the position before the one associated to
the next prefix. Generating the candidates using this data structure is straightforward, and simply consists of the generation of all the pairs for each block of suffixes.
E.g. for the block corresponding to the prefix {A, B, C}, {A, B, D, G} is inserted in
candidate prefixes, with suffixes H,I and L, followed by {A, B, D, H} with suffixes
I and L, followed {A, B, D, I} with suffix L.
Not every generated candidate obeys the apriori principle, so we can observe that
the candidate pattern {A, B, D}, in the first example, cannot be frequent, since its
subpattern {B, D} is not frequent. When the candidates are stored in memory,
during the counting-based phase, the apriori principle is enforced before inserting
candidates into candidate set. On the other hand, checking the presence of every
subset has a cost, which increases as the patterns get longer. If we also consider that
the relevant subpatterns are not in any particular order, this disrupts both spatial
and temporal locality in the access to the previous iteration results (Fk−1 ). For this
reason, and the low cost and high locality of intersection-based support checking, the
authors has decided to limit the candidate pruning step to the count-based phase.
2.2.2
Counting phase
In the first iteration, similarly to all FSC algorithms, DCI exploits a vector of counters. In subsequent iterations, it uses a Direct Count technique, introduced by the
same authors in [42]. The goal of this technique is to make the access to the counters associated with candidates as fast as possible. So, instead of using a hash tree,
2.2. DCI
23
prefix
index
a b c
a b d
b d f
3
7
9
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
b
b
d
d
c
c
c
c
d
d
d
d
f
f
d
e
f
g
h
i
l
m
i
n
suffix
d
e
f
g
h
i
l
m
i
n
0
1
2
3
4
5
6
7
8
9
Compressed
Memory = 9 + 3 + 10 = 21
Non−Compressed
Memory = 4 x 10 = 40
Figure 2.1: Compressed data structure used for itemset collection can also improve
candidate generation. This figure originally appeared in [43].
or others complex data structures, it extends the approach used for items. When
k = 2, each pair of (frequent) items is associated with a counter in an array through
an order preserving perfect hash function. Since the order in pairs of items is not
significant, and the elements of a pair are distinct, the number of counters needed
is m(m−1)
= m2 , where m is the number of frequent items.
2
When k > 2, using direct access to counters would require a large amount of
memory. In this case, the direct access prefix table contains a pointer to a contiguous
block of ordered candidates sharing the same
2-prefix. Note that the number of
mk
m
locations in the prefix table is 2 6 2 , where mk is the number of distinct
items in the dataset during iteration k, which is less than or equal to m, thanks
to pruning. Indeed, during the k th count-based iteration, DCI removes from each
generic transaction t every item that is not contained in at least k − 1 frequent
itemsets of Fk−1 and k candidate itemsets of Ck .
Clearly, as the execution progress, the size of the dataset actually used in computations decrease and, thanks to pruning, the whole dataset will rapidly shrink
enough to fit in main memory, for the final intersection-based phase. Even with
large datasets and limited memory, this often happen after 3 or 4 iterations, thus
limiting the drawbacks of the count-based phase, which becomes less efficient as k
increases.
24
2. Frequent Itemset Mining
Algorithm 1: DCI
input : D, minsup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// find the frequent itemsets;
F1 ← f irst scan(D, minsup);
//second and following scans on a temporary db D0 ;
F2 ← second scan(D0 , minsup);
k ← 2;
while D0 .vertical size() > memory available() do
k ← k + 1;
Fk ← DCI count(D0 , minsup, k);
end
k ← k + 1;
// count-based iteration and create vertical database VD ;
Fk ← DCI count(D0 , VD, minsup, k);
while Fk 6= ∅ do
k ← k + 1;
Fk ← DCI intersect(VD, minsup, k) ;
end
2.2.3
Intersection phase
The intersection-based phase uses a vertical database, in which each item α is paired
with a set of transactions tids(α) containing α, different from the horizontal one
used before, in which a set of items is associated with each transaction. Since a
transaction t supports pattern x iff x ⊆ t, the set of transactions supporting x can
be obtained by intersecting the sets of transactions (tidlist) associated with each
items in x. Thus the support σ(x) of a pattern x will be
\
σ(x) = tids(α)
α∈x
In DCI the sets of transactions are represented as bit-vectors, where the ith bit is equal
to 1 when the ith transactions contains the item and is equal to 0 otherwise. This
representation allows for efficient intersections-based on the bitwise and operator.
The memory necessary to contain this bitmap-based vertical representation is mk ·nk
bits, where mk and nk are respectively the numbers of items and transactions in the
pruned database used at iteration k. As soon as this amount is less than the available
memory, the vertical dataset representation can be built on the fly in main memory
in order to begin the intersection based phase of DCI.
During this phase, the candidates are generated on the fly in lexicographical
order, and their supports are checked using tidlist intersections. The above-described
method for support computation is indicated as k-way intersection. The k bit-vectors
2.3. Conclusions
25
associated with items contained in a k-pattern are and-intersected and the support
is obtained as the number of 1’s present in the resulting bit-vector. If this value is
greater than the support threshold σmin , then the candidate is inserted into Fk .
Since the candidates are generated on the fly, the set of candidates needs no
longer to be maintained. Moreover, both Fk−1 and Fk can be kept on disk. Indeed,
Fk−1 is lexicographically ordered and can be loaded in block having the same (k −2)prefix, and, thanks to the order of candidate generation, appending frequent patterns
at the end of Fk preserves the lexicographic order.
The set intersection is a commutative and associative operation, thus the
operands can be intersected in any order and grouped in any way. A possible
method is intersecting the tidlist of items pair wise, starting from the beginning, i.e.,
the first with the second, the result with the third, the result with the fourth and
so on. Since the candidates are lexicographically ordered, consecutive candidates
are likely to share a prefix of some length. Hence, the intersections related to this
prefix are pointlessly repeated for each candidate. In order to exploit this locality,
DCI uses an effective cache containing the intermediate results of intersections.
When the support of a candidate c is checked immediately after the candidate c0 ,
the tidlist associated with their common prefix can be obtained directly from the
cache.
1
2
3
4
Cached Pattern
{A}
{A, B}
{A, B, C}
{A, B, C, D}
Cached tidList
tids(A)
tids(A) ∩ tids(B)
(tids(A) ∩ tids(B)) ∩ tids(C)
((tids(A) ∩ tids(B)) ∩ tids(C)) ∩ tids(D)
Figure 2.2: Example of cache usage.
For example, after the computation of the support of the itemset {A, B, C, D},
the tidlists associated with all of its prefix are present in cache, as showed in Figure 2.2. Note that each cache position is obtained from the previous one by intersection with the tidlist of a single item. Hence, if the next candidate pattern
is {A, B, C, G}, only the last position of the cache need to be replaced, and this
implies just one tidlist intersection, since the tids intersections of {A, B, C} can be
retrieved from the third entry of the cache.
2.3
Conclusions
In this chapter, we have described the frequent itemset mining (FIM) problem, the
state of the art of FIM algorithms, and DCI, an efficient FIM algorithm, introduced
by Orlando et al. in [44]. We will use DCI in the second part of the thesis as a
building block for our approximate algorithm for distributed and stream data. DCI
26
2. Frequent Itemset Mining
has been chosen among the other FIM algorithms thanks to its efficiency and the
result ordering, which is particularly important when merging different result sets.
Moreover, we can exactly predict the exact amount of memory needed by DCI for the
whole intersection phase before starting it, and this has been exploited in APStream ,
our stream algorithm, for dynamically choosing the size of the block of transactions
to process at the same time.
3
Frequent Sequence Mining
The previous chapter has introduced the Frequent Itemset Mining (FIM) Problem,
the most computationally challenging part of Association Rules Mining. This chapter deals with Sequential Association Rules Mining (SARM) and in particular with
its Frequent Sequence Mining (FSM) phase. In this thesis work we have decided to
focus on this two popular data mining tasks, with particular regard to the issues
related to distributed and stream settings, and the usage of approximate algorithms
in order to overcome these problems. The algorithm proposed in this chapter, can
be used as a building block for the Frequent Sequence version of our approximate
distributed and stream algorithms described in the second part of this thesis, thanks
to its efficiency, and results ordering, which is particularly important when merging
different result sets.
The frequent sequence mining (FSM) problem consists in finding frequent sequential patterns in a database of time-stamped events. Going on with the supermarket
example, market baskets are linked to a time-line and no longer anonymous. An important extension to the base FSM problem is the introduction of time constraints.
For example, several application domains require limiting the maximum temporal
gap between events occurring in the input sequences. However pushing down this
constraint is critical for most sequence mining algorithms.
This chapter formally introduces the sequence mining problem and proposes
CCSM (Cache-based Constrained Sequence Miner), a new level-wise algorithm that
overcomes the troubles usually related to this kind of constraint. CCSM adopts an
innovative approach based on k-way intersections of idlists to compute the support
of candidate sequences. Our k-way intersection method is enhanced by the use
of an effective cache that stores intermediate idlists for future reuse inspired by
DCI [44] (see previous chapter). The reuse of intermediate results entails a surprising
reduction in the actual number of join operations performed on idlists.
CCSM has been experimentally compared with cSPADE [69], a state of the art
algorithm, on several synthetically generated datasets, obtaining better or similar
results in most cases.
Since some concept introduced in GSP [62] and SPADE [70] algorithm are used
to explain the CCSM algorithm, a quick description of these two follows the problem
description. Other related works are referred at the end of the chapter.
28
3.1
3. Frequent Sequence Mining
Introduction
The problem of mining frequent sequential patterns was introduced by Agarwal and
Srikant in [7]. In a subsequent work, the same authors discussed the introduction of
constraints on the mined sequences, and proposed GSP [62], a new algorithm dealing
with them. In the last years, many innovative algorithms were presented for solving
the same problem, also under different user-provided constraints [69, 70, 53, 20, 52,
8].
We can think of the problem of mining Frequent Sequence Mining (FSM) as
a generalization of Frequent Itemset Mining (FIM) to temporal databases. FIM
algorithms aims to find patterns (itemsets) occurring with a given minimum support
within a transactional database D, whose transactions correspond to collections of
items. A pattern is frequent if its support is greater than (or equal to) a given
threshold s%, i.e. if it is set-included in at least s%·|D| input transactions, where |D|
is the total number of transactions in D. An input database D for the FSM problem
is instead composed of a collection of sequences. Each sequence corresponds to a
temporally ordered list of events, where each event is a collection of items (itemset)
occurring simultaneously. The temporal ordering among the events is induced from
the absolute timestamps associated with the events.
A sequential pattern is frequent if its support is greater than (or equal to) a
given threshold s%, i.e. if it is ”contained” in (or it is a subsequence of) at least
s% · |D| input sequences, where |D| is the number of sequences included in D.
To make more intuitive both problem formulations, we may consider them within
the application context of the market basket analysis (MBA). In this context, each
transaction (itemset) occurring in a database D of the FIM problem corresponds to
the collection of items purchased by a customer during a single visit to the market.
The FIM problem for MBA consists in finding frequent associations among the items
purchased by customers. In the general case, we are thus not interested in the timestamp of each purchased basket, or in the identity of its customer, so the input
database does not need to store such information. Conversely, FSM problem for
MBA consists in predicting customer behaviors on the basis of their past purchases.
Thus, D has also to include information about timestamp and customer identity
of each basket. The sequences of events included in D correspond to sequences
of ”baskets” (transactions) purchased by the same customer during distinct visits
to the market, and the items of a sequential pattern can span a set of subsequent
transactions belonging to the same customer. Thus, while the FIM problem is
interested in finding intra-transaction patterns, the FSM problem determines intertransaction sequential patterns.
Due to the similarities between the FIM and FSM problems, several FIM algorithms have been adapted for mining frequent sequential patterns as well. Like
FIM algorithms, also FSM ones can adopt either a count-based or intersection-based
approach for determining the support of frequent patterns. The GSP algorithm,
which is derived from Apriori [7], adopts a count-based approach, together with a
3.1. Introduction
29
level-wise visit (Breadth-First) of the search space. At each iteration k, a set of
candidate k-sequences (sequences of length k) is generated, and the dataset, stored
in horizontal form, is scanned to count how many times each candidate is contained
within each input sequences. The other approach, i.e. the intersection-based one,
relies on a vertical-layout database, where for each item X appearing in the various input sequences we store an idlist L(X). The idlist contains information about
the identifiers of the input sequences (sid ) that include X, and the timestamps
(eid ) associated with each occurrence of X. Idlists are thus composed of pairs (sid,
eid ), and are considerably more complex than the lists of transaction identifiers
(tidlists) exploited by intersection-based FIM algorithms. Using an intersectionbased method, the support of a candidate is determined by joining lists. In the FIM
case, tidlist joining is done by means of simple set-intersection operations. Conversely, idlist joining in FSM intersection-based algorithms exploits a more complex
temporal join operation. Zaki’s SPADE algorithm [70] is the best representative of
such intersection-based FSM algorithms.
Several real applications of FSM enforce specific constraints on the type of sequences extracted [62, 53]. For example, we might be interested in finding frequent
sequences of purchase events which contain a given subsequence (super pattern constraint), or where the average price of items purchased is over a given threshold
(aggregate constraint), or where the temporal intervals between each pair of consecutive purchases is below a given threshold (maxGap constraint). Obviously, we
could solve this problem with a post-processing phase: first, we extract from the
database all the frequent sequences, and then we filter them on the basis of the
posed constraints. Unfortunately, when the constraint is not on the sequence itself
but on its occurrences (as in the case of the maxGap constraint), sequence filtering requires an additional scan of the database to verify whether a given frequent
pattern has still a minimum support under the constraint. In general, FSM algorithms that directly deal with user-provided constraints during the mining process
are much more efficient, since constraints may involve an effective prune of candidates, thus resulting in a strong reduction of the computational cost. Unfortunately,
the inclusion of some support-related constraints may require large modifications in
the code of an unconstrained FSM algorithm. For example, the introduction of the
maxGap constraint in the SPADE algorithm, gave rise to cSPADE, a very different
algorithm [69].
All the FSM algorithms rely on the anti-monotonic property of sequence frequency: every subsequence of a frequent sequence is frequent as well. More precisely
most algorithms rely on a weaker property, restricted to a well-characterized part
of subsets. This property is used to generate candidate k-sequence from frequent
(k − 1)-sequences. When an intersection-based approach is adopted, we can determine the support of any k-sequence by means of join operations performed [55] on
the idlist associated with its subsequences. As a limit case, we could compute the
support of a sequence by joining the atomic idlists associated with the single items
included in the sequence, i.e., through a k-way join operation [44]. More efficiently,
30
3. Frequent Sequence Mining
we could compute the support of a sequence by joining the idlists associated with
two generating (k − 1)-subsequences, i.e., through a 2-way join operation. SPADE
[70] just adopts this 2-way intersection method, and computes the support of a ksequence by joining two of its (k − 1)-subsequences that share a common suffix.
Unfortunately, the adoption of 2-way intersections requires maintaining the idlists
of all the (k − 1)-subsequences computed during the previous iteration. To limit
memory requirement, SPADE subdivides the search space into small, manageable
chunks. This is accomplished by exploiting suffix-based equivalence classes: two
k-sequences are in the same class only if they share a common (k − 1)-suffix. Since
all the generating subsequences of a given sequence belong to the same equivalence
class, equivalence classes are used to partition the search space in a way that allow
each class to be processed independently in memory. Unfortunately, the efficient
method used by SPADE to generate candidates and join their idlists, cannot be exploited when a maximum gap constraint is considered. Therefore, cSPADE is forced
to adopt a different and much more expensive way to generate sequences and join
idlists, also maintaining in memory F2 , the set of frequent 2-sequences.
This chapter discuss CCSM (Cache-based Constrained Sequence Miner), a new
level-wise intersection-based FSM algorithm, dealing with the challenging maximum
gap constraint. The main innovation of CCSM is the adoption of k-way intersections
to compute the support of candidate sequences. Our k-way intersection method is
enhanced by the use of an effective cache, which store intermediate idlists. The idlist
reuse allowed by our cache entails a surprising reduction in the actual number of join
operations performed, so that the number of joins performed by CCSM approaches
the number of joins performed when a pure 2-way intersection method is adopted,
but require much less memory. In this context, it becomes interesting to compare
the performances of CCSM with the ones achieved by cSPADE when a maximum
gap constraint is enforced.
The rest of the chapter is organized as follows. Section 3.2 formally defines the
FSM problem, while Section 3.5.2 describes the CCSM algorithm. In Section 3.5.3,
there are some experimental results and a discussion about them. Finally, Section 5.4
presents some concluding remarks.
3.2
3.2.1
Sequential patterns mining
Problem statement
Definition 1. (Sequence of events) Let I = {i1 , ..., im } be a set of m distinct
items. An event (itemset) is a non-empty subset of I. A sequence is a temporally
ordered list of events. We denote an event as (j1 , . . . , jm ) and a sequence as (α1 →
. . . → αk ), where each ji is an item and each αi is an event (ji ∈ I and αi ⊆ I).
The symbol → denotes a happens-after relationship. The items that appear together
in an event happen simultaneously. The length |x| of a sequence x is the number
3.2. Sequential patterns mining
of items contained in the sequence (|x| =
k-sequence.
31
P
|αi |). A sequence of length k is called a
Even if an event represents a set of items occurring simultaneously, it is convenient to assume that there exists an ordering relationship R among them. Such
order makes unequivocal the way in which a sequence is written, e.g., we cannot
write BA → DBF since the correct way is AB → BDF . This allows us to say,
without ambiguity, that the sequence A → BD is a prefix of A → BDF → A,
while DF → A is a suffix. A prefix/suffix of a given sequence α are particular
subsequences of α (see Def. below).
Definition 2. (Subsequence) A sequence α = (α1 → . . . →αk ) is contained in a
sequence β = (β1 →...→βm ) (denoted as αβ), if there exist integers 1≤i1 <...<ik ≤m
such that α1 ⊆βi1 , ..., αk ⊆βik . We also say that α is a subsequence of β, and that β
is a super-sequence of α.
Definition 3. (Database) A temporal database is a collection of input sequences:
D = {α| α = (sid, α, eid)},
where sid is a sequence identifier, α = (α1 → . . . → αk ) is an event sequence, and
eid = (eid1 , . . . , eidk ) is a tuple of unique event identifiers, where each eidi is the
timestamp (occurring time) of event αi .
Definition 4. (Gap constrained occurrence of a sequence) Let β a given
input sequence, whose events (β1 → . . . →βm ) are time-stamped with (eid1 , . . . , eidm ).
The gap between two consecutive events βi and βi+1 is thus defined as (eidi+1 −eidi ).
A sequence α = (α1 → . . . →αk ) occurs in β under the minimum gap and maximum
gap constraints, denoted as α vc β, if there exists integers 1≤i1 <...<ik ≤m such that
α1 ⊆βi1 , ..., αk ⊆βik , and ∀j, 1 < j ≤ k, minGap ≤ (eidij − eidij−1 ) ≤ maxGap, where
minGap and maxGap are user specified thresholds.
When no constraints are specified, we denote the occurrence of α in β as α v β.
This case is a simpler case of sequence occurrence, since we have that α v β simply
if αβ holds.
Definition 5. (Support and constraints) The support of a sequence pattern
α, denoted as σ(α), is the number of distinct input sequences β such that α v β. If
a maximum/minimum gap constraint has to be satisfied, the “occurrence” relation
to hold is α vc β.
Definition 6. (Sequential pattern mining) Given a sequential database and a
positive integer minsup (a user-specified threshold), the sequential mining problem
deals with finding all patterns α along with their corresponding supports, such that
σ(α) ≥ minsup.
32
3. Frequent Sequence Mining
3.2.2
Apriori property and constraints
Also in the FSM problem the Apriori property holds: all the subsequences of a
frequent sequence are frequent. A FSM constraint C is anti-monotone if and only if
for any sequence β satisfying C, all the subsequences α of β satisfy C as well (or,
equivalently, if α does not satisfy C, none of the super-sequences β of α can satisfy
C). Note that the Apriori property is a particular anti-monotone constraint, since
it can be restated as ’the constraint on minimum support is anti-monotone’.
In the problem statement above, we have already defined two new constraints
besides the minimum support one: given two consecutive events appearing in a
sequence, these constraints regards the maximum/minimum valid gap between the
occurrences of the two events in the various input database sequences.
Consider first the minGap constraint. Let δ be an input database sequence. If
β vc δ, then all its subsequences α, α β, satisfy α vc δ. This property holds
because α β implies that the gaps between the events of α result ”not shorter”
than the gaps relative to β. Hence, we can deduce that the minGap constraint is an
anti-monotone constraint.
Conversely, if the maxGap constraint is considered and α β vc δ, we do not
know whether α vc δ holds or not. This is because α β implies that the gap
between the events of α may be larger than the gaps relative to β. For example, if
(A→B→C) vc δ, the gaps relative to A→C (i.e. the gaps between the events A and
C in δ) are surely larger than the gaps relative to A→B and B→C. Therefore, if
the gap between the events B and C is exactly equal to maxGap, the maximum gap
constraint cannot be satisfied by A→C, i.e. A→C 6vc δ. Hence, we can conclude
that, using this definition of sub/super-sequence based on , the maxGap constraint
is not anti-monotonic.
3.2.3
Contiguous sequences
We have shown that the property ’β satisfies maxGap constraint’ does not propagate
to all subsequences α of β (α β). Nevertheless, we can introduce a new definition
of subsequence that allows such inference to hold.
Definition 7. (Contiguous subsequence) Given a sequence β = (β1 →...→βm )
and a subsequence α = (α1 →...→αn ), α is a contiguous subsequence of β, denoted
as α - β, if one of the following holds:
1. α is obtained from β by dropping an item from either β1 or βm ;
2. α is obtained from β by dropping an item from βi , where |βi | ≥ 2;
3. α is a contiguous subsequence of α0 , and α0 is a contiguous subsequence of β.
Note that during the derivation of a contiguous subsequence α from β, middle
events of β cannot be removed, so that the gaps between events are preserved.
3.2. Sequential patterns mining
33
Therefore, if δ is an input database sequence and β vc δ, and α - β, then α vc δ is
satisfied in presence of maxGap constraints.
Lemma 8. If we use the concept of contiguous subsequence (-), the maximum gap
constraint becomes anti-monotone as well. Therefore, if β is a frequent sequential
pattern that satisfies the maxGap constraint, then every α, α - β, is frequent and
satisfies the same constraint.
Definition 9. (Prefix/Suffix subsequence) Given a sequence α =
(α1 →...→αn ) of length k = |α|, let (k − 1)-prefix(α) ((k − 1)-suffix(α)) be the
sequence obtained from α by removing the first (last) item of the event α1 (αn ). We
can say that an item is the first/last one of an event without ambiguity, due to the
lexicographic order of items within events.
We can now recursively define a generic n-prefix(α) in terms of the (n+1)-prefix(α).
The n-prefix(α) is obtained by removing the first (last) item of the first (last) event
appearing in the (n + 1)-prefix(α). A generic n-suffix(α) can be defined similarly.
It is worth noting that a prefix/suffix of a sequence α is a particular contiguous
subsequence of α, i.e. n-prefix(α) - α and n-suffix(α) - α.
3.2.4
Constraints enforcement
Algorithms solving the FSM problems usually search for Fk exploiting in some
way the knowledge of Fk−1 . The enforcement of anti-monotone constraints can
be pushed deep into the mining algorithm, since patterns not satisfying an antimonotone constraint C can be discarded immediately, with no alteration to the
algorithm completeness (since their super-patterns do not satisfy C too).
More importantly, the anti-monotone constraint C is used during the generation
of candidates. Remember that, according to the Apriori definition, a k-sequence α
can be a ”candidate” to include in Fk only if all of its (k − 1)-subsequences result
to be included in Fk−1 .
We will use the - relation to support the notion of subsequence, in order to
ensure that all the contiguous (k − 1)-subsequences of α ∈ Fk will belong to Fk−1 .
Note that if we used the general notion of subsequence (), the number of the
(k − 1)-subsequences of α should be k. Each of them could be obtained by removing
a distinct item from one of the events of α. Conversely, since we have to use the
contiguous subsequence relation (-), the number of contiguous (k −1)-subsequences
of α may be less than k: each of them can be obtained by removing a single item
only from particular events in α, e.g. items belonging to the starting/ending event
of α, or contained in events composed of more than one item.
In practice, each candidate k-sequence can simply be generated by combining a
single pair of its contiguous (k − 1)-subsequences in Fk−1 .
34
3.3
3. Frequent Sequence Mining
GSP
The first algorithm that proposed this candidate generation method, based on pairs
of contiguous sequences, was presented in [62] by Srikant and Agrawal. Their algorithm, GSP, is a level-wise algorithm that scans repeatedly the dataset, and counts
the occurrences of the candidate frequent patterns contained in a set, which is generated before the beginning of each iteration. Each k-candidates is generated by
merging a pair of frequent (k − 1)-patterns that share a (k − 2) long contiguous
sub-sequence.
Figure 3.1: GSP candidate generation. The 3-patterns and 4-patterns are connected
with their generators using a thick line. Candidates discarded after support check
are not shown.
3.3.1
Candidate generation
During the k-candidate generation phase, GSP merges every pair of frequent (k − 1)patterns α and β, such that (k − 2) − suf f ix(α) = (k − 2) − pref ix(β). The
result of the merge is pattern α concatenated with the last item contained in β,
1 − suf f ix(β). This item is inserted as part of the last event, if this was the
case in β, or as a new event otherwise. For example, the patterns A → B and
B → C generate the candidate A → B → C, whereas the patterns A → B and
BC generate the candidate A → BC. In case some of the k − 1-subsequences of
the obtained candidate are not frequent, the candidate is discarded. In the above
example, in case A → C is not frequent, A → BC can be safely discarded. However,
the same is not true, for A → B → C, since A → C is not one of its contiguous
subsequences. Indeed, even in case A → C was not frequent due to the maxGap
constraint, A → B → C could be frequent.
3.4. SPADE
35
The set of candidates Ck is represented using a hash tree. Each node in the tree
is either a leaf node, containing sequences along with their counters, or an internal
node, containing pointers to other nodes. In order to find the counter for a pattern,
the tree is traversed starting from the root. The next branch to visit is chosen using
a hash function on the pth item in the sequence, where p is the depth of the node.
Figure 3.1 represents a lattice of frequent patterns. Each ellipse indicates a pattern, a line indicates the relationship includes/included by, and a thick line indicates
the ones exploited by GSP for the generation of candidates containing more than
two items.
3.3.2
Counting
As soon as GSP completes the generation of the set of candidates Ck , it start reading the input sequences in the dataset one by one. When an input sequence d is
processed, GSP search the hash tree recursively, processing all the branches that are
compatibles with the time-stamps contained in d. Each time a leaf is reached, GSP
check if any of the sequence patterns in the leaf is supported by d, and, in case the
time constraints are satisfied, it increments the associated counter. The inclusion
check of sequence pattern s in the input sequence d is performed using a vertical
representation of d, i.e., each item in d is associated with a list of time-stamps
corresponding to its occurrences in d. This representation enables GSP to align efficiently the pattern s with the input sequence d, starting from the first element and
stretching gaps as long as the constraints are satisfied.
3.4
SPADE
A completely different approach was proposed by Zaki in SPADE [70]. SPADE is an
intersection-based algorithm, i.e. each item is associated to a list of pairs (sid, eid)
and the support of a pattern is obtained using intersections. The pair (sid, eid)
corresponds to an occurrence of the items in an input sequence sid (sequence id)
with time-stamp eid (event id). Since these lists are kept in memory, the candidates
can be generated and checked on the fly, and there is no need to maintain in memory
the set of candidates, or to scan the dataset multiple times.
SPADE, as GSP, merges pairs of (k − 1)-patterns to obtain k-candidates, however
the pairs are chosen in a different way.
3.4.1
Candidate generation
SPADE generates a candidate k-sequence from a pair of frequent (k−1)-subsequences
that share a common (k − 2)-prefix1 . The generate candidate is composed of α
1
In some version of the algorithm the author uses suffixes instead of prefixes. This is not
relevant, however, unless we need to restrict the search space to patterns beginning/ending with
36
3. Frequent Sequence Mining
followed by the last element of β, either as one-item event, or as part of the last
event as we will explain later. For example, α = A→B→C→D is obtained by
combining the two subsequences A→B→C and A→B→D, which share the 2-prefix
A→B. Since also the resulting candidates share the same prefixes, a set of k-patterns
sharing a common (k − 1)-prefix is closed with respect to candidate generation, and
can be processed independently. The generation of 2-candidates is in some way an
exception: every pair of frequent items can generate candidates since they share a
0-prefix. For this reason, SPADE use intersections for candidates containing at least
3 items, but uses a count based approach for frequent items and 2-patterns.
In order to generate k-candidates, SPADE considers each pair of frequent (k − 1)patterns sharing the same (k − 2)-prefix, included pairs containing twice the same
pattern. Each pair can produce one, two, three or no candidates at all, depending on
their last events. Let α and β be two frequent (k − 1)-patterns sharing a common
prefix P and ending respectively with items X and Y . The last event in α may
contain one or more items. The first case is indicated as α = P → X, the second
one as α = P X. Four cases may arise:
α = P → X, β = P → Y
P → XY , P → X → Y and P → Y → X are valid candidates. In
case X = Y , P → X → X is the only candidate generated by α and β.
α = P → X, β = P Y
P Y → X is the only candidate generated by α and β.
α = P X, β = P → Y
P X → Y is the only candidate generated by α and β.
α = P X, β = P Y
In case X < Y , the candidate is P XY . Otherwise the candidate is
P Y X, unless X = Y . In this case, no candidate is generated.
3.4.2
Candidate support check
Immediately after the generation of a candidate, SPADE check its support using
idlist intersections. An idlist is a sorted list of the occurrences, i.e., pairs (sid, eid),
where sid identifies a specific input sequence, and eid one of its events. The ordering is on sid, with eid as secondary key. In SPADE, an idlist can be referred
either to an item or to a pattern. In the first case, the list corresponds to the
occurrences of the pattern, in the second case to the last position of each occurrence of the sequence. For example, if the only input sequence in the dataset is
(sid = 1, {({A, B}, eid = 1), ({A, C}, eid = 2), ({C}, eid = 6)}), then idlist(A) =
{(1, 1), (1, 2)}, idlist(AC) = {(1, 2)}, and idlist(A → C) = {(1, 2), (1, 3)}. Note
some items or sequence of items.
3.5. CCSM
37
that it is not relevant that there are two distinct occurrences of A → C ending in
(1, 3).
Two kind of intersection are possible: ordinary intersection, or equality join, and
temporal intersection, or temporal join. The first one is used when the candidate
is P XY = αY , or P → XY = αY , and exactly corresponds to the common
set intersection of idlist(α) and idlist(Y ): the results is the set of pairs (sid, eid)
appearing in both idlists. The second one is slightly more complex and corresponds
to the candidates α → Y (P X → Y and P → X → Y ). In this case the results is
the subset of idlist(Y ) containing only those entries (sid, eid2 ) such that an entry
(sid, eid1), with eid1 < eid2 , exists in idlist(α). Thanks to the ordering of idlist both
operation can be implemented efficiently. Furthermore, the idlist of α is available
from the previous level. Note that, thanks to the closure of common prefix classes
with respect to candidate generation, the search space can be traversed depth-first
by recursively exploring each prefix class. Thus, the idlist of prefixes can be reused
with limited memory requirement. SPADE can also be implemented in a strictly
level-wise manner, however it would be far less efficient.
3.4.3
cSPADE: managing constraints
In case the maxGap constraint is enforced, the solution found by the SPADE algorithm is no longer complete. For example, α = A→B→C→D is obtained by
combining the two subsequences A→B→C and A→B→D, which share the 2-prefix
A→B. Unfortunately, A→B→D is not a contiguous subsequence of α. This implies that, even if α is frequent and satisfy a given maxGap constraint, i.e. α ∈ F4 ,
its subsequence A→B→D could not have been included in F3 as not satisfying
the same maxGap constraint. In other words, SPADE might loose candidates and
relative frequent sequences. cSPADE [69] overcomes this limit by exactly using
the contiguous subsequence concept. α = A→B→C→D is now obtained from
A→B→C and C→D, i.e. by combining the (k − 1)-prefix and the 2-suffix of α. It
is straightforward to see that both the (k − 1)-prefix and 2-suffix of α are contiguous
subsequences of α. Unfortunately, the need for contiguous subsequences to guarantee
anti-monotonicity under the maxGap constraint partially destroys the prefix-class
equivalence self-inclusion of SPADE, which ensures high locality and low memory
requirement. While each prefix-class is mined, cSPADE also needs to maintain F2
in the main memory, since it uses 2-suffixes to extend frequent (k − 1)-sequences.
3.5
CCSM
The reason behind the choice to use F2 for candidate generation in cSPADE, is that
F2 is usually smaller than Fk−1 for k > 3, so the idlists of frequent 2-sequences should
fit in memory. However, even when this is true, the idlists of (k−1)-patterns contains
more elements, thus the average cost of an intersection is greater. In addition, the
38
3. Frequent Sequence Mining
number of intersection is generally increased. In fact, the generation of a candidate
depends on finding a pair of patterns with a matching common part. Hence, when
the match is required on just one item, as in the case of intersection with F2 , the
probability of generating a false positive (discarded candidate) is higher. On the
other hand, since the suffixes of processed candidates are in no particular order,
using Fk−1 for the same purpose, can be excessively memory demanding.
CCSM, the algorithm we propose, avoid these issues using a suitable traversal
order of the search space and an improved bidirectional idlist intersection operation.
3.5.1
Overview
The candidate generation method adopted by CCSM was inspired by GSP [62], which
is also based on the contiguous subsequence concept. We generate a candidate ksequence α from a pair of frequent (k − 1)-sequences, which share with α either a
(k − 2)-prefix or a (k − 2)-suffix. It easy to see that both these frequent (k − 1)sequences are contiguous subsequences of α.
As we have already highlighted above, the candidates generated by cSPADE are
more than those generated by CCSM/GSP are. We show this with an example.
Suppose that A → B → C ∈ F3 , and that the only frequent 3-sequence having
prefix B → C is B → C → D. CCSM directly combines these two 3-sequences
to obtain a single potentially frequent 4-sequence A → B → C → D. Conversely,
cSPADE tries instead to extend A → B → C with all the 2-sequences in F2 that start
with C. In this way, cSPADE might generate a lot of candidates, even if, due to our
hypotheses, the only candidate that has chances to be frequent is A → B → C → D.
3.5.2
The CCSM algorithm
Like GSP, CCSM visits level-wise and bottom-up the lattice of the frequent sequential
patterns, building at each iteration Fk , the set of all frequent k-sequences.
CCSM starts with a count-based phase that mines a horizontal database, and
extracts F1 and F2 . During this phase, the database is scanned, and each input
sequence is checked against a set of candidate sequences. If the input sequence
contains a candidate sequence, the counter associated with the candidate is incremented accordingly. At the end of this count-based phase, the pruned horizontal
database is transformed into a vertical one, so that our intersection-based phase can
start. Thereinafter, when a candidate k-sequence is generated from a pair of frequent
(k − 1)-patterns, its support is computed on-the-fly using items idlist intersections.
This happens by joining the atomic idlists (stored in the vertical database) that are
associated with the frequent items in F1 , as well as several previously computed
intermediate idlists that are found in a cache.
In order to describe how the intersection-based phase works, it is necessary to
discuss how candidates are generated, how idlists are represented and joined, and
how CCSM idlist cache is organized.
3.5. CCSM
39
Candidate generation.
At iteration k, we generate the candidate k-sequences starting from the frequent
(k − 1)-sequences in Fk−1 . For each f ∈ Fk−1 , we generate candidate k-sequences
by merging f with every f 0 ∈ Fk−1 such that (k − 2)-suffix(f ) = (k − 2)-prefix(f 0 ).
For example, f : BD→B is extended with f 0 : D→B→B to generate the candidate 4-sequence BD→B→B. Note that by construction, f and f 0 are contiguous
subsequences of the new candidate.
To make more efficient the search in Fk−1 for pairs of sequences f and f 0 that
share a common suffix/prefix, we aggregate and link the various groups of sequences
in Fk−1 .
Figure 3.2 illustrates the generation of the candidate 4-sequences starting from
F3 . On the left-hand and on the right-hand side of the figure two copies of the
3-sequences in F3 are shown. These sequences are lexicographically ordered either
with respect to their 2-suffixes or to their 2prefixes. Moreover, sequences sharing
the same suffix/prefix are grouped (this is represented by circling each aggregation/partition with dotted boxes). For example, a partition appearing on the left
side is {BD → B, D → D → B}. If two partitions that appear on the opposite
sides share a common contiguous 2-subsequence (2-suffix = 2-prefix), they are also
linked together. For instance, two linked partition are {BD → B, D → D → B}
(on the left), and {D → BD, D → B → B} (on the right). Due to the sharing of
suffix/prefix within and between linked partitions, we can obviously save memory
to represent F3 .
The linked partitions of frequent sequential patterns are the only ones we must
combine to generate all the candidates. In the middle of Figure 3.2, we show the
candidates generated for this example. Candidates than do not result frequent are
dashed boxed, while the frequent ones are indicated with solid line boxes. Note that,
before passing to the next pair, we first generate all the candidates from the current
pair of linked partitions. The order in which candidates are generated enhances
temporal locality, because the same prefix/suffix is encountered several times in
consecutively generated candidates. Our caching system takes advantage of this
locality, storing and reusing intermediate idlist joins.
Idlist intersection.
To determine the support of a candidate k-sequence p, we have first to produce the
associated idlist L(p). Its support will correspond to the number of distinct sid
values contained in L(p).
To produce L(p), we have to join the idlists associated with two or more subsequences of p. If both L(p01 ) and L(p02 ) are available, where p01 are p02 are the two
contiguous subsequences whose combination produces p, L(p) can be generated very
efficiently through a 2-way intersection: L(p) = L(p01 ) ∩ L(p02 ). Otherwise, we have
to intersect idlists associated with smaller subsequences of p. The limit case is a
40
3. Frequent Sequence Mining
Figure 3.2: CCSM candidate generation.
k-way intersection, when we have to intersect atomic idlists associated with single
items.
As an example of a k-way intersection, consider the candidate 3-sequence
A→B→C. Our vertical database stores L(A), L(B) and L(C), which can be joined
to produce L(A→B→C). Each atomic list stores (sid, eid) pairs, i.e. the temporal
occurrences (eid) of the associated item within the original input sequences (sid).
When L(A), L(B) and L(C) are joined, we search for all occurrences of A followed
by an occurrence of B, and then, using the intermediate result L(A→B), for
occurrences of C after A→B. If a maximum or minimum gap constraint must be
satisfied, it is also checked on the associated timestamps (eids).
Note that in this case we have generated the pattern A→B→C by extending the
pattern from left to right. An important question regards what information has to
be stored along with the intermediate list L(A→B). We can simply show that, if we
extend the pattern from left to right, the only information needed for this operation
is those related to timestamps associated with the last item/event of the sequence.
With respect to L(A→B), this information consists in the list of (sid, eid) pairs
of the B event. Each pair indicates that an occurrence of the specified sequential
pattern occurs in the input sequence sid, ending at time eid.
3.5. CCSM
41
On the other hand, if we generate the sequence by extending it from right to
left, the intermediate sequence should be B→C, but the information to store in
L(B→C) should be related to the first item/event of the sequence (B). In this
case, each (sid, eid) pair stored in the idlist should indicate that an occurrence of
the specified sequential pattern exists in input sequence sid, starting at time eid.
Consider now that we use a cache to store intermediate sequences and associated
idlists. In order to improve cache reuse, we want to exploit cached sequences to
extend other sequences from left to right and vice versa. Therefore, the lists of pairs
(sid, eid) should be replaced with lists of terns (sid, f irst eid, last eid), indicating
that an occurrence of the specified sequential pattern occurs in input sequence sid,
starting at time f irst eid and ending at time last eid.
Finally, note that two types of idlist join are possible: equality join (denoted as
∩e ) and temporal join (denoted as ∩t ). The first is the usual set-intersection, and is
used when we search for occurrences of one item appearing simultaneously with the
last item of the current sequence: for example, L(A→BC) = L(A→B) ∩e L(C)).
Temporal join is instead an ordering-aware intersection operation, which may also
check whether the minimum and maximum gap constraints are satisfied. Consider
the join of the example above, i.e. L(A→B→C) = L(A→B) ∩t L(C)). The result
of this join is obtained from L(C) by discarding all its pairs (sid2 , eid2 ) with nonmatching sid1 ’s in the first idlist (L(A→B)), or with a matching sid1 that is not
associated with any eid1 smaller than eid2 .
More formal definitions of the two base cases (lists of pairs) for equality join,
and (min gap, max gap) constraint-enforcing temporal join are shown below:
L1 ∩e L2 = {(sid2 , eid2 ) ∈ L2 |(∃(sid1 , eid1 ) ∈ L1 )
(sid1 = sid2 ∧ eid1 = eid2 )}
L1 ∩t L2 = {(sid2 , eid2 ) ∈ L2 |(∃(sid1 , eid1 ) ∈ L1 )
(sid1 = sid2 ∧ eid1 < eid2 ∧
minGap ≤ |eid2 − eid1 | ≤ maxGap)}
1
2
3
4
5
Cached Sequence
Cached Idlist
A
L(A)
A→A
L(A) ∩t L(A)
A→A→B
[L(A) ∩t L(A)] ∩t L(B)
A→A→BC
[[L(A) ∩t L(A)] ∩t L(B)] ∩e L(C)
A→A→BC→D [[[L(A) ∩t L(A)] ∩t L(B)] ∩e L(C)] ∩t L(D)
Figure 3.3: Example of cache usage.
42
3. Frequent Sequence Mining
Idlist caching.
Our k-way intersection method can be improved using a cache of k idlists. Figure 3.3
shows how our caching strategy works: the table represents the status of the cache
after the idlist associated with sequence A→A→BC→D has been computed. Each
cache entry is numbered and contains two values: a sequence and its idlist. Each
sequence entry i is obtained from entry (i − 1) by appending an item. In a similar
way, the associated idlist is the result of a join between the previous cached idlist and
the idlist associated with the last appended item. When a new sequence is generated,
the cache is searched for a common prefix and the associated idlist. If a common
prefix is found, CCSM reuses the associated idlist, and rewrites subsequent cache
lines. Considering the example of Figure 3.3, if the candidate A→A→BF is then
generated, the third cache line corresponding to the common prefix A→A→B will
be reused. In this way, the support of A→A→BF can be computed by performing
a single equality join between the idlist in line 3 and L(F ). The result of this join
is written in line 4 for future reuse.
Since the cache contains all the prefixes of the current sequence along with the
associated idlists, reuse is optimal when candidate sequences are generated in lexicographic order. Furthermore, as idlist length (and join cost) decreases as sequence
length increases, the joins saved by exploiting the cached idlists are the most expensive ones.
Figure 3.4: CCSM idlist reuse.
The combined effect of cache use and candidate generation is illustrated in Figure
3.4. On the left-hand side, a fragment of the lists of the linked partitions sharing
a common infix is shown. The right-hand side of the Figure illustrates instead how
candidates are generated. First, we consider the P artition(F G→A), i.e. the set
of sequences sharing the prefix/suffix F G→A. L(F G→A) is processed first, using
the cache as described before. L(A) and L(B) are then joined left to right with
3.5. CCSM
43
L(F G→A) to obtain L(F G→A→A) and L(F G→A→B). Finally, we join right
to left the lists so obtained with L(A) and L(C) to produce the lists associated
with all the possible candidates. When P artition(F G→A) has been processed, all
the intermediate idlists except those stored in cache are discarded, and the next
P artition(F G→B) is processed. The cache currently contains L(F G→A) and all
its intermediate idlists, so that L(F G) can be reused for computing L(F G→B).
Since partitions are ordered with respect to the common infix, similar reuses are
very frequent.
Dataset cs11
Min support 0.30 %
Max-gap 8
1.2e+06
2-ways
cached k-ways (CCSM)
pure k-ways
1e+06
2-ways
cached k-ways (CCSM)
pure k-ways
1e+06
800000
800000
600000
600000
#
#
Dataset cs21
Min support 0.40 %
Max-gap 8
1.2e+06
400000
400000
200000
200000
0
0
4
6
8
10
12
Pattern Length
14
16
4
6
8
10
Pattern Length
12
14
Figure 3.5: Number of intersection operations actually performed using 2-ways, pure
k-ways and cached k-ways intersection methods while mining two synthetic datasets.
Figure 3.5 shows the efficacy of CCSM caching strategy. The plots report the
actual number of intersection operations performed using 2-ways, pure k-ways and
CCSM cached k-ways intersection methods while mining two synthetic datasets. As
it can be seen, our small cache is very effective since it allows saving a lot of intersection operations over a pure k-ways method, although memory requirements are
significantly lower than those deriving from the adoption of a pure 2-ways intersection method.
3.5.3
Experimental evaluation
In order to evaluate the performances of the CCSM algorithm, we conducted several
tests on a Linux box equipped with a 450MHz Pentium II processor, 512MB of
RAM and an IDE HD. The datasets used were CS11, and CS21, two synthetic
datasets generated using the publicly available synthetic data generator code from
the IBM Almaden Quest data mining project [7]. In particular, the datasets contain
100, 000 customer sequences composed in the average of 10 (CS11) and 20 (CS21)
transactions of average length 5. The other parameters Ns , Ni , N , I used to generate
the maximal sequences of average size S = 4 (CS11) and S = 8 (CS21), were set
to 5000, 25000, 10000 and 2.5, respectively. Note that these values are the same as
those used to generate the synthetic datasets in [62, 69, 70]. Figure 3.6 plots the
44
3. Frequent Sequence Mining
number of frequent sequences found in datasets CS11 and CS21 as a function of
the pattern length for different values of the maxGap constraint. As expected, the
number of frequent sequences is maximum when no maxGap constraint is imposed,
while it decreases rapidly for decreasing values of the maxGap constraint.
Dataset cs11
Min support 0.30 %
160000
Dataset cs21
Min support 0.40 %
450000
maxGap= maxGap= 1
maxGap= 2
maxGap= 4
maxGap= 8
maxGap= 12
140000
120000
350000
300000
Pattern #
100000
Pattern #
maxGap= maxGap= 1
maxGap= 2
maxGap= 4
maxGap= 8
maxGap= 12
400000
80000
60000
250000
200000
150000
40000
100000
20000
50000
0
0
4
6
8
10
12
Pattern Length
14
16
4
6
8
10
Pattern Length
12
14
Figure 3.6: Number of frequent sequences in datasets CS11 (minsup=0.30) and
CS21 (minsup=0.40) as a function of the pattern length for different values of the
maxGap constraint.
In order to assess the relative performance of our algorithm, we compared its
running times with the ones obtained under the same testing conditions by cSPADE
(we acknowledge Prof. M.J. Zaki for kindly providing us cSPADE code) [69, 70].
Figure 3.7 reports the total execution times of CCSM and cSPADE on datasets
CS11 and CS21 as a function of the maxGap value. In the tests conducted with
cSPADE we tested different configurations of the command line options available to
specify the number of partitions into which the dataset has to be split (-e #, default
no partitioning), and the maximum amount of memory available to the application
(-m #, default 256MB).
From the plots, we can see that while on the CS11 dataset performances of the
two algorithms are comparable, on the CS21 dataset CCSM remarkably outperforms
cSPADE for large values of maxGap, while cSPADE is faster when maxGap is small.
This holds because for large values of maxGap, the actual number of frequent sequences is large (see Figure 3.6), and cSPADE has to perform a lot of intersections
between relatively long lists belonging to F2 . CCSM on the other hand, reuses in
this case several intersections found in the cache. Since execution times increase
rapidly for increasing values of maxGap, we think that the behavior of CCSM is in
general preferable over cSPADE one.
The same considerations can be done looking at the plots reported in Figure 3.8
that report for a fixed maxGap constraint (maxGap=8), the execution times of
CCSM and cSPADE on datasets CS11 and CS21 as a function of the minimum
support threshold. The CCSM and cSPADE execution times resulted very similar on
the CS11 dataset, while on the CS21 dataset CCSM resulted, for maxGap=8, about
3.6. Related works
45
twice faster than cSPADE.
Dataset cs11
Min support 0.30 %
10000
10000
Running time (s)
CCSM
cSPADE
cSPADE -e4-m40
cSPADE -e4-m70
cSPADE -e8-m40
cSPADE -e8-m70
cSPADE -m100
1000
Running time (s)
Dataset cs21
Min support 0.40 %
100
CCSM
cSPADE
cSPADE -e4-m40
cSPADE -e4-m70
cSPADE -e8-m40
cSPADE -e8-m70
cSPADE -m100
1000
100
10
1
10
0
2
4
6
Max Gap
8
10
12
0
2
4
6
Max Gap
8
10
12
Figure 3.7: Execution times of CCSM and cSPADE on datasets CS11 (minsup=0.30)
and CS21 (minsup=0.40) as a function of the maxGap value.
Dataset cs11
max gap: 8
120
CCSM
cSPADE
cSPADE -e4
cSPADE -e4-m40
cSPADE -e4-m70
cSPADE -e8-m40
cSPADE -e8-m70
cSPADE -m100
80
60
40
20
0
0.35
CCSM
cSPADE
cSPADE -e4-m40
cSPADE -e4-m70
cSPADE -e8-m40
cSPADE -e8-m70
cSPADE -m100
1000
Running time (s)
100
Running time (s)
Dataset cs21
max gap: 8
1200
800
600
400
200
0.4
0.45
Min support (%)
0.5
0
0.4
0.45
Min support (%)
0.5
Figure 3.8: Execution times of CCSM and cSPADE on datasets CS11 and CS21 with
a fixed maxGap constraint (maxGap=8) as a function of the minimum support
threshold.
3.6
Related works
The problem as been initially introduced by Agrawal and Srikant in [7], where they
present AprioriAll, a count-based algorithm for solving this problem. The same
authors in [62] generalize the problem and introduce GSP a new count based algorithm characterized by a better counter management and candidate generation policy. Another algorithm very similar to GSP, but using more efficient data structures
that exploits the presence of common suffixes shared by several frequent patterns is
PSP [37].
46
3. Frequent Sequence Mining
As in the association case both intersection based and projection based algorithms exists. Two of the best in first category algorithms are SPADE [65, 67, 70],
which computes the support of candidates using list intersection and SPAM [8] which
performs the same operation using vectors of boolean and bitwise operations. Two
representatives of the second category are FreeSpan [24] and PrefixSpan [52].
Mannila, Toivonen e Verkamo [35] define a slightly different problem: instead
of frequent patterns common to several input sequences, they search for episodes
frequently appearing in a unique long input sequence. The support for a subsequence
is the number of temporal windows containing it. In a subsequent work [34, 36] the
same authors introduces constraints on single items and pairs of elements present
inside episodes.
Generalizations introduced in GSP [62] are the usage of a taxonomy, the possibility to group together events contained in a specified temporal frame and temporal
constraints on mininum and maximum allowable distance between two consecutive
events(minGap/maxGap). The proposed algorithm does not handle them efficiently.
The performances of SPADE with constraint enforcement (cSPADE [69]) are widely
better when no constraint is required on maxGap, but are limited as for GSP when
it is enforced. CCSM (S.Orlando,R.Perego,C.Silvestri [46, 47]) have been specifically
designed in order to overcome his limitation, using a candidate generation method
that is not affected by maxGap constraint anti-monotonicity. PrefixSpan have been
extended in order to handle several kinds of constraints [53].
A further evolution of PrefixSpan is CloSpan [64], an algorithm that is able to
detect all closed sequential patterns2 , pruning early during computation most patterns that are frequent but not closed. Closed sequential pattern, even if they are
much more compact, exactly represent the whole set of frequent sequential pattern
and it is possible to switch from one representation to the other. Nevertheless building the complete set of pattern and checking for inclusion is more expensive than in
case of associations. CloSpan was the first algorithm dealing with closed sequential
pattern. More recently, J.Wang and J.Han proposed BIDE a new algorithm that
find all and only closed patterns, without false positives that need to be corrected
with post processing.
One of the first algorithms for incremental sequence mining is ISM [51] which
use a method similar to that used by SPADE and in addition maintains the set of
infrequent candidates(negative border ) in order to minimize recomputation. This
entails a non-trivial resource usage for large datasets, in contrast with ISE [38, 39],
which does not need additional data and use just inference from already known
pattern.
2
Closed sequential patterns are those sequential patterns that are not contained in any other
pattern having the same support. If B contains A, and both have the same support then every
input sequence containing A also contains B (the converse is always true).
3.7. Conclusions
3.7
47
Conclusions
In this chapter, we have presented CCSM, a new FSM algorithm that mine temporal
databases in the presence of user-defined constraints. CCSM searches for sequential
patterns level-wise, and adopts an intersection-based method to determine the support of candidate k-sequences. Each time a candidate k-sequence α is generated,
its support is determined on the fly by joining the k atomic idlists associated with
the frequent items (1-sequences) constituting the candidate. This k-way intersection is, however, a limit case of our method. In fact, our order of generation of
candidate ensures high locality, so that with high probability successively generated
candidates share a common subsequence of α. A cache is thus used to store the
intermediate idlists associated with all the possible prefixes of α. When the idlist
of another candidate β has to be built, we reuse the idlist corresponding to the
common subsequence of maximal length. The exploitation of such caching strategy
entails a strong reduction in the number of join operations actually performed. Finally, CCSM is able to consider the very challenging maxGap constraint over the
sequential pattern extracted. Preliminary experiments conducted on synthetically
generated datasets showed that CCSM remarkably outperforms cSPADE when the
selectivity of the gap constraint is not high. Since we are conscious that further
optimization can be pushed into the code, we consider these results as encouraging.
CCSM result sets are strictly ordered on (common part, pref ix item, suf f ix item),
thus different result sets can be efficiently merged using a simple list merge. Since
the distributed and stream FIM algorithms that are presented in the second part
of this thesis make a heavy use of result merging, CCSM can be used for efficiently
extending them for solving the FSM problem. In the last chapter, we give some
more detail on this use of CCSM.
48
3. Frequent Sequence Mining
II
Second Part
4
Distributed datasets
In many real systems, data are naturally distributed, usually due to a plural ownership, or a geographical distribution of the processes that produce the data. Moving
all the data to one single location for processing could be impossible due to either
policy or technical reason. Furthermore, the communications between the entities
owning parts of the data may be not particularly fast and immediate. In this context, the communication efficiency of an algorithm is often more important than the
exactness of its results.
In this chapter, we will focus on distributed association mining. We will start by
characterizing the different ways data can be distributed, and describe some useful
techniques common to several distributed association mining algorithms. Then we
will introduce the frequent itemset mining problem for homogeneous distributed
datasets and present two novel communication efficient distributed algorithms for
approximate mining of frequent patterns from transactional databases. Both the
algorithms we propose locally compute frequent patterns, and then merge local
results. The first algorithm, APRed , adaptively reduces the support threshold used
in local computation in order to improve the accuracy of the result, whereas the
second one, APInterp , uses an effective method for inferring the local support of
locally infrequent itemsets. Both strategies give a good approximation of the set of
the globally frequent patterns and their supports for sparse datasets, but APInterp
is more resilient to data skew. In the last part of the chapter, we report the results
of part of the tests we have conducted on publicly available datasets. The goal
of these tests is to evaluate the similarity between the exact result set and the
approximate ones returned by our distributed algorithms in different cases, as well
as the scalability of APInterp .
4.1
Introduction
As suggested before, there are several cases in which data can be distributed among
different entities, that we will call nodes. In the case of cellular phone networks,
each cell or group of cells may have its separate database for performance and
resilience reasons. At the same time other information about the customer that
own a device are available at the account department, and are kept separate due
52
4. Distributed datasets
to privacy reasons. Where a particular data can be found influences the kind of
solutions a problem can have. Therefore, before describing any algorithm for a
particular data mining problem, we need to specify in which context it will be used.
Homogeneous and heterogeneous data distribution
The two above examples of distributed databases, related to the cellular phones
domain, fall in two distinct major classes of data distribution. In the first case, each
node has its own database, containing the log of the activities of a device in the area
controlled by the group of antennas. Every local database contains different data,
but the kind of information is the same for every node. This situation is indicated
as homogeneous data distribution. On the other hand, if we are also interested in
data about customers, nodes having different kinds of data need to cooperate. In
the example, the cell database could contain the information that a device stopped
for several hours in the same place, whereas the accounting department database
knows which customer is associated with that device and its home address. This
situation is indicated as heterogeneous data distribution. In this chapter, we will
focus on association mining on homogeneously distributed data.
Communication bandwidth and latency issues
A key factor in the implementation of distributed algorithms is the kind of communication infrastructure available. An algorithm suitable for nodes connected by
high-speed network links, can be of little use if nodes are connected by a modem
and the public telephone network. Furthermore, for an algorithm that entails several
blocking communications, a high latency is definitely a serious issue. Distributed
systems are usually characterized by links having low speed, or high latency, or
both. Hence, efficient algorithms need to exchange as few data as possible, and
avoid blocking situation in which the local computation cannot resume until some
remote feedback arrives.
Parallel vs Distributed
Parallel (PDM) and distributed (DDM) data-mining are a natural evolution of datamining technologies, motivated by the need of scalable and high performance systems, or policy/logistic reasons. The main differences between these two approaches
is that while in PDM data can be moved (centralized) to a tightly coupled parallel system before starting computation, DDM algorithms must deal with limited
possibilities for data movement/replication, due either to specific policies or technical reasons like large network latencies. A good review of algorithms and issues in
distributed data mining is [48].
4.2. Approximated distributed frequent itemset mining
4.1.1
53
Frequent itemset mining
There exists algorithms for distributed frequent itemset mining (FIM) that usually
performs in a homogeneous context, and algorithms able to cope with heterogeneous
data, linked by primary keys [27] as, for instance, the individual number in the
previously seen example about personal data. The two main parallel/distributed
approaches [66], in the homogeneous case, are Count Distribution, in which each
node computes the support for the same set of candidates on his own dataset, and
Candidate/Data Distribution, where each node computes the support of a part of
candidates, using also part of the dataset owned by other nodes.
More in detail, algorithms based on Count-distribution compute the support of
each pattern locally, and then exchange (or collect) and sum all the supports to
obtain the global support. On the other hand, in Data Distribution and Candidate
Distribution each processor handles a disjoint set of candidate patterns, and access
all the data partitions for computing global support. The difference between the two
approaches is that, in Data Distribution, candidates are partitioned merely to divide
the workload, and all data are accessed by all processors, whereas in Candidate
Distribution the candidates are partitioned in such a way that each processor can
proceed independently and data are selectively replicated. Since only the counters
are sent, Count Distribution minimizes the communications, making it suitable for
loosely coupled setting. The other two techniques, instead, are more appropriate for
parallel systems.
A first parallel version of Apriori is introduced in [5], while other more efficient
solutions are found in [44, 45, 58, 22, 13, 56, 41, 13, 22, 5, 27, 66]. The diversity
of possible use cases makes the selection of the best algorithm a hard task. Even
metrics used for comparison may be more or less appropriate according to the specific
system architecture. A good survey on parallel association mining algorithms is [66].
Most these algorithms, however, are not suitable for loosely coupled settings.
Only a few papers discussing truly distributed FIM algorithms recently appeared in
the literature [56, 57, 63].
Nevertheless, as previously explained there are several real world systems that
are intrinsically distributed and loosely coupled. For this reason we have chose to
prefer DDM solutions, able to deal with such cases.
4.2
Approximated distributed frequent itemset
mining
In this section, we will introduce a novel approximate algorithm for distributed
frequent itemset mining. After a brief summary of the notation used for frequent
itemset, we will introduce the centralized algorithm that inspired our algorithms
and its naı̈ve distributed version. Then we will describe APRed and APInterp , the
algorithms we propose, and the experimental results we have obtained. Finally, we
54
4. Distributed datasets
will draw some conclusions.
4.2.1
Overview
A dataset D is a collection of subsets of items I = it1 , . . . , itm . Each element of D
is called a transaction. A pattern x is frequent in D with respect to a minimum
support minsup, if its support is greater than σmin = minsup · |D|, i.e. the pattern
occurs in at least σmin transactions, where |D| is the number of transactions in D. A
k-patternSis a pattern composed of k items, Fk is the set of all frequent k-patterns,
and F = i Fi is the set of all frequent patterns. F1 is also called the set of frequent
items.
In this section, we discuss two distributed algorithms for approximate mining of
frequent itemsets: APRed (Approximate Partition with dynamic minimum support
Reduction) and APInterp (Approximate Partition with Interpolation). Both exploit
DCI [44], a state-of-the-art algorithm for FIM, as the miner engine used for local
computations. The name ”Approximate Partition” derives from the distributed
computation method adopted, which is inspired by the Partition algorithm [55], and
its distributed straightforward version [41].
We assume that our dataset D is divided into several disjoint partitions Di , i =
{1, ..., n}, located on n collaborating entities, where each transaction completely
belongs to one of the partitions. In particular, we consider that the dataset is
already partitioned, according to some business rules, among geographically distributed systems. Collaborating entities are loosely coupled, and even if available
network bandwidths sometimes is not an issue, latency surely is. A fitting example
is a set of insurance companies connected by the Internet and collaborate in order to
detect frauds. In this kind of setting, we should avoid sending lots of messages with
several barrier synchronizations. Thus, a small loss of accuracy is a fair trade-off for
a reduced number of communications/synchronizations.
Both APRed and APInterp compute independently local solution for each node and
then merge local results. Instead of making a second pass, as Distributed Partition
does, we propose other methods to be used during the merge phase in order to
improve the support count. To this end, the minimum support threshold used in
local computation is adaptively reduced in APRed , whereas an approximate support
inference heuristic is used in APInterp . Experimental tests show that the solutions
produced by both APRed and APInterp are good approximation of the exact global
result, and that APInterp is more efficient than APRed . Unfortunately, the APInterp
method may also generate a few false positives, whose approximate supports is
usually very close to the exact one. Therefore, the support of the rules extracted
from these false positive patterns should not bother analysts. This is especially true
when a positive result just indicate a case that need the attention of the operator
for further investigation, as in the case of fraud detection: if a pattern with support
slightly higher than the threshold is interesting, probably a slightly lower one will
be interesting too. A single synchronization is required to compute and redistribute
4.2. Approximated distributed frequent itemset mining
55
the reduced support threshold, in APRed , and the knowledge of F2 , used by slaves for
global pruning in both algorithms. This is particularly important in the described
distributed setting, where the network latency is often a more critical factor than
the available bandwidth, and the reduced number of communications is worth a
small reduction in the accuracy of results. In APInterp , it is also possible to disable
local pruning; at the cost of a larger number of false positive, the algorithm become
asynchronous and suitable for unidirectional communications.
4.2.2
The Distributed Partition algorithm
Our APInterp and APRed algorithms were inspired by Partition [55], a sequential algorithm that divides the dataset in several partitions processed independently. The
basic idea exploited by Partition is the following: each globally frequent pattern must
be locally frequent in at least one partition. This guarantees that the union of all local solutions is a superset of the global solution. However, one further pass over the
database is necessary to remove all false positives, i.e. patterns that result locally
frequent but globally infrequent.
Obviously, Partition can be straightforwardly implemented in a distributed setting with a master/slave paradigm [41]. Each slave becomes responsible of a local
partition, while the master performs the sum-reduction of local counters (first phase)
and orchestrates the slaves for computing the missing local supports for potentially
globally frequent patterns (second phase) to remove patterns having global support
less than minsup (false positive patterns collected during the first phase).
While the Distributed Partition algorithm gives the exact values for supports, it
has pros and cons with respect to other distributed algorithms. The pros are related to the number of communications/synchronizations: other methods as countdistribution [22, 68] require several communications/synchronizations, while the Distributed Partition algorithm only requires two communications from the slaves to the
master, one single message from the master to the slaves and one synchronization
after the first scan. The cons are the size of messages exchanged, and the possible
additional computation performed by the slaves when the first phase of the algorithm produces false positives. Consider that, when low absolute minimum supports
are used, it is likely to produce a lot of false positives due to data skew present in
the various dataset partitions [50]. This has a large impact also on the cost of the
second phase of the algorithm too: most of the slaves will participate in counting
the local supports of these false positives, thus wasting a lot of time.
One naı̈ve work-around, that we will name Distributed One-pass Partition, consists
in stopping Distributed Partition after the first-pass. So in Distributed One-pass Partition each slave independently computes locally frequent patterns and sends them to
the master which sum-reduces the support for each pattern and writes in the result
set only patterns having the sum of the known supports greater than (or equal to)
minsup. Distributed One-pass Partition has obvious performance advantages vs. Distributed Partition. On the other hand, it yields a result that is approximate. Whereas
56
4. Distributed datasets
it is sure that at least the number of occurrences reported in the results exists for
each pattern, it is likely that some pattern has also occurrences in other partitions
in which it was not frequent.
This is formalized in the following lemma.
Lemma 10 (Bounds on support after first pass). Let P=1,...,N be the set of the
N partition indexes. Then let f part(x) = {j ∈ P |σj (x) > minsup · |Dj |} be the
set of indexes of the partitions where the pattern x is frequent and let f part(x) =
(P r f part) be its complement. The support for a pattern x is greater than or equal
to the support computed by the Distributed One-pass Partition algorithm:
X
σ(x)lower =
σj (x)
j∈f part(x)
and is less than or equal to σ lower (x) plus the maximum support the same pattern
can have in partitions where it is not frequent:
X
minsup · |Dj | − 1
σ(x)upper = σ(x)lower +
j∈f part(x)
Note that when a pattern does not result frequent in a partition, its actual local
support can be at most equal to the local minimum support threshold minus one.
We can easily transform the two absolute bounds defined above into the corresponding relative ones:
sup(x)upper =
σ(x)upper
σi (x)lower
, sup(x)lower =
|D|
|D|
These bounds can be used to calculate the Average Support Range described in
appendix A (ASR(B), Definition 14). Any approximate algorithm based on Distributed One-pass Partition will yield results with at most this average error on all
the supports.
The main issue with Distributed One-pass Partition is that for every pattern the
computed support is a very conservative estimate, since it always chooses the lower
bounds to approximate the results. The first method we propose, APRed , aim at
increasing this lower bound. This is obtained by mean of a reduction of the minimum
support used for local computation in order to increase the probability that globally
frequent patterns turn out to be locally frequent in most of the dataset partitions.
Generally, any algorithm returning a support value between the bounds will
have better chances of being more accurate. Following this idea, we devised another
algorithm based on Distributed One-pass Partition, APInterp , which uses a smart interpolation of support. Moreover, it is resilient to skewed item distributions.
4.2. Approximated distributed frequent itemset mining
4.2.3
57
The APRed algorithm
The key idea of APRed , our first approximate FIM algorithm, is to use a slightly
reduced minimum support threshold (an adaptively selected one) for local elaborations. The APRed algorithms exploits the same number of communication as the
Partition one, and consists of two phases too. The first phase allows the master to
compute a ”good approximation” R0 of R = F1 ∪ F2 , where R0 ⊆ R, and a lower
bound σ 0 (x) for support σ(x) of any patterns x ∈ R. This knowledge of R0 is then
used by each slave for globally pruning the candidates during the second phase. This
should reduce the production of false positives on the various slaves. Moreover, at
the end of this first phase, the master also reduces the user-provided minsup, and
this new support threshold is adopted by all the slave for the rest of the computation.
The rationale of lowering minsup in local slave computation is to increase the probability that globally frequent patterns turn out to be locally frequent in most of the
dataset partitions. Note that when a pattern is locally frequent in all the partitions,
the master is able to determine exactly its support. At the end of second phase the
master collects the locally frequent patterns (with respect to the reduced minsup)
from the slaves, and simply builds the approximate sets {Fi |i > 2} by summing the
supports associated with corresponding locally frequent patterns. Obviously, even
if the local frequent patterns have been computed by lowering minsup, the master
considers a pattern frequent only if this sum is at least |D| · min supp.
The two points to clarify are:
• how the master arrives at a ”good approximation” R0 of R = F1 ∪ F2 (at the
end of the first phase)
• how the master decides the support reduction ratio r to be used for the rest
of the computation (during the second phase).
A ”good approximation” for the frequent patterns composed of at most two items
is built using a significantly reduced minsup for local computation during the first
. In
phase. In our tests, this initial support threshold was set to minsup0 = minsup
2
several cases F1 and F2 have much less elements than the following sets Fk , thus
using such a low minimum support during the very first part of the computation
could be reasonable for wide ranges of user-specified values of minsup and sparse
datasets. Nevertheless, R0 gives us an accurate knowledge of R = F1 ∪F2 . However,
minsup0 is usually too small, and cannot be used for the following iterations.
Before describing the criteria used for deciding the support to use during the
remaining iterations, we need to introduce a measure of similarity, which is used to
compare two different result sets A and B. The Sim(A, B) measure, described in
details in Appendix A, ranges from 0 to 1, and considers both false positive/negative
and non-matching support values.
The master chose the new support threshold, minsup00 ∈ [minsup0 , minsup], in
such a way that Sim(R00 , R) is high, where R00 ⊆ R0 is introduced in the following.
58
4. Distributed datasets
Note that, since the correct result R is not available, we have to exploit the selfsimilarity between the best known approximation of R, i.e. R0 , and a more relaxed
one R00 , obtained as if all the slaves had mined their patterns (composed of one or
two items) using the support threshold minsup00 ∈ [minsup0 , minsup]. The idea
is to arrive at determining a value for minsup00 that is very close to minsup, thus
entailing a small increase in the computational complexity. In practice the master
chooses the highest minsup00 value which ensures a self-similarity (above a specified
threshold, 98% in our tests) between R00 and R0 .
The pseudo-code of the algorithm is contained in algorithm 2 and 3 for the slave
and master parts respectively. In the pseudo-code R0 i , F 0 i1 , F 0 i2 , σi (x) are related
to partition Di assigned to slave i, while the corresponding symbols without i are
related to global results and datasets. The truth function [[expr]], which is equals to
1 if expr is TRUE and 0 otherwise, is used to select only the frequent patterns with
respect to the specified support threshold.
Algorithm 2: APRed - Slave i
1
2
3
4
5
6
Compute local R0 i = F 0 i1 ∪ F 0 i2 w.r.t. minsup0 = 12 · minsup ;
Send local partial result R0 i to the master ;
Receive the global approximation R0 of R ;
Receive minsup00 . ;
Continue computation w.r.t. minsup00i ; use R0 for pruning candidates.;
Send local results to the master. ;
Algorithm 3: APRed - Master
1
2
3
4
5
6
7
Receive local partial S
results P
R0 i from all the slaves ;
0
0
Compute R = {x ∈ i R i | i σi (x) > minsup · |D|} ;
Send R0 to all the slaves ;
Compute r00 = max{r ∈ [0.5, 1]|Sim(R0 , R00 (r)) >Pγ} where γ is a user
provided similarity threshold, R00 (r) = {x ∈ R0 | i σir (x) > minsup · |D|},
σir (x) = [[σi (x) > r · minsup · |Di |]] · σi (x) ;
Send minsup00 = r00 · minsup to all the slaves ;
Receive local results
R00 i from
S
P all the slaves ;
0
00
Return R ∪ {x ∈ i R i | i σi (x) > minsup · |D|} ;
It is worth noting that the master discards already computed local results. In
particular, the presence of patterns in R0 i (see point 2) and R00 i (see point 7) that
do not result globally frequent, causes a waste of resources. This is a negative side
effect and, in the experimental section, we will use this quantity as a measure of the
efficiency of the proposed algorithm, in order to asses the impact on performance of
lowering the minimum support threshold. We will see, however, that by exploiting
4.2. Approximated distributed frequent itemset mining
59
the approximate knowledge R0 of F1 ∪ F2 for candidate pruning we can effectively
reduce this drawback.
4.2.4
The APInterp algorithm
APInterp , the second distributed algorithm we propose in this chapter, tries to overcome some of the problems encountered by APRed and Distributed One-pass Partition
when the data skew between the data partitions is high.
The more evident is that several false positives could be generated, increasing
the resource utilization and the execution time of both Distributed Partition and
Distributed One-pass Partition. As APRed , also APInterp addresses this issue by means
of global pruning based on partial knowledge of F2 : each locally frequent pattern
that contains a globally non-frequent 2-pattern will be locally removed from the set
of frequents patterns before sending it to the master and performing next candidate
generation. Moreover this skew might cause a globally frequent pattern x to result
infrequent on a given partition Di only. In other words, since σi (x) < minsup · |Di |,
x will not be returned as a frequent pattern by the ith slave. As a consequence, the
master of Distributed One-pass Partition cannot count on the knowledge of σi (x), and
thus cannot exactly compute the global support of x. Unfortunately, in Distributed
One-passP
Partition the master might also deduce that x is not globally frequent,
because j,j6=i σj (x) < minsup · |D|.
As explained in the previous section, APRed uses support reduction in order to
limit this issue. Unfortunately, this method exposes APRed to the combinatorial
explosion of the intermediate results, in case the reduced minsup is too small for
the processed dataset. APInterp , instead, allows the master to infer an approximate
value for this unknown σi (x) by exploiting an interpolation method. The master
bases its interpolation reasoning on the knowledge of:
• the exact support of each single item on all the partitions, and
• the average reduction of the support count of pattern x on all the partitions
where x resulted actually frequent (and thus returned to the master by the
slave), with respect to the support of the least frequent item contained in x:
σj (x)
j∈f part(x) ( minitem∈x (σj (item)) )
P
avg reduct(x) =
|f part(x)|
where fpart(x) corresponds to the set of data partitions Dj where x actually
resulted frequent, i.e. where σj (x) ≥ minsup · |Dj |.
The master can thus deduce the unknown support σi (x) on the basis of avg reduct(x)
as follows:
σi (x)interp = min (σi (item) ∗ avg reduct(x))
item∈x
60
4. Distributed datasets
It is worth remarking that this method works if the support of larger itemsets
decrease similarly in all the dataset partitions, so that an average reduction factor
(different for each pattern) can be used to interpolate unknown values. Finally
note that, as regards the interpolated value above, we expect that the following
inequalities hold:
σi (x)interp < minsup · |Di |
(4.1)
So, if we obtain that σi (x)interp ≥ minsup · |Di |, this interpolated result cannot be
accepted. If it was true, the exact value σi (x) should have already been returned
by the ith slave. Hence, in those few cases where the inequality (4.1) does not hold,
the interpolated value returned will be:
σi (x)interp = (minsup · |Di |) − 1
The proposed interpolation schema yields a better approximation of exact results
than Distributed One-pass Partition. The support values computed by the latter algorithm are, in fact, always equal to the lower bounds of the intervals containing the
exact support of any particular pattern. Hence any kind of interpolation producing
an approximate result set, whose supports are between the interval bounds, should
be, generally, more accurate than peeking always its lower bound.
Obviously several other way of computing a support interpolation could be devised. Some are really simple as the average of the bounds while others are complex
as counting inference, used in a different context in [43]. We chose this particular
kind of interpolation because it is simple to calculate, since it is based on data that
we already maintain for other purposes, and it is aware of the data partitioning
enough to allow for accurate handling of datasets characterized by heavy data-skew
on item distributions.
We can finally introduce the pseudo-code of APInterp (algorithms 4 and 5). As
in Distributed Partition, we have a master and several slaves, each in charge of a
horizontal partition Di of the original dataset. The slaves send information to the
master about the counts of single items and locally frequent 2-itemsets. Upon reception of all local results (synchronization), the master communicates to the slaves
an approximate global knowledge on F 0 2 , used by the slaves to prune candidates for
the rest of the mining process. Finally, once received information about all locally
frequent patterns, the master exploits the interpolation method sketched above for
inferring unknown support counts.
Note that when a pattern is locally frequent in all the partitions, the master is
able to determine exactly its support. Otherwise, an approximate inferred support
value is produced, along with an upper bound and a lower bound for that support.
In the pseudo-code Fki denotes the set of frequent k-patterns in partition i
(or globally when i is not present), F 0 k indicate an approximation of Fk and
Single Countsi1 is the support of all 1-patterns in partition i.
For the sake of simplicity, some detail of the algorithm has been altered in the
pseudo-code. In particular, points 4 and 5 of the slave pseudo-code are an over-
4.2. Approximated distributed frequent itemset mining
Algorithm 4: APInterp - Slave i
1
2
3
4
5
Compute local Single Countsi1 and F2i . ;
Send local partial results to the master ;
Receive the global approximation F 0 2 of F2 ;
Continue computation, by using F 0 2 for pruning candidates ;
Send local results to the master. If computation is over, send an empty
set ;
Algorithm 5: APInterp - Master
1
2
3
4
5
6
7
8
9
Receive local partial results Single Countsi1 and F2i from all the slaves;
Compute the exact F1 , on the basis of the local counts of single items;
Compute the approximate
P
S
F 0 2 = {x ∈ i F2i |
i counti (x) > minsup · |D|}
i
where if x ∈ F2 then counti (x) is equal to σi (x), or is equal to σi (x)interp
otherwise ;
Send F 0 2 to all the slaves ;
Receive local results from all the slaves (empty for slaves terminated
before the third iteration) ;
Compute and return,
S for each
P k, the approximate
F 0 k = {x ∈ i Fki |
i counti (x) > minsup · |D|}
i
where if x ∈ Fk then counti (x) is equal to σi (x), or is equal to σi (x)interp
otherwise;
61
62
4. Distributed datasets
simplification of the actual code: pattern are sent, asynchronously, as soon as they
are available in order to optimize communication. Each slave terminates when,
at iteration k, less than k + 1 pattern are frequent; this is equivalent to checking
emptiness of F 0 ik+1 , but more efficient. On the other side, the master continuously
collects results from still active slaves and processes them as soon as all expected
result sets of the same length arrive.
4.2.5
Experimental evaluation
In the following part of the section, we describes the behavior exhibited by our
distributed approximate algorithms in our experiments. We have run the APRed and
APInterp algorithms on several datasets using different parameters. The goal of these
tests is to understand,
how similarities of the results vary as the minimum support and number of
partitions change and the scalability.
Similarity and Average Support Range. The method we are proposing yields
approximate results. In particular APInterp computes pattern supports which may
be slightly different from the exact ones, thus the result set may miss some frequent
patterns (false negatives) or include some infrequent patterns (false positives). In order to evaluate the accuracy of the results we use a widely used measure of similarity
between two pattern sets introduced in [50], and based on support difference. At the
same time, we have introduced a novel similarity measure, derived from the previous
one and used along with it in order to assess the quality of the algorithm output.
To the same end, we use the Average support Range (ASR), an intrinsic measure of
the correctness of the approximation introduced in [61]. An extensive description of
this measures and a discussion on their use can be found in the appendix A.
Experimental environment
The experiments were performed on a cluster of seven high-end computers, each
equipped with an Intel Xeon 2 GHz, 1 GB of RAM memory and local storage. In
all our tests, we mapped a single process (either master or slave) to each node.
This system offers communications with good latency (a dedicated Fast Ethernet).
However, since APInterp requires just one synchronization, and all communication
are pipelined, its communication pattern should be suitable even for a distributed
system characterized by a high latency network.
Experimental data
We performed several tests using datasets from the FIMI’03 contest [1]. We randomly partitioned each dataset and used the resulting partitions as input data for
different slaves.
4.2. Approximated distributed frequent itemset mining
63
During the test for APRed , we used two different partitioning, briefly indicated
with the suffix P1 and P2 in plot and tables. In doing so, we tried to cover different
number of possible cases with respect to partition size and number of partitions.
Table 4.2.5 show a list of these datasets along with their cardinality, the number
of partitions used in tests, and the minimum and maximum sizes of the partitions.
Each dataset is also identified by a short reference code.
Table 4.1: Datasets used in APRed experimental evaluation. P1 and P2 in the dataset
name refers to different partitioning of the same dataset.
Dataset (reference)
accidents-P1 (A1)
accidents-P2 (A2)
kosarak-P1 (K1)
kosarak-P2 (K2)
mushroom-P1 (M1)
mushroom-P2 (M2)
retail-P1 (R1)
retail-P2 (R2)
T10I4D100K-P1(T10-1)
T10I4D100K-P2(T10-2)
T40I10D100K-P1(T40-1)
T40I10D100K-P2(T40-2)
#Trans.
/1000
340
340
990
990
8
8
88
88
100
100
100
100
#
Part
10
10
20
20
4
10
4
4
10
10
10
10
Part. size
/1000
13..56
15..55
11..79
21..78
1..3
0.5..1
14..31
10..31
2..17
8..16
3..19
5..13
In APInterp tests, each dataset was divided in to a number of partitions ranging
from 1 to 6, both in partition of similar and significantly different size. The first
ones, the balanced partitioned datasets, were used in order to assess speedup for
the tests on our parallel test bed. Table 4.2 shows a list of these datasets along
with their cardinality and the minimum and maximum sizes of the partitions (for
the largest number of partition). Each dataset is also identified by a short code,
starting with U in case the sizes of partitions differ significantly. The number of
partitions is not reported in this table, since it depends on the number of slaves
involved in the specific distributed test.
For each dataset, we computed the reference solution using DCI [44], an efficient
sequential algorithm for frequent itemset mining (FIM).
APRed experimental results
First we present the results obtained using APRed , for which we only used the most
strict Absolute Similarity measure (α = 1, see appendix A) for accuracy testing.
Table 4.3 shows a summary of computation results for all datasets, obtained by
using a self-similarity threshold γ = 0.98 to determine minsup0 = r · minsup, where
r ∈ [0.5, 1]. We have reported the absolute similarity of approximate results to
64
4. Distributed datasets
Table 4.2: Datasets used in APInterp experimental evaluation. When a datasets is
referenced by a keyword prefixed by U (see Reference column), this means that it
was partitioned in an unbalanced way, with partitions of significantly different sizes.
Dataset
accidents-bal
accidents-unbal
kosarak-bal
kosarak-unbal
mushroom-bal
mushroom-unbal
retail-bal
retail-unbal
pumbs-bal
pumbs-unbal
pumbs-star-bal
pumbs-star-unbal
connect-bal
Reference
A
UA
K
UK
M
UM
R
UR
P
UP
PS
UPS
C
#Trans.
340183
340183
990002
990002
8124
8124
88162
88162
49046
49046
49046
49046
67557
Part. size
55778..57789
3004..84011
163593..166107
112479..237866
1337..1385
328..1802
14307..14888
6365..23745
8044..8289
1207..12138
8034..8291
3156..12089
11086..11439
exact results, the number of globally frequent patterns and the number of distinct
discarded patterns, i.e. the patterns that are locally frequent but are discarded in
points 2 and 7 of master pseudo-code because they are not globally frequent.
Figure 4.1 shows several plots comparing the self-similarity used during computation, i.e. based on similarity between R0 and R00 , with the exact similarity between
the global approximate results and the exact ones for r ∈ [0.5, 1].
If we pick a particular value of r in the plot, corresponding to a value of selfsimilarity γ, we can graphically find the similarity of the whole approximate solution
to the exact one when r is used for the second part of the computation.
We have found that in sparse datasets the similarity is usually nearly equal to
(greater than) self-similarity, so the proposed empirical determination of r should
yield good results. Even when the selection is slightly mislead by an excessively
good partial result on R00 . This is the case of the Accident P2 dataset. Table 4.3
shows that APRed for this dataset chooses a support reduction factor of 0.95, and the
similarity of the final result is 95%, which is a remarkably good result. Nevertheless,
in the bottom left plot in Figure 4.1, we can see that by using a slightly smaller
reduction factor (0.75), it was possible to boost the similarity of the final result close
to 100%.
Figure 4.2 shows the number of discarded patterns (points 2 and 7 of master
pseudo-code) as a function of r. In order to put in evidence the effectiveness of
the pruning based on F 0 1 and F 0 2 , we report curves relative to different types of
pruning. Pruning local patterns using an approximate knowledge of F1 and F2 is
enough to obtain a good reduction in the number of discarded pattern in most of
the sparse datasets.
4.2. Approximated distributed frequent itemset mining
65
Table 4.3: Test results for APRed , obtained using the empirically computed local
minimum support (minsup00 = r00 · minsup) for patterns with more than 2 items (for
self-similarity threshold γ = 0.98).
Dataset
A1
A2
K1
K2
M1
M2
R1
R2
T10-1
T10-2
T40-1
T40-2
Min.
supp.
40 %
40 %
0.6 %
0.3 %
40 %
40 %
0.2 %
0.2 %
0.2 %
0.2 %
2%
2%
Simil.
r00
0.95
0.95
0.85
0.80
0.50
0.50
0.55
0.55
0.60
0.65
0.80
0.85
0.95
0.96
0.97
0.99
0.44
0.07
0.92
0.91
0.93
0.94
0.92
0.96
#
Freq
29646
28675
1132
4997
413
399
2675
2682
13205
13173
2293
2293
#
Discarded
678289
633908
1968
12379
366
288
7492
7786
31353
17444
19220
18186
The APRed algorithm performed worse on dense datasets, such as Accidents,
where too many locally frequent patterns are discarded, and Mushroom, where
similarity of approximate results to exact results was really low. Large data skews
seem to be a big issues for APRed , since in these cases several frequent patterns are
not returned at all (lots of false negatives, and thus small values for both Recall and
Similarity).
APInterp experimental Results
The experiments were run for several minimum support values and for different partitioning on each dataset. In particular, except when showing the effects of varying
the minimum support and the number of partitions, we reported results corresponding to three and six partitions and to the two smallest minimum support thresholds
used, usually characterized by a difference of about one order of magnitude in execution time.
Table 4.4 shows a summary of computation results for all datasets, obtained for
three and six partitions using two different minimum support values. The first four
columns contain the code of the dataset and the parameters of the test. The next
two columns contain the number of frequent patterns contained in the approximate
solution and the execution time. The average support range column contains the
average distance between the upper and lower bounds for the support of the various
patterns, expressed as a percentage of the number of transactions in the dataset
(see Definition 14). The following columns show the precision and recall metrics
and the number of false positives/negatives. As expected, there are really few false
negatives and consequently the value of Recall is close to 100%, but the Precision
66
4. Distributed datasets
Dataset: T10I4D100K P2
Min. supp = 0.2%
Dataset: Kosarak P2
Min. supp = 0.3%
100
100
95
95
90
90
%
%
85
80
85
75
80
70
Self-similarity F1+F2
Similarity
65
0.5
0.55
0.6
0.65
0.7
Self-similarity F1+F2
Similarity
75
0.75
r
0.8
0.85
0.9
0.95
1
0.5
0.55
0.6
0.65
100
99
90
98
80
97
70
96
50
94
40
30
Self-similarity F1+F2
Similarity
92
0.5
0.55
0.6
0.65
0.7
0.8
0.85
0.9
0.95
1
0.85
0.9
0.95
1
Self-similarity F1+F2
Similarity
20
0.75
r
0.8
60
95
93
0.75
r
Dataset: Mushroom P1
Min. supp = 40%
100
%
%
Dataset: Accidents P2
Min. supp = 40%
0.7
0.85
0.9
0.95
1
0.5
0.55
0.6
0.65
0.7
0.75
r
0.8
Figure 4.1: Similarity between the approximate distributed result and the exact one
for APRed . The most strict value (α = 1) was used for support difference weight. This
means that patterns with different supports are considered as not matching. Selfsimilarity is a measure used for similarity estimation during distributed elaboration,
when true results are not available.
is slightly smaller. Unfortunately, since these metrics do not take into account the
support, a false positive having true support really close to the threshold has the
same weight than one having a very small support. The last columns contain the
similarity measure for the approximate results introduced in Definitions 12 and 13.
The very high value of the f pSim proves that false positives have a support close
to the exact one (but smaller than the exact one, so that they are actually not
frequent). This behavior, i.e. a lot of false positives with a value of f pSim close to
100%, is particularly evident for datasets K and UK.
Figure 4.3 shows a plot of the fpSim measure obtained for different datasets
partitioned among a variable number of slaves. As expected, the similarity is higher
when the dataset is partitioned in few partitions. Anyway, in most case there is no
significant decrease.
We have also compared the similarity of the approximate result obtained using
support interpolation to the Distributed One-pass Partition one. The results are
shown in Figure 4.4. The proposed heuristic for support interpolation does improve
similarity, in particular for small minimum support values. Since no false positives
are produced by Distributed One-pass Partition, in this case f pSim would be identical
4.2. Approximated distributed frequent itemset mining
Dataset: T10I4D100K P2
Min. supp = 0.2%
Dataset: Kosarak P2
Min. supp = 0.3%
90
no pruning
F1 pruning
F1+F2 pruning
4
Discarded local pattern / Frequent pattern
Discarded local pattern / Frequent pattern
4.5
3.5
3
2.5
2
1.5
1
no pruning
F1 pruning
F1+F2 pruning
80
70
60
50
40
30
20
10
0
0.5
0.55
0.6
0.65
0.7
0.75
r
0.8
0.85
0.9
0.95
1
0.5
0.55
0.6
0.65
Dataset: Accidents P2
Min. supp = 40%
200
no pruning
F1 pruning
F1+F2 pruning
180
0.7
0.75
r
0.8
0.85
0.9
0.95
1
0.9
0.95
1
Dataset: Mushroom P1
Min. supp = 40%
3500
Discarded local pattern / Frequent pattern
Discarded local pattern / Frequent pattern
67
160
140
120
100
80
60
40
20
0
no pruning
F1 pruning
F1+F2 pruning
3000
2500
2000
1500
1000
500
0
0.5
0.55
0.6
0.65
0.7
0.75
r
0.8
0.85
0.9
0.95
1
0.5
0.55
0.6
0.65
0.7
0.75
r
0.8
0.85
Figure 4.2: Relative number of distinct locally frequent patterns that are not globally
frequent as a function of r for different pruning strategies for APRed . They are
discarded at point 2 and 7 of the master pseudo-code. This is a measure of the waste
of resources due to both data-skewness and minimum support lowering. Accidents,
a dense dataset, causes a lot of trashed locally frequent patterns.
to Sim, thus this measure is plotted just for the APInterp algorithm.
Finally, we have verified the speedup of the APInterp algorithm, using only uniformly sized partitions. Figure 4.5 shows the measured speedup when an increasing
number of slaves is exploited. Note that when more slaves are used, the dataset has
to be partitioned accordingly.
The APInterp algorithm performed worse on dense datasets, such as Connect,
where too many locally frequent patterns are discarded when we add slaves. On the
other hand, in some cases we obtained also superlinear speedups. This could be due
to the approximate nature of our algorithm: the support of several patterns could
be computed even if some slaves does not participate in the elaboration.
Acknowledgment
The datasets used during the experimental evaluation are some of those used for
the FIMI’03 (Frequent Itemset Mining Implementations) contest [1]. Thanks to
the owners of these data and people who made them available in current format.
In particular Karolien Geurts [21] for Accidents, Ferenc Bodon for Kosarak, Tom
68
4. Distributed datasets
fpSimilarity(%)
100
99.5
99
%
98.5
98
A (minsupp= 20%)
C (minsupp= 20%)
K (minsupp= 0.1%)
M (minsupp= 5%)
P (minsupp= 70%)
PS (minsupp= 25%)
R (minsupp= 0.05%)
UA (minsupp= 20%)
UK (minsupp= 0.1%)
UPS (minsupp= 25%)
UR (minsupp= 0.05%)
97.5
97
96.5
96
95.5
1
2
3
4
5
6
# partitions
Figure 4.3: fpSim of the APInterp results relative to datasets partitioned in different
ways.
Brijs [10] for Retail and Roberto Bayardo for the conversion of UCI datasets. Other
datasets were generated using the publicly available synthetic data generator code
from the IBM Almaden Quest data mining project [6].
4.3
Conclusions
In this chapter, we have discussed APRed and APInterp , two new distributed algorithms for approximate frequent itemset mining.
The key idea of APRed is that by using a reduced minimum support
(r · minsup, r ∈ [0.5, 1]) for distributed local elaboration on dataset partitions,
without modifying the support threshold for global evaluation of fetched results,
we can be confident that the final approximate results obtained will result quite
correct. Moreover, even if we lower the support threshold, APRed still results
efficient, and the amount of data sent to the master by the local slave is relatively
small. This is due to a strong pruning activity: locally frequent candidate patterns
are in fact pruned by using an approximate knowledge of F2 (often discarding more
than 90% of globally infrequent candidate patterns).
In our test, APRed performs particularly well on sparse datasets: in several cases
4.3. Conclusions
69
Similarity
Dataset Kosarak - 6 unbalanced partitions
100
%
90
80
70
60
Distr. One-pass Part. Sim()
AP Sim()
AP fpSim()
50
0.1
0.2
0.3
0.4
0.5
Minimum support (%)
0.6
0.7
0.8
Figure 4.4: Comparison of Distributed One-pass Partition vs APInterp .
an 80% reduction of minsup is enough to achieve a similarity close to 100%. On
the other end on most dense dataset the number of missing and spurious patterns
is definitely too high.
APInterp , instead, exploits a novel interpolation method to infer unknown counts
of some patterns, which are locally frequent in some dataset partitions. Since no
support reduction is involved, APInterp is able to mine dense dataset for values of
minsup that are too small to be used with APRed . For the same reason, also the
issue related to bad choice of the support reduction factor (see the Accident dataset
case in the APRed results), are avoided.
For dataset partitioning characterized by high data skew, the APInterp approach
is able to strongly improve the accuracy of the approximate results. Our tests prove
that this method is particularly suitable for several (mainly sparse) datasets: it
yields a good accuracy and scale nicely. The best approximate results obtained for
the various datasets were characterized by a similarity above 99%. Even if some
false positives are found, the high similarity value computed on the whole result
set proves that the exact supports of these false positives are actually close to the
support threshold, and thus of some interest to the analyst.
The accuracy of the results is better than in Distributed One-pass Partition case.
The main reason for this is that the Distributed One-pass Partition algorithm yields,
70
4. Distributed datasets
Speed up
10
A minsupp=20%
K minsupp=0.1%
9
8
Speedup(n)
7
6
5
4
3
2
1
1
2
3
4
5
6
# partitions
Figure 4.5: Speedup for two of the experimental datasets, Kosarak(K) and Accidents
(A), with balanced partitioning.
for any patterns, a support value that is the lower bound of the interval in which
the exact support is included. Hence, the count estimated by our algorithm, which
falls between the lower and upper bounds, is generally closer to the exact count than
the lower bound. Furthermore, the proposed interpolation schema does not increase
significantly the overall space/time complexity and is resilient to heavy skew in the
distribution of items.
Finally, both in APInterp and APRed , synchronization occurs just once as in a
naı̈ve distributed Partition, and, differently from Partition, slaves do not have to be
polled for specific pattern counts, thus limiting potential privacy breaches related
to low support patterns.
4.3. Conclusions
71
Table 4.4: Accuracy indicators for APInterp results obtained using the maximum
number of partitions and the lowest support.
Dataset
A
A
A
A
C
C
C
C
K
K
K
K
M
M
M
M
P
P
P
P
PS
PS
PS
PS
R
R
R
R
R
R
UA
UA
UA
UA
UK
UK
UK
UK
UP
UP
UP
UP
UPS
UPS
UPS
UPS
UR
UR
UR
UR
UR
UR
#
slaves
3
3
6
6
3
3
6
6
3
3
6
6
3
3
6
6
3
3
6
6
3
3
6
6
3
3
3
6
6
6
3
3
6
6
3
3
6
6
3
3
6
6
3
3
6
6
3
3
3
6
6
6
Minsup
%
20.00
30.00
20.00
30.00
70.00
80.00
70.00
80.00
0.10
0.20
0.10
0.20
5.00
8.00
5.00
8.00
70.00
80.00
70.00
80.00
25.00
30.00
25.00
30.00
0.05
0.10
0.20
0.05
0.10
0.20
20.00
30.00
20.00
30.00
0.10
0.20
0.10
0.20
70.00
80.00
70.00
80.00
25.00
30.00
25.00
30.00
0.05
0.10
0.20
0.05
0.10
0.20
Minsup
(count)
68036
102054
68036
102054
47289
54045
47289
54045
990
1980
990
1980
406
649
406
649
34332
39236
34332
39236
12261
14713
12261
14713
44
88
176
44
88
176
68036
102054
68036
102054
990
1980
990
1980
34332
39236
34332
39236
12261
14713
12261
14713
44
88
176
44
88
176
#
freq
899740
151065
912519
152873
4239440
546795
4335664
560499
852636
42963
947486
59601
3773538
864245
3888898
926827
2858126
145435
2921763
152855
2177124
441472
2227435
444542
17766
6105
1902
18372
6190
1967
901687
151268
916744
152942
818017
52212
922792
49420
2800681
149253
2879809
152124
2207340
455973
2162459
453334
17654
6185
1896
17901
6390
1968
Time
(s)
51.92
7.45
27.72
4.57
56.09
6.79
93.50
10.73
68.77
8.14
31.94
5.15
41.14
8.78
67.61
15.49
39.07
2.14
58.23
2.62
29.31
5.80
45.9
9.06
0.86
0.53
0.34
0.69
0.41
0.30
66.72
10.07
35.19
5.55
121.46
11.65
45.30
5.54
38.14
2.25
56.82
2.69
29.79
6.26
44.49
9.02
0.96
0.57
0.36
0.80
0.43
0.29
Avg.Sup.
Range(%)
0.289
0.378
0.574
0.768
2.401
2.894
4.093
5.191
0.013
0.033
0.024
0.077
0.542
0.862
0.899
1.182
3.766
3.170
6.068
7.020
1.672
1.238
2.526
2.261
0.005
0.009
0.018
0.006
0.010
0.024
0.309
0.440
0.639
0.782
0.011
0.062
0.020
0.050
3.217
5.101
5.216
6.777
2.102
1.980
1.976
2.359
0.005
0.010
0.019
0.005
0.012
0.025
Precision
%
98.87
98.95
97.51
97.80
97.37
97.67
95.24
95.24
88.81
89.53
80.56
65.93
99.50
76.13
96.57
71.00
94.42
97.57
92.36
92.99
94.80
97.82
92.72
96.98
91.07
93.39
94.59
88.63
92.47
92.63
98.68
98.81
97.06
97.75
92.62
74.21
82.76
79.27
96.30
95.17
93.71
93.46
93.53
94.90
95.51
95.46
91.56
92.48
94.75
90.56
90.33
92.56
Recall
%
99.96
99.97
99.99
100.00
99.93
99.98
99.97
99.93
99.11
98.05
99.89
99.80
99.97
99.98
100.00
100.00
99.99
99.81
99.99
99.97
99.93
99.78
99.99
99.61
99.89
99.84
99.82
99.92
99.88
99.96
99.98
99.96
99.99
99.98
99.17
98.54
99.95
99.69
99.94
99.91
99.99
100.00
99.96
99.98
99.99
99.99
99.91
99.83
99.78
99.95
99.91
99.93
False pos
%
1.13
1.05
2.49
2.20
2.63
2.33
4.76
4.76
11.19
10.47
19.44
34.07
0.50
23.87
3.43
29.00
5.58
2.43
7.64
7.01
5.20
2.18
7.28
3.02
8.93
6.61
5.41
11.37
7.53
7.37
1.32
1.19
2.94
2.25
7.38
25.79
17.24
20.73
3.70
4.83
6.29
6.54
6.47
5.10
4.49
4.54
8.44
7.52
5.24
9.44
9.67
7.44
False neg
%
0.04
0.03
0.01
0.00
0.07
0.02
0.03
0.07
0.89
1.95
0.11
0.20
0.03
0.02
0.00
0.00
0.01
0.19
0.01
0.03
0.07
0.22
0.01
0.39
0.11
0.16
0.18
0.08
0.12
0.04
0.02
0.04
0.01
0.02
0.83
1.46
0.05
0.31
0.06
0.09
0.01
0.00
0.04
0.02
0.01
0.01
0.09
0.17
0.22
0.05
0.09
0.07
Sim
%
98.83
98.92
97.51
97.80
97.30
97.65
95.20
95.17
88.10
87.97
80.49
65.84
99.45
76.12
96.52
71.00
94.40
97.39
92.34
92.96
94.73
97.61
92.69
96.59
90.97
93.25
94.42
88.57
92.37
92.60
98.66
98.77
97.05
97.73
91.91
73.40
82.72
79.08
96.24
95.07
93.69
93.44
93.48
94.88
95.49
95.43
91.49
92.34
94.55
90.52
90.25
92.50
fpSim
%
99.81
99.76
99.58
99.44
98.70
98.73
97.17
96.74
99.20
98.24
99.90
99.81
99.94
98.62
99.81
97.98
97.37
98.52
95.50
95.28
99.05
99.34
98.45
98.85
99.90
99.85
99.82
99.88
99.89
99.95
99.84
99.77
99.41
99.31
99.23
98.89
99.94
99.72
98.02
97.04
96.66
96.05
99.11
99.19
98.92
98.70
99.92
99.84
99.78
99.94
99.91
99.92
72
4. Distributed datasets
5
Streaming data
Many critical applications require a nearly immediate result based on a continuous
and infinite stream of data. In our case, we are interested in mining all frequent
patterns and their supports from an infinite stream of transactions. We begin this
chapter by describing the peculiarities of streaming data, then we will introduce
the problem of finding the most frequent items and itemset in a stream, along with
some state of the art algorithms for solving them. Finally, we will describe our
contribution: a streaming algorithm for approximate mining of frequent patterns.
5.1
Streaming data
Before introducing the notation used in this chapter, we briefly summarize the notation previously used for frequent itemset and frequent items. A dataset D is a
collection of subsets of items I = it1 , . . . , itm . Each element of D is called a transaction. A pattern x is frequent in dataset D with respect to a minimum support
minsup, if its support is greater than σmin = minsup · |D|, i.e. the pattern occurs
in at least σmin transactions, where |D| is the number of transactions in D. A kpattern isSa pattern composed of k items, Fk is the set of all frequent k-patterns,
and F = i Fi is the set of all frequent patterns. If D contains just transactions of
one item, then all of the frequent patterns are 1-patterns. These patterns are named
frequent items.
Since the stream is infinite, new data arrive continuously and results change
continuously as well. Hence, we need a notation for indicating that a particular
dataset or result is referred to a particular time interval. To this end, we write the
interval as a subscript after the entity. Thus D[t0 ,t1 ) indicates the part of the stream
received since t0 and before t1 . For the sake of simplicity we will write just D instead
of D[1,t] , when referring to all data received until current time t, if this notation is
not ambiguous. As usual, a square bracket indicates that the bound is part of the
interval, whereas a parenthesis indicates that it is excluded.
A pattern x is frequent at time t in the stream D[1,t] , with respect to a minimum support minsup, if its support is greater than σmin[1,t] = minsup · |D[1,t] |, i.e.
the pattern occurs in at least σmin[1,t] transactions, where |D[1,t] | is the number of
transactions in the stream D until time t. A k-pattern is a pattern composed of k
74
5. Streaming data
items, Fk[1,t] is the set of all frequent k-patterns, and F[1,t] is the set of all frequent
patterns.
5.1.1
Issues
The infinite nature of these data sources is a serious obstacle to the use of most of the
traditional methods since available computing resources are limited. One of the first
effects is the need to process data as they arrive. The amount of previously happened
events is usually overwhelming, so they can be either dropped after processing or
archived separately in secondary storage. In the first case access to past data is
obviously impossible whereas in the second case the cost for data retrieval is likely
to be acceptable only for some ”ad hoc” queries, especially when several scan of past
data are needed to obtain just one single result.
Other important differences with respect to having all data available for the mining processed at the same time regard the obtained results. As previously explained,
both the data and the results evolve continuously. Hence a result is referred to a
part of the stream and, in our case, to the whole part of the stream preceding a
given time t. Obviously, an algorithm suitable for streaming data should be able to
compute the ’next step’ solution on-line, starting from the previously known D[1,t−1)
and the current data D[t−1,t) , if necessary with some additional information stored
along with the current solution. In our case, this information is the count of a significant part of frequent single items, and a transaction hash table used for improving
deterministic bounds on supports returned by the algorithm, as we will explain later
in this chapter.
5.2
Frequent items
Even the apparently simple discovery of frequent items in a stream is challenging,
since its exact solution requires to store a counter for each distinct item received.
Some items may appear initially in a sporadic way and then become frequent, thus
the only way to exactly compute its support is to maintain a counter since its first
appearance. This could be acceptable when the number of distinct items is reasonably bounded. If the stream contains a large and potentially unbounded number of
spurious items, as in case of data with probabilities of occurrence that follows Zipf’s
law like internet traffic data, this approach may lead to a huge waste of memory.
Furthermore, the number of distinct items is potentially proportional to the length
of the stream. The Top Frequent items problem is closely related to the frequent
items one, except that the user does not directly decide the support threshold: the
result set contains only a given number of items having the highest supports. In this
case too the resource usage is unbounded. This issue has been addressed by several
approximate algorithms, which sacrifice the exactness of the result in order to limit
the space complexity. In this section, we will formally introduce the problem, and
5.2. Frequent items
75
then we will describe some representative approximate algorithms for finding the set
of most frequent items.
5.2.1
Problem
Let D[1,n] = s1 , s2 , . . . , sn be a data stream, where each position in the stream si
contains an element of the items I = it1 , . . . , itm . Let item iti occur σ[1,n] (iti ) times
in D[1,n] . The k items having the highest frequencies are named the top-k items
whereas items whose frequencies are greater than σmin = minsup · |D| are named
frequent items.
As explained before and in [12] the exact solution of this problem is a highly
memory intensive problem. Two relaxed versions of this problem have been introduced in [12]: FindCandidateTop(S, k, l) and FindApproxTop(S, k, ). The
first one is exact and consists in finding a list of l items containing the k most
frequent, whereas the second one is approximate. Its goal is to find a list of items
having a frequency greater than (1 − ) · σ[1,n] (itk ) where itk is the k th most frequent
item. FindCandidateTop can be very hard to solve for some input distributions,
in particular when the frequencies of the k th and the (l + 1)th are similar. In such
cases, the approximate problem is more practical to solve. Several variation of the
top-k frequent items problem have been proposed. The Hot Items problem described
in [40] for large datasets and, several year after, adapted to data streams ([16, 30])
is essentially the top-k frequent items problem formalized in a slightly different way.
The techniques used for solving this family of problems can be classified into two
large categories: count-based techniques and sketch-based techniques. The first ones
monitor a limited set of potentially ”interesting” items, using a counter for each one
of them. In this case, an error arise when an item is erroneously kept out of the set
or inserted too late. The second family provides a frequency estimation for every
item by using a hash-indexed vector of counters. In this case, the risk of completely
missing the occurrences of an item is avoided, at the cost of looser guarantees on
the computed frequencies.
5.2.2
Count-based algorithms
Count-based algorithms maintains a set of counters, each one associated with a
specific item. When the number of distinct items is supposed to be high, there
could be not enough memory for allocating all the counters. In this case, it is
necessary to limit our attention to a set of items compatible with the available
memory. Only the items in the monitored set have an associated counter, which is
incremented upon their arrival. Other items have just an opportunity to replace one
of the monitored items. In fact, in most methods, the set of monitored items varies
during the computation. Each algorithm of this family is characterized by the data
structure for the efficient maintenance of counters and the policy for replacing ”old”
counters it uses.
76
5. Streaming data
The Frequent algorithm
This method has been originally proposed in [40] for large datasets and is inspired by
an algorithm for finding a majority element. More recently two unrelated works ([16,
30]) have described new versions adapted to streams.
A well-known algorithm for discovering the most frequent item in a set, containing repetitions of two distinct items, consists in removing pairs of distinct items from
the set while this is possible. The elements left are all identical and their identity is
the solution. In the case there are more than two distinct items this method will still
work, provided that the majority exists, i.e., the most frequent element has more
than n2 occurrences, where n is the stream length.
Algorithm 6 shows the most efficient implementation of Majority. It requires just
two variables: one contains the value currently supposed to be the majority and
the other is a counter, indicating a lower bound for the advantage of the current
candidate against any other opponent. At the end of the scan of the data, the only
possible candidate is known. In case we are not dealing with streams, a second scan
over the data will give the definitive answer.
Algorithm 6: Majority
input : data [1]...data [n]
output: majority element if any
1
2
3
4
5
6
7
8
9
10
11
12
13
candidate ← data[1];
C ← 1;
for i ← 2 to n do
if C = 0 then candidate ← data[i];
if candidate = data[i] then C ← C + 1;
else C ← C − 1;
end
C ← 0;
for i ← 1 to n do
if candidate = data[i] then C ← C + 1;
end
if C 6 n2 then return NULL;
else return candidate
In order to efficiently discard pairs, items are examined only when they arrive.
The candidate variable keep track of the item currently prevailing, and a counter
C indicates the minimum number of different items required to reach a tie. In
other words, the number of item having the prevailing identity that are waiting to
be matched with a different item. If a majority exists, i.e. if an item has support
greater than n2 , it will be found. This is granted by the fact that an item is discarded
only when paired with a different item, and, for the majority element, this cannot
5.2. Frequent items
77
happen for all of its occurrences.
In case the most frequent item has a support smaller than n2 the behavior of the
Majority algorithm is unpredictable. Furthermore, we may be interested in finding
more than one of the top frequent items. The Frequent algorithm (algorithm 7) is
thus a generalization of Majority, and is able to deal with these two cases. Its goal
is to find a set of m items containing every item having a relative support strictly
greater than m1 . The key idea is to keep a limited number m of counters and, when a
new item arrives, decrement every counter, and replace one of the items having the
counter value equal to zero, if there is any. In this way an item is always discarded
Algorithm 7: Frequent
input : data [1]...data [n]
output: superset of items having support greater than
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
m
C ← {};
for i ← 1 to n do
if ∃f (data[i], f ) ∈ C then
replace (data[i], f ) with (data[i], f + 1);
else if ∃item (item, 0) ∈ C then
replace (item, 0) with (data[i], 1) in C ;
else if |C| < m then
insert (data[i], 1) in C ;
else
foreach (item, f ) ∈ C do
replace (item, f ) with (item, f − 1) in C ;
end
end
end
return {item : ∃f (data[i], f ) ∈ C}
together with m − 1 occurrences of other symbols, or m when the incoming symbol
is discarded too because no counter has reached zero, i.e. a total of m or m + 1
symbols are discarded. Hence if a frequent symbol x is discarded d times, either
before of after its insertion in the counter set, then a total of at most d · (m + 1) 6 n
n
n
stream positions will be discarded. Since x is frequent σ(x) > m
> m+1
> d. Thus,
an item that is frequent in the first n position of the stream will be in the set of
counters after the processing of the nth position.
In order to manage counters efficiently, a specifically designed data structure is
required. In particular the operations of insertion, update and removal of a counter
as well as the decrement of the whole set of counter need to be optimized. Both [16]
and [30] propose a data structure based on differential support encoding and a mix
of hash and double linked list which grants a worst-case amortized time complexity
which is O(1), and O(m) worst-case space bound.
78
5. Streaming data
This algorithm, in its original formulation, find just a superset of the frequent
items, with no indication on support and no warranty on the absence of false positives. In the case of an ordinary dataset, both issues can be avoided with a second
scan over the dataset but, on streaming data, this is not possible. However if we are
allowed to use some additional space, it is possible to find also an estimate of the
actual support of each items with some upper bound. In order to reach this goal,
we need to maintain an additional counter which is never decreased, corresponding
to a lower bound the support of each item, and a constant value indicating the
maximum number of previous occurrences before the insertion in the counter set.
Since the Frequent algorithm is correct, this amount is σmin[1,t] − 1, the maximum
integer smaller than the support threshold for the corresponding stream portion.
Furthermore, it is possible to exclude from the result set every item having the support under a specified value by increasing the number of counters and applying a
post-filter as described in [29] for itemsets.
The Lossy count algorithm
The Lossy Count algorithm (algorithm 8) was introduced in [33]. Its main advantages
versus the original formulation of Frequent are the presence of a constraint on false
positive and the computation of an approximate support, similarly to the modified
version of Frequent. Furthermore it is easily extensible to frequent itemsets, as we
will see later in this chapter. The kind of solution this algorithm find is called
an -deficient synopsis and consists in a result set containing every frequent item,
but no item having relative support less than minsup − , along with a support
approximation that is smaller than the exact relative support by at most .
The algorithm manages a set C of items, each associated with a counter and
a bound on its error. When a new item x arrives and x is known, its counter is
incremented. Otherwise a new entry (item,
1 1, bucket − 1) is inserted in C, where
bucket is the number of blocks of w = elements seen so far, and bucket − 1 is
the maximum number of previously missed occurrences of item x. The algorithm is
granted to maintain correctly the support in the -deficient synopsis. Hence, at the
beginning of a new block it is possible to delete every counter having a best-case
estimated support less than the error it would have if reinserted from scratch, which
is equal to bucket − 1. Since estimated frequencies are less than true frequencies by
at most , in order to get every frequent item but no item having relative support
less than minsup − it is enough to return only the items having an upper bound
for support, f + ∆, greater than or equal to minsup − .
The Sticky Sampling algorithm
Both [16] and [33] propose also some non deterministic methods. The idea is to
keep the most frequent counters and delete the others in order to free space for
new potentially frequent items. The way this is done, however, is different in the
5.2. Frequent items
79
Algorithm 8: Lossy Count
input : data[1]...data[n]
minsup, output: set containing every item having support greater than minsup · n
and no item whose support is less than (minsup − ) · n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
bcurrent ← 1 ;
C ← {};
for i ← 1 to n do
if (∃f, ∆) (data[i], f, ∆) ∈ C then
replace (data[i], f, ∆) with (data[i], f + 1, ∆);
else
insert (data[i], 1, bcurrent − 1) in C ;
end
if i mod 1 = 0 then
bcurrent ← bcurrent + 1 ;
foreach (item, f, ∆) ∈ C do
if f + ∆ < bcurrent − 1 then
remove (item, f, ∆) from C ;
end
end
end
end
return {item : (∃f, ∆) (item, f, ∆) ∈ C ∧ f + ∆ > (minsup − ) · n)}
80
5. Streaming data
Algorithm 9: Sticky Sampling
input : data[1]...data[n]
minsup, , δ
output: set containing every item having support greater than minsup · n
and no item whose support is less than (minsup − ) · n with
probability of failure δ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
C ← {};
1
;
t ← 1 · log minsup·δ
block len ← 2 · t;
rate ← 1;
for i ← 1 to n do
if i mod block len then
rate ← 2 · rate ;
block len ← t · rate ;
correct counters ;
foreach (item, f ) ∈ C do
while binomial(1, 12 )=0 do replace (item, f ) with
(item, f − 1);
end
end
if ∃f (data[i], f ) ∈ C then
replace (data[i], f ) with (data[i], f + 1) ;
1
)=0 then
else if binomial(1, rate
insert (data[i], 1) in C ;
end
end
return {item : ∃f (data[i], f ) ∈ C ∧ f > (minsup − ) · n}
5.2. Frequent items
81
two cases. Probabilistic-Inplace [16] discard one-half of the counters every r received
items and select the first items found immediately after the discard occurs. Sticky
Sampling [33] (algorithm 9) use, instead, a uniform sampling strategy over the whole
stream. In order to keep the number of counters probabilistically bounded, the
sampling rate is decreased for increasing stream lengths, and the previously known
frequencies are corrected to reflect the new rate using a stochastic method.
5.2.3
Sketch-based algorithms
As Count-based algorithms, also Sketch-based ones maintain a set of counters but,
instead of associating the counters with particular items, they are associated with
different overlapping groups of items. The analysis of the values of the counters for
the various groups containing an item allows us to give an estimate of its support.
In this approach there is no notion of monitored item, and the support estimation
is possible for any item. Algorithms included in this family, as in the case of the
Count-based family, share the same basic skeleton. The main differences are in
the management of the counters, the kind of other queries that can be answered
by using the same count-sketch and the exact function used for support estimation,
which directly influence the space requirements based on the user selected acceptable
error probability. In [15] G.Cormode and S.Muthukrishnan present their particularly
flexible Count-Min Sketch data structure as well as a good comparison to other state
of the art sketch techniques. We will adopt their unification framework in order to
describe a generic sketch based algorithm.
A sketch is a two dimensional array of dimension w by d. Let m be the number
of distinct items, h1 . . . hd be hash functions mapping {1 . . . m} into {1 . . . w} and
let g1 . . . gd be other hash functions defined on items. The (j, k) entry of the sketch
is defined to be
X
σ(i) · gk (i)
i:hk (i)=j
In other words, when an item i arrives, for each k ∈ {1 . . . d} the entry (hk (i), k)
is increased by the amount gk (i), which is algorithm dependent. Thus, the update
time complexity is O(d) and the space complexity is O(wd), provided that the hash
functions can be stored efficiently. The way the data structure is used in order to
answer a particular query, the required randomness and independence of the hash
functions, as well as the minimum size of the sketch array needed to guarantee the
fulfillment of probability of error constraints, are algorithm dependent.
A particularly simple count sketch is Count-Min [15]. In its case the values of
functions gk (item) are always 1, i.e. each counter is incremented by one each time
an item is transformed into its identifier by a hash function. The approximate value
is computed as the smallest of the counters associated with an item by any hash
function. Since several items can be hashed to the same integer, the approximate
value is always greater than the exact one. The two fragments of pseudo-code show
82
5. Streaming data
the simple updateSketch procedure and approxSupport function used by Count-Min
sketches.
Procedure updateSketch(sketch,item) - Count-Min sketch
1
2
3
foreach k ∈ {1 . . . d} do
sketch[hk (item), k] ← sketch[hk (item), k] + 1;
end
Function approxSupport(sketch, item) - Count-Min sketch
1
return mink∈{1...d} sketch[hk (item), k];
Other count sketch based methods are described in [12, 17, 28, 14].
5.3
Frequent itemsets
In this section, we introduce a new algorithm for approximate mining of frequent
patterns from streams of transactions using a limited amount of memory. In most
cases, finding an exact solution is not compatible with limited resources available and
real time constraints, but an approximation of the exact result is enough for most
purposes. The proposed algorithm consists in the computation of frequent itemsets
in recent data and an effective method for inferring the global support of previously
infrequent itemsets. Both upper and lower bounds on the support of each pattern
found are returned along with the interpolated support. Before introducing our
algorithm, we will shortly describe two other algorithms for approximate frequent
itemset mining. Then we will give an overview of APStream , our algorithm, followed
by a more detailed description and an extensive experimental evaluation showing
that APStream yields a good approximation of the exact global result considering
both the set of patterns found and their support.
5.3.1
Related work
The frequent itemset mining problem on stream of transactions (input itemsets)
poses additional memory and computational issues due to the exponential growth
of solution size with respect to the corresponding problem on streams of items. Here
we describe two representative approximate algorithms.
The Lossy Count algorithm for frequent itemsets
Manku and Motwani proposed in [33] an extension of their Lossy Count approximate
algorithm to the case of frequent itemsets. A straightforward conversion of Lossy
5.3. Frequent itemsets
83
Count, using the same data structure in order to store the support of patterns as the
transactions arrive, is possible but it would be highly inefficient. This is due to the
exponential number of patterns supported by each transaction. Actually, it would
be the same than computing the full set of itemset with no support constraint and
removing periodically infrequent pattern. In order to avoid this issue, the authors
process the transactions in blocks, so that the apriori constraint may be applied.
The algorithm is much similar to that previously described for items, so we
will focus on differences. The most notable
1 is that the transactions are processed in
batches containing several buckets of size . As many transactions as the available
memory can fit are buffered and then mined, using the number of buckets β as
minimum support. This is roughly equivalent to searching patterns appearing at
least once in each bucket, but more efficient. Every pattern x with support f in the
transactions currently buffered is inserted in the set of counters as (x, f, bucket − β),
where bucket indicates the last bucket contained in the buffer. At the same time the
support of every pattern already in the counter set is checked in current buckets,
updating the counters if needed and removing patterns that no longer satisfy the
f + ∆ > bucket inequality. Clearly, in order to avoid the insertion in the counter set
of spurious patterns, β should be a large number. Hence, a larger available memory
increase the accuracy and reduce the running time.
The Frequent algorithm for frequent itemsets
In [29] R.Jin and G.Agrawal propose SARM, a new algorithm for frequent itemset
mining based on Frequent [30]. Also in this case, the immediate extension of the
base algorithm has serious shortcomings. This is mainly due to the potentially high
1
counters
number of frequent patterns. While in the frequent items case just minsup
are needed, for frequent itemsets one of the arguments used in the correctness proof
is no longer true. In fact, in a stream of n transactions there can be more than
n
k-patterns having support greater than minsup. More precisely there can
minsup
n
n
n
frequent items, 2l · minsup
frequent pairs, and in general kl · minsup
be l · minsup
frequent k-pattern, where l is the length of transactions. Since the maximum length
of frequent patterns is unknown before computation, the user would need to specify
the maximal pattern length, maxlen, to use in order to correctly size the counter
set. Thus the number of counters needed for the computation of frequent itemsets
would be
maxlen
X l
1
minsup k=1 k
Furthermore, unless the transactions are processed in batches as in Lossy Count, all
the subpatterns of each incoming transaction need to be examined.
In order to avoid these side effects, the SARM algorithm maintains separate sets
Lk of potentially frequent itemsets, one for each different pattern length k. These
sets are updated using a hybrid approach: SARM updates L1 and L2 using the same
84
5. Streaming data
method proposed in Frequent, and at the same time buffers transactions for a levelwise batched processing. When a transaction t arrives, it is inserted in a buffer, and
both L1 and L2 are updated either by incrementing the count, for already known
1
,
patterns, or inserting the new ones. If the size of L2 exceeds the limit f · minsup·
where ∈ [0, 1] is a factor used for increasing the accuracy, and f is the average
number of 2-patterns per transaction, then the size of L2 is reduced by executing
the CrossOver operation, consisting in decreasing every counter and removing, as
in Frequent, patterns having count equal to zero. Every time this operation is
performed, the transaction buffer is processed. For increasing values of k > 2, the
k-patterns appearing in the buffer and having all subpatterns included in Lk−1 are
used for updating Lk . Then the buffer is emptied and the CrossOver operation is
applied to each Lk .
The ∈ [0, 1] factor can be used for enforcing a bound on result accuracy. If < 1
then no itemset having relative support less than (1−)·minsup will be in the result
set. Thus F minsup ⊆ L ⊆ F (1−)·minsup , where L is the result set, and F s is the set of
itemset whose support exceed s. When = 1 the SARM algorithm is not able to give
any guarantee on the accuracy, as the Frequent algorithm. Furthermore, both Lossy
Count for itemsets and SARM ignore previous potential occurrences of a pattern
when it is inserted into the set of frequent patterns. In the case of Lossy Count the
maximum number of neglected occurrences is returned along with the support, but
no other information available during the stream processing is exploited.
5.3.2
The APStream algorithm
In order to overcome these limitations APStream (Approximate Partition for Stream),
the algorithm we propose, uses the available knowledge on the support of other patterns to estimate a support for previously disregarded ones. The APStream algorithm
was inspired by Partition [55], a sequential algorithm that divides the dataset into
several partitions processed independently and then merges local solutions. The
adjectives global and local are referred to temporal locality. So they are used in
conjunction with properties of, respectively, the whole stream and just a relatively
small and contiguous part of the stream, hereinafter called a block of transactions.
Furthermore, we suppose that each block corresponds to one time unit: hence, D[1,n)
will indicate the first n−1 data blocks, and Dn the nth block. This hypothesis allows
us to adopt a lighter notation and cause no loss of generality.
The Streaming Partition algorithm. The basic idea exploited by Partition is the
following: if the dataset is divided into several partitions, then each globally frequent
pattern must be locally frequent in at least one partition. This guarantees that the
union of all local solutions is a superset of the global solution. However, one further
pass over the database is necessary to remove all false positives, i.e. patterns that
result locally frequent but globally infrequent.
5.3. Frequent itemsets
85
In order to extend this approach to a stream setting, blocks of data received from
the stream are used as an infinite set of partitions. A block of data is processed as
soon as ”enough” transactions are available, and results are merged with the current
approximate result, which is referred to the past part of the stream. Unfortunately,
in the stream case, only recent raw data (transactions) can be maintained available
for processing due to memory limits, thus the usual Partition second pass will be
restricted to accessible data. Only the partial results extracted so far from previous
blocks, and some other additional information, can be available for determining
the global result set, i.e. the frequent patterns and their supports. One naı̈ve workaround is to avoid the second pass and keep in the result set only patterns having the
sum of the known supports, i.e. only those corresponding to patterns that resulted
to be locally frequent in the various blocks mined so far, greater than (or equal to)
minsup. We will name this algorithm Streaming Partition. The first time a pattern
x is reported, its support corresponds to the support computed in the current block.
In case it appeared previously, this mean introducing an error. If j is the first block
where x is frequent, then this error can be at most σmin[1,j] − 1. This is formalized
in the following lemma.
Lemma 11 (Bounds on support after first pass). Let P = {1, ..., n} be the set
of indexes of the n block received so far. Then let f part(x) = {j ∈ P |σj (x) >
minsup · |Dj |} be the set of indexes of the blocks where the pattern x is frequent and
let f part(x) = (P r f part) be its complement. The support for a pattern x is no
less than the support computed by the Streaming Partition algorithm (σ lower (x)) and
is less than or equal to σ lower (x) plus the maximum support the same pattern can
have in blocks where it is not frequent:
σ(x)lower =
X
j∈f part(x)
σj (x)
,
σ(x)upper = σ(x)lower +
X
minsup · |Dj | − 1
j∈f part(x)
Note that when a pattern x is frequent in a block Dj , its local support is summed
to both the upper and lower bounds. Otherwise, its local support can range from 0
(no occurrence) to the local minimum support threshold minus one (i.e. minsup ·
|Dj | − 1), thus the lower bound remains the same, whereas the upper bound is
increased. We can easily transform the two absolute bounds defined above into the
corresponding relative ones, usable to calculate the Average Support Range, defined
in appendix A:
n
upper
sup(x)
X
σ(x)upper
σi (x)lower
=
, sup(x)lower =
, where |D| =
|Dj |
|D|
|D|
j=1
Streaming Partition has serious resource usage issues. In order to keep track of
frequent itemsets, a counter for each distinct pattern found to be frequent in at
least one block is needed. This obviously leads to an unacceptable memory usage in
most cases. The only way to overcome this limitation is introducing some kind of
forget policy: in the remainder of this paper when we refer to Streaming Partition we
86
5. Streaming data
mean Streaming Partition with the deletion of patterns that resulted to be globally
infrequent after each block processing. Another problem with Streaming Partition is
that for every pattern the computed support is a very conservative estimate, since
it always chooses the lower bounds to approximate the results.
Generally, any algorithm returning a support value between the bounds will
have better chances of being more accurate. Following this idea, we devised a new
algorithm based on Streaming Partition that uses a smart interpolation of support.
Moreover, it is resilient to skewed item distributions.
The APStream algorithm.
The streaming algorithm we propose, APStream , tries to overcome some of the problems encountered by Streaming Partition and other similar algorithms for association
mining on streams when the data skew between different incoming blocks is high.
The most evident is that several globally infrequent patterns may be locally
frequent, increasing both resource utilization and execution time of these algorithms.
APStream addresses this issue by means of global pruning based on historical exact
(when available) or interpolated support: each locally frequent pattern that is not
globally frequent according to its interpolated support will be immediately removed
and will not produce any child candidate. Moreover this skew might cause a globally
frequent pattern x to result infrequent on a given data block Di . In other words,
since σi (x) < minsup·|Di |, x will not be found as a frequent pattern in the ith block.
As a consequence, we will not be able to count on the knowledge of σi (x), and thus we
cannot exactly compute the support of x. Unfortunately,
P Streaming Partition might
also deduce that x is not globally frequent, because j,j6=i σj (x) < minsup · |D|.
Result merge and interpolation
When an input block Di is available for processing, APStream extract its frequent
itemsets using the DCI algorithm. Then for each pattern x, included either in past
combined results or in the recent FIM results, it computes the approximate global
support σ[1,i] (x)interp in different ways, according to the specific situation. The approximate past support (σ[1,i) (x)interp ) was obtained by merging the FIM results
of blocks D1 . . . Di−1 using the technique currently discussed. σ[1,i) (x)interp can be
either known or not, depending on the presence of x in the past combined results.
In the same way, σi (x) is known only if x is frequent in Di . The following table
summarizes the possible cases and the action taken by APStream :
σ[1,i) (x)interp
known
known
unknown
σi (x)
known
unknown
known
Action
sum σi (x) to past support and bounds.
recount σi (x) on recent, still available, data.
interpolate past support σ[1,i) (x)interp
The first case is the simpler to handle: the new support σ[1,i] (x)interp will be the
sum of σ[1,i) (x)interp and σi (x). Since σi (x) is exact, the width of the error interval
5.3. Frequent itemsets
87
will remain the same. The second one is similar, except that we need to look at
recent data for computing σi (x). The key difference with Streaming Partition is the
handling of the last case. APStream , instead of supposing that x never appeared in
the past, tries to interpolate σ[1,i) (x).
The interpolation is based on the knowledge of:
• the exact support of each item (or optionally just the approximate support of
a fixed number of most frequent items)
• the reduction factors of the support count of subpatterns of x in current block
with respect to its interpolated support over the past part of the stream.
The algorithm will thus deduce the unknown support σ[1,i) (x) of itemset x on
the part of the stream preceding the ith block as follows:
(
interp
σ[1,i) (x)
= σi (x) ∗ min
min
(
σ[1,i) (item) σ[1,i) (x r item)interp
,
σi (item)
σi (x r item)
)
)!
item ∈ x
In the previous formula the result of the inner min is the minimum among the ratios
of supports of items contained in pattern x in past and recent data, and the same
values computed for itemsets obtained from x by removing one of its items. Note
that during the processing of recent data, the search space is visited level-wise and
the merge of the results is performed starting from shorter pattern too. Hence the
interpolated supports σ[1,i) (x r item)interp of all the k − 1-subpatterns of a k-pattern
x are known. In fact, each support can be either known from the processing of the
past part of the stream or computed during the previous iteration on recent data.
Example of interpolation. Suppose that we have received 440 transactions so
far, and that 40 of these are in the current block. The itemset {A, B, C}, briefly
indicated as ABC, is frequent locally whereas it was infrequent in previous data.
Table 5.1 reports the support of every subpattern involved in the computation.
The first column contains the patterns, the second and third columns contain the
supports of the patterns in the last received block and in the past part of the stream.
Finally, the last column shows the reduction ratio for each pattern.
The algorithm examines itemsets of size k − 1 (two in this simple example),
and single items, and choose the one having the minimum ratio. In this case the
minimum is 2.5, corresponding to the subpattern {A, C}. Since in recent data the
support of itemset x = {A, B, C} is σi (x) = 6, the interpolated support will be
σ[1,i) (x)interp = 6 · 2.5 = 15
It is worth remarking that this method works if the support of larger itemsets
decreases similarly in most parts of the stream, so that a reduction factor (different for each pattern) can be used to interpolate unknown values. Finally note
that, as regards the interpolated value above, the following inequality should hold:
σ[1,i) (x)interp < minsup · |D[1,i) |. If it is not satisfied, the interpolated result should
88
5. Streaming data
x
σi (x) σ[1,i) (x)interp
{A, B, C}
6
?
{A, B}
8
50
{A, C}
12
30
{B, C}
10
100
{A}
17
160
{B}
14
140
{C}
18
160
{}
40
400
σ[1,i) (x)interp
σi (x)
?
6.2
2.5
10
9.4
10
8.9
-
Table 5.1: Sample supports and reduction ratios (σmin[1,t) = 20).
not be accepted since, otherwise, the exact value σi (x) should have already been
found. Hence, in those few cases where the above inequality does not hold, the
interpolated value will be: σ[1,i) (x)interp = (minsup · |D[1,i) |) − 1.
In the example described in table 5.1 the interpolated support for {A, B, C} is 15
and the minimum support threshold for past data is 20, so the bound is respected.
Otherwise, the interpolated support would be forced to 19.
The proposed interpolation schema yields a better approximation of exact results than Streaming Partition, in particular with respect to the approximation of
the support of frequent patterns. The supports computed by the latter algorithm
are, in fact, always equal to the lower bounds of the intervals containing the exact
support of any particular pattern. Hence any kind of interpolation producing an
approximate result set, whose supports are between the interval bounds, should be,
generally, more accurate than picking always its lower bound. For the same reason
the computed support values should be also more accurate than those computed by
Lossy Count (Frequent does not return any support value). Obviously several other
way of computing a support interpolation could be devised. Some are simple as
the average of the bounds while others are complex as counting inference, used in
a different context in [43]. We chose this particular kind of interpolation because it
is simple to calculate, since it is based on data that we already maintain for other
purposes, and it is aware of the underlying data enough to allow for accurate handling of datasets characterized by data-skew on item distributions among different
blocks.
We can finally introduce the pseudo-code of APStream . As in Streaming Partition
the transactions are received and buffered. DCI, the algorithm used for the local
computations, is able to exactly know the amount of memory required for mining a dataset during the intersection phase. Since frequent patterns are processed
sequentially and can be offloaded to disk, the memory needed for efficient computation of frequent patterns is just that used by the bitmap representing the vertical
dataset and can be computed knowing the number of transactions and the number
of frequent items.
5.3. Frequent itemsets
Procedure processBlock(frequentItems,buffer, globFreq)
1
2
3
4
5
6
7
locF req[1] ← f requentItems ;
k←2;
while locF req[k − 1].size >= k do
locF req[k] ← computeF requent(k, locF req, globF req) ;
if k =2 then V D ← f illV erticalDataset(buf f er, f requentItems) ;
commitInsert(V D, k, locF req, globF req) ;
end
Procedure commitInsert(VertData,k,locFreq, globFreq)
1
2
3
4
5
6
7
foreach pat ∈ globF req[k] : pat ∈
/ locF req[k] do
compute support of pat in VertData ;
if pat is frequent then
pre-insert pat in globFreq[k] ;
end
end
replace globFreq[k] with sorted insertBuffer;
Function computeFrequent(k,locFreq, globFreq)
1
2
3
4
5
6
7
8
9
compute local frequent pattern ;
foreach locally frequent pattern pat do
compute global interpolated support and bounds ;
if pat is globally frequent then
insert pat in locFreq[k] ;
pre-insert pat in globFreq[k] ;
end
end
return Fk ;
89
90
5. Streaming data
Thus, we can use this knowledge in order to maximize the size of the block of
transactions processed at once. For the sake of simplicity we will neglect the quite
obvious main loop with code related to buffering, concentrating on the processing
of each data block. The interpolation formula has been omitted too, in the pseudocode, for the same reason.
Each block is processed, visiting the search space level-wise, for discovering frequent patterns. In this way, itemsets are sorted according to their length and the
interpolated support for frequent subpattern is always available when required. The
processing of patterns of length k is performed in two steps. First frequent patterns
are computed in the current block and then the actual insertion into the current
set of frequent patterns is carried out. When a pattern is found to be frequent in
the current block its support on past data is immediately checked: if it was already
known then the local support is summed to previous support and previous bounds.
Otherwise, a support and a pair of bounds are inferred for past data and summed
to the support in the current block. In both cases, if the resulting support pass
the support test, the pattern is queued for delayed insertion. After every locally
frequent pattern of the current length k has been processed, the support of every
previously known pattern that is not locally frequent is computed on recent data.
Patterns passing the support test are queued for delayed insertion too. Then the
set of pre-inserted itemsets is sorted and the actual insertion take place.
Bounds on computed support errors
As a consequence of using an interpolation method to guess an approximate support
value in the past part of the stream, it is very important to establish some bounds
on the support found for each pattern. In the previous subsection, we have already
indicated a pair of really loose bounds: each support cannot be negative, and if a
pattern was not frequent in a time interval then its interpolated support should be
less than the minimum support threshold for the same interval. The lower bound
is obviously always satisfied, whereas in case a support value σ[1,i−1] (x)interp breaks
its upper bound value, it will be forced to (minsup · |D[1,i−1] |) − 1 which is the
greatest value compatible with the bound. This criterion is completely true for nonevolving distributed dataset (distributed frequent pattern mining) or for the first
two data block of the stream. In the stream case, the upper bound is based on
previous approximate results, and could be inexact if the pattern corresponds to a
false negative. Nevertheless, it does represent a useful indication.
Bounds based on pattern subset The first bounds that interpolated supports
should obey, derive from the Apriori property: no set can have a support greater
than those of any of its subset can. Since recent results are merged level-wise with
previously known ones, the interpolation can exploit already interpolated subset
support. When a subpattern is missing during interpolation this mean that it has
been examined during a previous level and discarded. In that case, all of its superset
5.3. Frequent itemsets
91
may be discarded as well. The computed bound is thus affected by the approximation of past results: a pattern with an erroneous support will affect the bounds for
each of its superset. To avoid this issue it is possible to compute the upper bound
for a pattern x simply using the upper bounds of its sub-patterns instead of their
support. In this way, the upper bounds will be weaker, but there will be less false
negatives due to erroneous bounds enforcement.
Bounds based on transaction hash In order to address the issue of error propagation in support bounds we need to devise some other kind of bounds that are
computed exclusively from received data and thus are independent of any previous results. Such bounds can be obtained using inverted transaction hashes. This
technique was first introduced in the algorithm IHP [26], an association mining algorithm where it was used for finding upper bounds for the support of candidates
in order to prune infrequent ones. As we will show this method can be used also
for lower bounds. The key idea is that each item has an associated hashed set of
counters that are accessed by using transaction id as a key. More in detail, each
array hcnt[item] associated with an item is an array of hsize counters initialized to
zero. When the tidth transaction t = {ti } is processed, a hash function transforms
the tid value into an index to be used for the array of counters. Since tids are
consecutive integer numbers, a trivial hash function as h(tid) = tid mod hsize will
guarantee an equal repartition of transactions among all hash bins. For each item
ti ∈ t the counter at position h(tid) in the array hcnt[ti ] is incremented.
The hash function implicitly subdivides the transactions of the dataset. Each
partition corresponds to a position in the array of counters, while the value of
each counter represents the number of occurrences of an item in a given set of
transactions. These hashes are a sort of ”compressed” tid-list and can be intersected
to obtain deterministic bounds for the number of occurrences of a specified pattern.
Notably these arrays of counters have a fixed size, independent from the number of
transactions processed.
Let hsize = 1, A and B two items and hA = hcnt[A][0] and hB = hcnt[B][0]
the only counters contained in their respective hashes, i.e. hA and hB are the
number of occurrences of items A and B in the whole dataset. According to the
Apriori principle the support σ({A, B}) for the pattern {A, B} can be at most equal
to min(hA , hB ). Furthermore, we are able to indicate a lower bound for the same
support. Let n[i] be the number of transactions associated with the ith hash position,
which, in this case, corresponds to the total number of transactions n. We know
from the inclusion/exclusion principle that σ({A, B}) should be greater than or at
least equal to max(0,hA + hB − n). In fact if n − hA transactions does not contains
the item A then at least hB − (n − hA ) of the hB transactions containing B will also
contain A. Suppose that n = 10, hA = 8, hB = 7. If we represent with an X each
transaction supporting a pattern and with a dot any other transaction we obtain
the following diagrams:
92
5. Streaming data
Best case(ub(AB)= 7)
A: XXXXXXXX..
B: XXXXXXX...
AB: XXXXXXX...
Worst case(lb(AB)=5)
XXXXXXXX..
...XXXXXXX
...XXXXX..
Then no more than 7 transactions will contain both A and B. At the same time
at least 8 + 7 − 10 = 5 transactions will satisfy that constraint. Since each counter
represents a set of transaction, this operation is equivalent to the computation of the
minimal and maximal intersections of the tid-lists associated with the single items.
Usually hsize will be larger than one. In that case, the previously explained
computations will be applied to each hash position, yielding an array of lower bounds
and an array of upper bounds. The sums of their elements will give the pair of
bounds for pattern {A, B} as we will show in the following example. Let hsize = 3,
h(tid) = tid mod hsize the hash function, A and B two items and n[i] = 10 be
the number of transactions associated with the ith hash position. Suppose that
hcnt[A] = {8, 4, 6} and hcnt[B] = {7, 5, 6}. Using the same notation previously
introduced we obtain:
h(tid)=0
Best case
Worst case
A: XXXXXXXX..
XXXXXXXX..
B: XXXXXXX...
...XXXXXXX
AB: XXXXXXX...
...XXXXX..
supp
7
5
h(tid)=1
Best case
Worst case
A: XXXX......
XXXX......
B: XXXXX.....
.....XXXXX
AB: XXXX......
..........
supp
4
0
h(tid)=2
Best case
Worst case
A: XXXXXX....
XXXXXX....
B: XXXXXX....
....XXXXXX
AB: XXXXXX....
....XX....
supp
6
2
Each pair of columns represents the transactions having a tid mapped into the
corresponding location by the hash function. The lower and upper bounds for the
support of pattern AB will be respectively 5 + 0 + 2 = 7 and 7 + 4 + 6 = 17.
Both lower bounds and upper bounds computations can be extended to larger
itemsets by associativity: the bounds for the first two items are composed with the
third element counters and so on. The sums of the elements of the last pair of
resulting arrays will be the upper and the lower bounds for the given pattern. This
is possible since the reasoning previously explained still holds if we considers the
occurrences of itemsets instead of those of single items. The lower bound computed
in this way will be often equal to zero in sparse dataset. Conversely, on dense
datasets this method did proved to be effective in narrowing the two bounds.
Experimental evaluation
In the final part of this section, we study the behavior of the proposed method. We
have run the APStream algorithm on several datasets using different parameters. The
goal of these tests is to understand how similarities of the results vary as the stream
length increases, the effectiveness of the hash-based pruning, and, in general, how
dataset peculiarities and invocation parameters affect the accuracy of the results.
Furthermore, we studied how execution time evolves in time when the stream length
increases.
5.3. Frequent itemsets
93
Similarity and Average Support Range. The method we are proposing yields
approximate results. In particular APStream computes pattern supports which may
be slightly different from the exact ones, thus the result set may miss some frequent
patterns (false negatives) or include some infrequent patterns (false positives). In
order to evaluate the accuracy of the results we use a widely used measure of similarity between two pattern sets introduced in [50], and based on support difference.
To the same end, we use the Average support Range (ASR), an intrinsic measure of
the correctness of the approximation introduced in [61]. An extensive description of
both measures and a discussion on their use can be found in the appendix A.
Experimental data. We performed several tests using both real world datasets,
mainly from the FIMI’03 contest [1], and synthetic dataset generated using the IBM
generator. We randomly shuffled each dataset and used the resulting datasets as
input streams.
Table 5.2 shows a list of these datasets along with their cardinality. The datasets
having the name starting with T are synthetic datasets, which mimic the behavior of
market basket transactions. The sparse dataset family T20I8N5k has transactions
composed, on average, of 20 items, chosen from 5000 distinct items, and include
maximal patterns whose average length is 8. The dataset family T30I30N1k was
generated with the parameters synthetically indicated in its name and is composed
of moderately dense datasets, since more than 10,000 frequent patterns can be extracted even with a minimum support of 30%. A description of all other datasets can
be found in [1]. Kosarak and Retail are really sparse datasets, whereas all other real
world dataset used in experimental evaluation are dense. Table 5.2 also indicates
for each dataset a short identifying code that will be used in our charts.
Dataset
Reference
accidents
A
kosarak
K
retail
R
pumbs
P
pumbs-star
PS
connect
C
T20I8N5k
S2..6
T25I20N5k
S7..11
T30I30Nf1k
D1..D9
#Trans.
340183
990002
88162
49046
49046
67557
77302..3189338
89611..1433580
50000..3189338
Table 5.2: Datasets used in experimental evaluation.
Experimental Results. For each dataset and several minimum support thresholds, we computed the exact reference solutions by using DCI [44], an efficient sequential algorithm for frequent pattern mining (FPM). Then we ran APStream for
94
5. Streaming data
different values of available memory and number of hash entries.
The first test is focused on catching the effect of used memory on the behaviour of
the algorithm when the block of transactions processed at once is sized dynamically
according to the available resources. In this case, data are buffered as long as all
the item counters, and the representation of the transactions included in the current
block fit into the available memory. Note that the size of all frequent itemsets, either
mined locally or globally, is not considered in our resource evaluation, since they can
be offloaded to disk if needed. The second test is somehow related to the previous
one. In this case, the amount of required memory is variable, since we determine
a-priori the number of transactions to include in a single block, independently of
the stream content. Since the datasets used in the tests are quite different, in both
cases we used really different ranges of parameters. Therefore, in order to fit all the
datasets in the same plot, the numbers reported in the horizontal axis are relative
quantities, corresponding to the block sizes actually used in each test. These relative
quantities are obtained by dividing the memory/block size used in the specific test
by the smallest one for that dataset. For example, the series 50KB, 100KB, 400KB
thus becomes 1,2,8.
The first plot in figure 5.1 shows the results obtained in the fixed memory case,
while the second one refers to the case of a fixed number of transactions per block.
The relative quantities reported in the plots refer to different base values of either
memory or transactions per blocks. These values are reported in the legend of
each plot. In general, when we increase the number of transaction processed at
once, either statically or dynamically on the basis the memory available, we also
improve the results similarity. Nevertheless, the variation is in most cases small and
sometimes there is a slightly negative trend caused by the nonlinear relation between
used memory and transactions per block. In our test we noted that choosing an
excessively low amount of available memory for some datasets lead to performance
degradation and sometimes also to similarity degradation. The last plot shows the
effectiveness of the hash-based bounds on reducing the Average Support Range (zero
corresponds to an exact result). As expected the improvement is evident only on
more dense datasets.
The last batch of tests makes use of a family of synthetic datasets with homogeneous distribution parameters and varying lengths. These datasets are obtained
from the larger dataset of the serie by truncating it to simulate streams with different lengths. For each truncated dataset we computed the exact result set, used
as reference value in computing the similarity of the corresponding approximate result obtained by APStream . The first chart in figure 5.2 plots both similarity and
ASR as the stream length increases. We can see that similarity remains almost the
same, whereas the ASR decreases when an increasing amount of stream is processed.
Finally, the last plot shows the evolution of execution time as the stream length increases. The execution time increases linearly with the length of the stream, hence
the average time per transaction is constant if we fix the dataset and the execution
parameters.
5.4. Conclusions
95
Similarity(%)
Similarity(%)
95
95
90
90
85
85
80
80
%
100
%
100
75
75
70
70
A (minsupp= 30%, base mem=2MB)
C (minsupp= 70%, base mem=2MB)
P (minsupp= 70%, base mem=5MB)
PS (minsupp= 40%, base mem=5MB)
R (minsupp= 0.05%, base mem=5MB)
K (minsupp= 0.1%, base mem=16MB)
65
60
55
1
2
4
Reletive available memory
A (minsupp= 30%, base trans/block=10k)
C (minsupp= 70%, base trans/block=4k)
K (minsupp= 0.1%, base trans/block=20k)
P (minsupp= 70%, base trans/block=8k)
PS (minsupp= 40%, base trans/block=4k)
R (minsupp= 0.05%, base trans/block=2k)
65
60
55
8
16
1
2
4
8
Relative transaction number per block
16
Average support range(%)
5
A (minsupp= 30%)
C (minsupp= 70%)
K (minsupp= 0.1%)
P (minsupp= 70%)
4.5
4
3.5
%
3
2.5
2
1.5
1
0.5
0
0
100
200
Hash entries
300
400
Figure 5.1: Similarity and Average Support Range as a function of available memory,
number of transactions per block, and number of hash entries.
Acknowledgment
The datasets used during the experimental evaluation are some of those used for
the FIMI’03 (Frequent Itemset Mining Implementations) contest [1]. Thanks to
the owners of these data and people who made them available in current format.
In particular Karolien Geurts [21] for Accidents, Ferenc Bodon for Kosarak, Tom
Brijs [10] for Retail and Roberto Bayardo for the conversion of UCI datasets. Other
datasets were generated using the publicly available synthetic data generator code
from the IBM Almaden Quest data mining project [6].
5.4
Conclusions
In this chapter we have discussed APStream , a new algorithm for approximate frequent pattern mining on streams, and described several related algorithms for frequent item and itemset mining. APStream exploits a novel interpolation method to
infer the unknown past counts of some patterns, which are frequents only on recent data. Since the support values computed by the algorithm are approximate,
we have also proposed a method for establishing a pair of upper and lower bounds
96
5. Streaming data
dataset: T30I30N1k min_supp=30%
dataset: T30I30N1k min_supp=30%
100
0.2
32
99.8
0.1
99.4
16
Relative time
99.6
ASR (%)
Similarity (%)
0.15
8
4
0.05
99.2
2
Similarity
ASR
99
1
2
4
8
Stream length (/100k)
16
0
32
relative time
1
1
2
4
8
Stream length (/100k)
16
32
Figure 5.2: Similarity and Average Support Range as a function of different stream
lengths.
for each interpolated value. These bounds are computed using the knowledge of
subpattern frequency in past data and the intersection of a hash based compressed
representation of past data.
Experimental tests show that the solution produced by APStream is a good approximation of the exact global result. The comparisons with exact results consider
both the set of patterns found and their support. The metric used in order to assess
the quality of the algorithm output is the similarity measure introduced in [50].
The interpolation works particularly well for dense dataset, achieving a similarity
close to 100% in best cases. The adaptive behaviour of APStream allows us to limit
the amount of used memory. As expected, we have found that a larger amount of
available memory corresponds to a more accurate result. Furthermore, as the length
of the processed stream increases, the similarity with the exact result remains almost the same. At the same time, we observed a decrease in the average difference
between upper and lower bounds, which is an intrinsic measure of result accuracy.
This means that when the stream length increase, the relative bounds on support
get closer. Finally, the time needed to process a block of transactions does not
depend on the stream length, hence the total execution time is linear with respect
to the stream length. In the future, we plan to improve the proposed method by
adding other stricter bounds on the approximate support and to extend it to closed
patterns.
Conclusions
The knowledge discovery process and, particularly, its data mining algorithmic part,
have been extensively studied in the literature during the last twenty years, and is
still an active discipline. Several problem and analysis methods have been proposed,
and the extraction of valuable and hidden knowledge from operational databases is,
currently, a strategic issue for most medium and large companies. Most of these organizations are geographically spread by nature, and distributed database systems
are widely diffused due to either logistic, failure resilience, or performance reasons.
Banks, telecommunication companies, wireless access provider, are just some of the
users of distributed system for the management of both historical and operational
data. Furthermore, in several cases, the data are produce and/or modified continuously and at a sustained rate. The usage of data mining algorithms in distributed
and stream settings may introduce several challenging issues. Problems may be either technical, related to the network infrastructure and the huge amount of data,
political, related to privacy, company interest or ownership of data. The issues to
solve, however, depend on the kind of knowledge we are interested to extract from
data.
In this thesis, we have analyzed in detail the issues related to the Association
Rules Mining, and more precisely to its most computationally expensive phase, the
mining of frequent patterns in distributed dataset and data stream, where these
patterns can be either itemsets (FIM) or sequences (FSM). The core contribution
of this work is a general framework for adapting an exact Frequent Pattern Mining
algorithm to a distributed or streaming context. The resulting algorithms are able
to find efficiently an approximation of the exact results with a strong reduction of
communication size, in the distributed case, and memory usage, in the stream case.
In both cases, the approximate support of each pattern is returned along with an
interval containing the true value.
The proposed methods have been evaluated in a distributed setting and in a
stream one, using several real world and synthetic datasets. The results of our test
show that this framework gives a fairly accurate approximation of exact results, even
only exploiting simple and generic interpolation schemas as those used in the tests.
In the distributed case, the interpolation based method exhibits linear speedup as
the number of partitions increases. In the stream case, the time that is required to
process a block is on average constant, hence the total execution time is linear with
respect to the length of the data stream. At the same time, both the similarity to
the exact results and the absolute width of the error interval are almost constant.
Thus, the algorithm is suitable for mining infinite amount of data.
One further original contribution presented in this thesis is an algorithm for
100
Conclusions
frequent sequence mining with gap constraints. CCSM is a novel algorithm for
the discovery of frequent sequence patterns with constraints on the maximum gap
between the occurrences of two part of the sequence (maxGap). The proposed algorithm has been compared with cSPADE, a state of the art algorithm, obtaining
better performance result for significant value of the maxgap constraint. Thanks to
the particular transversal order of the search space exploited by CCSM, the intermediate results are highly reused, and the output is ordered. This is particularly
important and allows to efficiently integrate the CCSM algorithm in the proposed
distributed/stream framework, as explained in the next section.
Future works
Frequent Sequence Mining on distributed/stream data
The methods presented for frequent itemset extraction can easily be extended to
the other kind of frequent patterns considered in this thesis: the frequent sequences.
This only involves minor modifications of the algorithms: replacing the interpolation formula with one suitable for sequences, and the FIM algorithm with a FSM
algorithm. The CCSM algorithm is a suitable FSM candidate to be inserted in our
distributed and stream framework, since it is level-wise and returns ordered set of
frequent sequences. This ordering allows for merging on-the-fly the sequence patterns as they arrive, and the level-wise behavior makes more information available
to be exploited by the interpolation schema in order to give a better approximation.
Furthermore, the on-the-fly merge reduces both memory requirement and computational cost of the merge phase.
As the overall framework remains exactly the same, all the improvements and
limits that we have explained for frequent itemsets are still valid. The only differences are those originated by the intrinsic difference between frequent itemset and
frequent sequences, which make the result of FSM potentially larger and more likely
to be affected by combinatorial explosion.
Frequent Itemset Mining on distributed stream data
The proposed merge/interpolation framework can be extended seamlessly to manage
distributed streams in several ways. The most straightforward one is based on
the composition of APInterp , followed by APStream . Each slave is responsible for
extracting frequent itemsets from its local streams. The results of each processed
block are sent to the master and merged, first among them using APInterp , and then
with the past combined results as in APStream . The schema on the left of Figure III
illustrates this framework. Resnode,i is the FIM result on the ith block of the node
stream, whereas Resi is the result of the merge of all local ith results, and Hist Resi
is the historical global result, i.e., from the beginning to the ith block.
Conclusions
101
Figure C.3: Distributed stream mining framework. On the left distributed merge
followed by stream merge, on the right local stream merge followed by distributed
merge.
A first improvement on this base idea could be the replacement of the two cascaded merge phases, one distribution related and the other stream related, with
a single one. This would allow for better accuracy of results and stricter bounds,
thanks to the reduction of cumulated errors. Clearly, the recount step, used in
APStream for assessing the support of recently non-frequent itemsets that were frequent in past data, is impossible in both cases. Since the merge is performed in
the master node, only the received locally frequent patterns are available. However,
this step proved to be effective in our preliminary tests on APStream , particularly for
dense datasets.
In order to introduce the local recount phase, it is necessary to move the stream
merge phase to the slave nodes. In this way, recent data are still available in the
reception buffer, and can be used to improve the results. Each slave node then sends
its local results, related to the whole history of its streams, to the master node that
simply merges them like in APInterp . Since these results are sent each time a block
is processed, it would be advisable to send only the differences in the results related
to the last processed block. This involves rethinking the central merge phase but,
in our opinion, it should yield good results. The schema on the right of Figure III
102
Conclusions
illustrates this framework. DCI result streams are directly processed by APStream ,
yielding Hist Resnode,i , i.e. the results on the whole node stream at time i. APInterp
collect these results and output the final result Hist Resi .
The last aspect to consider is synchronization. Each stream evolves, potentially
at a different rate with respect to other streams. This means that when the stream
reception buffer of a node is full other nodes could be still collecting data. Thus, the
collect and merge framework should allow for asynchronous and incremental result
merge, with some kind of forced periodical synchronization, if needed.
Limiting the combinatorial explosion of the output
It should be noted that, both in the distributed and in the stream settings, the actual
time needed to process a partition is mainly related to the statistical properties of
data. This problem is not specific to our algorithms. Instead, it is a peculiarity of the
frequent itemset/sequences problems, and is directly linked to the exponential size
of the result sets. Our goal was to find an approximate solution as close as possible
to the exact one, and this is exactly what we achieved. However, this means that in
case the exact solution is huge, the approximate solution will be huge too. In this
case, if we want to ensure that data can be processed at a given rate, choosing a
different approach is mandatory.
Two approaches can be devised: the first one is based on alternative representations of results, such as closed/condensed/maximal frequent patterns. As quickly
explained in the related works of chapter 2, both the result size and the information
on support of patterns decrease from the first to the last of the three problems, but
the presence of a pattern in the results is always certain. The second one, instead,
aim at discovering only a useful subset of the result, as in the case of alignment
patterns [31]. We have done some preliminary work on approximate distributed
closed itemset mining [32], but also the second approach will be matter of further
investigations. We believe it should be particularly effective in the sequence case,
which is more affected by the combinatorial explosion problem.
A
Approximation assessment
The methods we are proposing yields approximate results. In particular APInterp
computes pattern supports which may be slightly different from the exact ones,
thus the result set may miss some frequent pattern (false negatives) or include some
infrequent pattern (false positives). In order to evaluate the accuracy of the results
we need a measure of similarity between two pattern sets. A widely used one has
been introduced in [50], and is based on support difference.
Definition 12 (Similarity). Let A and B respectively be the reference (correct) result
set and the approximate result set. supA (x) ∈ [0, 1] and supB (y) ∈ [0, 1], where
x ∈ A and y ∈ B, correspond to the relative support found in A and B respectively.
Note that since B corresponds to the frequent patterns found by the approximate
algorithm under observation, A − B thus corresponds to the set of false negatives,
while B − A are the false positives.
The Similarity is thus computed as
P
max{0, 1 − α ∗ |supA (x) − supB (x)|}
Simα (A, B) = x∈A∩B
|A ∪ B|
where α > 1 is a scaling parameter, which increase the effect of the support dissimilarity. Moreover, α1 indicates the maximum allowable error on (relative) pattern
supports. We will use the notation Sim() to indicate the default case for α, i.e.
α = 1.
In case absolute supports are used instead than relative ones, the parameter α
will be smaller than or equal to 1. We will name this measure Absolute Similarity,
indicated as SimABS (A, B).
This measure of similarity is thus the sum of at most |A ∩ B| values in the range
[0, 1], divided by |A ∪ B|. Since |A ∩ B| 6 |A ∪ B|, similarity lies in [0, 1] too.
When a pattern appears in both sets and the difference between the two supports
is greater than α1 , it does not improve similarity, otherwise similarity is increased
according to the scaled difference. If α = 20, then the maximum allowable error in
the relative support is 1/20 = 0.05 = 5%. Supposing that the support difference for
a particular pattern is 4%, the numerator of the similarity measure will be increased
by a small quantity: 1 − (20 ∗ 0.04) = 0.2. When α is 1 (default value), only patterns
104
A. Approximation assessment
whose support difference is at most 100% contribute to increase similarity. On the
other hand, when we set α to a very high value, only patterns with a very similar
supports in both the approximate and reference sets will contribute to increase the
similarity measure (which is roughly the same than using Absolute Similarity with
α close to 1).
It is worth noting that the presence of several false positives and negatives in
the approximate result set B contributes to reduce our similarity measure, since this
entails an increase in A ∪ B (the denominator of the Simα formula) with respect to
A∩B. Moreover, if a pattern has an actual support that is slightly less than minsup
but the approximate support (supB ) is slightly greater than minsup, similarity
is decreased even if the computed support was almost correct. This could be an
undesired behavior. While a false negative can constitute a big issue, because some
potentially important association rules will be not generated at all, a false positive
with a support very close to the exact one could be tolerated by an analyst.
In order to overcome this issue we propose a new similarity measure, f pSim
(where f p stand for false positive). Since this measure consider every pattern included in the approximate result set B (instead of A ∩ B), it can be used in order to
assess whether false positives have an approximate support value close to the exact
one or not. A high value of f pSim compared with a smaller value of Sim simply
means that in the approximate result set B there are several false positive with a
true support close to minsup.
Definition 13 (fpSimilarity). Let A and B respectively be the reference (correct)
result set and the approximate result set. supB (x) ∈ [0, 1], where x ∈ B, corresponds
to the support found in result sets B, while sup(x) ∈ [0, 1] is the actual support of
the same pattern. fpSimilarity is thus computed as
P
max{0, 1 − α ∗ |sup(x) − supB (x)|}
fpSimα (A, B) = x∈B
|A ∪ B|
where α > 1 is a scaling parameter. We will use the notation Sim() to indicate the
default case for α, i.e. α = 1.
Note that the numerator of this new measure considers all the patterns found
in the set B, thus also false positives. Hence finding a pattern with a support close
to the true one is considered a ”good” result in any case, even if this pattern is
not actually frequent. For example, suppose that minimum support threshold is
50% and x is an infrequent pattern such that sup(x) = 49.9. If supB (x) = 50%, it
will result to be a false positive. However, since supB (x) is very close to the exact
support sup(x), the value of f pSimα () will be increased.
In Definition 13 we used sup(x) instead of supA (x) to indicate the actual support
of itemset x since it is possible, as in the example case, that a pattern is present in
B even if it is not frequent (hence not present in A).
In both definitions above, we used sup(x) to indicate the (relative) support,
ranging from 0 to 1. In the remainder of the paper, in particular in the algorithm
105
description, we will also use the notation σ(x) = sup(x) · |D| to indicate the support
count (absolute support), ranging from 0 to the total number of transactions.
When bounds on the support of each pattern are available, an intrinsic measure
of the correctness of the approximation is the average width of the interval between
the upper bound and the lower bound.
Definition 14 (Average support range). Let B be the approximate result set, sup(x)
the exact support for pattern x and sup(x)lower and sup(x)upper the lower and upper
bounds on sup(x), respectively. The average support range is thus defined as:
ASR(B) =
1 X
sup(x)upper − sup(x)lower
|B| x∈B
Note that, while this definition can be used for every approximate algorithm, how
to compute sup(x)lower and sup(x)upper is algorithm specific. In the next section, we
will present a way that is suitable for the class of algorithms containing the one we
are proposing.
Other, less accurate, similarity measures can be borrowed from the Information
Retrieval theory:
Definition 15 (Recall & Precision). Let A and B respectively be the reference (correct) result set and the approximate result set. Note that since B corresponds to the
frequent patterns found by the approximate algorithm under observation, A − B thus
corresponds to the set of false negatives, while B − A are the false positives.
Let P (A, B) ∈ [0, 1] be the Precision of the approximate result, defined as follows:
B∩A
B
Hence the Precision is maximal (P (A, B) = 1) iff B ∩ A = B, i.e. the approximate
result set B is completely contained in the exact one A, and no false positives occurs.
Let R(A, B) ∈ [0, 1] be the Recall of the approximate result, defined as follows:
P (A, B) =
B∩A
A
Hence the Recall is maximal (R(A, B) = 1) iff B ∩ A = A, i.e. the exact result set
A is completely contained in the approximate one B, and no false negative occurs.
R(A, B) =
According to our remarks above concerning the benefits of the f pSim measure
(Def. 13), we have that a ”good” approximate result should be characterized by to a
very high Recall, where the supports of the possible false positive patterns should be
however very close to the exact ones. Conversely, in order to minimize the standard
measure of similarity (Def. 12), we need to maximize both Recall and Precision,
while keeping small the difference in the approximate supports of frequent patterns.
106
A. Approximation assessment
Bibliography
[1] Workshop on frequent itemset mining implementations FIMI’03 in conjunction
with ICDM’03. In fimi.cs.helsinki.fi, 2003.
[2] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm
for generation of frequent itemsets. Parallel and Distributed Computing, 2000.
[3] R. Agarwal, C. Aggarwal, and V.V.V. Prasad. Depth first generation of long
patterns. In KDD ’00: Proceedings of the sixth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 108–118, New York,
NY, USA, 2000. ACM Press.
[4] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD
International Conference on Management of Data, pages 207–216, Washington,
D.C., 1993.
[5] R. Agrawal and J.C. Shafer. Parallel mining of association rules. In IEEE
Transaction On Knowledge and Data Engineering, 1996.
[6] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In
Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan
Kaufmann, 1994.
[7] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th Int. Conf.
Data Engineering, ICDE, pages 3–14. IEEE Press, 1995.
[8] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. Sequential pattern mining using
bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2002.
[9] R. J. Bayardo Jr. Efficiently Mining Long Patterns from Databases. In Proc. of
the ACM SIGMOD Int. Conf. on Management of Data, pages 85–93, Seattle,
Washington, USA, 1998.
[10] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for
product assortment decisions: A case study. In Knowledge Discovery and Data
Mining, pages 254–260, 1999.
[11] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: a maximal frequent itemset
for transactional databases. In Proc. of the International Conference on Data
Endineering ICDE, pages 443–452. IEEE Computer Society, 2001.
108
Bibliography
[12] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data
streams. In ICALP ’02: Proceedings of the 29th International Colloquium on
Automata, Languages and Programming, pages 693–703, London, UK, 2002.
Springer-Verlag.
[13] D.W. Cheung, J. Han, V.T. Ng, A.W. Fu, and Y. Fu. A fast distributed
algorithm for mining association rules. In DIS ’96: Proceedings of the fourth
international conference on on Parallel and distributed information systems,
pages 31–43, Washington, DC, USA, 1996. IEEE Computer Society.
[14] G. Cormode and S. Muthukrishnan. What’s hot and what’s not: tracking
most frequent items dynamically. In PODS ’03: Proceedings of the twentysecond ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
systems, pages 296–306. ACM Press, 2003.
[15] G. Cormode and S. Muthukrishnan. An improved data stream summary: the
count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005.
[16] E.D. Demaine, A. López-Ortiz, and J.I. Munro. Frequency estimation of internet packet streams with limited space. In ESA ’02: Proceedings of the 10th
Annual European Symposium on Algorithms, pages 348–360, London, UK, 2002.
Springer-Verlag.
[17] C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst.,
21(3):270–313, 2003.
[18] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors.
Advances in Knowledge Discovery and Data Mining. AAAI Press, 1998.
[19] V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining Very Large Databases.
IEEE Computer, 32(8):38–45, 1999.
[20] M.N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential pattern mining
with regular expression constraints. In The VLDB Journal, pages 223–234,
1999.
[21] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual
Transportation Research Board, Washington DC. (USA), January 12-16, page
18pp, 2003.
[22] E-H.S. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for
association rules. In IEEE Transaction on Knowledge and Data Engineering,
2000.
Bibliography
109
[23] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers, 1st edition, 2000.
[24] J. Han, J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan:
Frequent pattern-projected sequential pattern mining. In In Proc. ACM 6th
Int. Conf. on Knowledge Discovery and Data Mining, pages 355–359, 2000.
[25] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of the ACM SIGMOD Int. Conference on Management of Data,
2000.
[26] J.D. Holt and S.M. Chung. Mining association rules using inverted hashing and
pruning. Inf. Process. Lett., 83(4):211–220, 2002.
[27] V.C. Jensen and N. Soparkar. Frequent itemset counting across multiple tables.
In In 4th PAcific Asia Conference on Knowledge Discovery and Data Minig,
2000.
[28] C. Jin, W. Qian, C. Sha, J.X. Yu, and A. Zhou. Dynamically maintaining
frequent items over a data stream. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 287–
294, New York, NY, USA, 2003. ACM Press.
[29] R. Jin and G. G. Agrawal. An algorithm for in-core frequent itemset mining
on streaming data. To appear in ICDM’05, 2005.
[30] R.M. Karp, S. Shenker, and C.H. Papadimitriou. A simple algorithm for finding
frequent elements in streams and bags. ACM Transactions on Database Systems
(TODS), 28(1):51–55, 2003.
[31] H. Kum, J. Pei, W. Wang, and D. Duncan. ApproxMAP: Approximate mining
of consensus sequential patterns. In Proceedings of the Third International
SIAM Conference on Data Mining, 2003.
[32] C. Lucchese, S. Orlando, R. Perego, and C. Silvestri. Mining frequent closed
itemsets from highly distributed repositories. In Proc. of the 1st CoreGRID
Workshop on Knowledge and Data Management in Grids in conjunction with
PPAM2005, September 2005.
[33] G. Manku and R. Motwani. Approximate frequency counts over data streams.
In In Proceedings of the 28th International Conference on Very Large Data
Bases, August 2002.
[34] H. Mannila and H. Toivonen. Discovering generalized episodes using minimal
occurrences. In Knowledge Discovery and Data Mining, pages 146–151, 1996.
110
Bibliography
[35] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in
Sequences. In Proceedings of the First International Conference on Knowledge
Discovery and Data Mining (KDD-95), Montreal, Canada, 1995. AAAI Press.
[36] H. Mannila, H. Toivonen, and A.I. Verkamo. Discovery of frequent episodes in
event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, 1997.
[37] F. Masseglia, F. Cathala, and P. Poncelet. The PSP approach for mining
sequential patterns. In Principles of Data Mining and Knowledge Discovery,
pages 176–184, 1998.
[38] F. Masseglia, P. Poncelet, and M. Teisseire. Incremental mining of sequential
patterns in large databases. Technical report, LIRMM, France, January 2000.
[39] F. Masseglia, P. Poncelet, and M. Teisseire. Incremental mining of sequential
patterns in large databases. Data and Knowledge Engineering, 46(1):97–121,
2003.
[40] J. Misra and D. Gries. Finding repeated elements. Technical report, Ithaca,
NY, USA, 1982.
[41] A. Mueller. Fast sequential and parallel algorithms for association rules mining:
A comparison. Technical Report CS-TR-3515, Univ. of Maryland, 1995.
[42] S. Orlando, P. Palmerini, and R. Perego. Enhancing the apriori algorithm for
frequent set counting. In DaWaK ’01: Proceedings of the Third International
Conference on Data Warehousing and Knowledge Discovery, pages 71–82, London, UK, 2001. Springer-Verlag.
[43] S. Orlando, P. Palmerini, R. Perego, C. Lucchese, and F. Silvestri. kDCI: a
multi-strategy algorithm for mining frequent sets. In Proceedings of the workshop on Frequent Itemset Mining Implementations FIMI’03 in conjunction with
ICDM’03, 2003.
[44] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adaptive and resourceaware mining of frequent sets. In Proc. of the 2002 IEEE International Conference on Data Mining, ICDM, 2002.
[45] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. An efficient parallel
and distributed algorithm for counting frequent sets. In Proc. of Int. Conf.
VECPAR 2002 - LNCS 2565, pages 197–204. Spinger, 2002.
[46] S. Orlando, R. Perego, and C. Silvestri. CCSM: an efficient algorithm for
constrained sequence mining. In Proceedings of the 6th International Workshop
on High Performance Data Mining: Pervasive and Data Stream Mining, in
conjunction with Third International SIAM Conference on Data Mining, 2003.
Bibliography
111
[47] S. Orlando, R. Perego, and C. Silvestri. A new algorithm for gap constrained
sequence mining. To appear in Proceedings of ACM Symposim on Applied
Computing SAC - Data Mining track, Nicosia, Cyprus, March 2004.
[48] B. Park and H. Kargupta. Distributed Data Mining: Algorithms, Systems, and
Applications. In Data Mining Handbook, pages 341–358. IEA, 2002.
[49] J.S. Park, M.S. Chen, and P.S. Yu. An Effective Hash Based Algorithm for
Mining Association Rules. In Proceedings of 1995 ACM SIGMOD Int. Conf.
on Management of Data, pages 175–186.
[50] S. Parthasarathy. Efficient progressive sampling for association rules. In
Proceedings of the 2002 IEEE International Conference on Data Mining
(ICDM’02), page 354. IEEE Computer Society, 2002.
[51] S. Parthasarathy, M.J. Zaki, M. Ogihara, and S. Dwarkadas. Incremental and
interactive sequence mining. In CIKM, pages 251–258, 1999.
[52] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu.
Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern
growth. In Proceedings of the 17th International Conference on Data Engineering, page 215. IEEE Computer Society, 2001.
[53] J. Pei, J. Han, and W. Wang. Mining sequential patterns with constraints in
large databases. In Proc. of Proceedings of the 11-th Int. Conf. on Information
and Knowledge Management (CIKM 02), pages 18–25, 2002.
[54] N. Ramakrishnan and A. Y. Grama. Data Mining: From Serendipity to Science.
IEEE Computer, 32(8):34–37, 1999.
[55] A. Savasere, E. Omiecinski, and S.B. Navathe. An efficient algorithm for mining
association rules in large databases. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, pages 432–444. Morgan Kaufmann, September 1995.
[56] A. Schuster and R. Wolff. Communication Efficient Distributed Mining of Association Rules. In ACM SIGMOD, Santa Barbara, CA, April 2001.
[57] A. Schuster, R. Wolff, and D. Trock. A High-Performance Distributed Algorithm for Mining Association Rules. In The Third IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL, November 2003.
[58] T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining
association rules. In PDIS: International Conference on Parallel and Distributed
Information Systems. IEEE Computer Society Technical Committee on Data
Engineering, and ACM SIGMOD, 1996.
112
Bibliography
[59] C. Silvestri and S. Orlando. Distributed association mining: an approximate
method. In Proceedings of 7th International Workshop on High Performance
and Distributed Mining, in conjunction with Fourth International S, April 2004.
[60] C. Silvestri and S. Orlando. Approximate mining of frequent patterns on
streams. In Proc. of the 2nd International Workshop on Knowledge Discovery from Data Streams in conjunction with PKDD2005, October 2005.
[61] C. Silvestri and S. Orlando. Distributed approximate mining of frequent patterns. In Proceedings of ACM Symposim on Applied Computing SAC - Data
Mining track, March 2005.
[62] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. In Proc. 5th Int. Conf. Extending Database Technology, EDBT, volume 1057, pages 3–17. Springer-Verlag, 1996.
[63] R. Wolff and A. Schuster. Mining Association Rules in Peer-to-Peer Systems.
In The Third IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL, November 2003.
[64] X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in
large datasets. In Proc. 2003 SIAM Int.Conf. on Data Mining (SDM’03), 2003.
[65] M.J. Zaki. Fast mining of sequential patterns in very large databases. Technical
Report TR668, University of Rochester, Computer Science Department, 1997.
[66] M.J. Zaki. Parallel and distributed association mining: A survey. In IEEE
Concurrency, 1999.
[67] M.J. Zaki. Parallel sequence mining on shared-memory machines. In LargeScale Parallel Data Mining, pages 161–189, 1999.
[68] M.J. Zaki. Scalable algorithms for association mining. IEEE Transactions on
Knowledge and Data Engineering, 12:372–390, May/June 2000.
[69] M.J. Zaki. Sequence mining in categorical domains: incorporating constraints.
In Proceedings of the ninth international conference on Information and knowledge management, pages 422–429. ACM Press, 2000.
[70] M.J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1-2):31–60, 2001.
List of PhD Thesis
TD-2004-1 Moreno Marzolla
”Simulation-Based Performance Modeling of UML Software Architectures”
TD-2004-2 Paolo Palmerini
”On performance of data mining: from algorithms to management systems for
data exploration”
TD-2005-1 Chiara Braghin
”Static Analysis of Security Properties in Mobile Ambients”
TD-2006-1 Fabrizio Furano
”Large scale data access: architectures and performance”
TD-2006-2 Damiano Macedonio
”Logics for Distributed Resources”
TD-2006-3 Matteo Maffei
”Dynamic Typing for Security Protocols”
TD-2006-4 Claudio Silvestri
”Distributed and Stream Data Mining Algorithms for Frequent Pattern Discovery”