Download Neural Networks as Artificial Memories for Association Rule Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Neural Networks as Artificial
Memories for Association Rule
Mining
Vicente Oswaldo Baez Monroy
Submitted for the degree of Doctor of Philosophy
Department of Computer Science
December, 2006
Abstract
The collection of data is a high priority task in our daily life because humans
are interested in understanding more about the variables of the diverse events
happening around us. This analysis or understanding is derived from the usage
of techniques which aim to produce predictive or descriptive models from data.
While the former generates models to predict new status of data, the latter finds
important patterns to describe it. In Computer Science, Data Mining is a multidisciplinary area responsible for producing these data-analysis techniques by
developing algorithms which aim to form novel understandings of data. Among
the descriptive data-mining techniques, the task of association rule mining stands
out because of its simple, but powerful knowledge rule-format to represent how
the attributes or items, which form events or patterns in an environment, associate amongst each other and how strong these associations are.
Based on the support-confident framework proposed by Agrawal for association
rule mining, the generation of these rules is typically achieved by first identifying the group of interesting or frequent itemsets in data and then generating
rules from the discovered itemsets. To determine whether an itemset is frequent
or not, the calculation of its corresponding support property must be performed
since it defines its frequency of occurrence in the mined environment.
i
Although the number of approaches and strategies for association rule mining
have been growing since 1993, there are very few proposals based on biologicallyinspired technology. In particular, there are barely any neural-network based approaches for the generation of this type of rules. To further the research in this
field, we explore how neural networks can be used for association rule mining in
this thesis.
Since it has been assumed that association rules are a type of knowledge that
humans can generate mechanically, and considering that neural networks imitate
human behavior, we have stated that an implicit neural-based framework may
exist for this data mining technique. In particular, we have followed the premise
that association rules can be derived from the knowledge learnt by a neural network similar to those generated by traditional algorithms like Apriori.
In order to perform association rule mining with neural networks, we focus
on investigating how to perform the counting of patterns or itemsets, which is
normally produced by looking for the patterns by scanning the high dimensional
space defined by the original data environment, through decoding the knowledge
embedded in an auto-associative memory and a self-organising map. This is, we
have worked in the first stage of the neural-based proposed framework which
involves the building of artificial memories that are able to learn, store and recall itemset support after they have been trained with data defining associations.
Especially, we analyse and decode the training process and the weight matrix
of a self-organising map and an auto-associative memory to propose itemsetsupport extraction mechanisms through which they are able to recall itemsetsupport when an itemset is presented as stimulus to the trained networks.
ii
Since data sources or environments are not static and any knowledge therefore derived from them, like rules or itemsets, tend to outdate as fast as new
events occur, we have also investigated how the itemset-support knowledge accumulated by a memory must be maintained throughout time. Particularly, we
propose how a self-organising-map based memory can maintain its knowledge
of itemset support valid throughout time while it learns from a non-stationary
environment.
iii
Contents
1 Introduction
1
1.1
The Role of Neural Networks in Data Mining . . . . . . . . . .
12
1.2
Linking ANNs and ARM: The Motivations . . . . . . . . . . .
14
1.3
The Neural-Network Candidates . . . . . . . . . . . . . . . . .
23
1.4
The Research Questions . . . . . . . . . . . . . . . . . . . . .
24
1.4.1
Aims and Objectives . . . . . . . . . . . . . . . . . . .
28
Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
1.5
2 Association Rule Mining
33
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.2
The Scope of ARM . . . . . . . . . . . . . . . . . . . . . . . .
35
2.2.1
Formal Definition . . . . . . . . . . . . . . . . . . . . .
36
Frequent Itemset Mining . . . . . . . . . . . . . . . . . . . . .
37
2.3.1
The Calculation of Itemset Support . . . . . . . . . . .
38
2.4
Taxonomy of the FIMers . . . . . . . . . . . . . . . . . . . . .
40
2.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
2.3
3 Hypothetical Neural Network for Association Rule Mining
46
3.1
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.2
Hypothetical ARM Framework Based on ANNs . . . . . . . . .
52
3.3
A Formal Definition of the Problem . . . . . . . . . . . . . . .
56
iv
3.4
Ideal ANN Characteristics for Building Memories for ARM . .
57
3.5
Reasons for Studying an AAM and a SOM for ARM . . . . . .
58
3.6
Similarities and Differences with Surveyed Approaches . . . . .
67
3.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4 An Auto-Associative Memory for ARM
4.1
70
Correlation Matrix Memory for ARM . . . . . . . . . . . . . .
71
4.1.1
The Learning of Itemset Support by a CMM . . . . . . .
71
4.1.2
Recalling Itesemt Support from The Weight Matrix of a
CMM . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Complexity Analysis: CMM vs. Apriori . . . . . . . . .
79
4.2
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.3
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.1.3
5 Itemset Support Generation From a Self-Organising Map
91
5.1
Considering a SOM for ARM: Principles . . . . . . . . . . . .
5.2
A Probabilistic Itemset-support Estimation Mechanism . . . . . 101
5.3
Experiments and Results . . . . . . . . . . . . . . . . . . . . . 111
5.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Incremental Training for Incremental ARM: A SOM Model
92
131
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2
Batch SOM for Non-stationary Environments . . . . . . . . . . 135
6.3
6.2.1
The Problem Definition
. . . . . . . . . . . . . . . . . 136
6.2.2
Interpretation by Node Influences of the Batch Training
6.2.3
The Algorithm . . . . . . . . . . . . . . . . . . . . . . 141
6.2.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . 144
137
Itemset Support Maintenance by Incremental SOM Training . . 145
6.3.1
Experiments . . . . . . . . . . . . . . . . . . . . . . . 145
v
6.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7 Conclusions and Future Work
7.1
Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.1.1
7.2
168
Contributions . . . . . . . . . . . . . . . . . . . . . . . 185
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.2.1
For The Auto-associativity-based Memory . . . . . . . 187
7.2.2
For The Self-organising-map-based Memory . . . . . . 187
7.2.3
For the Quality of the Itemset-support Estimation . . . . 192
7.2.4
ANNs-based Candidate Generation Procedures . . . . . 192
7.2.5
Distributed Association Rule Mining . . . . . . . . . . 193
7.2.6
The Itemset Concept in Dynamic Data . . . . . . . . . . 196
A The Apriori Algorithm
199
B The Neural Network Candidate Algorithms
201
vi
List of Figures
1.1
An illustration of ARM applied to the data source defined in (a). The
aim is to generate rules with a minimal threshold of 20%. As the
first part of ARM, the frequent itemsets have to be discovered, such
as in (b), based on their support. Then, association rules, as in (c), are
formed from them. . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
A shopping receipt as an example of a data source in which the associativity concept can be exploited for the extraction of knowledge. . .
1.3
7
9
Framework defined by the processes and strategies for the problem
of association rule mining. Its conception is based on the support-
. . . . . .
18
1.4
Neural-based framework for ARM. . . . . . . . . . . . . . . . . .
23
2.1
Example of an itemset-search-space lattice. In this case, the data space
confidence framework of Agrawal (Agrawal et al., 1993).
is formed by 4 items. Indexes represent the lexicographic order of the
itemsets in the space. . . . . . . . . . . . . . . . . . . . . . . . .
3.1
38
Hypothetical Neural-based framework for ARM. In particular, this thesis focuses on developing an artificial memory for its purposes (colored
area). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
53
General structure of a mapping neural network. This appears in (Ham
and Kostanic, 2001). . . . . . . . . . . . . . . . . . . . . . . . .
vii
59
3.3
Outline of the theoretical internal support model defined in (GardnerMedwin and Barlow, 2001) to produce the counting of patterns with
distributed representations in a group neurons. . . . . . . . . . . . .
3.4
62
Theoretical projection models defined in (Gardner-Medwin and Barlow, 2001) to produce the counting of patterns with distributed repre-
. . . . . . . . . . . . . . . . . . .
66
4.1
Illustration of the accumulation of knowledge by a CMM. . . . . . .
73
4.2
Illustration of the accumulation of knowledge by a weightless CMM
sentations in a group neurons.
which has been modified to collect frequency information. The dark
matrix illustrates the new matrix called the frequency matrix Mf which
contains the corresponding pattern frequencies.
4.3
. . . . . . . . . . .
75-by-75 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Chess dataset will be made. . . . . . . . . .
4.4
85
119-by-119 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Mushroom dataset will be made. . . . . . .
5.1
84
129-by-129 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Connect dataset will be made. . . . . . . . .
4.5
75
85
Maps resulting from training a SOM with artificial datasets describing
associations. The red hexagons on the gray maps define the hits received from the input patterns during training. Cluster formations are
presented with the coloured maps.
viii
. . . . . . . . . . . . . . . . .
97
5.2
This figure illustrates the importance of the mean in the calculation of
the support of an item from a trained SOM. Different number of transactions (n) composing of zeros and ones have been used to form the
bottom graphs. These graphs show that different concentrations (densities) of these bistate values captured in an item induce the tendency
of the distribution of the curve to approach the densest value (e.g., in
the left graph the number of failures (zi =0) is greater than the number
of successes, therefore the highest point of the distribution tends to be
placed at 0).
5.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
The figure on the left depicts the intersections of an event B with events
A1 ,. . . ,A5 of a partition over S. The figure on the right depicts the concept of Voronoi regions which can be formed on the SOM (the dots
represent the codewords while the stars represent the data points assigned to each Voronoi region). . . . . . . . . . . . . . . . . . . . 109
5.4
Representation of the Probabilistic Itemset-support Estimation Mechanism (PISM) proposed in this chapter. . . . . . . . . . . . . . . . 111
5.5
Results for the support value of 15 itemsets (top graph), 255 itemsets
(centre graph) and 65535 (bottom graph) obtained after using PISM in
order to satisfy the query -All- to the map trained with the dataset Bin4,
Bin8 and Bin16 respectively. For reference, the values corresponding
to the same queries using an Apriori implementation are also plotted. . 115
5.6
Intermediate results (The support values of 15 itemsets) generated from
using PISM for the query -All- to the map being trained with dataset
Bin4x100. In both cases, the SOM needs five epochs to converge but
after the first epoch, good estimations can be formed for the support of
itemsets. The small difference in the performance between these two
exercises is due to the type of initialisation chosen. . . . . . . . . . . 116
ix
5.7
Results (plots on the right) obtained after using PISM in order to satisfy
the queries -1Itemset : 4Itemset- to the map trained with the dataset
Chess. For reference, the values corresponding to the same queries
(plots on the left) against the dataset Chess using an Apriori implementation are also plotted. . . . . . . . . . . . . . . . . . . . . . . 117
5.8
Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 3Itemset- to the map trained with the dataset
Mushroom. For reference, the values corresponding to the same queries
(plots on the left) against the dataset Mushroom using an Apriori implementation are also plotted. . . . . . . . . . . . . . . . . . . . . 118
5.9
Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 3Itemset- to the map trained with the dataset
Connect. For reference, the values corresponding to the same queries
(plots on the left) against the dataset Connect using an Apriori implementation are also plotted. . . . . . . . . . . . . . . . . . . . . . . 118
5.10 Distribution of the itemset-support estimations made by SOM via our
method for the query -90to100Itemsets- for the Chess dataset. The
corresponding errors are summarized in Table 5.8. . . . . . . . . . . 123
5.11 Generalising errors for the results given by Table 5.8. While the xaxis represents the different itemset groups, the y-axis determines the
calculated error.
. . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.12 Distribution of the itemset-support generalisation error made by SOM
via our method for the queries: 1Itemsets, 2Itemsets and 3Itemsets for
the Chess dataset, when the size of the map, representing an itemsetsupport memory for ARM, increases. . . . . . . . . . . . . . . . . 127
x
5.13 Distribution of the itemset-support generalisation error made by SOM
via our method for the query 45to100Itemsets for the Chess dataset,
when the size of the map, representing an itemset-support memory for
ARM, increases. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.1
Algorithm SOM training in batch. . . . . . . . . . . . . . . . . . . 137
6.2
Tendency of the final influence given by the node mj depending on the
strength of the influences received from other nodes. . . . . . . . . . 141
6.3
Incremental algorithm proposed for SOM in batch. While the first (top)
function triggers the learning at each stage of a non-environment, the
second (bottom) function performs the training of the SOM with the
current data chunk and the old information coming from the set of best
matching units of the latest trained map. . . . . . . . . . . . . . . . 143
6.4
Representation of a training data space describing a non-stationary environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.5
Topologies formed by a SOM trained with our incremental batch approach through the six different phases defining the non-stationary environment represented in Figure 6.4. The black dots define the structure of the trained map. The data points used for the training of a SOM
at each phase of the environment, according to the order in Table 6.1,
are defined by the green and blue dots, which represent respectively
the old knowledge (data extracted from the BMUs) and the current
data chunk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
xi
6.6
Differences between estimations (approximations) and calculations (real
values) made respectively by a trained SOM and the Apriori algorithm
for the group of 1-itemsets throughout the four phases of the environment I. The first column from the left describes the type of estimations
that can be produced with a SOM trained with only the data chunk in
turn in the environment (Chunk-SOM). The second column represents
the estimations that can be made from a SOM trained with our incremental approach (Bincremental-SOM). The last column shows the
estimation made by a SOM trained with always all the data chunks
available en the environment (Allchunks-SOM). . . . . . . . . . . . 151
6.7
Differences between estimations (approximations) and calculations (real
values) made respectively by a trained SOM and the Apriori algorithm
for the group of 2-itemsets throughout the four phases of the environment I. The first column from the left describes the type of estimations
that can be produced with a SOM trained with only the data chunk in
turn in the environment (Chunk-SOM). The second column represents
the estimations that can be made from a SOM trained with our incremental approach (Bincremental-SOM). The last column shows the
estimation made by a SOM trained with always all the data chunks
available en the environment (Allchunks-SOM). . . . . . . . . . . . 152
6.8
The RMS error during the phases of the environment I for the group of
the 1-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.9
The RMS error during the phases of the environment I for the group of
the 2-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.10 Error describing the quality of the quantization of the approaches tested
for the data chunk describing the changes at each phase of the environment I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
xii
6.11 Error describing the quality of the quantization of the approaches tested
for the data chunks describing the previous phases (history) at each
phase of the environment I. . . . . . . . . . . . . . . . . . . . . . 155
6.12 RMS Error during the phases of environment II for the group of the
1-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.13 RMS Error during the phases of environment II for the group of the
2-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.14 Error describing the quality of the quantization of the approaches tested
for the data chunk describing the changes at each phase of the environment II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.15 Error describing the quality of the quantization of the approaches tested
for the data chunks describing the previous phases (history) at each
phase of the environment II. . . . . . . . . . . . . . . . . . . . . . 159
6.16 RMS Error during the phases of environment III for the group of the
1-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.17 RMS Error during the phases of environment III for the group of the
2-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.18 Error describing the quality of the quantization of the approaches tested
for the data chunk describing the changes at each phase of the environment III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.19 Error describing the quality of the quantization of the approaches tested
for the data chunks describing the previous phases (history) at each
phase of the environment III. . . . . . . . . . . . . . . . . . . . . 161
6.20 Runtime for the three approaches evaluated on the environment III. . . 165
xiii
7.1
Example of association rules which can be further generated from the
knowledge learnt by a neural network about the Mushroom dataset defined in (D.J. Newman and Merz, 1998). All these rules, describing
the associativity among the attributes of a dataset, have the format of:
if (list of items or attributes) then (list of items or attributes) with
[support=% and confidence=%]. . . . . . . . . . . . . . . . . . . . 171
7.2
Triangular binary formations formed with the itemset search space
formed with 3 and 4 items. . . . . . . . . . . . . . . . . . . . . . 189
7.3
Function generated with the support of some groups of itemsets derived
from the Chess dataset. . . . . . . . . . . . . . . . . . . . . . . . 194
7.4
Incremental SOM-based approach for distributed ARM. The rules
will be generated from the latest trained SOM. . . . . . . . . . . 195
7.5
Local SOM-based approach for distributed ARM. While the local maps are queried remotely in the model on the left, the trained
maps are transmitted to be the source from rules will be generated in the model on the right. . . . . . . . . . . . . . . . . . . 196
7.6
A neural-based approach for ARM for data streams. . . . . . . . . . 197
A.1 The Apriori algorithm. This figure was extracted from the original
paper of Agrawal (Agrawal and Srikant, 1994). The top pseudocode
describes the main steps of Apriori. The bottom SQL query defines the
way in which candidates are formed during a mining process. . . . . . 200
B.1 An Associative memory based on a CMM.
xiv
. . . . . . . . . . . . . 202
List of Tables
4.1
List of real-life binary training datasets used in the testing of an autoassociative memory for ARM. They are part of the datasets normally
used for testing FIM algorithms or FIM benchmarks (Jr. et al., 2004;
Goethals and Zaki, 2003). . . . . . . . . . . . . . . . . . . . . . .
4.2
Support constraint conditions used to form the group of itemsets on
which the memories will be tested.
4.3
81
. . . . . . . . . . . . . . . . .
83
Error results obtained in the experiments for the support recall for the
groups of 3-itemsets, constrained as defined in Table 4.2, made by a
CMM through our proposals. . . . . . . . . . . . . . . . . . . . .
4.4
86
Error results obtained in the experiments for the support recall for the
groups of 4-itemsets, constrained as defined in Table 4.2, made by a
CMM through our proposals. . . . . . . . . . . . . . . . . . . . .
4.5
86
Error results obtained in the experiments for the support recall for
groups of different k-itemsets of the Chess dataset made by a CMM
through our proposals. The elements (itemsets) of the groups used to
make the CMM recall were defined by the Apriori algorithm through
constraining the itemsets with a minimum support between 90 and 100%. 87
xv
4.6
Error results obtained in the experiments for the support recall for
groups of different k-itemsets of the Chess dataset made by a CMM
through our proposals. The elements (itemsets) of the groups used to
make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 45 and 100%.
4.7
87
Error results obtained in the experiments for the support recall for
groups of different k-itemsets of the Connect dataset made by a CMM
through our proposals. The elements (itemsets) of the groups used to
make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 75 and 100%.
4.8
88
Error results obtained in the experiments for the support recall for
groups of different k-itemsets of the Mushroom dataset made by a
CMM through our proposals. The elements (itemsets) of the groups
used to make the CMM recall were defined by the Apriori algorithm
by constraining the itemsets with a minimum support between 45 and
100%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
88
List of binary-artificial training datasets used in the experiments for
analysing SOM properties for ARM. Each dataset contains all the possible n itemsets generated by m items. . . . . . . . . . . . . . . . .
5.2
95
List of real-life binary training datasets used in the testing of PISM for
SOM. They have been used in FIM-algorithm benchmarks (Jr. et al.,
2004; Goethals and Zaki, 2003). . . . . . . . . . . . . . . . . . . . 112
5.3
List of queries used to form the groups of itemsets used for the testing
of the itemset-support estimations from SOMs via PISM. In this case,
k means the size of the itemsets and σ refers to the support used to
form such itemset groups.
. . . . . . . . . . . . . . . . . . . . . 113
xvi
5.4
Generalisation errors produced by a trained SOM through PISM for
the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Chess
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5
Generalisation errors produced by a trained SOM through PISM for
the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Mushroom dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.6
Generalisation errors produced by a trained SOM through PISM for the
queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Connect
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.7
Generalisation errors for the trained SOMs with linear initialisation. . 122
5.8
Generalised errors for trained SOMs with linear initialisation on different ranges of groups of k-itemsets. . . . . . . . . . . . . . . . . . . 124
6.1
Data order followed in the incremental batch training. . . . . . . . . 144
6.2
Definition of the non-stationary environments using the Chess dataset.
In the first two environments, each data chunk has the same (fixed)
number of transactions, while in the case of the third environment, the
number of transactions was chosen randomly (unfixed).
. . . . . . . 147
6.3
Results obtained of the approaches tested for Environment I. . . . . . 162
6.4
Results obtained of the approaches tested for Environment II. . . . . . 163
6.5
Results obtained of the approaches tested for Environment III. . . . . 164
7.1
Characteristics of the two itemset-support memories based on the studied neural networks. m defines the total number of items from which
the n transactions or itemsets of a training data D are derived. mb
corresponds to the number of BMUs formed in an epoch training.
xvii
. . 179
7.2
Comparison of the generalised errors between our approaches for SOM
and CMM for the different number of tested itemsets, which resulted a
query from the Chess dataset with a minimal support between 90 and
100 %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3
Comparison of the generalised errors between our approaches for SOM
and CMM for the different number of tested itemsets, which resulted a
query from the Chess dataset with a minimal support between 45 and
100 %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
xviii
Acronyms and Symbols
xix
Acknowledgement
I first thank My supervisor Dr. Simon O’Keefe for his time given throughout the
development of this research. I must express my deeply gratitude to the CONACYT, which is the national counsel of science and technology in Mexico, and the
central bank of Mexico for the support offered, without which this thesis and the
experience around it would not be possible.
Throughout these years I had the pleasure to meet many nice people and become their friend. Among them, I wish to acknowledge Patricia and Michael Lee
for their support and positive words. I also want to express many thanks to the
staff of the Computer Science Department of the University of York. Moreover, I
want to thank the anonymous conference reviewers who provided some valuable
comments which helped me in the conclusion of this work.
Most importantly, I would like to express my thanks to Victoria Mueller who
has been an important person to me in one way or another. Finally, I must thank
my family (Marcela, Vicente and Carlos) for all their love, encouragement and
support given to me throughout all these years. In particular, I must express my
gratitude to my mother for showing me that everything is possible in this life.
My deepest gratitude I must reserve for myself because we never gave up in the
difficult times.
xx
Declaration
I declare that work proposed by this thesis is solely my own, except where indicated, attributed or cited to other authors. Some of the material of this thesis
have been previously published and presented in conferences. A complete list of
publications is provided on the next page.
xxi
List of Publication
The following is a list of publications that has been produced during the course
of this research.
2005
• Vicente O. Baez-Monroy and Simon O’Keefe, ”Modelling Incremental
Learning With The Batch SOM Training Method”, HIS ’05: Proceedings
of the Fifth International Conference on Hybrid Intelligent Systems, IEEE
Computer Society, Rio, Brazil, pages 542–544, 2005.
• Vicente O. Baez-Monroy and Simon O’Keefe, ”Principles of Employing a
Self-organizing Map as a Frequent Itemset Miner”, Artificial Neural Networks: Biological Inspirations - ICANN Proceedings, 15th International
Conference, Warsaw, Poland, September 11-15, 2005.
2006
• Vicente O. Baez-Monroy and Simon O’Keefe, ”The Identification and
Extraction of Itemset Support Defined by the Weight Matrix of a SelfOrganising Map”, 2006 IEEE World Congress on Computational Intelligence, 2006 Proceedings of the International Joint Conference on Neural
Networks IJCNN, July 16-21, Vancouver, BC, Canada, 2006.
xxii
Chapter 1
Introduction
Data is a common term to denote the collection of facts or raw information describing processes or activities, or the status of events occurring in an environment. Data normally exists in a variety of forms; however, it is often categorised
into the groups of numbers (numeric or quantitative data) and symbols (categorical or qualitative data), depending on the meaning of the variables or features
of the events that want to be modeled. The collection of data is an important
activity which has been performed throughout our history and has contributed to
the development of our societies in one form or another. For instance, its collection has allowed human beings develop methods for climate forecasting, medical
diagnosis and treatments, highway and house planning, and so forth.
As we are surrounded by data everywhere and its collection has an important
significance for our daily life, some areas in Computer Science (CS) have been
particularly dedicated to the development of the necessary technology to collect,
store and model large amounts of data from almost every possible recordable
source of information. For example: bank operations, video, medical checks,
images, shopping activities, vehicles, web transactions, and many others.
Among the areas responsible for the management of data, a first example is the
1
database area which focuses on its modeling. Normally, the term database in
this area refers to an entity-relationship model environment, in which the data
changes periodically due to new operations (inserts, deletes, updates) represented
by transactions. A second example is the data warehouse area in which data is
seen as a large repository containing integrated information for efficient querying and analysis. The resources of the data modelling are based on a star-schema
concept (Kimball, 1996) rather than a transactional one.
The main ambition behind the data collection is to provide humans with precious resources of data regarding a specific event which can be converted into
the more valuable commodity called knowledge or information. Once data have
been collected, the search for interpretations begins, aiming to gain a novel or
better understanding of the variables modelled in the data. This search activity,
known as data analysis, has been a permanent motivation for researchers in the
development of new methods which can provide faster, better and new interpretations from diverse and unlimited data sources.
For the last decade, it has gradually been becoming a rule that any project,
addressing the extraction of knowledge from a data source, must follow the
framework defined by the area of knowledge discovery. In this area, the term
of KDD1 , attributed to Piatetsky-Shapiro et al (Piatetsky-Shapiro and Frawley,
1991; Piatetsky-Shapiro, 2000), has been coined to define that:
The knowledge discovery in databases is the untrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns2 .
1
KDD stands for Knowledge Discovery in Databases.
The term pattern is used here to define any final product (e.g., a series of clusters, a classification model, rules, etc.) resulted of any knowledge discovery process.
2
2
In the KDD process, while the first stages have been dedicated to the management, cleaning and transformation of data, the last ones are devoted to the
manipulation of the knowledge which focuses on presenting the results in the
most profitable and understandable format for the human experts. Occupying
the central activity and representing the core of the process, data-mining algorithms are placed. These algorithms are responsible for conducting the discovery
of knowledge by performing a series of operations to form a model which satisfies the definition of the values assigned to the input parameters for the mining
activity.
DM (Data Mining) (Han and Kamber, 2000; Hand et al., 2001; Cios, 2000)
is an area in Computer Science whose efforts are set on developing these new
algorithms for data analysis. The central focus of these algorithms is to build
either predictive models, in which some features or variables of the original data
are used to build a model which will be predicting unknown or future states of
the model variables, or descriptive models, which focus on describing the data
through finding hidden patterns or relationships in the data to gain some understanding of it. The quality of the models generated as well as the performance of
the DM algorithm for a specific problem are often affected by the nature of the
data, which can be redundant, imprecise, noisy, high-dimensional or dynamic.
Therefore, a successful DM approach is often considered an algorithm which
deals reasonably well with the characteristics defined by data.
The spectrum of techniques in DM can be classified (according to the type
of task to be performed) into clustering, classification, regression and prediction,
association rule mining and sequence analysis. Nevertheless, this DM classifica-
3
tion is not unique and specialised extensions to each technique can also be found
in literature (e.g., spatial mining, web mining, outlier detection, frequent itemset
mining, text mining, bio mining, and so forth.). Data mining is often defined as a
multi-disciplinary area which integrates miscellaneous techniques from different
fields such as Machine learning, Statistics, Neural networks, etc. Therefore, it is
rare to find a unique solution for a data-mining problem.
Among the descriptive DM techniques, ARM (Association Rule Mining)
stands out because of its comprehensive and powerful format to represent the
knowledge discovered from a data source. ARM, introduced by Agrawal et al
(Agrawal et al., 1993), is a technique to form rules defining the associativity
among the components, attributes or variables of a data source or environment
D. In its beginnings, ARM was conceived to analyse shopping-basket databases
to identify customer-behavior patterns. Nowadays, it has been applied to problems in which the associativity property of the variables defining a dataset can
be exploited (e.g., web mining, text mining, bio mining).
In ARM, the data source D (database, dataset), which is represented by n
probable different patterns x1 , ..., xk , ..., xn (transactions), can be modeled horizontally mainly in two forms. The first is as a collection of binary vectors of
length m (m defines the number of components, elements, or attributes representing the transactions) in which while the absence of value at some pattern
element, for instance xki =0, refers to the lack of participation of the element
i for the pattern (transaction or association) xk , a value of one (xki =1) defines
the contrary event for the element (its participation). The other method of representation is as size-varied-integer vectors whose elements represent the indexes
i of elements with participation (xki =1) in the binary representation. Individual
4
elements of the patterns are known as items Ii , and an itemset is defined as a
grouping or a combination of some items defining a pattern.
Rules with the format of X ⇒ Y [supp, conf ]3 are the aim of ARM for
describing the participation (associations) of a total set of items, Γ = {I1 , .., Im },
in the transactions contained in D. To form such descriptions of D, according to
Agrawal (Agrawal et al., 1993), ARM can be divided into a two-stage problem.
This standpoint of ARM is known as the support-confidence framework which
basically defines that association rules can be formed by first finding the frequent
itemsets in D and second by generating the rules from such itemsets.
The former is a DM task known as FIM (Frequent Itemset Mining) (Goethals,
2003) and involves the formation and counting of the different possible k-itemsets
for all 1 ≤ k ≤ m to determine the relevant ones based on a minimal support
threshold σ. The latter, which defines the RG (Rule Generation) stage and has
been identified as a straightforward process, uses the information resulting from
FIM for the generation of rules.
To determine which k-itemsets are the important (frequent) ones in a mining exercise, the generation of candidates and the calculation of itemset support
are performed. Candidate generation refers to the formation of new k-itemsets
(combinations of k items) during the process, while the calculation of the support refers to action taken (counting) by the algorithm to measure the appearance
of the candidates in the n transactions. The relevance of the support for ARM,
in particular for FIM, is high since it is the main metric by which an algorithm
determines the status of an itemset against the threshold. This is, the support of a
3
This is the simplest format of an association rule. X and Y are known as the precedent (the
”if” part) and consequent (the ”then” part) of the rule. These two parts are called itemsets. Two
basic metrics, supp and conf , defining the support and confidence of the rule, are used to define
the interestingness of the rule.
5
rule, for instance A → B, is equal to the support of its itemset; i.e., supp(A → B)
= supp(AB). For the generation of rules, another metric, known as confidence, is
used to determine the strength of the rules which are derived from the final set of
frequent itemsets along with their corresponding support.
To illustrate the concept of ARM, an example is given in Figure 1.1 in which
we show the type of knowledge that can be derived from performing ARM on a
set of transactions representing a shopping basket dataset. As commented above
and shown with our example, the generation of association rules is produced by
performing first FIM, and then using the discovered itemsets to produce the target rules. Association rules provide a description of data in the form of ”if-then”
statements. Unlike the if-then rules of logic, association rules are probabilistic
in nature.
The challenge of the generation of association rules from a database is considerable, since its complexity, which is mainly defined by the complexity of
FIM, is exponential. That is, the exponential search space, formed by m items
conforming n transactions in a database, is defined by 2m . Therefore, according
to Goethals (Goethals, 2002), in many applications which involve hundreds of
items, it is easy to find a search space with a number of itemsets larger than the
number of atoms in the universe (≈ 1079 ).
Even though other metrics have been proposed to define rule or itemset interestingness (Hilderman and Hamilton, 1999; Tan et al., 2002; Meo, 2003;
Omiecinski, 2003) and some critiques have been made of the support-confidence
framework (Brin et al., 1997; Aggarwal and Yu, 1998), the importance of the
framework is not in question because its methodology has been adopted in one
6
Figure 1.1: An illustration of ARM applied to the data source defined in (a). The aim is
to generate rules with a minimal threshold of 20%. As the first part of ARM, the frequent
itemsets have to be discovered, such as in (b), based on their support. Then, association
rules, as in (c), are formed from them.
form or another by the majority of the successful and current state-of-the-art
algorithms. For instance, Apriori (Agrawal and Srikant, 1994; Mannila et al.,
1994), Eclat (Zaki et al., 1997a) and FP-growth (Han et al., 2000a).
The diversity of the FIM algorithms in terms of the knowledge area which
they are based on is not as vast as in other DM techniques. For example, as in
pattern classification (Duda et al., 2000), in which approaches based on Decision
Trees, Neural Networks, Bayesian-based methods, and many others, can be used
7
for performing the task.
The reason of the latter is that the main work on ARM has focused on finding
the best data structure to reduce the complexity of FIM (the counting of items and
the selection of the frequent ones) rather than trying to develop possible alternatives to tackle the problem. For instance, proposals involving biologically-based
technologies, such as genetic algorithms, have rarely been undertaken (Vázquez
et al., 2002; Yan et al., 2005b; Alatas and Akin, 2006). In particular, it seems that
the concept of approaches inspired by the learning capabilities of biological systems is not acceptable or suitable for generating such associativity descriptivedata models even if it could be possible to relate the problem of ARM to the general notion of inference in the learning activity performed by humans in which
data-driven interactions with the environment constantly occur.
For instance, in a task of learning from a data source such as the one represented in Figure 1.2, a human is able to perform two classical types of inference,
called induction and deduction, in order to obtain and use an abstract representation (model or knowledge) of the data source. With such a model formed (deduction), he is able to categorise (predict) the corresponding group to new items
depending on their nature and purpose (e.g., garlic and olive oil can be grouped
into the taxonomy of condiments and cooking ingredients). A third inference
case known as transduction4 could also be performed if the task would imply
to form answers for particular points of interest directly from the data (training
dataset) without having to build a global model of it. As stated in (Kantardzic,
2002), an important application of the latter is ARM. Moreover, this human could
perform the counting of real items or the formation of new abstract representa4
This term refers to the fact of directly drawing conclusions about data without constructing
a model. Transduction has been introduced in (Vapnik, 1998).
8
Figure 1.2: A shopping receipt as an example of a data source in which the associativity
concept can be exploited for the extraction of knowledge.
tions of items by combining them (e.g., milk and eggs and vanilla = cake). The
counting of either real or abstract items will be done by scanning the information
defined by either his shopping basket or his receipt. As a result of performing a
pass over the data (learning), he would have kept some knowledge in his memory (brain area, neurons) regarding single or collective, real or abstract itemsets,
which can be used to answer questions about the item appearance such as - How
many trios involving diet coke, crisps and napkins have been bought? Or have
toothpaste and aspirin been bought together? - It is very likely that his answers
regarding the frequency (support) of either real or abstract items would not be
necessarily exact at first, therefore, an error in his item-frequency recalls, defined
by Errorrecalling = Realf requency -HumanApproximationf requency , can be stated to
exist. The accuracy of his answers will depend on how well he is able to remember (memory issues) the number of appearances (itemset support) for each of the
different 2m combinations that can be formed by his list of m individual items.
In order to elevate the accuracy of his answers more passes on the data (training
epochs) will have to be performed by him.
9
If the current situation involved a large number of sources (multiple transactions) needing to be mined, then the problem would become more complicated
for the human. The complexity of this situation can be measured by O(mcn)
in which m, c and n define the number of items, the number of possible candidates (the possible itemset search space whose size grows exponentially) and
the number of transactions respectively. Even though the complexity grows as
a result of adding a new item or transaction, a human will always be able to
produce an answer, defining a generalisation of the frequency occurrence of the
queried itemsets, through a learning (scanning and counting) of the associations
existing in the data. For the time being, we can therefore conclude that ARM
is a mechanical activity performed naturally by humans, which involves factors
such as pattern association, pattern counting, and memory, in which the better
the answer, the larger the memory abilities and the number of passes over the
data source.
Additionally, it has been stated by A.R. Gardner-Medwin and H.B. Barlow (Gardner-Medwin and Barlow, 2001) that the concepts of learning, counting,
frequency, and association are all intimately related in the genetic mechanism of
learning as follows:
...Sensory inputs are graded in character and may provide weak
or strong evidence for identification of a discrete binary state of the
environment such as the presence or absence of a specific object.
Such classifications are the data on which much simple inference is
built and about which associations must be learned. Learning any
association requires a quantitative step in which the frequency of a
joint event is observed to be very different from the frequency predicted from the probabilities of its constituents. Without this step,
10
associations cannot be reliably recognized, and inappropriate behavior could result from attaching too much importance to chance
conjunctions or too little to genuine causal ones. Estimating a frequency depends in its turn on counting, using that word in the rather
general sense of marking when a discrete event occurs and forming
a measure of how many times it has occurred during some epoch.
Counting is thus a crucial prerequisite for all learning, but the form
in which sensory experiences are represented limits how accurately
it can be done...
The situation explained above is just a basic example of an associative reallife problem in which the computational power of the brain can be used to process a correct answer, which in this case cannot be performed as fast as we wish
to. Nevertheless, in this digital era, artificial architectures such as an ANN (Artificial Neural Network) can be used to imitate the human brain since as defined
in (Haykin, 1999), an ANN is:
a massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for
use. It resembles the brain in two respects:
1. Knowledge is acquired by the network through a learning process.
2. Interneuron connection strengths known as a synaptic weights
are used to store its knowledge.
Hence, an ANN can be used to perform tasks in which the concepts of learning and generalisation are involved. Moreover, its use becomes more attractive
when the problem in turn implies making use of its power to recall previous
11
knowledge or their experience of the environment learnt.
Therefore, in order to contribute with new biologically-inspired approaches
that can be used for problems such as ARM in which the counting of patterns
representing associations (events) is significant, we investigate in this thesis the
suitability of ANNs for ARM. In particular, we focus on determining if association rules can be generated from the knowledge embedded in the synaptic
weights of a neural network. Moreover, since the training data can be representing a dynamic environment, we also conduct research on determining if the
synaptic weights of an ANN, which serve as the basis of the rules, can be updated
throughout time with the changes occurring in the training environment.
1.1 The Role of Neural Networks in Data Mining
As mentioned previously, data mining is constituted by a large number of techniques coming from different CS fields which have been incorporating throughout time. The diversity in techniques opens up the possibility of addressing
a data-mining problem with different proposals. For instance, a classification
problem is often achieved by using decision trees; nevertheless, neural networks
have already been shown to be promising alternatives to conventional methods (Zhang, 2000). Nowadays, among the techniques forming the data-mining
toolbox, ANNs have emerged to be considered as one of the most useful tools
to the point of having become part of the current-commercial KDD solutions
such as Clementine (SPSS, 1968). Their inclusion in the data-mining framework has not been easy since they were not initially considered suitable for these
tasks due to the main criticism of neural networks which lies in the following
aspects. First, neural networks have been categorised as black-boxes whose re-
12
sults lack for symbolic representation. Therefore their verification, integration
and interpretation are difficult. Second, the time used-up for the training of a
neural network can be beaten by conventional methods.
As a response to these critiques, approaches, such as (Lu et al., 1996; Craven
and Shavlik, 1997; Zhang, 2000), have appeared to argue and support for the
reconsideration of neural networks for the data-mining framework. To argue for
their inclusion, some points are outlined as follows:
• To tackle the incomprehensibility of the neural models, some work has
been produced for the extraction of rules and the visualization of the knowledge modeled by their weight matrices (Ultsch and Siemon, 1990; Hammer et al., 2002).
• Neural networks are data driven self-adaptive methods which adjust to the
input data. This adaptation is sometimes done without any explicit specification of functional or distributional form of the underlying model. For
instance, the SOM (Self-Organizing Map) (Kohonen, 1996).
• They are able to approximate any function with arbitrary accuracy.
• They have the capability of remembering past states of the input data
within their weight matrices; i.e., they resolve the elasticity-plasticity dilemma
such as the ART network (Carpenter and Grossberg, 1989).
• Neural networks are nonlinear models so that they can be more flexible in
modeling real world complex relationships.
• Statistical analysis can be conducted on the basis given by ANNs.
• New training algorithms have reduced the training time without sacrificing
the accuracy of the results.
13
• They form a more suitable inductive bias than the conventional algorithms.
Because of the type of results given by neural networks, they have also
been considered an important part of the soft computing5 framework for datamining (Mitra et al., 2002), which carries on with the aim of producing methods
of computation that perform a reasonable solution at low cost by seeking an approximate solution to either imprecise or precise problems.
Considering the above characteristics of neural networks, we can conclude
that they are on the right track of scaling-up to problems involving large datasets
and becoming important members of a new generation of algorithms in which
human interaction to compute hypotheses will be relevant. Speaking generally,
the number of contributions of neural network approaches for DM problems,
whose total description goes beyond the limit of this thesis, such as classification, clustering and prediction is already vast and constantly increasing. Nevertheless, for other problems, such as descriptive DM tasks, in particular ARM, the
attention of the ANN research community remains passive and uncertain.
1.2 Linking ANNs and ARM: The Motivations
The conception of the use of ANNs for ARM may at first appear inappropriate
since it can be argued that in the task of ARM there is nothing to be classified,
clustered or predicted. Nevertheless, the use of ANNs starts making sense if
some factors are well thought out as follows:
The Meaning of Association. In ARM, association is a concept defined implic5
Soft computing (Kecman, 2001) is not a clearly defined discipline at present, but it follows
the premises that: 1) the real world is pervasively imprecise and uncertain and 2) precision
and certainty carry a cost. In other words, they are methodologies which aim to exploit the
tolerance for impression, uncertainty, approximate reasoning, and partial truth in order to achieve
tractability, robustness, and low-cost solutions.
14
itly by the appearance of m items in the n transactions of a data source D.
It is a concept exploited for the formation of knowledge formatted like
rules about some D and its discovery is the main purpose of this DM task.
Associations are represented physically by patterns formed through the
grouping of binary or numeric elements whose strengths of associativity
are mainly measured by the frequency (support) of the members.
Similarly, in the field of Neural Networks, the concept of associativity has
also been employed. In this case, it defines an explicit relationship that
exists between a pair of input patterns (input and target) presented to the
neural network under supervised training. This concept is mainly exploited
by neural architectures which imitate the concept of associative memory
presented in our brain. In general terms, the explicit association among
the inputs is the target to be learnt by some networks to form a knowledge
which can be used for pattern association or recognition tasks.
Hence, based on the fact that both technologies know and manage the
concept of association for creating a better understanding of some environment D; why should we therefore not assume that one of the best
techniques for pattern recognition under association concepts can perform
the counting of patterns or sub-patterns (itemsets) defining associativity
among their elements.
Rules as Symbolic Representation of Knowledge. One of the fundamental considerations for not using ANNs for data analysis tasks has been the comprehensibility of their results. Therefore, since the late 80’s (see beginnings in (Towell and Shavlik, 1993; Craven and Shavlik, 1993; Andrews
15
et al., 1995)), some efforts have been put into gaining interpretations of
these nonlinear systems. One approach to tackle this drawback has been
to develop methods of rule extraction (Benitez et al., 1997; Tickle et al.,
1998; Taha and Ghosh, 1999; Craven and Shavlik, 1999; Tsukimoto, 2000;
Browne A., 2003; Malone et al., 2006), which turn the conversion of the
real-valued knowledge formed in the weight matrix into symbolic rules,
describing the inside of the neural model in a global or local manner
(depending on the level of description). Rule-extraction mechanisms already exist for different neural architectures such as multi-layer perceptron (Duch et al., 1996; McGarry et al., 1999), recurrent networks (Jacobsson, 2005), self-organising networks (Hammer et al., 2002; Malone
et al., 2006), radial basis function networks (McGarry et al., 1999), and
even for hybrid architectures (Eggermont, 1998). They have focused on
the formation of if -then or logic, m-of-n (Setiono, 2000) and fuzzy rules,
in particular for classification tasks.
According to Krishnan (Krishnan et al., 1999), rule extraction is a search
problem because it requires exploring a space of candidate rules to test
each individual candidate against the network to determine if they are
valid. This validity refers to the rule property of describing what happens
inside of the neural network. In particular, the behavior of the network
for those instances (inputs) that will form the antecedents of such rules.
A rule-extraction method searches until all the maximally-general rules
have been found. Heuristics have also been employed to limit the combinatorics of the rule-transverse, for instance, the number of elements in
the rule antecedent and/or the search has been limited to combinations of
elements that only occur in the training dataset. Speaking generally, these
approaches have been efficient mechanisms to make neural networks un-
16
derstandable.
In ARM, rules are generated to explain the events occurring in a data
source D. Their generation is realised by performing a process which
resembles the counting of patterns in the high dimensional space formed
by D. During rule generation, an algorithm typically uses a data structure
(e.g., tries, hash trees, and many others.) to represent the findings (patterns and their properties) and to speed up the mining process. Moreover,
heuristics have been also proposed to prune the exponential itemset search
space formed by the items of D.
Hence, based on the similarity between the extraction of rules from neural
networks for classification tasks, and the process for generating associations rules from a database, it may be possible to extract association rules
from an ANN through a mechanism which can perform the correct interpretation of the knowledge embedded in the weight matrix.
Activities and Strategies in ARM. As a result of all the diversity of approaches
in ARM proposed since 1993, a framework, which we have summarised
and depicted in Figure 1.3, can be defined. Taking into account the current capabilities of neural networks, we could begin by assuming that they
may be seen as alternatives to the following activities constituting such a
framework:
• Candidate-generation Strategies.
Up to this day, researchers do not have a certain idea about what decreases the complexity of ARM (Bodon, 2006). Nevertheless, it has
17
Figure 1.3: Framework defined by the processes and strategies for the problem of as-
sociation rule mining. Its conception is based on the support-confidence framework of
Agrawal (Agrawal et al., 1993).
18
T1-{1,2,4,6,7}
T2-{1,2,3,5,6,7}
T3-{1,3,5,6}
T4-{1,2,3,5,6,7}
T5-{1,3,4,6}
T6-{1,3,5,7}
T7-{1,2,4,5,7}
T8-{1,3,4,5,6}
Integer (item index)
T1-{1,1,0,1,0,1,1}
T2-{1,1,1,0,1,1,1}
T3-{1,0,1,0,1,1,0}
T4-{1,1,1,0,1,1,1}
T5-{1,0,1,1,0,1,0}
T6-{1,0,1,0,1,0,1}
T7-{1,1,0,1,1,0,1}
T8-{1,0,1,1,1,1,0}
Binary
Data Representation
Sampling
T1-{1,1,0,1,0,1,1}
T3-{1,0,1,0,1,1,0}
T4-{1,1,1,0,1,1,1}
T6-{1,0,1,0,1,0,1}
T8-{1,0,1,1,1,1,0}
Vertical (tidlist)
i(1)-{T1,T2,T3,T4,T5,T6,T7,T8}
i(2)-{T1,T2,T4,T7}
Horizontal
T1-{1,1,0,1,0,1,1}
T2-{1,1,1,0,1,1,1}
Data Format
Data Definition
(DD)
PATRICIA
Hash-trees
Iterative
Recursive
Frequent, closed, maximal
ITEMSETS
AB (100%)
ABC (10%)
ACDF (100%)
CDEFGH (70%)
DE (10%)
Intersecting
Counting
The counting of support
Itemset-support monotonicity property
Pruning itemsets
Breadth-first search
Depth-first search
Traversing the search space
Candidate Generation
Tries
RULES
A->B (100%, 90%)
AB->C (10%, 9%)
AC->DF (100%, 12%)
BF->GH (50%, 50%)
ABCF->H (50%, 50%)
CDE->FGH (70%, 90%)
D->E (10%, 20%)
Frequent itemset and support needed
The calculation of confidence
Minimal Confidence Threshold
Minimal Support Threshold
Itemset Representation
Rule Generation
(RG)
Frequent Itemset Mining
(FIM)
been stated that a possibility may be defined by the total number of
itemsets which have to be checked by an algorithm during the process in order to detect those which are interesting. To reduce such a
number (c), a phase called candidate generation has been included
as part of the definition of some FIM approaches. The phase aims
to generate the corresponding group of k-itemset candidates, which
may turn out to be frequent after the calculation of their support (e.g.,
by counting), based on the information provided, for instance, by the
k-1th iteration regarding the frequent (k-1)-itemsets. The pruning
of the unnecessary itemsets has been possible by using the following heuristic called the downward closure property of itemset support (Agrawal and Srikant, 1994): All subsets of a frequent itemset
must also be frequent.
Because the main purpose of this strategy is to determine which itemsets will turn out to be unfrequent as soon as possible in the mining
process, a neural network, which learns incrementally some information provided from a FIM algorithm during an ARM process, could
be taken into account to predict which itemsets are not worth to be
considered in the mining process.
• Itemset-storage-structure Strategies.
Most of the work produced for FIM has focused on this type of
strategy since it has been proven that the use of different data structures for representing the itemsets and candidates found during FIM
can result in a reduction of time and memory issues (different datastructure-based algorithms are summarized in (Goethals, 2003)). The
19
strategics aim to find the most compact and easy-to-traverse data
structure for representing the search space of FIM. Hence, we could
think of employing a type of neural network which can organise its
neural structure dynamically based on the incoming patterns (itemsets) in order to adapt it to form a representation of the search space
lattice in which the mapping of the itemsets can be performed. This
may be achieved by an unsupervised neural network.
• Input-data-layout Strategies.
One of the drawbacks of any DM technique is the dealing with data of
high dimensionality. This problem, better known as the curse of dimensionality, is a factor which directly affects the performance of any
algorithm. Moreover, the performance tends to get worse when the
problem concerned also involves scanning large datasets with dense
patterns. In order to tackle this nature of the data, solutions like,
for instance, PCA (Principal Components Analysis) (Jolliffe, 1986),
have been used to reduce the dimensionality of the original data, before it is mined or visualised. It is important to mention that the
representation of the input data plays an important role within the
data-mining framework, because it directly affects the result’s accuracy and the performance of the algorithm. Thus, a trade-off always
has to be made between performance and accuracy.
In ARM, proposals have been employing three different layouts for
the input dataset. They involve either the representation (vertical
or horizontal) in which the input data will be handled by the algorithm or the quantity (sampling techniques) of transactions that
20
will be employed to form the rules. It is possible that a neural network which has learnt interesting relationships inherent to the input
dataset can also provide such a good representation of the associations. That is, input-data-attribute relationships will be encoded in
the weight matrix whose size is considerably smaller than the original dataset. This manner of tackling the problem matches with the
activities performed in the extraction of knowledge from neural networks explained above. Moreover, this idea would lead to building
a form of artificial memory which would learn and remember associations from the input data (imitating what a human would do in
the same situation). Nevertheless, in order to form such memories, it
is crucial to investigate if counting of patterns can be performed by
ANNs.
• The Maintenance of Rules.
Since data does not only describe stationary environments, researches
have proposed the incorporation of the maintenance of rules as part
of the general framework. For instance, this data characteristic was
first taking into account in (Cheung et al., 1996b). To address the
maintenance, the frequent itemset-representation data structure is periodically updated with the changes occurring in the original dataset.
The major inconvenience of these approaches are that they are performed on top of, or derived from, the traditional FIM algorithms
which brings a new complexity to the entire framework. Moreover,
the re-use of knowledge from past mining activities is needed to realise the maintenance which results in keeping certain information of
considerable size throughout time which can turn into a new data21
maintenance complexity for ARM. Hence, we believe it would not
be mistaken to assume that a neural network with the capability of
incremental learning can perform the maintenance of the itemsets by
incorporating the changes suffered in the environment of the original
data source. This idea can be interpreted to be very closely related to
the previous one (artificial memory for ARM) since the ANN may be
collecting new knowledge about the current state of the environment
infinitely throughout time.
Other Factors. One of the advantages of neural networks compared to some
typical algorithms is that they can be implemented in hardware, which
means that we can think about having an ANN-based piece of hardware
in future, in which itemset support can be recalled and therefore FIM can
be performed. Moreover, ANNs are considered to be parallel since they
simulate the way the biological counterparts work.
Employing neural networks for data analysis brings out the advantage of
re-using their weight matrices for diverse mining tasks. For instance, it
would no longer be necessary for analysts to run different environments in
order to create models for clustering and association rules, if good results
could be obtained from the same source of knowledge. Moreover, discovering that such re-use happens for different purposes (not only for predictive but also descriptive tasks) in some of the current models of ANNs
would reinforce the idea that this biologically-inspired technology models
some conditions happening in the human brain, in which different brain areas process or control different biological activities in response to stimuli
by exploiting the knowledge allocated by their neurons.
22
D(t)
RECALLING
ITEMSET SUPPORT
ENVIRONMENT
TRAINING
(COUNTING)
A->B (100%, 90%)
AB->C (10%, 9%)
FIM / ARM AC->DF (100%, 12%)
BF->GH (50%, 50%)
LOGIC
ABCF->H (50%, 50%)
CDE->FGH (70%, 90%)
D->E (10%, 20%)
DECODING
COUNTING
D(t+k)
NEURAL NETWORK
(ITEMSET- SUPPORT MEMORY)
(FREQUENT- PATTERN MEMORY)
QUERYING
SUPPORT FOR AN ITEMSET
Figure 1.4: Neural-based framework for ARM.
As described in this section, there are some links, which can be interpreted as
part of the motivations for this research, between ARM and ANNs to believe that
the use of an ANN for problems like ARM is possible. Thus, this research carries
on with the hypotheses that the ARM framework can be transformed in such a
way that neural networks constitute the core of the process. This conception of
the idea is illustrated in Figure 1.4.
1.3 The Neural-Network Candidates
An introduction of the models considered in this work will herein be given.
Moreover, some of the reasons behind their inclusion within this study are described. Details of the justification for research on each neural network will be
stated in Section 3.5 in Chapter 3.
Auto-Associative Memory (AAM). This neural network, as its name says, is
trained to capture associations between incoming pairs of patterns for future recognition. The input patterns can be expressed in form of unipolar
23
vectors. In learning tasks in which labels or targets yi do not exist to be
associated with the training data xi , this ANN allows the formation of associations between a pattern xi and itself. The maximum size of the weight
matrix formed can be expected to be m-by-m for an auto-associative memory, where m refers to the pattern dimensionality. It has the property to
remember the associativity expressed between the input patterns under supervised training. Thus, it has the faculty to retrieve from its memory
(weight matrix) the pattern most closely associated with the input pattern
normally corrupted.
Self-Organizing Map (SOM). This is one of the most commonly used neural
networks for data mining. The main qualities of this neural network are: i)
it can organise its structure according to the description of the dataset and
ii) it learns from data in an entirely unsupervised training environment.
iii) the maximum size of the map expected is high likely to be smaller
than the size of the original dataset. iv) a self-organising map has also
inherited some properties of Vector Quantization, therefore it compresses
the distributions of the m-dimensional data into the two-dimensional map.
1.4 The Research Questions
Inspired by the critiques on the application of ANNs for DM, the research presented here aims to investigate whether artificial neural networks are suitable for
a data-mining technique which concerns neither clustering nor classification; in
contrast, it forms descriptions of the content of a database D by providing knowledge in a rule format: Rules which describe not only the associativity among the
elements or attributes of the patterns in a database, but also the frequency in
which they occur; rules which normally are produced as a consequence of per-
24
forming the counting of patterns registered in a database.
Although the course of this research can be addressed towards the different
possibilities laid out above, and posteriorly discussed in Chapter 3, the scope of
this thesis has been confined to investigating if the itemset property of support
(the frequency of patterns) can be reproduced (discovered or calculated) from
the knowledge embedded in the weight matrix of our candidate neural networks.
In other words, we will be taking the problem of determining if our chosen ANN
candidates can be biologically-based solutions for ARM. In particular, we focus on determining if they are suitable (if they have the information resources
and properties) for becoming artificial memories for ARM, which will be able
to count, store and recall itemset support after they have been trained with some
data modelling associations (binary patterns). Moreover, because data is dynamic, we also study the problem of the maintenance of the knowledge in the
memories through the concept of the incremental learning with neural networks.
The reasons for moving the focus of the research in this direction can be summarized as follows:
The Description of Data by ANNs. Data is often described by the results given
through methods which form general models or identify local patterns.
ANNs have been used for the former, as an alternative of the traditional
algorithms, in particular, when predictions of new states of the learnt variables are needed. For the latter, ANNs are able to describe data by forming
clusters of the patterns, existing in the data, with the information presented
in their learnt knowledge. Nevertheless, the idea of describing data from
such knowledge in the form of association rules, as humans do from the
knowledge captured by their brains, is still unexplored.
25
The Importance of Itemset Support for ARM. Even though the itemset property of support has been criticised for not being the best metric to measure
the interestingness of itemsets and therefore rules, its use and definition
has been a fundamental part in the evolution of the approaches for ARM.
In particular, it has an important role in the support-confidence framework
defined since the problem of ARM was introduced to the DM field. Therefore, in order to determine which patterns, associations or itemsets may be
statistically important for the generation of association rules, a neural network should at least be able to know or calculate the occurrence frequency
of the itemsets; i.e., their support property in the learnt environment.
Reproducing Counting Abilities with ANNs. Counting is an important activity in biological systems since it is a method to summarise what occurs in
an environment. It also takes part in a learning process (Gardner-Medwin
and Barlow, 2001). Moreover, results of counting are an important part of
our daily knowledge which often is utilised for decision making.
Reproducing counting abilities on artificial models is a challenging and
relevant task for the modelling and conception of what may be happening in real life (in the human brain). Moreover, performing counting with
ANNs can also be seen as the development of either interpretations of,
or knowledge extraction mechanisms from them (weight matrices) in order to reproduce values which describe the frequency of events (patterns,
itemsets) occurring within the data.
Building Artificial Memories for ARM. In order to develop the framework illustrated in Figure 1.4, it is important to start defining an ANN architecture
which can learn, recall, and store knowledge about its environment (the
frequency of patterns) autonomously. This would result in its hypotheses
26
and recalls being formed exclusively from its knowledge, which was previously captured directly from the environment, rather than depending on
results from the counting of patterns given by other proposals. In addition,
different networks could be incorporated into the system in order to make
more complex hypotheses from the knowledge stored in the ANN-based
memory for other purposes.
In detail, the questions which we are interested in answering with this thesis
are as follows:
• ANNs have been applied to form data models for classification and clustering for instance, but can they be used for descriptive-data-mining techniques in which the aim is to represent the data in form of association
rules?
• The interpretation of the knowledge (weight matrix, mapping) generated
by ANNs has been a permanent task of research in the ANN field. In particular, mechanisms have been developed to describe hypotheses formed
by the nodes of a neural network into symbolic representations for classification problems. Nevertheless, would the knowledge learnt by an ANN
be useful for describing associations among the elements of a database?
• In the support-confidence framework of ARM, FIM (first stage of ARM)
calculates a metric called support to produce the raw material known as the
frequent itemsets for the generation of rules. Therefore, could it be possible that our chosen neural networks can have some knowledge describing
the frequency of patterns distributed along their nodes? I.e., Could the
results of counting be generated from our ANN candidates?
• For the last decade, an implicit framework has been built in ARM, illustrated in Figure 1.3, which is replete with processes and strategies in order
27
to break down the complexity of the data-mining task. Therefore, could
it be possible to have an itemset-support memory based on an ANN, as a
substitute for the original database, to take part in such framework ?
• If the implementation of such neural-network itemset-support memory was
feasible, could it continue accumulating knowledge of the pattern frequency throughout time while the original data environment changes?
In general terms, this thesis focuses on the process of counting patterns and
the maintenance of the counters with ANNs in order to propose their inclusion
as itemset-support memories in frameworks which involve tasks like FIM. It is
important to note that the development of this topic, artificial neural networks for
association rule mining, as summarized in Chapter 3, has been rarely undertaken.
Additionally, to the best of our knowledge, the topic with which this thesis deals
has not yet been studied by other researchers. Therefore, this work primarily focuses on investigating which biologically-based capabilities, in conjunction with
the different learning paradigms (supervised, unsupervised and hybrid learning)
and architectures of the neural networks, can initially be used for the calculation
of the itemset support and consequently for its maintenance.
1.4.1 Aims and Objectives
Having defined the problem area domain and focus of the thesis, and having
selected two neural networks as candidates, this work aims to achieve the following:
• To gain some insight into the use of neural networks for techniques involving neither induction nor deduction (classification, prediction), but transduction (association rule mining) (Kantardzic, 2002).
28
– Normally, ANNs have been used to create global models by learning
the dependencies of some given data (induction) in order to use such
models for predicting outputs for future values (deduction). Therefore, it will be investigated if our ANN models can reproduce the
outputs which can normally be given by the application of a transduction approach, such as the process of association rule mining.
• To investigate if the counting of patterns, which normally is realised in a
high-dimensional space defined by the training data, can be reproduced
with the information defined within the low-dimensional space formed by
the weight matrix of a trained neural network.
– The exploration of counting with ANNs has been proposed in (GardnerMedwin and Barlow, 2001). However, many assumptions on two theoretical models were made in order to make a proposal. In our case,
an analysis of two well-known neural networks, a self-organising
map and an auto-associative memory, will be made to determine if
these two models perform pattern counting (collect knowledge about
frequency) while they learn.
• To develop mechanisms of knowledge extraction for the weight matrix of
the ANN candidates in order to generate (calculate) itemset support.
– Once it has been stated that, as a result of training, pattern frequency
is embedded in the knowledge of a neural network, we will focus
on giving the right interpretation to the weight matrix in order to
recall the frequency (support) of patterns (itemsets), which can be
composed by different elements (items) of the learnt environment.
29
• To propose how the frequency pattern knowledge in a neural network can
be maintained while the environment is changing.
– In particular, a proposal on this topic will be given by exploiting the
incremental learning abilities of a self-organising map, since it is an
ANN which learns without supervision.
• To propose the development of a neural framework for tasks like ARM
by proving that itemset support can be taught, stored and recalled from
an artificial memory, based on either a self-organising map or an autoassociative memory.
– To consider our neural candidates as support-itemset memories, it
would be important to evaluate the accuracy of their results. Therefore, the comparison of the neural networks against the values given
by the counting process performed by Apriori will be realised.
Since ANNs have been successfully applied to problems involving classification, prediction, clustering. This work aims to complement the success of ANNs
by serving as an evidence that the knowledge learnt by them on normal training
conditions can be reused for data-mining tasks whose performance requires the
knowledge or information about the frequency of the training patterns.
1.5 Organisation
The remaining chapters of the thesis are organized as follows:
• We first provide the formal definition of ARM in Chapter 2. In particular,
we focus on describing the sub-task of FIM. Moreover, since the Apriori
algorithm, which is the gold standard and historically most important for
30
ARM, will be used to compare our results, its definition is given in Appendix A.
• The little literature related to our aims is summarised in Chapter 3. Details
regarding the characteristics, which are needed in a neural network to be
able to perform FIM, are given here. In this chapter, we justify the study
on our neural network candidates. Moreover, conceptions are given about
how the ARM problem can be understood from an ANN point of view.
• Chapter 4 describes our own interpretation on how the counting process
is realized by an auto-associative memory. In particular, a correlation matrix memory is studied, and two methods to estimate itemset support from
its weight matrix are proposed. Information about the CMM training is
defined in Appendix B.
• Chapter 5 explores the idea of using a self-organising map for the counting of patterns. Unlike other approaches of SOM for ARM, our approach
called PISM (Probabilistic Itemset-support eStimation Mechanism), which
is based on Probability theory, only uses the information coded by the
best-matching units of the map for the estimation (recall) of itemset support. Our novel perspective of the use of SOM for ARM alleviates the
drawbacks presented in similar approaches, which limit the use of SOM to
just clustering the input data before FIM performs. Information about the
SOM training is defined in Appendix B.
• Chapter 6 develops the idea that the incremental learning capability of
neural networks can be exploited for the task of the maintenance of rules in
ARM. In particular, a study with the SOM is carried out to update itemsetsupport on the map.
Since a batch training has been considered unsuitable for the update of a
31
SOM in non-stationary environments, we have developed BincrementalSOM (Batch incremental training for SOM), which is a mechanism based
on using the knowledge captured by the best-matching units, to update
the information of a SOM regarding the itemset support occurring in these
environments.
• Finally, the conclusions drawn by this thesis, its limitations and future directions are all defined in Chapter 7.
32
Chapter 2
Association Rule Mining
2.1 Introduction
In the field of DM (Data Mining), the techniques responsible for the analysis
of data are often classified into predictive and descriptive. While the former
aims to build models from data for producing future inferences about it, the latter performs the discovery of hidden patterns in data. Among the descriptive
techniques, ARM (Association Rule Mining) excels because of its easy manner of describing important patterns existing in data. Roughly speaking, it is a
technique which searchers for patterns describing relevant associations among
attributes, elements, or items of a given data source or environment.
In the beginnings of ARM, which was introduced in (Agrawal et al., 1993), this
DM task was only proposed for the analysis of market-basket data, but today its
application has been extended to a variety of other domains.
Taking into account the straight manner in which ARM forms its knowledge
from an environment, which does not consider the building of a model, ARM
has been classified as a transductive learning process (Kantardzic, 2002). The
knowledge or inferences generated by ARM are described by a group of rules
33
with the format of: X → Y, which symbolise associations of patterns occurring
in data. As the total number of possible rules, generated from a given data, can
be exponentially large, their generation is considered to be infeasible; therefore,
ARM proposals have focused on generating only the relevant rules based on
calculating metrics which define their interestingness. For instance, state-of-theart algorithms, like Apriori, developed independently in (Agrawal and Srikant,
1994; Mannila et al., 1994), are based on the definition of the support-confidence
framework (Agrawal et al., 1993), which states that ARM must be tackled by
dividing it into the tasks of: FIM (Frequent Itemset Mining) and RG (Rule Generation). This is, rules are derived from a knowledge known as frequent itemsets
discovered by FIM.
Overall, ARM aims for the discovery of interesting rules which satisfy minimal constraints respecting, for instance, the properties of itemset support and
rule confidence which are respectively involved in FIM and RG. ARM has been
addressed by a large variety of algorithms. A description of the most well-known
algorithms can be found in (Goethals, 2003; Bodon, 2006; Aaron Ceglar, 2006).
The strategies to break down the complexity of the problem, which the current
algorithms are based on, have involved:
• The usage of different data structures to represent the knowledge or findings during the mining process.
• The exploration of different layouts to represent the input data.
• The implementation of efficient strategies to traverse the search space.
• The definition of optimization methods for the reduction of I/O operations.
This has involved the definition of parallel and distributed approaches.
34
• The approximation of the rule properties to reduce the number of data
scans.
• The type of target knowledge from which rules will be derived.
In the history of ARM, the Apriori algorithm has been very important to
the extent that its philosophy has been utilised in one form or another in the
development of the state-of-the-art algorithms.
Although many approaches have claimed reducing the complexity of the problem, the real causes that produce a good performance in the algorithms are still
uncertain (Bodon, 2006), as well as the different sources from which these rules
can be generated. Hence, it is important to continue research in this area in order
to find answers to the current enigmas.
2.2 The Scope of ARM
As a technique which makes inferences from data, ARM can be seen as addressing the generation of rules R from an environment, represented by a dataset or
database D, through an algorithm A in charge of searching for rules which must
satisfy thresholds T set up for a mining exercise. Such a search is typically conducted by the usage of diverse strategies S. In the best theoretical scenario, ARM
should be tackled by an algorithm whose performance, represented by a function
FA , minimizes Equation 2.1. This is, R, satisfying T , should be produced by
the execution of A, which minimizes Ω that defines the complexity and computational resources utilized in a mining exercise over D, based on the use of S.
R = min FA (Ω, D, T )
S
35
(2.1)
2.2.1 Formal Definition
Let any symbol, literal or element ij from a set I = {i1 , i2 , . . . , im } be called
an item, and a grouping or formation of items X such that X ⊆ I be called an
itemset. In particular, an itemset X with k = |X| is called a k-itemset.
Let D be a set of n transactions, events or patterns grouped in a dataset or
database representing an environment. Each transaction ti is defined by a unique
identifier id together with an itemset Y, satisfying Y ⊆ I. A ti is said to hold or
support an itemset X iff X ⊆ ti .Y.
A basic association rule is an implication defined by A → B in which the itemsets A ∧ B ⊂ I, but A ∩ B = ∅.
The support of an itemset X with respect to a D is defined by the fraction of
transactions supporting it. This can be defined as follows:
supp(X) =
|{ti | ti ∈ D ∧ ti.Y ⊇ X}|
|D|
(2.2)
In the case of a rule defined by A → B, its support is given by the support
of A∪B; therefore, supp(A → B)=supp(A∪B=X). Since the support of X defines
the occurrence frequency of X in D, it can also be understood as the probability
of X , P(X).
The strength of a rule in D is defined by its confidence as follows:
conf(A → B) =
supp(A ∪ B)
supp(A)
(2.3)
Since it is impractical and not desirable to mine and generate the total space
of itemsets and rules, which grows exponentially in function of |I|, the challenge
36
of ARM has been to discover only those which are interesting. The interestingness of an itemset and a rule is determined by evaluating the properties of support
and confidence against the thresholds of minsupp and minconf respectively. For
instance, an itemset X is frequent iff supp(X) ≥ minsupp.
Hence, based on the above defined, which describes the support-confidence
framework (Agrawal et al., 1993), the aim of ARM is then: 1) to discover a set
F = {Xi ∈ D|supp(Xi ) ≥ minsupp} representing all frequent itemsets out of
a space defined by 2m possible ones, and 2) to perform rule generation with the
information generated in F . In a simplified manner, association rule mining is
the result of performing frequent itemset mining and rule generation.
2.3 Frequent Itemset Mining
Frequent itemset mining has become an essential factor in the generation of association rules from data because it is in charge of seeking out the right raw
material, known as frequent itemsets, from which rules are derived.
Since its complexity mainly defines the complexity of ARM, it has been the
focus of attention of researchers who have looked at developing an algorithm
which can satisfy Equation 2.1. This is, an algorithm whose performance does
not deteriorate abruptly with the possible conditions presented by the data and
thresholds.
FIM is a combinatorial search problem which aims to form a set of itemsets F
through discovering the frequent ones from an itemset search space SI defined
by the m items of a data source D. The target set can then be stated as follows:
37
F = {Xi ∈ SI |supp(Xi ) > minsupp}
(2.4)
In which, the search space SI, which is often represented by a lattice structure
as in Figure 2.1, is defined by,
SI = {X ∈ C(m, k) |∀k : k > 0 ∧ k ≤ |I| ∧ m = |I|}
(2.5)
Whose size satisfies

|SI| =

m
X
 m 
m

=2 −1
k
k>0
(2.6)
Which refers to the total number of combinations, associations, or itemsets that
define SI and which could occur in D.
Figure 2.1: Example of an itemset-search-space lattice. In this case, the data space
is formed by 4 items. Indexes represent the lexicographic order of the itemsets in the
space.
2.3.1 The Calculation of Itemset Support
In order to discover the set corresponding to the frequent itemsets from which
association rules will be derived, the itemset property of support must be calcu38
lated. This property not only defines the frequency or probability of occurrence
of an itemset within the mined data, but also represents a metric to determine
the importance or interestingness of an itemset in the mining process. Therefore, to achieve such a calculation, three approaches have been proposed by the
state-of-the-art algorithms as follows:
1. By occurrence counting. In this case, each itemset Xi under investigation
has associated a counter which is incremented when it is discovered that
a ti ⊇ Xi while the scanning of D is carried out. Since it is not feasible
to count all the possible itemsets of the defined search space, algorithms
in this category often make use of a procedure for CG (Candidate Generation) in order to focus the support calculation just on potential itemsets
called candidates. A CG procedure is a function which forms candidates
based on the frequent itemsets already discovered. At the beginning of
ARM, these potential itemsets were compared to each transaction of D to
determine their corresponding support; nevertheless, in order to reduce the
runtime of ARM, it has been proposed the projection of the transactions
into data structures representing the candidate itemsets. No error is produced in the itemset-support values generated from occurrence counting
since they define the real number of appearances of an investigated itemset in the data. Occurrence counting is normally utilised by algorithms
which perform a breath-first search and/or make use of the horizontal layout for the input data. Important approaches based on occurrence counting
are the Apriori (Agrawal and Srikant, 1994; Mannila et al., 1994) and FPgrowth (Han et al., 2000b) algorithms.
2. By set intersections. In this type of approaches, the typical horizontal layout of the input data D is replaced by a vertical one. Therefore, contrary to
traditional transactions, each item ij has associated a list, known as tidlist,
39
containing the identifiers of the transactions that support it. In this case,
a candidate C, which represents X∪Y, is formed by an intersection such
that C.tidlist = X.tidlist ∩ Y.tidlist. Therefore, the support of an itemset
is determined by |C.idlist|. A good representative algorithm based on set
intersections is Eclat (Zaki, 2000). Similar to occurrence-counting-based
values, the support values generated through set intersections are errorless.
3. By estimation. As the itemset support calculated from the above approaches
involves scanning the original data in one way or another, it has been stated
that they calculate the real itemset support. Nevertheless, since the number of candidates, whose support needs to be determined, and the number of transactions or lists, representing the input data, can be large in
real-life, algorithms based on support estimation aim to infer this itemset
property without having to use the original data. This is, they propose
using other information to make the estimations; for instance, information
defined by the discovered frequent itemsets during the process. The advantage of this type of approach above the traditional one is that the number
of candidates generated, as well as the number of data passes, are reduced.
Since the itemset-support produced is an estimation and not a calculation,
it is important to state that the total number of itemsets found, and consequently rules formed, can be affected as a result of the quality of the
estimations. One example of itemset-support estimation is the PASCAL
algorithm in (Bastide et al., 2000).
2.4 Taxonomy of the FIMers
There is no a unique classification of the current algorithms for FIM. Therefore,
in this section, we will form a taxonomy based on the different strategies pro-
40
posed for the benefit of the mining task.
Input Data Definition Strategy. This type of strategy based its existence on either the amount of transactions needed to describe the tendency behavior
of the items in the original data D satisfactorily, or the type of layout used
to define D. For instance, it has been proposed that transforming the natural horizontal layout of D into a vertical one, which is formed by sets
describing the behavior of each item, can provide advantages for the mining; cf., (Shenoy et al., 2000). Moreover, since D is often large in real life,
it has also been investigated whether the tendency of the items, defined in
m transactions of D, can be captured by just k transactions. Therefore, the
use of sampling d to define D has been already studied and tested (Toivonen, 1996; Zaki et al., 1996). As FIM is performed over d rather than
D, discrepancies in the number of frequent itemsets discovered along with
their corresponding support can be found with respect to the traditional
approaches.
Itemset Support Calculation Strategy. These strategies refer to the manner in
which the support of itemset candidates are calculated during a mining process. As described above in Section 2.3.1, there are three main approaches
utilised by the current algorithms.
Itemset Storage Structure Strategy. While an algorithm searches the space,
the knowledge already discovered, which is formed by the frequent itemsets and their respective support and often used for the generation of new
candidates, has to be stored or mapped into a data structure. Since the
size of this data structure directly influences the performance of a FMI
algorithm, a large part of research in the field has been done on finding
the most suitable data structure for ARM. This has involved looking for a
41
data structure that allows not only itemsets to be represented compactly,
but also fast access within the structure. In other words, this type of strategy aims to produce the most effective itemset storage structure which can
be successfully exploited by an algorithm. The data structures proposed
have involved: hash trees (Agrawal and Srikant, 1994), enumeration-set
trees (Agarwal et al., 2001; Coenen et al., 2004a), adhoc trees (Han et al.,
2000c; Coenen et al., 2004b), matrices (El-Haji and Zaiane, 2003), arrays (Grahne and Zhu, 2005; Liu et al., 2002), tries (Woon et al., 2004)
and others (Goethals, 2003).
Search Itemset Space Strategy. Since FIM seeks the most interesting itemsets
within a search space that can be represented by a tree structure, approaches
have investigated efficient manners to traverse such a space. For instance,
algorithms based on Breadth-First Search (BFS) and Depth-First Search
(DFS) have been proposed. The former is also known as a level-wise
search since the algorithm moves level by level generating candidates and
discovering frequent k-itemsets until no more candidates exist. In contrast, a DFS-based or class-wise algorithm works recursively by checking
a family of k-itemsets before it moves towards a new one.
In order to perform the best searching, complement techniques, involving heuristics and procedures, have also been proposed. The objective,
in this case, is to lead the course of an algorithm to focus on certain areas of the space in which interesting itemsets are likely to be found. The
most frequent used heuristic is defined by the downward closure property
of itemset support, which establishes that: All subsets of a frequent itemset
must be also frequent. The latter is a consequence of the anti-monotonic
relationship between the support σ and the k number of items defining an
42
itemset within the space. This is, while k increases during the m levels of
a search space, σ tends to decrease.
In the case of the procedures, candidate generation has been one of the
most used since it prunes fake candidate itemsets as soon as possible during a mining exercise. Hence, the number of elements visited and checked
in the space is reduced to just those itemsets which are very certain to become frequent.
By considering the properties of the frequent itemsets, for instance, their
location in the search space, which often occurs to be at the top of the
lattice, approaches, like in (Mannila and Toivonen, 1997) and (Goethals,
2002), have pointed out respectively the existence of borders between the
frequent and unfrequent itemsets, and in the generation of candidates.
Optimization Strategy. Approaches in this group have focused on finding the
best way of administrating the computational resources for an ARM process. That is, to provide the best mining runtime by performing the best
management of the hardware is the aim of this category. For instance,
some researchers have worked on the parallelization of ARM (Zaki et al.,
1997b; Han et al., 1997; Joshi et al., 1999; Jin and Agrawal, 2002; Veloso,
2003). Since the formation of association rules can demand large amounts
of memory, proposals, for instance, in (Goethals, 2004), have considered
simple techniques to extend the capability of the algorithms.
Other approaches have identified that most of the time is spent when the algorithms output their results; therefore, the implementation of fast routines
has been developed (Rácz et al., 2005).
43
Target Knowledge Strategy. Although the target knowledge of ARM is defined
by a set of rules representing item associations, most of the approaches
have focused on undertaking only the discovery of the set of the frequent
itemsets F because the generation of rules can be produced from its elements.
Based on the fact that set F can still be large in real life, other approaches
have investigated to seek for memberless set representations of F from
which rules can be generated indistinctly. Therefore, attention has been
given to the concepts of frequent maximal (Gunopulos et al., 1997) and
closed (Pasquier et al., 1998) itemsets.
The MFI (Maximal Frequent Itemsets) form a set M whose elements satisfy that no frequent supersets exist. In other words, M is composed by the
itemsets lying down in the frequent boundary space. Even though MFIbased algorithms have helped to reduce the complexity of ARM and to
develop new pruning and search strategies such as, for instance, top-down
and/or bottom-up searches, the definition of M has been criticised because
it is not possible to generate the support of their itemset subsets that are
also frequent by definition.
Important algorithms with this conception are Maxminer (Roberto J. Bayardo, 1998), Mafia (Burdick et al., 2001), FPMax (Grahne and Zhu, 2003)
and GenMax (Gouda and Zaki, 2001).
In order to overcome the disadvantage of the set M , the discovery of CFI
(Closed Frequent Itemsets) has been proposed because the support for all
frequent itemsets can be generated from the closed ones. An itemset Y is
closed if no proper superset of Y exists that has the same support. Some
44
CFI-based algorithms are A-Close (Pasquier et al., 1999), CLOSET (Pei
et al., 2000) and CHARM (Zaki and Hsiao, 2002).
2.5 Conclusions
Since this thesis is investigating the suitability of two neural networks for association rule mining, we have given the relevant background on ARM in this
chapter. In particular, we have focused on providing information on the first
phase known as frequent itemset mining. Some concepts about this mining task
have been described above, because our main interest is to reproduce the itemset
support values calculated by the current algorithms, which in one way or another
have to perform pattern counting from data.
Inspire of many algorithms, based on different strategies, having been proposed for ARM or FIM since 1993, biologically-inspired approaches have been
rarely investigated. Therefore, in the next chapters we develop some ideas regarding the creation of ANNs-based approaches for the generation of this type
of rules by exploiting the knowledge allocated in their weight matrix after training.
45
Chapter 3
Hypothetical Neural Network for
Association Rule Mining
Unlike other DM research topics, in which the literature is so vast, the lack of
research is a property of the addressed topic of ANNs for ARM. Nevertheless,
a summary of the little literature available for the topic will be presented in this
chapter. We also justify the study on our neural network candidates for ARM.
Additionally, we define the stages of the proposed ANN-based framework for
ARM and a list of the properties that we consider as relevant in an ANN for
ARM.
As part of the conclusions of this chapter, we will point out the differences
between our research and the existing work related.
3.1 Literature Review
To the best of our knowledge, the use of neural networks for association rule
mining was first explored by Sallans (Sallans, 1997). In this work, he made
use of the unsupervised techniques of FA (Factor Analysis) and MGM (Mixture
46
of Gaussian Models) for his study. This work linked ANNs and ARM due to
the fact that classification rules can be generated from ANNs for classification.
The series of transactions were simulated and based on some underlying patterns
(seed patterns), which might or might not be correlated, and the addition of some
noise. The conclusions of this study have reported that while the FA-based model
could never converge with the ARM data, and no clear reason was found for such
behavior, the MGM network showed that it is able to learn the seed patterns used
to generate the noisy transactions. The worst performance of MGM was shown
when the patterns were highly correlated and the data was noisy, while the best
performance resulted from uncorrelated patterns and clean data. This work does
not state anything regarding the calculation of support, neither on how the MGM
model should be understood for the generation of itemsets or rules.
Gupta et al. (Gupta et al., 1999) undertook the problem of pruning or grouping final association rules for inspection and analysis. Even if the work focused
on proposing a new distance metric to group association rules, a SOM took part
in the proposed grouping methodology. The idea developed was initially to calculate the distance values among the rules through their metric. Then, MDS
(Multi-Dimensional Scaling) was employed to form a vectorial representation of
the distances which served as inputs to the SOM which clustered such vector
space in order to visualize the rules. Due to the orientation of the work, it can be
categorized as part of the techniques developed for the management and visualization of rules rather than their discovery, in which a neural network was used
to cluster the rule space.
In 2000, a frequent itemset algorithm based on the Hopfield neural network
was presented by Gaber et al. (Gaber et al., 2000b). The work was inspired by the
47
facts that it has been demonstrated to be possible the extraction of classification
rules from trained neural networks and that the Hopfield network has been used
for combinatorial optimization problems. The authors concluded that a Hopfield
network in an arrangement of n-by-p nodes with their proposed energy function
should be enough to map the maximal frequent itemsets1 of a given input set of
n transactions and p items. Even if the idea of ANNs for ARM makes sense,
the absence of experiments and results makes it difficult to consider this work as
a formal solution to the problem. Moreover, no indications have been given on
how to interpret the neural network after its training.
Even though the work of A.R. Gardner-Medwin and H.B. Barlow (GardnerMedwin and Barlow, 2001) is not related to the topic of ANNs for ARM, it is
highly relevant for our purposes since it deals with the problem of counting in
distributed neural representation. Initially, it was stated that:
Learning about a causal or statistical association depends on
comparing frequencies of joint occurrence with frequencies expected
from separate occurrences, and to do this, events must somehow be
counted...
Hence, interested in how events can be counted by biological mechanisms,
Gardner-Medwin and Barlow defined two theoretical neural models to explore
the effects of counting. In this work, in the process of counting, each event E,
representing some stimulation, produces an activity representation (a representation or activity pattern is formed by the state activity [0,1] of the neurons) in the
network.
1
A maximal itemset is a variant of a frequent itemset. It can be understood as a super-frequent
itemset whose derived itemsets are all frequent. Algorithms targeting this type of itemsets were
commented in Chapter 2.
48
Two types of representations have been established to model the relation between events and their neuronal representations. First, a direct representation,
which would model the ideal state for counting, is one that has at least one of
the neurons active exclusively for the presented stimulation. Therefore, those
neurons are defined to have a one-to-one relationship between their activity and
the occurrence of the event. Second, a representation is known to be distributed
when its active neurons participate also in the representation of other events in a
counting epoch. Therefore, the relationship between activity nodes, supporting
distributed representations, and frequency events can be stated to be many-tomany.
They followed the idea that distributed representation exists in the biological
models, therefore, interference that results from the overlap of the distributed
representations was defined as a problem which can be dealt with in two ways:
(1) direct representation must be generated or (2) the frequency of an event must
be estimated from the frequency of use of its individual active elements.
It was also noticed that although a distributed representation may increase the
variance of the estimated counts and impairs the speed and reliability of learning,
it is often regarded as a desirable feature of the brain since it brings the capacity
to distinguish a large number of events with a finite and relative number of neurons.
As a result of this work, two neural models were proposed: (a) the projection
model is composed by an arrangement of Z binary neurons which will get activated or deactivated as a response to some events during a counting epoch. The
frequency of the occurrence of a particular event Ec is estimated by the usage of
49
all the neurons that respond actively in its representation. In other words, the Z
synaptic weights, which are proportional to the usages of individual neurons, are
summed up into an accumulator neuron X to produce a frequency estimation.
(b) the internal support model employs Z2 (full connectivity) synapses to count
all possible parings of activity. Its excitatory synapses acquire strengths proportional to the number of times that pre- and postsynaptic neurons have been activated together during events experienced in a counting epoch. In other words,
internal excitatory synapses within the network measure the frequency of cooccurrences of activity in pairs of neurons by a Hebbian mechanism. Therefore,
the total internal activation, stabilizing the representation of an event Ec , is estimated by testing the effect of a diffuse inhibitory influence on the number of
active neurons.
In general, this theoretical work addressed the problem of counting on distributed
representations by focusing on forming event frequency estimations taking into
account the usage of the neurons and the representation overlaps produced in a
counting epoch.
A procedure for mining association rules from a database for on-line recommendation is developed by Changchien and Lu (Changchien and Lu, 2001).
The implemented system depends on three different technologies: a start schema
database, a SOM, and RST (Rough Set Theory). In particular, SOM is just used
to cluster the transformed and normalised transaction records. Since the authors
determined that SOM cannot explain the resulting clusters itself, they proposed
association rules to explain their meaning. Hence, RST is employed to derive
rules that explain the characteristics of each cluster and the attribute relationships among the different clusters. This work does neither generate frequent
itemsets nor calculate itemset support; instead, it uses a confidence metric based
50
on RST to form rules. It does not discuss the accuracy of the final rules at all,
but based on the explanation of the work, it can be classified as a soft-mining
solution. Moreover, a strong dependency between the SOM and the database is
presented and exploited for the performance of the entire system.
Yang and Zhang (Shangming Yang, 2004) have also proposed that a SOM
can be used for ARM because of its clustering properties, in particular due to the
fact that its use makes clustering be a two-level approach (first, the data is clustered with a SOM and then the SOM is clustered), which gives the benefit that
computational load decreases (Vesanto and Alhoniemi, 2000). This work neither
provides any type of experimentation nor results. Roughly speaking, the authors
only expose the idea that the split of the binary data, representing associations,
into similar groups formed by SOM may benefit the task of FIM but no results
are shown to support such an assumptions.
To continue with the improvement of their approaches on PPI (Protein-Protein
Interaction) prediction methods, in which an ANN (Eom and Zhang, 2004) and
an adaptation of ARM (Eom et al., 2004) were used separately, Eom et al. decided to explore the idea of combining their past proposals in order to generate
association rules directly from the ANN weight matrix to improve accuracy in
the protein predictions (Eom and Zhang, 2005; Eom, 2006). Mainly, a supervised ART-1 network is used to classify vectors Xi , modeling PPI attributes, into
k different classes. After this network converges, a weight-to-rule decoding procedure is initialized to transform its weight matrix into a form of association
rules. To form rules between input attributes and their corresponding class k ,
the authors stated that a vector which maximizes the final value of the kth output
node must be calculated. Taking into account the characteristic of the problem,
51
Eom et al. stated that rule formation from an ANN can be addressed as a nonlinear integer optimization problem. Therefore, a GA (Genetic Algorithm) was
used to do the maximization of the objective function. The idea behind using
a GA is that it would look for the best chromosome which can maximize the
corresponding network output and give the best combination of input features as
a result. Once these chromosomes have been discovered, they are decoded into
a form of association rules called neural feature association rule for protein prediction. The few results presented show that the combination between an ANN
and ARM provides better accuracy in the PPI prediction than its antecedents.
3.2 Hypothetical ARM Framework Based on ANNs
Because it has been defined that relevant concepts for ARM such as counting
and association take part in the process of learning performed by biological systems (Gardner-Medwin and Barlow, 2001) and knowing that such biological behaviors can be artificially imitated by using ANNs, we can state that an implicit
ANN-based framework exists for tasks like association rule mining.
This hypothetical framework, which is depicted in Figure 3.1, can be defined to
be constituted for the following stages:
1. Environment Definition. Roughly speaking, this stage focuses on performing tasks for transformation, collection, manipulation of the data describing a collection of events of some environment. In this thesis, we use
the original representation of the events in which the association among
their elements is formatted binary. Nonetheless, finding a more advantageous representation of the input patterns without losing either the hierarchial property among them or the associativity property of their elements
is a relevant clue for the improvement of their learning. In particular, since
52
ENVIRONMENT
E
DG
LE NG
I
OWAR
KN SH
ARTIFICIAL
MEMORY
EXTRACTING
TRAVERSING
53
NEW FEATURE SPACE
OTHER
NEURAL NETWORKS
TASK
LOGIC
E
TR XTR
AV AC
E T
R
SI ING
NG
Figure 3.1: Hypothetical Neural-based framework for ARM. In particular, this thesis
focuses on developing an artificial memory for its purposes (colored area).
EXTRACTING
QUERYING
INCREMENTAL LEARNING
(counting)
ARM LOGIC
A->B (100%, 90%)
AB->C (10%, 9%)
AC->DF (100%, 12%)
BF->GH (50%, 50%)
ABCF->H (50%, 50%)
CDE->FGH (70%, 90%)
D->E (10%, 20%)
KNOWLEDGE
Candidate Generation
Itemset Representation
patterns can be spare and/or high dimensional by nature.
2. Learning. This is a task mainly governed by the course of some learning algorithm responsible for modifying the neural-network architecture
(nodes or weight matrix) in charge of acting as the artificial memory which
accumulates the knowledge presented in the environment throughout time.
Therefore, the learning algorithm is principally determined by the type of
neural network used to learn the data coming from the environment. In
our particular case, the training algorithms of a self-organising map and an
auto-associative memory will be evaluated to fulfill this framework task.
3. Artificial Memory. The purpose of this stage is important for the generation of rules because it forms the base knowledge of the framework from
which the properties of the rules to be generated and therefore the rules will
be derived. The main quality of any ANN to become an artificial memory
for ARM is that its embedded knowledge can allow functions or methods F
to be defined to describe properties of the learnt associations. Initially, for
the generation of rules, itemset support must be calculated (recalled) from
the nodes of the chosen neural network. In other words, it is necessary for
the neural network to be able of acting as an artificial memory which has
the ability of producing the counting of patterns in order to identify their
corresponding number of occurrence in the environment.
Another way of comprehending the knowledge formed at this stage is as a
new mapping or feature space in which the original associations or patterns
have been quantified and coded in a compact representation, limited by
the space formed with the nodes of the network for future usages. Therefore, to generate rules from the resulting space, it is important to define
54
the correct decoding of it in order for the frequency of the taught patterns
(support) to be determined as an estimation of what is normally discovered in the original space through the counting of the patterns. Moreover,
this embedded knowledge about the environment can be used to supply the
formation (training) of other neural architectures in order to create more
complex rules describing the environment.
In this thesis, we focus on developing this section. Therefore, our two
ANN candidates will be studied to establish whether they can become our
desired itemset-support memory by determining if they can reproduce frequency knowledge about the patterns in the training data, which is usually
generated by performing the counting of them.
4. Knowledge Sharing.
This is a task in charge of supplying or transferring knowledge from the
main memory to procedures for the formations of new neural architectures
(training of other neural networks), so that the proposed framework can
re-utilise the collected knowledge for other tasks, such as, for instance, the
prediction of behaviors of associations in the original environment.
5. Extracting-Traversing and Extracting-Querying.
Since the arrangement of nodes, forming the artificial memory, can be seen
as a data structure in which knowledge of an environment is distributed and
accumulated because of training, it is necessary to build techniques which
permit the exploration, recovery and exploitation of the information satisfactorily defined in the neural structure.
To accomplish these tasks, especially in the representation of the extracted
55
knowledge, techniques produced in the ARM field could be reused; nevertheless, a re-definition of them needs to done to adjust their performance
to the new space defined by the weight-matrix structure.
6. Task Logic.
This stage groups techniques and methodologies to lead the extraction of
knowledge in order to solve a mining process. It concentrates some of the
strategies to be followed in the extraction of knowledge from the memory
in order to speed up the process. This stage will keep a direct relationship
with the strategies defined to carry on with the competition of the finding
of association rules. Operation boundaries are not completly defined as the
aims of their existence can be supported by results given by other neural
structures or other stages existing in the framework.
7. ARM Logic.
This is the stage which leads to the generation of rules from the knowledge
embedded in the neural network. For instance, it will be responsible for
applying the correct strategies for planning the best logistic of the mining
exercise. In other words, this stage would administrate the total resources,
data and processes, for the generation of association rules from a trained
neural network with information about events occurred in some environment.
3.3 A Formal Definition of the Problem
Let D be a finite n collection of binary events or discrete patterns X called itemsets which represent different formations of associations among the m elements
(items) of a set in an environment.
56
Let Φ be a feature space or knowledge formed (weight matrix) by the m
nodes of a neural model mAN N trained with D.
Under the original conditions of D, a value f (itemset support), representing
the frequency of occurrence of an association or pattern X in D, is calculated
by performing the counting process P of that pattern in the environment so that
f (X|D) = P(X, D).
Since mAN N has already acquired knowledge about D, the aim is to reproduce an estimation of the frequency of occurrence of pattern X known as fb
through the application of a decoding procedure Θ on the feature space Φ formed
such that f (X|D) ≈ fb(X|Φ) = Θ(X, Φ). Therefore, the problem can be set up
as seeking a definition of Θ for a particular Φ in order to produce frequency
estimations for patterns in D.
3.4 Ideal ANN Characteristics for Building Memories for ARM
Based on the problem defined above, we consider important to state some properties of ANNs which might be relevant for tackling our current problem as follows:
• It is important to consider ANNs whose mapping or weight matrix formed
is smaller than the size of the environment represented by patterns in D.
Therefore, quantization properties can be relevant for tackling the summarisation or compacting of the original environment.
57
• It is required that the neural-based model deals well with large n-dimensional
patterns, particularly with data that could be formatted as binary arrays.
• The neural-network model will have to be able to learn new patterns without forgetting past knowledge. This characteristic in neural networks is
known as the elasticity-plasticity dilemma. The exploit of this characteristic of a neural network would avoid employing tree-party processes for
the maintenance of rules.
• Initially, neural networks that conduct unsupervised learning attract our
attention because no information about the patterns or associations in the
environment is known a priori. Although, supervised trained ANNs can
also be considered for our aims if they are able to learn the associations
of the elements of the input pattern by supporting the definition of autoassociative inputs.
3.5 Reasons for Studying an AAM and a SOM for
ARM
The world of neural networks is vast in terms of the different types of paradigms
of learning algorithms (e.g., error-correction learning, hebbian learning, competitive learning, etc.) and architectures (recurrent and feed-forward networks).
It is the combination of learning algorithms and architectures which makes the
generation of solutions for learning tasks such as pattern association, pattern
recognition (classification), control, function approximation, filtering, and others possible.
58
Since this thesis concerns investigating the suitability of ANNs for association rule mining, we have chosen two different neural networks from the large list
of supervised and unsupervised candidates. The selection of an auto-associative
memory and a self-organising map to take part in our study is mainly because of
their properties, which are summarised as follows:
Auto-Associative Memory
Among the learning tasks, PA (Pattern Association) has caught our interest as
a viable alternative to finding answers to the research questions, stated in Section 1.4 in Chapter 1, because it is a task which can be performed by involving
concepts of learning, memory and association.
To achieve pattern association, according to Ham and Kostanic (Ham and
Kostanic, 2001), any neural network (e.g., feedforward multilayer perceptron
networks, counterpropagation networks, radial basis function networks and associative memory networks) from the group of the mapping networks, whose
general structure is shown in Figure 3.2, can be employed for this purpose.
Figure 3.2: General structure of a mapping neural network. This appears in (Ham and
Kostanic, 2001).
As a mapping network, in which the input patterns are coded and projected
into the synapse weights during training, an associative memory aims to imitate
the memory capabilities performed by the human brain, which has the ability
to retrieve and store information via the management of the association con59
cept (Kohonen, 1978). In other words, this type of artificial neural network is
taught to associate (O’Keefe, 1995). It learns the relationships, established by
the pairs of input patterns, through storing them in a content addressable manner. This neural network learns knowledge from its environment by exploiting
the explicit associations defined by each of the input-pattern pairs {X, Y } utilized for its training. Moreover, the concept of association is presented and used
when information is retrieved from it. In other words, this neural network is able
to give a response to the environment when a stimulus is presented to its inputs
by using the concept of association between the stimulus and its weight matrix.
One of the main abilities of this neural network is to remember information
given in its training. This is possible because it forms a mapping (weight matrix) in which the target patterns (memorized patterns) and the associations are
organised and stored for future recalls, that can involve queries with corrupted or
noisy stimuli.
Like its biological counterpart (the human brain), this ANN also handles the
concept of memory in two stages: storage and recall of information. While
the former phase refers to the training of the network, the latter refers to the
extraction of information from the weight matrix in response to a stimulus.
These two phases resemble the operations performed by the human in our example, stated in Chapter 1, in which he scans the data defining some shopping
baskets (shopping transactions) to memorise the most of the information in order
to answer queries about the contents of the baskets, for instance, queries regarding the associations among the items bought.
To recall information from its weight matrix, a stimulus is always required. Un-
60
like conventional memories, they have the ability to learn and generalise. Moreover, no exact location of the required information is needed for its recovery;
instead, the recall is formed with the distributed information in its nodes.
Although a task of pattern association can be confused with a task of PR
(Patter Recognition), which can be understood as the process whereby a received
pattern is assigned to one of the prescribed classes (Haykin, 1999), a difference
in the definition of the targets yi between these two tasks can be pointed out.
Whereas an ANN for a pure PR task normally uses yi as a value to define the
class ci (ci = yi ) to which an input pattern xi belongs, an ANN for PA uses yi
as a vector, known as the memorised pattern, to represent a pattern to which an
input pattern ,called the key pattern, will be associated.
For our problem, which involves the estimation of itemset support from a weight
matrix, we believe that a neural network for PA is practically more appropriate
than for PR, mainly because, in the definition of our problem, in particular, in
the part concerning the definition of the input data, which can be summarised as
a dataset D composed of itemsets (binary patterns or binary transactions), there
is no other information apart from the itemsets (transactions) themselves, which
can serve to fulfill the targets needed in a pure PR task. Therefore, it can be
stated that the target definition of the different learning tasks has been an important factor in leading this research towards the study of neural networks which
let data speak for themselves, especially in cases, such as ours, in which there
are no labels associated to the input data before they are learnt.
Since it has been stated that the target yi (memorized pattern) will be defined
by the corresponding key pattern xi of each input pair, our current research could
be confined to the study of the suitability of an AM (Auto-Associative Memory)
61
for ARM since this neural network holds the characteristic of yi = xi ∀xi in D
in its inputs.
We focus on studying an AM, because it also satisfies the properties of the
theoretical internal support model depicted in Figure 3.3, which was defined by
A.R. Gardner-Medwin and H.B. Barlow (Gardner-Medwin and Barlow, 2001) to
study the limits of counting. Being the latter, a very important task needed for
the generation of association rules from databases.
Figure 3.3: Outline of the theoretical internal support model defined in (GardnerMedwin and Barlow, 2001) to produce the counting of patterns with distributed representations in a group neurons.
According to Austin (Austin, 1996) almost any ANN used for pattern classification can be used for building an associative memory. Nevertheless, factors
to be considered for real-life problems are the speed of training and the type of
inputs they handle (Ham and Kostanic, 2001).
The Hopfield neural network (Hopfield, 1982) can be regarded as a good
candidate since it is an ANN which bases its operativity of learning and recovering memories (patterns) in a manner similar to the human brain. For instance, a
62
Hopfield network is able to recover (remember) a thought pattern with just partial information given as a stimulus to the network. Hence, it can be stated that a
characteristic of this ANN is robustness.
A Hopfield network for ARM has already been proposed by Gaber et al. (Gaber
et al., 2000a) from which I have concluded the following:
This neural network was not used to calculate the support of the itemsets
from a dataset, instead the identification of a variant of them, called maximal
itemsets, was proposed. Due to the nature of the maximal itemsets, the main
drawback of this work is that to calculate support for all itemsets derived from
the maximal discovered itemsets an extra pass over the training data is needed. In
other words, even if the network has learnt to detect the maximal itemsets from
an environment D, it is not capable of providing information about the support
of the all possible frequent itemsets (associations) happening in it.
Taking Gaber’s work as an antecedent, and considering the options available
for building an AM exposed, for instance, by Austin (Austin, 1996) and Ham
and Kostanic (Ham and Kostanic, 2001), we have decided to study an AM based
on a CMM (Correlation Matrix Memory), because a CMM is purely oriented to
PA tasks which deals with binary and non-binary data and has fast training and
recall stages.
A CMM, whose training is defined in Appendix B, has been chosen because
its operativity has been used to build more robust and complex ANN systems; cf.,
Konohen’s book (Kohonen, 1978), or ADAM (Advanced Distributed Associative
Memory) and AURA (Advanced Uncertain Reasoning Architecture) system both
63
developed by Austin et al. (Austin and Stonham, 1987; Austin, 1995; Austin
et al., 1995). Moreover, as it will be explained in next chapter, it has a natural
ability for ARM due to it accumulates knowledge from its environment through
a superposition of the memories representing the input pairs, which allows the
estimation of itemset support be possible.
Self-Organising Map
According to Kohonen (Kohonen, 1996), various human sensory impressions
are neurologically mapped into the brain so that spatial or other relations among
stimuli correspond to spatial relations among the neurons organised into a twodimensional map. Hence, in order to build a SOM-based memory, our aim is to
determine whether the counting of patterns occurred in a high dimensional space
can be reproduced with the knowledge embedded in the two-dimensional mapping produced by a SOM.
Our study is focused on a SOM in Chapter 5 because it is an astonishing
model of the unsupervised neural networks which form a knowledge based on
the resources expressed by the training data by letting data speak for itself. Furthermore, the interpretation of its embedded knowledge has become an important
activity for data analysis. For instance, the inspection of correlations among the
learnt variables has been investigated by considering the similarities existing in
the planes formed by such variables (Vesanto and Ahola, 1999). The visualisation of the contribution of each variable in the formation of the map has also
been addressed (Kaski et al., 1998a). To facilitate visual inspection of the highdimensional data, visual techniques for the cluster results given by a SOM, based
on data projection methods, have been developed (Su and Chang, 2001). Also,
the extraction of logical rules from trained self-organising networks, in particular
64
for classification problems, has been explored to create different understandings
from them (Hammer et al., 2002; Malone et al., 2006).
More importantly, a SOM has the abilities to undertake tasks, such as data
clustering (Vesanto and Alhoniemi, 2000; Kiang, 2001) and vector quantization (Heskes, 2001) which make the formation of a compact representation of
the training environment possible.
Kohonen et al. (Kohonen et al., 2000) have tackled the problem of producing
massive maps under environments in which data naturally occur in large amounts
to perform their visual exploration. The usage of a SOM for combinatorial problems has already been explored since the individual SOM neurons tend to learn
the properties of the underlaying distribution of the space in which they operate (Aras et al., 2003).
The inclusion of SOM technology to the process of the generation of association rules has been investigated in (Changchien and Lu, 2001; Shangming Yang,
2004). Nevertheless, as above stated, these proposals have the disadvantage that,
neither work on the interpretation of the trained map for ARM was proposed, nor
the reproduction of the counting process was investigated. On the contrary, the
proposals have based their implementation on limiting the SOM to clustering the
input data space and making strong dependencies between the SOM clusters and
the original data for the generation of association rules.
An indirect consideration of SOMs for ARM can be interpreted from the
work of Heskes (Heskes, 2001), where the relationship between SOMs, VQ and
Mixture Modeling is explored. In this work, neither itemset support, nor rules
65
are generated. Nevertheless, in one experiment, Heskes uses basket-market data
in order to build a map to model the relationships between the items defining the
transactions of a dataset, following the assumption that items of similar groups
have similar co-occurrence frequencies with other items in the basket. In this
case, the training data is defined by a matrix with the relative frequencies (support) of the items, in order to calculate conditional probabilities to be used in the
distance metric proposed for the formation of the map.
Figure 3.4: Theoretical projection models defined in (Gardner-Medwin and Barlow,
2001) to produce the counting of patterns with distributed representations in a group
neurons.
Additionally, due to the way in which a SOM learns from its environment,
we have categorised it as a practical representation of the theoretical projection
model, depicted in Figure 3.4, defined by A.R. Gardner-Medwin and H.B. Barlow (Gardner-Medwin and Barlow, 2001), in which the knowledge about the
frequency of the pattern occurrences in an environment, is distributed during
learning in the neural components of the networks. Therefore, we have assumed
that estimations about the occurrences of patterns, known as itemset support in
ARM, can be produced from a SOM by interpreting the local knowledge generated by the nodes of the map.
66
3.6 Similarities and Differences with Surveyed Approaches
Taking into account our aims and the current pieces of research for the topic of
ANNs for ARM, we strongly believe it is convenient to define that this research
differs from the approaches summarized above as follows:
Our work will not focus on detecting either seed patterns (Sallans, 1997) or
maximal itemsets (Gaber et al., 2000b) because even if an ANN was able to
detect or learn them correctly, it would still need to know at least one of their
properties, such as, for instance, support, to measure the relevance of such patterns or associations in the environment to generate the desired rules. Therefore,
this thesis concentrates more on evaluating if our ANN candidates can learn
something about the support of the input associations in order to recall those values when some stimulus is presented to the network. Additionally, we believe
that through capturing support with ANNs, the calculation of other metrics, for
instance, like rule confidence, which defines the conditional probability among
the itemsets forming the body of the rules, can be done straightforward from the
same embedded knowledge in the memory.
Similar to other approaches, a SOM will be used. Nevertheless, we do
not want to restrict its abilities to just clustering data for ARM as proposed
in (Changchien and Lu, 2001; Shangming Yang, 2004); instead, it is believed
that due to its unsupervised properties, the SOM is able to let data speak for
themselves and therefore properties like the frequency of the training patterns,
must exist in the resultant mapping and its decoding is the target.
67
Because we believe the dependency between the proposed neural network
and its training data shown in (Gaber et al., 2000b; Changchien and Lu, 2001;
Shangming Yang, 2004) is a negative property for the future neural framework
for ARM, we are interested in breaking it down and alternatively decoding the
knowledge distributed within the nodes of the chosen ANNs in order to discover
or calculate itemset support from it.
Regarding our main interest in finding a pattern counting ability in our candidates, it can be stated that our work will produce applied neural models rather
than theoretical ones as in (Gardner-Medwin and Barlow, 2001). However, our
selected neural networks, a self-organising map and an auto-associative memory,
can theoretically be categorised into the proposed projection and internal support
models respectively. In addition, our studies will not ignore the associative statistical factor in the input patterns as it was done in (Gardner-Medwin and Barlow,
2001). Therefore, our proposal will endeavor to estimate support values for any
varied item combinations.
Probably the work most similar to ours, in the sense of generating association
rules from an ANN and employing incremental training of ANNs for maintaining rules throughout time, is the one of (Eom and Zhang, 2005; Eom, 2006);
nevertheless, we focus on studying ANNs which can handle unsupervised tasks
rather than supervised ones because we do believe that employing a supervised
ANN would limit the operativity of the wanted framework to only data which
could be classified a priori. The latter is not a feature of the problems often tackled by ARM.
It is important to state that our work is about making the baseline for the
68
representation of a system in the format of association rules through the knowledge learnt by a neural network about such a system, and not to do with getting
the optimal set of rules for that system, which we have considered to be a further
task after having discovered that association rules can be produced from a trained
neural network.
3.7 Conclusions
In this chapter, the literature involved in our prime objectives has been summarized. The proposed ANN-based framework for ARM has been explained in
more detail.
We stated some of the characteristics that we think are important for an ANN
to have if it is to be used for developing a neural-based framework for ARM or
similar task involving the counting of patterns.
Not less important, the reasons for studying an auto-associative memory and
a self-organising map for association rule mining have been stated here. Additionally, we have specified the differences and affinities with this research and
current literature.
69
Chapter 4
An Auto-Associative Memory for
ARM
To provide answers to the research questions stated in the introduction, in particular, to determine if a trained ANN can have the resources (information) in
its weight matrix to answer queries regarding the support of the itemsets drawn
from the training dataset, we begin by studying the suitability of an associative memory for association rule mining, because this particular ANN bases its
operativity on the concept of association. This neural network exploits the associativity property among the components of the inputs not only for learning its
environment, but also for emulating human-memory operations in the recall of
information from the knowledge embedded in its weight matrix. In particular,
we focus on studying an auto-associative memory based on a Correlation Matrix
Memory (CMM).
After justifying the research on this type of neural network in Chapter 3, it
will be first studied and analysed here if changes need to be applied to the training rule of this neural network in order to learn not only the associations defined
70
by the input patterns, but also information about the appearance frequency of the
patterns that is needed for itemset-support calculations. Secondly, an interpretation of the resulting mapping (the weight matrix formed by supervised training)
is proposed to perform itemset-support recalls when stimuli (queries about itemsets) are presented to the memory.
To evaluate the accuracy of the recalls made by the associative memory
through our proposals, we compare its results with the results obtained by the
Apriori algorithm, which is an important algorithm for the calculation of support
and consequently for the general process of association rule mining. Conclusions about the response of the associative memory for the calculation of itemset
support will be given at the end of this chapter.
4.1 Correlation Matrix Memory for ARM
Since the operativity of a correlation matrix memory is known to be in two
stages: learning and recalling; we will first point out the natural ability of the
CMM to learn information about the number of appearances of the components
describing the input patterns. Secondly, we will propose how the mapping, resulted from a correlation matrix memory, should be understood in order to retrieve information regarding the support of any itemset, defined in the training
patterns, when this memory is queried to recall such information.
4.1.1 The Learning of Itemset Support by a CMM
To begin, it is necessary to re-state that a CMM is a one-layer memory structure
which basically is defined by a square array whose dimension is the m number
of elements in the patterns (this statement on the memory size is only valid in the
71
case of an auto-associative memory). Therefore, the corresponding memory matrix, resulting from supervised training, is represented an M ∈ <m×m or Bm×m
for non-binary and binary models respectively.
In order to be trained in any D environment, both types of CMM, which have also
been identified by O’Keefe (O’Keefe, 1995) as weightless or weighted memories
respectively, require pairs of patterns representing associations with a form X →
Y in which an input or key pattern is associated with an output or memorised pattern. In our case, the available data is signified by a group of n unipolar vectors
(patterns or itemsets) ∈ {0, 1}, in which a value of one assigned to some k elements or items is used to define their existence within the pattern, otherwise zero
is assigned. This is, each input is an m-vector defining a particular association
among items from the set I = {i1 , i2 , ..., im }.
The corresponding associations to be presented to the memory may look as
{(X1 , Y1 ), (X2 , Y2 ), ..., (Xn , Yn )}
(4.1)
In which, both patterns, the key and memorised, will obtain their values from
the n transactions in D.
In particular, each pattern of each pair (Xk {0, 1}mx1 , Yk {0, 1}mx1 ) will take the
same value defined by a transaction tk . Hence, a pair k satisfies:
Xk → Yk = (Xk , Yk )
Xk = {x1k , x2k , ..., xmk } ; Yk = {y1k , y2k , ..., ymk }
(4.2)
Xk = Yk = tk .x
To train a weighted CMM, the equation governing this procedure, has been
defined in literature (Haykin, 1999; Ham and Kostanic, 2001) as follows:
M=
m
X
k=1
72
Yk XkT
(4.3)
It is the term Yk XkT or outer product which has been determined to be an
estimation of the weight matrix W(k) of the neural network functioning as a linear associator (Ham and Kostanic, 2001). This matrix W(k), which associates or
maps Yk onto Xk , forms a mapping representing solely the association described
by the k th input pair in turn. Therefore, a resulting matrix M must be understood
as a grouping or encoding (sum) of the m weighted matrices W. This summing
or correlation among matrices can be illustrated as in Figure 4.1.
Figure 4.1: Illustration of the accumulation of knowledge by a CMM.
Individual weight values of the network, whose update resembles a generalisation of the Hebbian rule learning, can be expressed by
wij =
m
X
yik xjk
(4.4)
k=1
As a consequence of using unipolar elements as inputs, the product yi xj of
some k th pattern will be in one of the two following states:
yik xjk =


 1 existance of association ij in k

 0
(4.5)
otherwise
In particular, it can be noticed that a matrix k, representing a thought association between patterns X and Y, will show existence values in its components
which satisfy:
73
wij = 1 ∀ {i, j}|{i, j} ∈ {P ({j|ij = 1 ∈ X}, 2)}
(4.6)
In which, the pais of indexes {i, j} derived from P, which represents the set
of n2 permutations with repetitions among the indexes of items that have an existence in the input pattern X, define the matrix elements that need to be updated
to learn the corresponding auto-association of the pattern X.
Being conscious of the sum operation performed by the CMM to learn the
incoming data, it can then be deduced from Equation 4.3 that, for the benefit
of this thesis, the value at each component wij not only means the existence of
an association between elements i and j, but also the number of times that such
association has occurred in the environment. Therefore, we can asseverate that
a weighted CMM naturally builds a frequency matrix when its training involves
bipolar inputs. Having discovered the latter characteristic in this ANN has been
relevant because its values associated to its nodes can be used, as will be explained further later, to recall itemset support.
In the case of the weightless CMM whose training is based on :
wij =
m
_
yik xjk
(4.7)
k=1
in which the accumulation of knowledge is produced by performing a superposition, through a bitwise-OR operator, among the different W matrices resulting from the different associations described by the training pairs (Austin and
Stonham, 1987). This is, this type of CMM has the characteristic of unifying
the Ws rather than summing them up. Therefore, the final weight matrix W fires
its elements wij in those cases in which the individual matrices Ws have active
74
Figure 4.2: Illustration of the accumulation of knowledge by a weightless CMM which
has been modified to collect frequency information. The dark matrix illustrates the new
matrix called the frequency matrix Mf which contains the corresponding pattern frequencies.
intersections in order to record the associations. This W is defined by
m
_
W =
Xk YkT
(4.8)
k=1
Unfortunately for the weightless CMM, it does not have the natural ability,
as its weighted counterpart, to collect knowledge regarding frequencies of the
pattern components. Therefore, to reproduce such an ability, we propose to build
a matrix as in Figure 4.2 for the gathering of such information; nevertheless, a
proper justification for the construction and maintenance of this data structure
needs to be found. In particular, when it has been concluded that the weighted
CMM performs this process naturally.
So far it has been found that, due to the auto-associative property of a weighted
CMM, this ANN is able to learn information about the occurrences of the components of its inputs naturally. This is, no change was made to its traditional
training to gather frequency knowledge about the input patterns. Hence, in the
next section, we will explain how a trained CMM can be understood in order to
75
estimate itemset support from it.
4.1.2 Recalling Itesemt Support from The Weight Matrix of a
CMM
One important characteristic of the frequency matrix formed by a CMM is that it
is symmetric. This leads our desired itemset-support recall mechanism to focus
on the knowledge of only n(n + 1)/2 nodes rather than the n2 elements of the
complete matrix. This is, more than half of the nodes, defined by (n2 − (n(n +
1)/2)), can be disposed of for our aims. The new resource matrix, from which
itemset-support values will be estimated, is represented by


0
 w11

 w21 w22

W = .
..
..
 ..
.
.


wm1 wm1 · · · wmm








(4.9)
To use the embedded information in this triangular W matrix for our purposes, the right interpretation needs to be drawn in order to produce accurate
support values from it. Therefore, to achieve itemset-support recalls from W
when a stimulus is presented, we propose the following mechanism of interpretation:
Calculating The Support for 1-itemsets
In this case, the aim is to generate a recall of support for any of the individual
items ij ∈ I. This can be understood as the recall of a value among the elements
of the matrix whose indexes satisfy that i = j. Therefore, our recall action is
focused on the elements of the main diagonal of the matrix {w00 , w11 , ..., wmm }.
Since it has already been stated that the number of occurrences of such elements
76
are defined in the corresponding elements wii , giving a recall for the support of
these items only involves carrying out the calculation of P(wii ) which defines
the probability of the ith item in the n patterns defining the training dataset.
Therefore, the corresponding P(wii ) is defined as follows:
supp(i) = P (wii ) =
f req(wii )
n
(4.10)
Calculating The Support for 2-itemsets
To calculate support to the group of 2-itemsets, which contain C(m,2) combinations derived from m items, we need to apply the Equation 4.10 to each of the
elements located off of the main diagonal. In particular, to recall support of a
rule ij → ik defined by supp(ij → ik ) or simply supp(ij ik ), it is only necessary
to use the corresponding value of wij from the matrix. Therefore, the recall of
support for any itemset, defined by X = {ij} from this particular group, can be
defined as follows:
supp(X) = supp(ij) = P (wij ) =
f req(wij )
n
f or all i 6= j
(4.11)
Estimating Support for The k-itemsets (2 > k ≤ m)
The calculation of support for the groups of the k-itemsets when k >2 is not as
straightforward as the two previous cases. This is, in this case, we need to define
a mechanism which can combine the knowledge in the CMM to estimate itemset
support. In order to achieve this new aim, two proposals, based on Definition 1,
will be discussed below.
Definition 1 (Independent Random Variables) Let k random variables (X1 ,X2 ,...,Xk )
77
represent k events (i1 ,i2 ,...,ik ). As such events do not have influence in any form
among them, it can be stated they are independent from one another; therefore,
the probability holding that they occur together can be specified by calculating
its joint probability as follows:
Pr(X1 = i1 , . . . , Xk = ik ) =
k
Y
Pr(Xi = ii )
(4.12)
i=1
In our first proposal, which will be identified as the method A, the support
value will be estimated by employing the probabilities given by elements of the
diagonal matrix. In other words, it will be assumed that the probability of a
k-itemset X is held by the joint probability of the items associated within it.
Therefore, a support value can be represented by
supp(X)
ˆ
= P (X) =
k
Y
P (wii )
(4.13)
i∈X
For the second proposal or method B, instead of using the k individual probabilities, the following is proposed:
By using the associative property of multiplication over the elements of Equation 4.13, paired arrangements of products among the k elements can be established as follows:
P (X) = (P (w11 ) ∗ P (w22 )) ∗ · · · ∗ (P (wjj ) ∗ P (wkk ))
(4.14)
In which, each paired product of probabilities can be stated to hold the following property,
P (A ∩ B) = P (AB) = P (A)P (B)
78
(4.15)
Which establishes that the probability of the intersection between two events
is defined by the product of their single probabilities. Hence, P(X) in Equation 4.14 can be re-defined as
P (X) = (P (w12 )) ∗ · · · ∗ (P (wjk ))
(4.16)
Therefore, in order to give a recall of the support of a k-itemset where k >2,
the calculation needed is given by









supp(X)
ˆ
= P (X) =
k/2
Q
P (wij )
when k is even
pairs{i,j}∈X
where i>j



(k−1)/2

Q




P (wij ) ∗ P (wkk )




 pairs{i,j}∈X
when k is odd
where i>j
(4.17)
4.1.3 Complexity Analysis: CMM vs. Apriori
Since the use of a CMM for ARM is totally novel, we have considered it relevant
to provide a theoretical basis for the complexity analysis between the Apriori
algorithm and a CMM for ARM. It is important to state that although different
Apriori implementations exist, the analysis conducted here is based on a breadth
first implementation since it is the gold standard from which many other implementations have been derived.
Let m denote the total number of items, from which an n number of transactions have been derived to define a dataset D. Therefore, the time and space
complexities can be expressed as follows:
79
Space Complexity
Since all the itemsets discovered so far have to be saved by Apriori while the mining is performed, the space required to represent all of them is O(2m ), assuming
that each itemset is represented independently in each node of the chosen data
structure. In the case of a CMM, this complexity is defined by its weight matrix,
in which the input patterns are compressed, and is equal to O(m2 ).
Time Complexity
In order to determine if a k-itemset (an itemset with k items) is frequent, the
counting of its occurrences in D is done by Apriori. Therefore, the complexity
of such a calculation is defined by O(nk), whilst an estimation, realised by our
methods A and B, produces a complexity of O(k) and O(0.5k) respectively.
There is another complexity involved in the case of the CMM, which is produced by the process of learning D and is defined by O(nm2 ).
4.2 Experiments
It has been stated above how the resources, needed for the calculation for itemset
support, are being captured (learnt) by a CMM during training. Additionally, it
has also been proposed how the auto-associative memory based on CMM should
be interpreted for retrieving itemset support from its weight matrix.
In order to determine the accuracy of the proposed mechanisms for itemsetsupport recalls from a trained CMM, some experiments with real-life datasets,
defined in Table 4.1, will be performed in this section.
80
Dataset
Chess
Connect
Mushroom
Number of Transactions
(n)
3196
67557
8124
Number of Items
(m)
75
129 (avg. 43 per transaction)
119 (avg. 23 per transaction)
Table 4.1: List of real-life binary training datasets used in the testing of an autoassociative memory for ARM. They are part of the datasets normally used for testing
FIM algorithms or FIM benchmarks (Jr. et al., 2004; Goethals and Zaki, 2003).
The comparison of the CMM recalls, defining itemset-support estimations, is
realised against the zero-error support values given by the Apriori algorithm. In
particular, we use the implementation of Borgelt (Borgelt, 2003) of the Apriori
algorithm which scans the dataset D in turn and builds a data structure to do the
counting of the support of itemsets.
It is very important to note that the results of the experiments presented
here do not involve calculations for the groups of 1- and 2-itemsets. The latter was concluded because the corresponding support for such itemset groups
is a straightforward procedure, as explained above, and which for the benefit of
this thesis produces zero-error-support values as Apriori algorithm does. These
errorless values have been possible because of the CMM’s natural ability to learn
and store this itemset property (support) of the input patterns while it learns the
corresponding associations.
Therefore, the experimentation carried out here focuses on testing the proposed
procedures for the estimation of the support for k-itemsets when k >2. In these
experiments, while the method A, defined by Equation 4.13, involves exploiting
the independence property of the items for the calculation of the itemset support,
the method B, defined by Equation 4.17, uses the information of the support of
the 2-itemsets involved in the k-itemset to make a recall. In particular, our experiments consist of querying the associative memory to recall the support for
81
some groups of k-itemsets for testing its accuracy.
To determine the accuracy of the generalization given by this memory, an
error for each of the different itemset groups, involved in the queries, will be
measured. The calculated error will be represented by the well-known RMS
(Root-Mean-Square) error defined as follows:
E RM S
v
u n
u1 X
=t
{yi (xi ; W ) − ti }2
n i=1
(4.18)
In which, xi represents a k-itemset of a group of n k-itemsets, whose support
values yi and ti (fluctuating in the range of 0 -it never occurs- and 100 -it always
occurs-) have been calculated respectively by a recall made from a weight matrix
W of a trained CMM and a traditional FIM process performed by the Apriori algorithm for comparison.
Since it is our aim to evaluate the accuracy of the results given by an autoassociative memory, it is necessary to define which itemsets will be employed for
testing. Although it would be ideal to check the response of the trained CMM
for the total itemset search space, the fact is that it is unfeasible because of the
search-space size. Therefore, some groups representing different support tendencies and sections of the itemset search space have been chosen to be created
as follows:
Group A The rare itemsets (low-frequent). This group is formed by elements
which can have a support fluctuating between 0.001 and 1 percentage of
the total transactions of the dataset. This group represents itemsets which
rarely appear in the dataset, therefore they can be considered to be outlier
82
Group Type
A
B
C
Support constraint applied
Itemsets with support in the range of 0.001% – 1%
Itemsets with support in the range of 45% – 55%
Itemsets with support in the range of 90% – 100%
Table 4.2: Support constraint conditions used to form the group of itemsets on which
the memories will be tested.
associations in the environment.
Group B The regular itemsets (semi-frequent). The support values of this group
lay between 45 and 55 percent. These are itemsets which appear in a
moderate way in the dataset. The associations described by this group can
sometimes be interpreted as the obvious knowledge. Nevertheless, their
changes of support can be useful to detect abrupt tendencies in the dataset
throughout time.
Group C The constant itemsets (high-frequent). This is the group of certainty.
It contains itemsets which appear frequently in the dataset. An itemset
in this group can have a support in the range of 90 to 100 percent. This
group describes the common interests among the transactions defined by
an environment.
After defining the itemset groups on which our proposals will be tested, the
elements of each group have been determined by running the Apriori algorithm
over the original datasets with the corresponding support constraint conditions
defined in Table 4.2. Once the groups have been formed, each member is used as
stimulus to query the corresponding memories in order to compare the support
values.
Before the results of the experiments are shown and discussed, we have plotted each of the final trained W s, which have learnt the associations of the datasets
83
0.9
0.8
0.7
0.6
1
0.5
0.5
0
0
0
0.4
10
10
20
20
0.3
30
30
40
40
0.2
50
50
0.1
60
60
70
70
80
Figure 4.3: 75-by-75 frequency matrix formed by a CMM, from which itemset-support
recalls about the Chess dataset will be made.
of Table 4.1, in Figures 4.3, 4.4, and 4.5. Each of these matrices represents the
available source of knowledge from which each memory will produce with a recall (estimation) of itemset support for the queried itemsets (patterns) through
our proposals.
The numbers describing the errors calculated for the chosen datasets corresponding to the recalls for 3- and 4-itemset groups are shown in Tables 4.3
and 4.4 respectively. It is important to comment that, due to the range of values
available to represent itemset support (in this case from 0 to 100%), the worst
error scenario in a recall will be determined by a value of 100.
In order to discover the accuracy of recalls made by the CMM with our proposals in situations, which not only involve itemsets with the same size, we have
84
Figure 4.4: 129-by-129 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Connect dataset will be made.
Figure 4.5: 119-by-119 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Mushroom dataset will be made.
85
Dataset
Chess
Connect
Mushroom
Itemset Group
A
B
C
A
B
C
A
B
C
Itemsets in Group
(n)
12,340
1,871
167
113,013
1,540
826
21,961
74
-
RMS Error
Method A Method B
0.790250 0.766740
3.534600 2.769100
0.617510 0.475190
0.21098
0.20019
2.9726
2.3955
0.67045
0.38234
1.0606
1.0402
4.9631
4.1033
-
Table 4.3: Error results obtained in the experiments for the support recall for the groups
of 3-itemsets, constrained as defined in Table 4.2, made by a CMM through our proposals.
Dataset
Chess
Connect
Mushroom
Itemset Group
A
B
C
A
B
C
A
B
C
Itemsets in Group
(n)
269,719
12,359
203
2,776,831
14,743
2,451
243,608
72
-
RMS Error
Method A Method B
0.724600 0.534530
4.172900 3.185000
0.775860 0.505150
0.19219
0.13916
3.5693
2.8307
0.77525
0.38354
0.76988
0.7246
5.2328
3.5672
-
Table 4.4: Error results obtained in the experiments for the support recall for the groups
of 4-itemsets, constrained as defined in Table 4.2, made by a CMM through our proposals.
set up some experiments with the three selected datasets and summarised the final results in Tables 4.5 and 4.6.
In general, these results show that differences exist between the zero-error
counted by Apriori and the approximations recalled from the CMM. The best
approximations were shown by the proposal which uses the support of the 2itemsets for estimating the support of a k-itemset presented to the memory.
86
Group of Itemsets
(k)
1
2
3
4
5
6
7
Number of Itemset
(n)
167
203
128
39
4
RMS Error
Method A Method B
0.617510
0.47519
0.775860
0.50515
0.912940
0.86483
1.089600
0.9105
1.394400
1.4274
Table 4.5: Error results obtained in the experiments for the support recall for groups of
different k-itemsets of the Chess dataset made by a CMM through our proposals. The
elements (itemsets) of the groups used to make the CMM recall were defined by the
Apriori algorithm through constraining the itemsets with a minimum support between
90 and 100%.
Group of Itemsets
(k)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Number of Itemset
(n)
4985
25,500
88,170
217,705
397,947
550,220
581,647
471,908
293,209
138,294
48,473
12,023
1,896
152
5
RMS Error
Method A Method B
2.9141
2.285
3.593
2.6836
4.2165
3.6522
4.8524
3.995
5.5205
5.0014
6.2197
5.2905
6.9495
6.494
7.7125
6.6983
8.5192
8.1773
9.3857
8.3329
10.32
10.098
11.317
10.258
12.416
12.252
13.865
12.791
14.903
14.751
Table 4.6: Error results obtained in the experiments for the support recall for groups of
different k-itemsets of the Chess dataset made by a CMM through our proposals. The
elements (itemsets) of the groups used to make the CMM recall were defined by the
Apriori algorithm by constraining the itemsets with a minimum support between 45 and
100%.
87
Group of Itemsets
(k)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number of Itemset
(n)
2,757
13,218
44,721
111,159
208,126
297,836
327,797
277,133
178,389
85,839
29,903
7,135
1,052
76
RMS Error
Method A Method B
1.6031
0.98297
1.921
1.0486
2.1543
1.4329
2.3382
1.3499
2.4889
1.6243
2.6207
1.4417
2.7472
1.6749
2.8768
1.4244
3.012
1.6514
3.1502
1.3619
3.2857
1.5753
3.4064
1.2622
3.4853
1.4725
3.4506
1.152
Table 4.7: Error results obtained in the experiments for the support recall for groups
of different k-itemsets of the Connect dataset made by a CMM through our proposals.
The elements (itemsets) of the groups used to make the CMM recall were defined by the
Apriori algorithm by constraining the itemsets with a minimum support between 75 and
100%.
Group of Itemsets
(k)
1
2
3
4
5
6
Number of Itemset
(n)
110
91
37
6
RMS Error
Method A Method B
4.2501
3.5735
4.8628
3.2724
5.3181
5.153
5.7803
4.8079
Table 4.8: Error results obtained in the experiments for the support recall for groups of
different k-itemsets of the Mushroom dataset made by a CMM through our proposals.
The elements (itemsets) of the groups used to make the CMM recall were defined by the
Apriori algorithm by constraining the itemsets with a minimum support between 45 and
100%.
88
4.3 Conclusions
This chapter has been the first step in the quest for a neural network whose embedded knowledge can be used accurately for the generation of association rules,
as if such rules were generated from the original (training) data D. In order to
generate such rules, the property of itemset support is necessary, since it is the
parameter (metric) to evaluate which itemsets (combinations) from a total space
of 2n itemsets (n is the number of items of elements forming the itemsets, patterns, or transactions) are interesting for the process. Therefore, in this chapter,
we have investigated if itemset support can be generated from the weight matrix
formed by an associative memory. This neural network has been chosen from
the large taxonomy of ANNs because its operativity is governed mainly by the
use of the association concept among the input patterns.
In particular, it has been analysed if this memory, after training, has the
knowledge needed to assign a value defining itemset support to the stimulus
(queried itemsets) presented to the memory. As a result of our analysis, it has
been found that the weighted type of this memory has the natural ability to learn
information about the frequency of itemsets defined by the input associations.
After discovering that itemset frequencies are embedded in the weight matrix, we defined how it is possible to calculate the support for the group of the
1- and 2-itemsets directly from the weight matrix. These recalled support values
from the memory are as errorless as the values given by the well-known Apriori
algorithm, which scans and counts the dataset directly to calculate itemset support.
In the case of the support for the groups of k-itemsets when 2 > k ≤ m,
89
two methods have been proposed to tackle the problem using the only information available in the matrix, the support information of the groups of 1- and
2-itemsets, to give a recall. Both methods, A and B, assume that items (their
events of existence) are probabilistic independent among one another. While the
method A uses the individual item probabilities (values defining support in the
main diagonal of the matrix) as resources for the estimation, the method B forms
pairwise items, whose support values are defined within the matrix, from the
queried itemsets in order to give an answer.
One of the advantages of using a CMM is that the space complexity generated
by its weight matrix, from which itemset support will be recalled and defined by
O(2m ), is much smaller than the one formed by its counterpart, the Apriori algorithm, which will be equal to O(m2 ) in the worst scenario.
Errors in the recalls have resulted from employing both methods. Nevertheless, the method B has shown slightly better results. In summary, it can be stated
that this memory is suitable for the perfect (errorless) calculation of support for
the 1- and 2-itemset groups. Nevertheless, improvements for the case of the support for itemsets larger than two items remain an open problem.
It is relevant to note that the usage of a CMM for ARM has been conceivable
due to its training, which, as shown in figure 4.1, is the result of a superposition
of the CMMs defining the input pairs of patterns.
90
Chapter 5
Itemset Support Generation From a
Self-Organising Map
After concluding in the previous chapter that although a supervised ANN, like
an auto-associative memory, has the ability to remember itemset support for the
groups of 1- and 2-itemsets perfectly, it struggles in recalling support for larger
itemesets, due to overlaps produced by the distribution of the learnt associations
in its weight matrix.
In contrast to the previous chapter, here we will be looking at an unsupervised
ANN rather than a supervised one for the task of building our desired memory.
First of all, we will focus on studying a Self-Organising Map, because it has been
successfully used for data mining tasks such as data visualisation, clustering and
modelling. Additionally, its usage for ARM has already been proposed, but with
some limitations.
To begin with, a study on the suitability of a SOM for ARM is presented.
With the conclusions drawn from the study and considering the biological SOM
91
properties, we will propose how to interpret a trained SOM with patterns, representing associations in an environment, in order to extract itemset support from
its nodes. In other words, we propose how to reproduce the counting of patterns
from the knowledge embedded in the map by our proposal named PISM (Probabilistic Itemset-support eStimation Mechanism).
The accuracy of the itemset-support estimations, made by a SOM for either
real-life or artificial datasets, is tested versus itemset-support results calculated
by the Apriori algorithm. In order to improve the accuracy of the estimations, the
concept of emergent feature maps is also studied. Conclusions on the suitability
and considerations for the use of this neural network for ARM are given at the
end.
5.1 Considering a SOM for ARM: Principles
Motivated by the fact that pattern-occurrence counting values can be reproduced
from the knowledge formed within the weight matrix of an auto-associative
memory, based on a correlation matrix memory, our attention is now turned to
studying whether a property of a network with an unsupervised training, such as
the SOM (Self-Organising Map), can be utilised to count itemsets, and thus can
become our itemset-support memory for ARM.
The SOM is an outstanding neural network, due to its peculiar characteristics
for multi-dimensional data clustering and visualisation. As stated in Chapter 3,
its advantage is that it captures the knowledge from a dataset without supervision. To use a SOM for a data-mining task (Kaski et al., 1998b), it is necessary to
set up first some initial parameters (for example, radius, neighborhood function,
92
and others.). In the SOM training, an NNS (Nearest-Neighbor Search) (Yianilos, 1993) is at the core of the process. This NNS provides a mechanism for
the selection of the BMUs (Best-Matching Units) or winners in order to determine the corresponding group of nodes, which triggers the update of the map.
Through this update process, the map m re-organises itself iteratively in such
a manner that the state M can be reached, so that m → M when t → ∞. In
this final state M , the map can be declared to be a model, which has already
gained certain information from the input or training dataset in its weight matrix
(reference models, codebook), and from which some important clustering and
data-visualisation properties can be derived satisfactorily. However, due to the
nature of ARM, the model formed will be utilised for a task which concerns neither clustering nor visualisation.
Although the training of a SOM is typically performed in a sequential mode,
in which the map is updated every time that a single input vector randomly chosen is presented, we have considered the usage of the batch mode, in which the
map is updated once when the whole data is presented, because it has been used
to propose, for instance, the creation of large maps (Song and Lee, 1998; Kohonen et al., 2000), the formation of maps for non vectorial data (Kohonen and
Somervuo, 2002) or string data (Kohonen and Somervuo, 1998), and the speedup of the SOM training through the parallelisation of it (Lawrence et al., 1999a;
Kohonen et al., 2000). Moreover, as will be shown further, batch training allows
for an easy identification of the distributed knowledge in the neurons during a
counting or training epoch. Therefore, a SOM can be updated in batch (Kohonen, 1996; Kohonen et al., 2000) by calculating the new values of its reference
vectors associated to its neurons as follows:
93
P
j hji (t)Sj (t)
mi (t + 1) = P
j nV j (t)hji (t)
(5.1)
This new state of the map (mi (t + 1)) can be understood as the spread of
influences from each node mj generated by the corresponding data inputs represented by Sj (t) weighted by a neighborhood kernel function hji (t). The term
Sj , defined below in Equation 5.2, describes a sum function of the nvi inputs
contained at the Voronoi region Vj = {xi | kxi − mj k < kxi − mk k ∀k 6= j},
corresponding to node mj .
Si (t) =
nvi
X
xj
(5.2)
j=1
In the case of the kernel function, this is normally defined by a Gaussian function
as follows:
Ã
kri − rj k2
hij (t) = exp −
2σ 2 (t)
!
(5.3)
In which ri and rj represent the positions of nodes mi and mj on the SOM
grid, and σ defines the neighborhood radius. One characteristic of this kernel is
that it presents its highest value at the origin of the influence (at the winners) and
decreases monotonically along the remaining nodes of the map.
Considering Equation 5.1, the update of the map can be interpreted as a process which places Gaussian functions close to the values Sj /nV j . Each of these
values represents a data point which can be considered to be the mean µj of the
nV j numbers of allocated patterns xi at node mj , or the centroid nj of the Voronoi
set Vj defined by node mj as follows:
94
nj =
1 X
xi
nvj x ∈V
i
(5.4)
j
After such functions have been placed at the corresponding nodes (the BMUs),
their influences need to be propagated to the rest of nodes in the map, in order to
produce a new state of the SOM. The strength of the influences is different and
is determined by the number of data points allocated at the node and its position
in the map. It is worth noting that this interpretation of the update resembles a
process of density estimation (Silverman, 1986; Devroye, 1987), in which gaussian functions are often placed at the input values in order to be summed up to
determine an estimation fb of the real density distribution f of the input data.
Dataset
Bin4
Bin6
Bin8
Bin10
Bin12
Bin16
Transactions(itemsets)
n
15
63
255
1023
4095
65535
Items
m
4
6
8
10
12
16
Description
All itemsets derived from 4 items
All itemsets derived from 6 items
All itemsets derived from 8 items
All itemsets derived from 10 items
All itemsets derived from 12 items
All itemsets derived from 16 items
Table 5.1: List of binary-artificial training datasets used in the experiments for
analysing SOM properties for ARM. Each dataset contains all the possible n itemsets
generated by m items.
To identify which SOM characteristics can be used for ARM, experiments
were conducted with the datasets described in Table 5.1. It must be noted these
data are all artificial, because our interest implies not only to find out how the
SOM behaves when it is fed with a data distribution formed with discrete (binary) patterns describing associations, but also to observe the SOM’s response
when it is trained with datasets containing the whole itemset data space, which
is defined by 2n -1 itemsets derived from n items.
95
For our experiments, the following points apply: i) the use of a batch training
method, ii) the Gaussian function is used as the neighborhood function because
of its probabilistic properties, and iii) the size of the radius is constant and set to
one because in that manner the BMUs will always spread their maximum influence to the entire map.
The maps in Figure 5.1 represent the results for this analytical exercise which
can be interpreted as follows:
Formation of Clusters. To make the clusters visible, the D-matrix1 of each
map is plotted and coloured using principal components for colour assignment (Vesanto et al., 2000). The size of the clusters, as well as the
strength of the influences, are also directly related to the number of patterns allocated to the BMUs; therefore, in the resulting maps, particularly
those with more transactions, it is important to note that the clusters organized around the edges of the square grid are relatively homogenous in
size. That is, the hypercube, whose vertices define the different binary patterns of a multi-space, is projected (compressed) into the two-dimensional
space bounded by the neurons of the map, forming a circular formation of
clusters along with a communal cluster in the middle. A good example, in
which this cluster phenomena is visible, is the resulting map for 12 items
in Figure 5.1.
This cluster size repartition is due to the fact that the input data distribution
is controlled by limiting the appearance of any item to 2n /2 times in the
dataset. Although, this is something rare in real-life problems, it helps to
understand the theoretical scenario when all itemsets exist in the training
1
It is a distance matrix among the most commonly used means for the visualization of the
SOM. It investigates the differences of adjacent nodes. In a D-Matrix, the median of the computed distances among the nodes and their neighbor nodes is determined.
96
Figure 5.1: Maps resulting from training a SOM with artificial datasets describing
associations. The red hexagons on the gray maps define the hits received from the input
patterns during training. Cluster formations are presented with the coloured maps.
97
dataset. Another characteristic that also affects the number of clusters on
the maps is their size. The size of the map is often determined heuristically, which normally results in having a codebook which can always be
expected to be considerably smaller than the size of the original dataset.
In other words, the original dataset with its hidden associations would be
compressed and coded in a distributed manner within the nodes of the map
in such a way that its interpretation remains unknown.
Identification of Winners. One important characteristic of this ANN is that the
identification of BMUs is simple. For example, in all figures, the size of
the red hexagons represents the number of hits received during an epoch
training. The importance of the set of BMUs (nodes with non-zero hits) is
high, because they trigger the update of the map at each epoch, and also
because it is to these nodes that the input Patterns are allocated. Thus, it
can be assumed that any node mj which has been hit mj .] times has a high
relevance for the calculation of itemset support iff mj .] = |Vj | ≥ σ (minimum support threshold). The latter initially leads to the conclusion that
if a cluster Ci contains strong (that is very dense, very populated or very
frequent) BMUs then it can also be considered strong. Using the language
of ARM, it is to be expected that if a cluster Ci is found to be frequent on
the map, then some of its members, called winners, may also be frequent.
It is important to note that the latter is only partially correct because even
though a winner is found to be frequent, that fact does not ensure that all
the members (items) of the patterns in it would also be considered frequent.
As we have initially classified a SOM to be a representative of the projection model defined in (Gardner-Medwin and Barlow, 2001), the number of
hits to a node can be interpreted as a property which defines the usage of
98
that node in the process of counting input patterns during an epoch. Therefore, estimations regarding the occurrence frequency of the patterns can be
calculated by using such a node property.
Due to the way in which a SOM clusters data, it can expected that some
BMUs might share similarities among the elements of their patterns even
though they will never share the same patterns (ni .S ∩ nj .S = Ø , for all
i 6= j). Therefore, it can be assumed that the true support value of an itemset will involve the collection of the corresponding value from the nodes
which in turn share the itemset. This can be viewed as a soft clustering
procedure in which a data point may have multiple memberships, but the
weighted memberships may sum to one.
Dependency Amongst Clusters. As a consequence of training, a set S of patterns would have been formed in each BMU. To be using SOMs for FIM, a
SOM has to be able to provide the support for any itemset possibly formed
by the input patterns. From this perspective, the dependency amongst the
binary patterns is relevant. This dependency of the patterns at each BMU
can be seen iff they are re-arranged into a hierarchical data structure in
order to help the FIM algorithm with counting support from each pattern.
This pattern dependency is important since it can be part of a method to
calculate the true support of a k-itemset, determining for instance if the kitemset ⊆ m-itemset. To determine whether a pattern is either a parent of
or a child of another pattern, a bitwise operator can be adopted as follows:
Let A and B be two binary patterns and ∨ be a bitwise OR operator such
that the operation A ∨ B can give the following dependencies:
99
• A and B are related directly only if (A ∨ B) gives either A or B as a
result, so that A is considered the parent of B, (A ∨ B) = A, meaning
that B ⊆ A, otherwise A will be considered the child of B, (A ∨ B)
= B, meaning that A ⊆ B.
• A partial or zero dependency is defined when (A ∨ B) gives neither A
nor B as a result, meaning that a possible dependency exists among
the clusters on the map.
Following the idea described above, local data structures can be built at
each BMU in order to visualise the way in which a SOM splits the datainput-space lattice into the nodes. With the definition of the pattern dependencies, we might build a more complex structure involving all the patterns
found and organised by a SOM. Therefore, this idea would lead to the
building of a data structure, for instance a tree or trie, whose nodes would
contain the pattern definition and its corresponding frequency, which will
have to be computed on the-fly. The latter resembles the operation performed by a traditional FIM algorithm so that there would not be a good
reason to use a SOM for ARM.
At this stage, it can be concluded that even if the separation of the patterns
could reduce the computational complexity of the counting, and the identification of important sectors of the input-data space could be detected on the map this
neural network is still dependant on the original dataset. In order to overcome the
current disadvantage, we could look at two alternatives: The first would involve
adopting either the construction of a hierarchical data structure with the input
patterns, or their direct use for FIM as proposed in (Shangming Yang, 2004).
Nevertheless, both of these proposals would be limiting the use of a SOM to just
the formation of clusters, which as commented previously would not be a sufficient justification to claim that association rules can be generated directly from
100
knowledge, derived from this neural network.
The second approach is more challenging and interesting, and involves taking
this research a step further and investigating a method of decoding the knowledge
of the map in order to form an interpretation to eliminate the dependency between
the SOM and the training dataset for FIM. Such an interpretation would bring a
positive benefit for the use of a SOM for ARM, because the SOM would fulfill
the role of a large artificial memory which learns associations, defined in an
environment, so that recalls about the support value of any itemset can be made
satisfactorily any time that this memory is queried. Therefore, in the next section,
we will develop a knowledge extraction mechanism called PISM (Probabilistic
Itemset-support eStimation Mechanism) to decode or interpret a trained SOM in
order to reproduce the process of the counting of patterns or itemsets from its
neurons.
5.2 A Probabilistic Itemset-support Estimation Mechanism
To achieve our aim, which corresponds with the development of a SOM-based
memory for ARM, we first need to find out how and where the associations are
being represented within the SOM. It has already been stated that the representations of the input associations are all distributed among the nodes in the map,
but it is necessary to determine which group of nodes in the map will contribute
to the itemset-support estimations. In other words, the selection of the nodes,
which will serve as the source of knowledge for itemset-support estimations, is
indispensable to be realised.
101
Therefore, after observing in our experiments that the set of BMUs is the
group responsible for the changes occurring in the map during training, and by
following the recommendation made in (Alhoniemi et al., 1999), which states
that the information allocated in the BMUs is often an attractive source of information for the development of many applications, we will concentrate on defining an extraction method for the BMUs formed in a training epoch. In order to
achieve the selection of such nodes from the trained map, two initial definitions
are stated as follows:
Definition 2 (Set of Winners) A set W, defining the final winners, is formed by a
number of nodes mb from the final map M such that,
W = {Mi | Mi .#S > 0 or Mi .S 6= φ}
mb = |W | ≤ |M |
Where S is a set whose elements are the patterns that hit the node i.
Definition 3 (Winning Vector) A vector Z = {z1 , . . . , zn } will be called a final
winning reference vector if its respective node forms part of the set W. In addition, zji will be understood as the ith component of Z that has acquired some
information about the ith component of some input patterns (xi ) which due to
their similarities have been allocated by the training process in the node associated to Mj .
Once W = {w1 , w2 , · · · , wmb } has been identified in a converged map M by
using Definition 2, the next stage requires the application of some concepts from
the Probability Theory to the reference vector of each node in the set W in order
to evaluate whether it is feasible to obtain our desired knowledge, the support
of itemsets, from these vectors. It is relevant to note that it is not exclusively
necessary to extract the set W from the converged map, but it can be assumed
102
that more accurate itemset-support estimations can be made if the latest state of
a map is utilized.
To define a value Pr that holds the probability of the fact that a node, for
instance mj , has become a BMU mc for some input patterns after completing a
training epoch, it can be first assumed that becoming a winner in the process is
equally likely to happen among all of the nodes in the map; therefore, the probability of each element contained in W can be specified as follows:
Definition 4 (Prior Probability: Becoming a Winner) In SOM training, where N
defines the number of input patterns contained in D, the probability that a node
mi has become mc during an epoch training is defined by the ratio between the
number of times that mc has been hit by D and the N number of input occurrences
in the training.
µ
Pr(mi → mc ) =
mi .#S
N
¶
(5.5)
Where #S is understood as the number of data points which belong to the
Voronoi set Vi delimited by the winner mi .
The previous definition is the prior probability that a node has become a
BMU. This Pr(i) appeared in (Alhoniemi et al., 1999) when Alhoniemi et al was
giving a probability interpretation of the response of the nodes to a new data
sample using Bayes theorem. Initially, the prior probability values of the nodes
are all zero, and some will change after the conclusion of each training epoch. In
this work, the values given by the converged map will be used.
Before continuing to define our extraction mechanism, we consider it impor103
tant to explain how the reference vector Z of a BMU i holds the corresponding
probabilities (frequencies) of each pattern component xi of the total number of
patterns allocated to that node. Therefore, the explanation is as follows:
As mentioned above, the input patterns, representing the associations to be
learnt, are formatted binary. Nevertheless, once they are presented to the SOM,
they are handled as m-dimensional real vectors due to: i) the use of an Euclidean
distance metric for the BMU searching rather than the use of a Hamming distance
which could be more appropriated for binary vectors, and ii) the vectors resulting
from applying Equation 5.1 are formed by real, but not binary numbers. Therefore, the input patterns can be defined to be real vectors whose components are
bistate since they hold zeros and ones and the tendency of having less zeros than
ones or vice versa will depend upon the hidden associations defined by them. In
other words, each variable or component zi , representing an item behaviour, can
be defined to be a bimodal distribution whose peaks will be defined at the values
of one and zero.
This bistate property of each zi is important for the purpose of this work, as
a normal distribution can be created from these values. The different concentrations or densities of these two values will form a density distribution whose
mean µ, which is the highest point of the distribution, defines a value containing
the percentage of success events (zi =1) for this variable or item.
In ARM, this percentage of success of zi of a total number of events N is what
the support of zi represents. In other words, the µ of each component measures
the number of occurrences of that component. Some examples describing this
idea are depicted in Figure 5.2, in which the curve of the density distribution of
each bottom graph represents an estimation of the real distribution formed by
104
the n transactions used in each example. For this case, the plotted distributions
have been formed by using the concept of a gaussian-kernel estimator (Silverman, 1986; Devroye, 1987).
It has been mentioned above that the update of the SOM, particularly Equation 5.1, can be interpreted as an aggregate of influences coming from different
BMUs on the map. Hence, If a BMU is about to be updated, it can be derived
from 5.1 that the final value for its codeword is a value very close to the mean
of all the patterns accumulated in it. The difference between the real mean and
the value given by the SOM is due to receiving influences from others weighted
by the distances among the nodes and the radius used for the update of the map
during training. Then, It can be concluded that the values of the components
defined in each BMU are estimations of the real means, and could therefore be
used for the calculation of the support instead of the original set of patterns.
Reviewing Equation 5.1, which is used for the conventional batch training,
and analysing the training procedure, it can be concluded that the updating of all
the components zi in the map (codebook) is realized independently. The final
value possessed in the component zi is never affected by the updating process
of the component zj at any point during training and vice versa. Meanwhile,
it has been stated in (Hugh, 1997) that whenever there is a kind of physical
independence between events or processes, it shall be assumed that they have
mathematical independence. Therefore, assuming that each component zj of a
final winning reference vector Z is independent from the others, and that each
of these components {z1 , z2 , . . . , zn } defines a probability of appearance of their
corresponding elements {x1 , x2 , . . . , xn } in the input patterns, we can proceed to
define that:
105
Figure 5.2: This figure illustrates the importance of the mean in the calculation of the
support of an item from a trained SOM. Different number of transactions (n) composing
of zeros and ones have been used to form the bottom graphs. These graphs show that
different concentrations (densities) of these bistate values captured in an item induce the
tendency of the distribution of the curve to approach the densest value (e.g., in the left
graph the number of failures (zi =0) is greater than the number of successes, therefore
the highest point of the distribution tends to be placed at 0).
106
Definition 5 (Independence Amongst Components of a Winner) Let a SOMtraining process be defined as an experiment where the possible outcomes S (S
is a sample space in probability terms) are defined by a countable infinite set of
data points in Rn (n is the dimensionality of the input data) which are allocated
in D. Let two random components zi and zk at the winning vector mj be represented by two discrete random variables, A and B respectively, which describe
the probability of their corresponding events of occurrence. Assume that these
variables A and B have no influence on each other. Thus, the probability that
they both occur together can be described by the joint probability, the product of
the probabilities of these two individual variables, i.e.
Pr(A = zi , B = zk ) = Pr(A = zi ) ∗ Pr(B = zk )
If the case is to study the joint distribution of n random variables X1 ,. . .,Xn ,
which conveniently can be represented by a random vector X = (X1 , . . . , Xn ),
describing values from z1 ,. . .,zn then the corresponding joint probability can be
obtained by,
Pr(X1 = z1 , . . . , Xn = zn ) =
n
Y
Pr(Xi = zi )
(5.6)
i=1
Definition 5 is possible, due to the fact that all components have been declared independent. It has also been said in (DeGroot, 1975) that n random variables X1 ,. . .,Xn have a discrete joint distribution if the random vector (X1 ,. . .,Xn )
can have only a finite number or an infinite sequence of different possible values
(x1 ,. . .,xn ) in Rn . Then, the joint probability function of X1 ,. . .,Xn is defined to
be a function f so that for any point (x1 ,. . .,xn ) ∈ Rn ,
107
f (x1 , . . . , xn ) = Pr(X1 = z1 , . . . , Xn = zn )
So far, it has been specified that every node in the map has a real value attached defining its prior probability. This prior probability must be zero for all
nodes which are not elements in the set W . Consequently, in Equation 5.6, a
joint probability is defined describing the fact that some random variables happen
together at each winner. Thus, the next step is to define the manner of calculating the total or final probability Pr(E|M ) that some event E, which represents
k variables happening together, occurs in the structure defined by a resultingtrained map M . To define the value Pr(E|M ), the concept of probability partition found in (DeGroot, 1975) can be utilized as follows:
Definition 6 (Partitioned Data Space) Let S denote a sample space (input data
or pattern space) of some experiment (training) and consider k events A1 ,. . . ,Ak
in such a way that they are disjoint (they do not share elements). Thus, it is
said that these k events form a partition of S. If the k events A1 ,. . . ,Ak form a
partition of S, and if B is any other event in S, then the events A1 B, A2 B,. . . ,Ak B
will consequently form a partition of B as illustrated in Figure 5.2. Hence, it is
possible to write
B = (A1 B) ∪ (A2 B) ∪ · · · ∪ (Ak B)
Moreover, since the k events (right side of the equation) are disjoint,
Pr(B) =
k
X
Pr(Aj B)
j=1
Finally, it is known that if Pr(Aj ) >0 for j = 1, 2, . . . , k then Pr(Aj B) =
Pr(Aj ) Pr( B| Aj ). Thus, it follows that
108
Pr(B) =
k
X
Pr(Aj ) Pr( B| Aj )
j=1
Where P r( B| Aj ) defines the conditional probability of the event B occurring at the partition defined by the event Aj .
Figure 5.3: The figure on the left depicts the intersections of an event B with events
A1 ,. . . ,A5 of a partition over S. The figure on the right depicts the concept of Voronoi
regions which can be formed on the SOM (the dots represent the codewords while the
stars represent the data points assigned to each Voronoi region).
For the purpose of this work, it can then be stated that the k events A1 ,. . . ,Ak
are the result of the ”best-matching” process during training, and that they represent the group of nodes over which the input data space is split, forming k different events (partitions) which are disjoint (no pattern is shared among them).
Similarly, the event B can be seen as any event that is likely to occur defined by
some of the elements (items) of the vectors allocated at each node. In summary,
the final probability of an event E can be defined as follows:
Definition 7 (Total Probability - The Frequency Occurrence of a Pattern -) Having a final map M after training a SOM with D containing discrete patterns
(binary patterns), it is possible to obtain the corresponding probability of the
event E, describing the fact that k components or items, x1 ,. . .,xk of the input
vectors appear together in the training environment D, through calculating the
109
sum of the partial probabilities regarding the event E (z1 ,. . .,zk ) in those neurons,
defined by the set W , in which the knowledge of D has been distributed. This
calculation is represented by:
Pr(E|D) ≈ Pr(E |M ) =
mb
X
Pr(Wi ) Pr(E |Wi )
(5.7)
i=1
in which, P r(Wi ) represents the priori probability of each BMU neuron and
P r(E|Wi ) defines the probability of the event E in that neuron. This final probability P r(E|M ), whose calculation is depicted in Figure 5.4, can also seen as
the estimation the frequency occurrence of the event E in D. It is important to
clarify that while Pr(E |M ) is the estimation of the real value Pr(E |D) which
normally is calculated by doing the counting of the event or pattern E in the
whole dataset D. An Equation summarising this definition is as follows:


Pr(E |M ) =
mb
X
Pr(Wi )
i=1
k
Y
Pr(zj )
(5.8)
j=1(z∈Wi )
To conclude the description of this method, it is essential to use the correct terminology in order to describe this proposed method using the frequentitemset-mining language. Therefore, it must be first defined that a k-itemset is
a k-multi event E occurring within the components of map M. Each possible kitemset has an associated support value (probability of appearance) in D that can
be calculated from M by,
Definition 8 (Estimation of Itemset Support from a SOM) Let M be a map resulting from training a SOM with N binary patterns grouped in an environment D
representing transactions involving m items. An estimation of the support (supp)
ˆ
of an itemset X, defining a pattern or event E formed with k items, which is equal
to an estimation of the frequency occurrence of such a pattern in the training
110
Figure 5.4: Representation of the Probabilistic Itemset-support Estimation Mechanism
(PISM) proposed in this chapter.
environment, can be calculated from the embedded and distributed knowledge in
M by summing up the probabilities registered by E in the BMUs, defined by a set
W , for such D, as follows:


supp(X)
ˆ
= Pr(X |M ) =
mb
X
Pr(Wi )
i=1
k
Y
Pr(zj )
(5.9)
j=1(z∈Wi )
5.3 Experiments and Results
In order to verify the accuracy of the knowledge extraction method described
above, which at this stage is only concerned with the estimation of the support
metric for any possible itemset derived from the patterns occurring in a training
111
dataset or environment, some datasets described in Tables 5.1 and 5.2 have been
used to test it. In all the experiments presented here, the SOM-training method is
batch, the radius is set to one, and the neighborhood function is Gaussian unless
otherwise stated.
Dataset
Chess
Connect
Mushroom
Transactions(m)
Items(n)
3196
67557
8124
75
129 (Avg. 43)
119 (Avg. 23)
Table 5.2: List of real-life binary training datasets used in the testing of PISM for SOM.
They have been used in FIM-algorithm benchmarks (Jr. et al., 2004; Goethals and Zaki,
2003).
The experiments have been carried out in two stages, involving artificial and
real-life datasets. Here, the focus lies on testing how accurate the results given by
a SOM via our proposal are when this neural network is being queried to provide
the support value of some group of itemsets. A list with all the queries involved
in our experiments is presented in Table 5.3, which also defines the way in which
each trained map will be queried. To be congruent with the previous chapter, the
groups of itemsets for testing our extraction method on the trained SOMs have
also been formed by mining the artificial and real-life datasets with some constraints, concerning the support σ and/or the size k of the tested itemsets, through
the use of the Apriori implementation developed by Borgelt (Borgelt, 2003).
As in the previous chapter, in order to corroborate how far or close the estimations made by our proposal are from the real values, the itemset support
given by the Apriori algorithm, implemented by Borgelt, will likewise be used
for comparison.
112
Query
All
1Itemsets
2Itemsets
3Itemsets
4Itemsets
1to3Itemsets
45to100Itemsets
75to100Itemsets
90to100Itemsets
Description
Applied to Maps Trained With
All itemsets
All 1-itemsets
All 2-itemsets
All 3-itemsets
All 4-itemsets
All k-itemsets where 1 ≤ k ≥ 3
All k-itemsets where 45 ≤ σ ≥ 100
All k-itemsets where 75 ≤ σ ≥ 100
All k-itemsets where 90 ≤ σ ≥ 100
All artificial datasets
All real-life datasets
All real-life datasets
All real-life datasets
Chess dataset
All real-life datasets
Chess, mushroom datasets
Connect dataset
Chess dataset
Table 5.3: List of queries used to form the groups of itemsets used for the testing of the
itemset-support estimations from SOMs via PISM. In this case, k means the size of the
itemsets and σ refers to the support used to form such itemset groups.
Each experiment begins with the training of a SOM M with a dataset or environment D. Consequently, a group of itemsets L along with their real support
property, representing the itemsets satisfying some constraints C, is generated
by mining D with the Apriori algorithm. The itemsets existing in L are then
used as stimuli to make M recall their support or frequency occurrence values
via our method from the knowledge embedded in its group of BMUs. Once the
real and estimated itemset-support values are collected, a generalization error is
calculated. To evaluate the effectiveness of a SOM in recalling itemset support,
the RMS error, defined by Equation 5.10, has been applied.
E RM S
v
u
N
u1 X
=t
{y(xn ; m∗ ) − tn }2
N i=1
(5.10)
In this particular case, y stands for the method proposed here for calculating
itemset-support from a trained map m∗ which is queried to recall the support of
an itemset x. The comparison of y is then done versus t which holds the support
113
of x generated by the Apriori algorithm over the original dataset. N stands for
the number of patterns contained in the itemset group requested. For example, N
will be equal to 75 if a SOM, trained with the Chess dataset, is queried to recall
support values for the group of 1-itemsets defined for such a dataset.
In some of the graphs shown below, along the x-axis, the itemsets, for which
the SOM is queried, are arranged. Each itemset is plotted based on their order of
appearance (from left to right, from bottom to top) in the lattice structure representing the complete data space formed by the k items. An example of a lattice
showing this arrangement can be seen in Figure 2.1 in Chapter 2.
In the case of the artificial datasets, the maps are queried for all the support values for all of the possible itemsets (combinations) that can be formed
by the n items or attributes of the dataset. This is to determine if our events
E, representing patterns or associations in the environment, whose final probability or frequency occurrence is defined by the probabilities of the individual
random variables representing the items, are independent or pairwise independent. This concern is important because if it were found that components of the
events are only pairwise independent, then the estimations made by this method
for events, involving the joint probability of more than 3 variables or items, could
present large differences to the real frequency counters. Therefore, it could not
be claimed that a SOM works as an itemset-support memory for ARM.
In Figure 5.5, the results for the artificial datasets obtained from applying
PISM, in particular from applying Equation 5.9, to make SOM recall itemset
support are shown.
114
60
SOM (PISM)
Apriori Implementation
50
Support (%)
40
30
20
10
0
15
k−itemsets (1<=k<=4)
60
SOM (PISM)
Apriori Implementation
50
Support (%)
40
30
20
10
0
255
k−itemsets (1<=k<=8)
60
SOM (PISM)
Apriori Implementation
50
Support (%)
40
30
20
10
0
65535
k−itemsets (1<=k<=16)
Figure 5.5: Results for the support value of 15 itemsets (top graph), 255 itemsets (centre
graph) and 65535 (bottom graph) obtained after using PISM in order to satisfy the query All- to the map trained with the dataset Bin4, Bin8 and Bin16 respectively. For reference,
the values corresponding to the same queries using an Apriori implementation are also
plotted.
115
To provide an indication of the behavior of a SOM and how accurate the
results can be during training, some temporal (intermediate) stages of a SOM
before it converges are plotted in Figure 5.6. It should be noted in these two figures that even though the mappings, derived from a SOM training, represent the
same input environment, it does not imply that the same results will be generated
by them. The variation in the results is the consequence of the map initialisation,
that for this work, is done randomly. To overcome this unstable situation, the
map could then have been initialised linearly. It is relevant to point out from
Figure 5.6 that since the very first training iteration (epoch) both SOMs are able
to provide a good estimation for the real itemset support.
70
SOM (PISM) Epoch 1
SOM (PISM) Epoch 3
SOM (PISM) Epoch 5
Apriori Implemetation
Support (%)
60
50
40
30
20
10
0
15
k−itemsets (1<=k<=4)
80
SOM (PISM) Epoch 1
SOM (PISM) Epoch 3
SOM (PISM) Epoch 5
Apriori Implemetation
Support (%)
60
40
20
0
15
k−itemsets (1<=k<=4)
Figure 5.6: Intermediate results (The support values of 15 itemsets) generated from
using PISM for the query -All- to the map being trained with dataset Bin4x100. In both
cases, the SOM needs five epochs to converge but after the first epoch, good estimations
can be formed for the support of itemsets. The small difference in the performance
between these two exercises is due to the type of initialisation chosen.
To give a taste of the quality of the estimations given by a SOM via the pro116
posed method for some of the query cases for real-life datasets, the graphs in
Figures 5.7, 5.8, and 5.9 are depicted. These resulting mappings of the trained
SOMs compared to the Apriori results have, at first glance, a better accuracy than
the artificial ones. An explanation for this improvement in the itemset-support
mapping is that in the real-life datasets the distribution of the patterns is unbalanced, that is, the fact of finding only a single repetition for each possible
different pattern (itemset) of the distribution is only remotely likely to happen.
The latter is a phenomenon happening in the artificial datasets used in this experimentation.
SOM (PISM)
Apriori Implementation
S u p p o r t (%)
100
50
0
75
100
50
0
1−i t e m s e t s
S u p p o r t (%)
100
50
0
2775
100
50
0
2−i t e m s e t s
S u p p o r t (%)
50
67525
100
50
0
3−i t e m s e t s
67525
3−i t e m s e t s
S u p p o r t (%)
100
50
0
2775
2−i t e m s e t s
100
0
75
1−i t e m s e t s
100
50
0
1215450
4−i t e m s e t s
1215450
4−i t e m s e t s
Figure 5.7: Results (plots on the right) obtained after using PISM in order to satisfy the
queries -1Itemset : 4Itemset- to the map trained with the dataset Chess. For reference,
the values corresponding to the same queries (plots on the left) against the dataset Chess
using an Apriori implementation are also plotted.
In order to assess the capability of a SOM to generalise the recall of the support of any itemset, a series of three experiments has initially been conducted
with the Chess, Mushroom and Connect datasets for the groups of the itemsets
117
SOM (PISM)
Apriori Implementation
100
80
80
S u p p o r t (%)
100
60
40
20
0
60
40
20
0
119
1−i t e m s e t s
100
80
80
S u p p o r t (%)
100
60
40
20
0
60
40
20
0
7021
2−i t e m s e t s
7021
2−i t e m s e t s
100
80
80
S u p p o r t (%)
100
60
40
20
0
119
1−i t e m s e t s
60
40
20
0
273819
3−i t e m s e t s
273819
3−i t e m s e t s
Figure 5.8: Results (plots on the right) obtained after using PISM in order to satisfy
the queries -1Itemset : 3Itemset- to the map trained with the dataset Mushroom. For
reference, the values corresponding to the same queries (plots on the left) against the
dataset Mushroom using an Apriori implementation are also plotted.
SOM (PISM)
Apriori Implementation
100
80
80
S u p p o r t (%)
100
60
40
20
0
60
40
20
0
129
1−i t e m s e t s
100
80
80
S u p p o r t (%)
100
60
40
20
0
60
40
20
0
8256
2−i t e m s e t s
8256
2−i t e m s e t s
100
80
80
S u p p o r t (%)
100
60
40
20
0
129
1−i t e m s e t s
60
40
20
0
349504
3−i t e m s e t s
349504
3−i t e m s e t s
Figure 5.9: Results (plots on the right) obtained after using PISM in order to satisfy
the queries -1Itemset : 3Itemset- to the map trained with the dataset Connect. For reference, the values corresponding to the same queries (plots on the left) against the dataset
Connect using an Apriori implementation are also plotted.
118
plotted in Figures 5.7, 5.8 and 5.9. In these experiments, we are interested in
evaluating the effectiveness of our method applied to a SOM with random initialisation in recalling itemset support for itemset groups whose components are
formed by the same number of items. The corresponding generalisation errors
for these experiments are summarised in Tables 5.4, 5.5, and 5.6. These numbers
have been obtained by applying Equation 5.10 to the results given by a SOM and
our method, and original dataset and the Apriori algorithm.
Experiment
1
2
3
Itemset Group
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1
0.13781
1.3735
1.4338
0.27685
1.426
1.4803
0.21732
1.3974
1.4593
Rectangular
Radius
0.5
0.14819
0.96662
1.0097
0.15956
0.95871
1.0012
0.14176
0.99191
1.037
0.001
0.26277
0.8871
0.92083
0.16423
0.75058
0.78645
0.14409
0.79192
0.83168
1
0.15412
1.3771
1.4415
0.15149
1.3789
1.4431
0.15006
1.3699
1.4314
Hexagonal
Radius
0.5
0.18899
1.0398
1.0841
0.12718
0.99795
1.0454
0.16442
1.0199
1.0659
0.001
0.15363
0.80162
0.84045
0.1863
0.82116
0.85914
0.15158
0.78599
0.82551
Table 5.4: Generalisation errors produced by a trained SOM through PISM for the
queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Chess dataset.
Experiment
1
2
3
Itemset Group
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1
0.14403
0.42695
0.2484
0.19336
0.44692
0.25762
0.21152
0.4413
0.25409
Rectangular
Radius
0.5
0.14272
0.31001
0.18097
0.095607
0.28158
0.16553
0.14741
0.29438
0.17267
0.001
0.16962
0.32908
0.193
0.15078
0.30919
0.18251
0.18457
0.34385
0.20083
1
0.14957
0.46244
0.27029
0.1531
0.45812
0.26885
0.16683
0.4298
0.25081
Hexagonal
Radius
0.5
0.11507
0.28882
0.17016
0.10954
0.29399
0.1724
0.11674
0.30569
0.18004
Table 5.5: Generalisation errors produced by a trained SOM through PISM for the
queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Mushroom dataset.
119
0.001
0.13477
0.3464
0.2034
0.15362
0.32803
0.1929
0.12932
0.31145
0.18481
Experiment
1
2
3
Itemset Group
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1
0.35658
0.76868
0.6471
0.44143
0.80008
0.66575
0.39652
0.76256
0.63775
Rectangular
Radius
0.5
0.49537
0.68955
0.55741
0.38671
0.65154
0.53933
0.33373
0.63865
0.53422
0.001
0.78376
0.77088
0.57361
0.83169
0.80109
0.59207
0.8609
0.84214
0.6246
1
0.45956
0.80169
0.66536
0.38639
0.7705
0.64628
0.33847
0.76385
0.64553
Hexagonal
Radius
0.5
0.36677
0.65929
0.5489
0.28075
0.63333
0.53553
0.78373
0.83437
0.6366
Table 5.6: Generalisation errors produced by a trained SOM through PISM for the
queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Connect dataset.
It is important to note that errors for three different values of radius for different map layouts have been calculated. Radius has been varied to determine
the best parametric state of a SOM for FIM. The tuning of the radius is done
since this parameter can work as the regulator of the BMU influences exchanged
during training, and its definition can affect the accuracy of the final mapping.
The way in which the nodes are distributed in the SOM, hexagonal or rectangular, has also been tested since we have assumed that different organisations of
neurons can also influence the final estimated itemset-support values.
Because the six maps of each experiment were initialised randomly, it can
be stated that there is no relationship amongst their results. Nevertheless, what
can be stated to have a relationship is the three values, representing errors for
the three groups of itemsets, because these values have been calculated from the
same trained map at each experiment. By using such relationships, a tendency in
the resulting errors for the tested groups for each dataset can be stated as follows:
• In case of the Chess dataset, Table 5.4, the error tends to increase while
k, which defines the size of the itemsets, increases without any exception
120
0.001
0.74751
0.75174
0.56305
0.82694
0.79409
0.58539
0.6665
0.71637
0.54759
of the type of radius employed in the experiments. No clear tendency is
presented regarding on which map layout provides the best results, so the
layout can be stated to be irrelevant for this dataset in the estimation of
itemset support.
• For the case of the Mushroom dataset, Table 5.5, the error tends to form
a distribution with a mode at the 2-itemsets without any exception of the
type of radius. As the above case, a tendency to give more accurate results,
amongst the two map layouts, is not clear either.
• In the last case, regarding the results for the Connect dataset, Table 5.6,
the error also tends to form a distribution with a mode at the 2-itemsets,
as similar to the previous case, but no defined order between the extreme
error values is presented.
Since it was not possible to establish which radius is the best for the estimation of itemset support with the results given above, the next experiments will
involve initialising all the maps, utilising different radius’, in the same manner.
To produce such initialisation in the maps, a linear initialisation will be applied,
because in this way the weight vectors are initialised along the linear subspace
spanned by the two principal eigenvectors of the input dataset (Kohonen, 1996).
Only one experiment is shown for each group of itemsets in Table 5.7, because
the same results will always be obtained for all cases. This is, a map trained in
batch with a linear initilisation will converge to a final state in an identical manner because its initial state is always the same.
Unlike our previous experiments, in this case, the results maintain a relationship for the different radius and itemset sizes. Therefore, in general terms, it
can be stated that while the radius decreases for training, the generalisation error
121
Dataset
Itemset Group
Chess
Mushroom
Connect
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1-itemsets
2-itemsets
3-itemsets
1
0.2074
1.3238
1.3837
0.24362
0.41104
0.23652
0.8835
0.95011
0.72913
Rectangular
Radius
0.5
0.13494
0.95034
0.99443
0.28344
0.35578
0.20278
0.67506
0.75157
0.58338
0.001
0.15624
0.5906
0.61271
0.15767
0.34597
0.20182
0.74965
0.66644
0.47813
1
0.26896
1.3455
1.4024
0.24715
0.42121
0.24215
0.87661
0.95628
0.73671
Hexagonal
Radius
0.5
0.15379
1.0044
1.0498
0.29623
0.36697
0.20841
0.7407
0.79004
0.60596
0.001
0.15624
0.5906
0.61271
0.15767
0.34597
0.20182
0.74965
0.66644
0.47813
Table 5.7: Generalisation errors for the trained SOMs with linear initialisation.
also decreases. It can also be noted that the error values for the different itemset
groups, given by, either the rectangular, or hexagonal maps with a radius set to
0.001, are all the same respectively. In the cases, in which the radius is greater
than 0.001, it is observed that the maps with a rectangular layout give better estimations than their hexagonal counterparts. In summary, based on the results in
Table 5.7, some temporal conclusions can be made as follows:
• It can temporally be determined that the value of the error tends to grow
while k, which is the number of components (items) in the itemsets, does
it as well.
• It can also be concluded that the generalisation error has a direct relationship with the amount of influence defined by the radius used for training.
• The best error values come from the maps whose radius parameter has
been reduced to 0.001. Therefore, it can be assumed that performing noexchange-of-BMU-influence, i.e., setting the radius to zero, which reduces
the SOM to a Vector Quantization algorithm, could generate even better
estimations for itemset support.
In order to find out if the above conclusions can satisfy more realistic mining
122
activities, in which support for different groups of k-itemsets with different sizes
is needed, some trained SOMs have also been tested on the queries, constraining
by the support property of the itemsets in the dataset, defined in Table 5.3. To
visualise the SOM behavior in this type of mining exercises, we have plotted, in
Figure 5.10, an example of the SOM estimations against the real support values
to model their discrepancies. The complete results for these new experiments,
involving different ranges of varied groups of k-itemsets, are summarized in Table 5.8.
For the 3−itemset group
For the 4−itemset group
100
Real Itemset Support
Real Itemset Support
100
98
96
94
92
90
88
90
92
94
96
98
98
96
94
92
90
88
100
90
Estimated Itemset Support
For the 5−itemset group
96
98
96
94
92
90
92
94
96
98
Estimated Itemset Support
93
92
91
90
89
90
91
92
93
Estimated Itemset Support
For the 7−itemset group
Real Itemset Support
91.5
91
Hexagonal SOM
Rectangular SOM
90.5
90
89
100
94
Real Itemset Support
Real Itemset Support
94
For the 6−itemset group
98
90
88
92
Estimated Itemset Support
Apriori
89.5
90
90.5
91
91.5
Estimated Itemset Support
Figure 5.10: Distribution of the itemset-support estimations made by SOM via our
method for the query -90to100Itemsets- for the Chess dataset. The corresponding errors
are summarized in Table 5.8.
123
94
Dataset
Query
Chess
90to100Itemsets
Chess
45to100Itemsets
Mushroom
45to100Itemsets
Connect
75to100Itemsets
k
3
4
5
6
7
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
3
4
5
6
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Map Layout
Hexagonal Rectangular
0.42918
0.3668
0.52008
0.4352
0.61963
0.50871
0.76948
0.62251
1.0274
0.81303
0.92081
0.89465
1.0748
1.0476
1.2055
1.1724
1.327
1.2829
1.4407
1.3794
1.5395
1.4548
1.6126
1.4983
1.6496
1.4996
1.648
1.4559
1.6202
1.3801
1.5947
1.3037
1.6045
1.2653
1.6755
1.2926
1.8418
1.4084
2.0514
1.5868
0.47635
0.44559
0.54856
0.49902
0.62168
0.49366
0.71689
0.51841
1.6103
1.6109
1.6982
1.6883
1.7487
1.7322
1.7817
1.7626
1.8063
1.7886
1.8296
1.8168
1.858
1.8532
1.8969
1.9027
1.9503
1.9686
2.0217
2.0538
2.1142
2.1612
2.23
2.293
2.3675
2.4485
2.5059
2.6096
2.24
2.4146
Itemsets per group
167
203
128
39
4
4985
25500
88170
217705
397947
550220
581647
471908
293209
138294
48473
12023
1896
152
5
100
91
37
6
2757
13218
44721
111159
208126
297836
327797
277133
178389
85839
29903
7135
1052
76
1
Table 5.8: Generalised errors for trained SOMs with linear initialisation on different
ranges of groups of k-itemsets.
124
Roughly speaking, it has been noted again that the rectangular maps produce
better estimations than the hexagonal ones. Nevertheless, the differences established between them are not considerable. The tendency of the error to growth,
while k increases, is also presented. However, in some cases, it tends to stay
steady or even to decrease as shown in Figure 5.11.
For Chess dataset with 90to100Itemsets Query
1.2
For Chess dataset with 45to100Itemsets Query
2.2
Hexagonal SOM
Rectangular SOM
1
2
Hexagonal SOM
Rectangular SOM
1.8
0.8
1.6
0.6
1.4
1.2
0.4
1
0.2
3
4
5
6
7
For Chess Mushroom with 45to100Itemsets Query
0.8
Hexagonal SOM
0.75
Rectangular SOM
0.8
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
For Chess Connect with 75to100Itemsets Query
3
Hexagonal SOM
Rectangular SOM
0.7
2.5
0.65
0.6
0.55
2
0.5
0.45
0.4
3
4
5
6
1.5
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Figure 5.11: Generalising errors for the results given by Table 5.8. While the x-axis
represents the different itemset groups, the y-axis determines the calculated error.
Reviewing the numbers defined by the error tables, it can be determined that
even though the generalisation errors look relatively small for the experiments,
some considerable differences between the estimated and real support values can
be found for some itemsets. Although these differences tend to get diminished
during training, their final values could cause misleading support values during
the detection of frequent itemsets in an ARM exercise so that it is important to
125
minimize such misleading effects.
As an explanation for these discrepancies in the support results, two options
may be considered: a) the type of metric used to organise the patterns on the
map, or b) the property of assuming that the items (events) of the patterns are
independent at each BMU. Moreover, both reasons could be directly influenced
by the number of neurons contained in the map since this characteristic of the
map is responsible for limiting the node-space available for any new incoming
pattern. In other words, the larger the number of nodes available in the map, the
better the distribution of the patterns along the map, and therefore the better the
resolution of the outcome. The latter is a concept known as emergent feature
maps, in which their structure is defined by large number of neurons, defined
by Ultsch (Ultsch, 1999) which has shown to have the potential for data mining
tasks (e.g., classification and clustering). Emergence occurs in natural- as well as
artificial-systems, and refers to the ability of a system to produce a phenomenon
on a new, higher level, as a result of the cooperation of many elementary processes (Ultsch, 1999). That is, new data features can emerge from structures
formed by the cooperation of a large number of neurons. The latter is a characteristic that it is not presented in the traditional SOMs in which the number
of neurons is controlled and limited (this number is approximately equal to the
number of clusters) to show emergence. Therefore, a series of experiments with
the Chess dataset have been set up to evaluate if the concept above defined could
improve the response of a SOM for FIM. In these experiments, different sizes of
a SOM have been used and the results are shown in Figures 5.12 and 5.13.
The sizes of the map involved in these experiments range from two nodes
to three times the map size H, which defines the number of neurons in the map
126
Error calculated
For 1−itemsets
0.6
0.4
0.2
0
2
0.5H
H
1.5H
Size of the map
2H
2.5H
3H
2H
2.5H
3H
2H
2.5H
3H
Error Calculated
For 2−itemsets
3
2
1
0
2
0.5H
H
1.5H
Size of Map
Error Calculated
For 3−itemsets
3
2
1
2
0.5H
H
1.5H
Size of Map
Figure 5.12: Distribution of the itemset-support generalisation error made by SOM via
our method for the queries: 1Itemsets, 2Itemsets and 3Itemsets for the Chess dataset,
when the size of the map, representing an itemset-support memory for ARM, increases.
needed for mapping the n number of input patterns contained in the training
dataset. H is normally calculated heuristically (Vesanto et al., 1999) by Equation 5.11. In this case, it has resulted in being 289 for the 3196 patterns existing
in the Chess dataset.
√
H=5 n
(5.11)
At first suspected and then corroborated by the results shown in the Figures 5.12, and 5.13, the error for the recall of the support of k-itemsets (where
k is greater than 1) tends to decrease while the number of nodes increases in the
map. The improvement is due to the fact that the input patterns develop a better organization on the map, which also breaks down the dependency property
among the items of the input patterns. That is, patterns which are truly related
remain closer on the map but they do not share the same node.
127
15
3−itemset Group
4−itemset Group
5−itemset Group
6−itemset Group
7−itemset Group
8−itemset Group
9−itemset Group
10−itemset Group
11−itemset Group
12−itemset Group
13−itemset Group
14−itemset Group
15−itemset Group
16−itemset Group
17−itemset Group
1.6
1.4
Error
Error
10
1.8
1.2
1
5
0.8
0.6
0.4
0
2
0.5H
H
1.5H
2H
2.5H
3H
Size of Map
3H
Size of Map
Figure 5.13: Distribution of the itemset-support generalisation error made by SOM via
our method for the query 45to100Itemsets for the Chess dataset, when the size of the
map, representing an itemset-support memory for ARM, increases.
This improvement in the error for the group of the 1-itemsets does not represent the behaviour explained above. In this particular case, it starts with a value
very close to zero since there are just two nodes which accumulate the total input
patterns so that the values, representing the calculated support, are placed closer
to their corresponding means because there are not too many BMU influences to
combine. The contrary effect occurs when the map size is increased; this has the
result that the error tends to increase, due to the fact that some BMU influences
are presented in the updating of the map. Nevertheless, the likely error stays
steady and low along the experiments.
In a more real scenario, in which the value for different itemset groups is
128
extracted from a SOM, for instance, the query 45to100Itemsets over the Chess
dataset, it is evident how the error in the estimations tend to decrease while H
increases in the experiments. This is due to the better distribution of the patterns
along the nodes; therefore, it can be concluded the use of the concept of emergent
feature maps is advantageous for the quality of the itemset-support estimations
made by this SOM-based memory for ARM.
5.4 Conclusions
In the discovery of association rules from datasets, the task of FIM has to be
performed first, since it provides the raw material to form the possible rules.
To measure which itemsets from the database are interesting for the data mining task, the support of itemsets has to be calculated. Motivated for the results
obtained from the knowledge formed with an auto-associative memory in the
previous chapter, we focus on exploring an unsupervised neural network. In particular, we investigate the use of a SOM for ARM. In this work, the exploration
of a novel application for SOM, which refers to using its codebook for the estimation of itemset support, has been undertaken.
Unlike other proposals (Changchien and Lu, 2001; Shangming Yang, 2004),
which form the literature of this topic, in this work, the suitability of estimating ”itemset support” from the weight matrix of this neural network has been
investigated. In particular, the calculation of the support has been proposed by
an extraction mechanism called PISM (Probabilistic Itemset-support eStimation
Mechanism), which uses only the winners formed in the final map for forming
itemset-support estimations. Thus, the input dataset can be discarded after training, since the winners have gathered enough capability (information) to map the
129
associations occurring in the multi-dimensional input patterns.
To validate the suitability of a SOM for this type of data mining task, results of some experiments have also been given versus an implementation of the
Apriori algorithm. Numeric comparisons have also been done by using an error
metric between the estimated value extracted from a SOM against a traditional
FIM algorithm. The results of the experiments have shown that the suitability
of a SOM for ARM is realistic, in particular if the concept of emergent SOM is
utilized.
In summary, we have satisfactorily tackled the problem of how to reproduce
the counting of patterns, occurring in a high dimensional space of an environment, with the learnt information or knowledge embedded in the two-dimensional
space formed by a trained SOM. Therefore, it can be concluded that a SOM can
be seen as a good candidate for our desired memory, due to its ability to learn
information about the frequency-occurrence of patterns hidden in training associations.
130
Chapter 6
Incremental Training for
Incremental ARM: A SOM Model
Because itemset support is a very important metric for the generation of association rules, an extraction mechanism for SOMs, which decodes the formed
codebook, has been proposed to estimate itemset support in the previous chapter.
As a result of our proposal, the generation of association rules directly from the
knowledge of a trained SOM has become feasible; therefore, unlike other related
proposals (Shangming Yang, 2004; Changchien and Lu, 2001), we have stated
that the training data is no longer needed for ARM.
Nevertheless, a problem, affecting the validity of the itemset-support knowledge embedded in the trained SOM, emerges as soon as the database1 used for
its training starts acquiring new transactions. This is, the itemset information in
the SOM is not still valid to describe the current state of the database. To tackle
this new problem, resulted from the dynamics of the data, incremental-training
1
Throughout this chapter, the term database will be used instead of the term dataset to refer to
the data source because we part from the fact that data have been recorded and stored in a model
beforehand. Moreover, we believe the use of database is more appropriated for the overtaking
problem since dataset is often used in literature to describe a data file with a finite number of
samples (transactions, patterns).
131
mechanisms can be used; nevertheless, their definitions have been based on nonbatch procedures.
Because the concept of batch training is important for itemset-support estimation and recall from a trained SOM, we undertake the task of developing an
incremental batch training mechanism for SOMs called Bincremental-SOM in
this chapter. In particular, we will propose how the itemset knowledge embedded
in the map should be updated to maintain it valid while the original environment
changes periodically.
6.1 Introduction
Previously in Chapter 5, we have stated that itemset support can be estimated
from a trained SOM, especially from its BMUs. These estimations have been
possible because the codebook has been decoded probabilistically. As the codebook stores information about itemset support, we have considered the SOM to
be an artificial memory from which the support of an itemset X can be recalled
anytime that it is needed. Therefore, we have proposed that the SOM can take
the role of an itemset-support provider, instead of the original database, in the
ARM framework. Additionally, this resulting trained SOM can be also seen as
an abstract descriptive model Msom formed with the associations happening in
the training database D.
In the real world, the possibility that data are static is almost impossible. Especially, if we refer to data which are used for analysis which typically tend to
change their state as a result of events occurring in their environment. This type
of environments are called non-stationary, since new states appear as a result of
132
changes throughout time. For instance, new shopping patterns may emerge in
the transactions of a database as a result of the introduction or promotion of new
products in the market.
Because of all the changes or dynamics presented in a database, any knowledge derived from it, producing either predictive or descriptive models, tends to
loose accuracy in representing the current state of the underlying database.
The ignorance of this inconvenient-but-realistic data characteristic in the development of data-mining approaches becomes a serious problem for data analysis. In particular, misleading and wrong decisions can be made by end users for
not having the latest state of the data represented in their models. For example,
a mortgage can be initially given to a bad-credit client just because, in current
usage, the model is not considering the most up-to-date tendencies in the market.
Unprofitable selling policies can be applied to some items in a supermarket if the
rules used do not catch the actual tendency of the customer-shopping behavior.
Therefore, it has become important to consider developing algorithms which can
cope with this undesirable data property in order to keep the models alive, updated, or valid as reasonably as possible.
As part of the DM toolbox, any SOM-based approach, such as our itemsetsupport memory, suffers from the problem stated above because it is well-known
that any trained SOM only learns the state presented in the input database at the
moment of its training. One way of making a SOM learn a new state of D is by
employing the simplest solution to tackle the problem, which involves retraining the SOM with the latest information contained in D (history and changes
together). Nevertheless, this solution is neither practical nor optimal because it
133
requires keeping the entire D throughout time to perform the training and consequently no advantage can be taken from the past knowledge learnt by its nodes.
In the ANN world, this problem has been tackled by performing incremental
training of neural networks, which allows them to update their knowledge with
the current information of their environment without loosing or forgetting the
information already embedded in their weight matrix. When a neural network
learns incrementally, it can also be established that it addresses the elasticityplasticity dilemma, since it adapts its internal structure in order to capture the
basis of the new incoming inputs without forgetting the past experiences.
During the past years, proposals related to SOM technology have appeared
to tackle the lack of training under non-stationary environments. For instance,
GSOM (Alahakoon et al., 2000a) is focused on developing a self-organisingand-growing structure for continuous learning. This variant of a SOM is trained
under a sequential mechanism which grows while it learns through the usage of
some heuristics which determine if the deletion or insertion of nodes into the
structure is suitable.
Since our aim is to update the model Msom , representing an artificial memory
full of information about itemset support, and having in mind that the itemsetsupport extraction mechanism proposed assumes that Msom has been formed by
a batch training, we can re-state that the underlying problem has turned out to
involve not only updating the map under non-stationary conditions but also doing it by a batch-incremental training. According to Hung et al (Hung S., 2004),
the limitation of learning models in a non-stationary environment has been addressed by introducing the concept of non-batch learning. This concept involves
techniques such as online learning, lifelong learning, incremental learning and
134
knowledge transfer, which have all pointed out the limitation of employing batch
learning for tasks concerning non-stationary environments. The main criticism
of the batch methods refers to the need of keeping the entire database for training
which results impractical for these environments.
Therefore, we focus here on defining a batch training mechanism for SOM
suitable for non-stationary environments. In particular, the latter is necessary
because our interest lies in gaining some insight into the topic regarding the
problem of incremental association rule mining in which the maintenance of the
frequent itemsets and their support, according to Cheung et al (Cheung et al.,
1996b), is relevant to maintain the association rules up-to-date. In other words,
the work proposed below should be considered as a proposal needed to perform
incremental ARM from a neural-network standpoint.
The key insight in the maintenance of itemset support will result from the exploitation of incremental learning properties of neural networks. The idea will
then involve studying whether the dynamics, derived from the presence of data
chunks (defining inserts) from D, can be incorporated incrementally to the model
Msom (t), formed at time t, in order to produce Msom (t + k), which defines the
latest state of D. Our approach must be interpreted as an incremental learning
task for SOMs which makes the map retain the past knowledge while the latest
state of the support of itemsets in the environment is learning.
6.2 Batch SOM for Non-stationary Environments
Practically speaking, we can state that a database itself represents a non-stationary
environment formed by different phases throughout time. These phases, defin-
135
ing the size-varied groups of transactions or data chunks, can normally occur at
different speed rates which adds a new complexity to the incremental problem.
Nevertheless, we will assume in this case that among the different phases forming the environment (database), there is always a time space when the data can
be buffered in order to be posteriorly presented to the SOM for learning.
6.2.1 The Problem Definition
Let D be a database and t a metric to measure time. Thus, a data distribution
that represents D at the time t can be expressed by D(t). Let D+ be a group of
new data points, representing itemsets, that makes D pass from the state t to t+x;
S
such that D(t+x) = D(t) D+ . Let Msom be a SOM-based model resulting from
training a SOM with input data. Thereby, Msom (D(t)) defines a SOM describing
some concepts, for instance, the topology, the distribution or the associativity of
the attributes or items of D(t).
The problem is then to produce a model Msom (D(t + x)) by not using traditional procedures, because they require the presence of the entire dataset D(t+x)
for a solution. Instead, it is necessary to do it in batch to produce a satisfactory
0
approximation model Msom (D(t + x)) in comparison to the possible traditional
0
final model Msom (D(t + x)), so that Msom (D(t + x)) ≈ Msom (D(t + x)).
The main characteristic of the wanted mechanism is that the old data chunks
D(i), for all i < j in which j defines the index of the latest data chunk in the environment, will not be needed for generating such an approximation. However,
the use of some knowledge K from Msom (D(j − 1)) can be considered.
In terms of the above definition, the algorithm we aim for can be defined as
136
follows:
Msom (D(t))0 = γ( K[Msom (D(t − 1))] , D+ ) ≈ Msom (D(t))
(6.1)
Where γ() is the target algorithm with inputs parameters: K[Msom (D(t−1))]
and D+ that define some knowledge of the SOM generated at phase (t-1) and the
new group of transactions of D respectively.
6.2.2 Interpretation by Node Influences of the Batch Training
SOM training can typically be realised under two modes: sequential or batch,
whose usage depends exclusively on the characteristics of the task to be tackled.
The differences between them are basically the manner in which they perform
the update of the map m. Overall, both modes always look for the best group of
reference vectors (weight matrix) which can map and quantize the distribution
and information of the training data accurately. The steps needed for doing the
SOM training in batch are summarized in detail in Figure 6.1.
Figure 6.1: Algorithm SOM training in batch.
In the search task for BMUs, a distance metric, for instance the Euclidean
distance, is needed to perform the comparison between the vectors of the map
and the inputs (step 4). A new state in the map at the epoch i will result from
137
using Equation 6.2 (step 6). A final map, representing the information of D, will
be created after the total number of epochs has been reached.
P
j hji (t)Sj (t)
mi (t + 1) = P
j nV j (t)hji (t)
(6.2)
From Equation 6.2, it can be deduced that each node contributes to the modification of the map until it converges. In particular, it can be stated that each node
mi contributes with its own Si to the update of the map at each epoch. The term
S, which is defined below, is used to define the concentration of input patterns at
each node.
Si (t) =
nvi
X
xj
(6.3)
j=1
Here xj refers to the nvi different patterns that have chosen the node mi as
their BMU. S can also be understood as a collection of the different data points
allocated at the Voronoi region Vi = {xi | kxi − mi k < kxi − mk k ∀k 6= i}.
An expanded version of Equation 6.2, which governs training, can be derived as
follows:
mi (t + 1) =
h1i (t)S1 (t) + · · · + hji (t)Sj (t) + · · · + hmi (t)Sm (t)
nV1 (t)h1i (t) + · · · + nVj (t)hji (t) + · · · + nVm (t)hmi (t)
(6.4)
In this representation of the batch-training equation, the way in which each
of the units contribute to the update of the map is even more evident. Although
it seems that all the nodes take part in the update operation, the reality is that not
all of them have the resources to provide an influence to such a change. The lack
of contribution from some nodes is due to the fact that they have not become
BMUs; therefore, they do not have any data point in their regions which can be
shared with the rest of the map. For instance, if inputs have chosen only nodes
138
mj and mk (out of a total of m nodes) as BMUs during a training epoch, then
a particular situation, describing the update of node mi from nodes m1 , mj and
mk , can be expressed decomposing Equation 6.4 as follows:



∆i (m1 (t)) =

 ∆i (V1 (t)) = 0



∆i (mj (t)) =
hji (t)Sj (t)
nV1 (t)h1i (t)+···+nVj (t)hji (t)+···+nVk (t)hki (t)+···+nVm (t)hmi (t)

 ∆i (Vj (t)) = ∆i (bmuj (t)) 6= 0



∆i (mk (t)) =
h1i (t)S1 (t)
nV1 (t)h1i (t)+···+nVj (t)hji (t)+···+nVk (t)hki (t)+···+nVm (t)hmi (t)
hki (t)Sk (t)
nV1 (t)h1i (t)+···+nVj (t)hji (t)+···+nVk (t)hki (t)+···+nVm (t)hmi (t)

 ∆i (Vk (t)) = ∆i (bmuk (t)) 6= 0
(6.5)
Where ∆k (mj (t)) defines the contribution or influence given to the node mk
from the node mj or the Voronoi region Vj at time t. In the particular case of
the term ∆i (m1 (t)) (top term), its influence becomes zero because this node was
not able to allocate any pattern input in its region (|V1 | = 0). After observing
that only some nodes contribute to the map formation, a new formulation of the
batch training equation can be expressed by,
mi (t + 1) =
X
∆i (mj (t))
(6.6)
mj ∈W
Which represents the update of the map in terms of the influential factors
generated by a set W containing nodes satisfying the condition of:
W = {mi | mi .|S| > 0 or mi .S 6= φ}
Where S is a set whose elements are patterns that have hit the node mi .
139
(6.7)
In order to use Equation 6.6 for the training of the SOM, it is necessary to
state mathematically how each of the node influence can be calculated. Therefore, if we employ one of the non-zero-result-influential components stated in 6.5,
for instance, the term ∆i (mj (t)), and considering that two nodes become BMUs,
then an expression, defining the influence from node mj to mi , can be stated as
follows:
∆i (mj (t)) =
hji (t)Sj (t)
nVj (t)hji (t) + nVk (t)hki (t)
In which, it is important to note that the total influence generated from mj to
mi depends not only on the state produced by mj but also on external influences
formed at others BMUs. This external influence is represented in the denominator of the expression, and represents the number of data points allocated into
those other BMUs weighted by the corresponding neighborhood function originating from them. Hence, a total influence generated from mj to mi at time t
can be described by,
∆i (mj (t)) =
Sj (t)
i
h
nVk (t)hki (t)
nVj (t) +
hji (t)
It is important to point out that the final response from node mj results
in a value almost equal to the centroid of its Voronoi region Vj defined by
P
nj = n1v
xi . The difference between these two terms can be stated to be
j
xi ∈Vj
controlled by an influential coefficient β, which defines the ratio of the influences
of other BMUs apart from mj to the distance between mi and mj . Therefore, it
can be stated that the influence generated at a node mj towards a node mi in a
SOM training in batch can be defined by:
140
∆i (mj (t)) =
Sj (t)
nVj (t) + βji
(6.8)
In which βji is defined by,
P
βji =
(k∈W )∧(k6=j)
nVk (t)hki (t)
hji (t)
(6.9)
The tendency of the possible differences between the resulting value ∆i (mj (t))
and the corresponding value of the centroid mj (t) is represented by the graph in
Figure 6.2.
Final Node Influence
Sj(t)/n vj(t)
Di (mj(t))
0
Influence from other nodes
Figure 6.2: Tendency of the final influence given by the node mj depending on the
strength of the influences received from other nodes.
6.2.3 The Algorithm
In order to satisfy the requirements stated in Equation 6.1, we need to determine
which information from an old trained map is relevant to keep for its maintenance throughout time. This is, we need to identify K since it defines a knowl-
141
edge about the past of an environment. Therefore, we have proposed K to be
defined by the last group of BMUs since they have triggered the last change of
the available map. In other words, a historical knowledge K about a training environment D up to time t can be formed from a model Msom (D(t)) by extracting
the following properties of each BMU:
• The reference vector, since it summarises all the patterns grouped at the
node.
• The hit histogram, which also defines a prior probability of the node. This
is, the number of times that the corresponding node has been hit by the
inputs.
Once this knowledge has been obtained, the next step is to define a procedure
to re-use it in a training process in which a SOM is about to learn the new state
D(t + x) drawn from the original environment. The latter is necessary otherwise
the past knowledge will be forgotten as soon as the new training with the latest
state in D begins. To perform the re-usage of the past knowledge, we propose
adding a new term to Equation 6.6 based on the fact that a batch SOM training is
the result of summing up neural influences. Therefore, we propose that the SOM
training in batch should be defined by,
mi (t + 1) =
X
∆i (mj (t)) +
X
∆i (K[MSOM (D(t))])
(6.10)
mj ∈W
In which, t of ∆i (K[MSOM (D(t))]) refers to the previous state defined in
the non-stationary environment rather than describing the t epoch in the current
training. The insertion of this new component to the traditional training of the
SOM is proposed as a medium to keep track of the old data because it defines
142
not only the past organization of data (centroids) but also how strong or weak the
pattern populations allocated to the BMUs were.
Based on the above explained, our desired γ() algorithm is defined as in Figure 6.3. The first function is responsible for triggering the learning of the data at
each of the stages existing in the non-stationary environment. The second function, which is called as a result of the appearance of a new data chunk in the
environment, is in charge of not only learning the new data chunk in the environment, but also retaining the old knowledge captured by the previous trained
map. In other words, both sources, new patterns and reference vectors in the
BMUs of the previous SOM, serve as vector inputs for the training of the new
map (Step 4). Moreover, a type of linear initialisation occurs with our approach
at the moment of the learning of a new data chunk, because the old map is also
re-used to initialise the new one (Step 2).
Function Batch-Incremental ()
1) t=0 // t defines the current stage in the training environment
2) Mt=0 // it represents the dynamic SOM
3) while (t<endofEnvironment)
4) Mt+1 = UpdateSOMIncludingPast(Mt, D(t+1))
5) t++
6) end
// this function returns a trained SOM with information defining the current and past data chunks.
Function UpdateSOMIncludingPast (Mt, D(t+1))
1) K = extract-bmus-knowledge(Mt) // knowledge about the past is captured
2) mi= initialisation (K) // knowledge is also used for initialisation of a SOM
3) for (i=1;i=numepochs; i++) do begin
4) forall patterns p Î D(t+x) È K do begin
5)
call LookingforBMU(p);
6) end
7) mi = update-map() // the batch equation is employed
8) end
9) return (mi) // resulting map describing the data chunks of the environment up to the time t+1
Figure 6.3: Incremental algorithm proposed for SOM in batch. While the first (top)
function triggers the learning at each stage of a non-environment, the second (bottom)
function performs the training of the SOM with the current data chunk and the old information coming from the set of best matching units of the latest trained map.
143
6.2.4 Experiments
To test our approach defined above, we have re-created the experiment conditions
presented in (Furao and Hasegawa, 2004), in which a neural-network training is
tested for non-stationary environments by using diverse data chunks to represent
different data distributions. In our case, the training data space, which will be
learnt incrementally by our approach, is shown in Figure 6.4. To simulate a nonstationary environment, a SOM will be fed with these data regions (A1, A2, A3,
B, C and D - defining different topologies -) in the order defined in Table 6.1.
Figure 6.4: Representation of a training data space describing a non-stationary environment.
Phase environment
I
II
III
IV
V
VI
Inputs
chunk knowledge
A1
no
A2
yes(A1)
A3
yes(A2)
B
yes(A3)
C
yes(B)
D
yes(C)
Table 6.1: Data order followed in the incremental batch training.
The aim of this experiment is to observe if the internal structure of the SOM,
trained incrementally by using the algorithm defined above, is able to map all of
144
the different topologies represented in the training data. In other words, it is desirable that the presentation of a new data topology (data chunk) does not make
the nodes forget previous knowledge or mappings about other topologies.
The mappings, formed at each of the six-phase training of a non-stationary
environment by a SOM trained with our proposal, are shown in Figure 6.5. In
all the phases apart from the first one, the SOM is trained with the corresponding
data chunk (changes in the data) along with the historical knowledge (the group
of the BMUs) describing the previous phases. It can be noted that in each of the
different phases the SOM not only covers the topology of the corresponding data
chuck for the phases, but also uses some of its nodes to retain the knowledge
learnt previously.
6.3 Itemset Support Maintenance by Incremental
SOM Training
After defining a method which allows the batch-incremental training of a SOM,
we will focus in this section on investigating whether such a proposal can be used
to maintain the knowledge itemset support embedded in a trained map throughout time. This is, in the next experiments we will evaluate if a trained SOM,
acting as an artificial itemset-support memory, is able to update its knowledge
with the changes occurring in a non-stationary environment that will be simulated by partitioning a FIM real-life dataset.
6.3.1 Experiments
To evaluate the suitability of our batch-incremental-training algorithm for SOMs
as an approach for the maintenance of the knowledge (itemset support) learnt by
145
Environment III
100
Environment IV
A1
100
A1
B
50
50
A2
A2
0
0
−50
−50
A3
A3
−100
−100
0
70
0
70
Environment V
Environment VI
100
100
A1
A1
B
50
B
50
A2
A2
0
D
0
C
C
−50
−50
A3
A3
−100
−100
0
70
0
Figure 6.5: Topologies formed by a SOM trained with our incremental batch approach
through the six different phases defining the non-stationary environment represented in
Figure 6.4. The black dots define the structure of the trained map. The data points used
for the training of a SOM at each phase of the environment, according to the order in
Table 6.1, are defined by the green and blue dots, which represent respectively the old
knowledge (data extracted from the BMUs) and the current data chunk.
146
70
a SOM from dynamic data, a series of experiments will be conducted by simulating three different non-stationary environments derived from the Chess dataset
(a real-life dataset with 3196 transactions defined by 75 items). In each environment, the dataset is separated into different data chunks with different sizes to
model different phases of changes in which the database can get involved as a
result of the dynamic of the environment. It is important to state that the radius
parameter of the batch training equation, which is responsible for shrinking the
neighborhoods and tuning the trained map for detecting fine data structures (Kohonen, 1996), will be maintaining a constant value of 1 in our experiments, since
we are interested in capturing in the final map the global tendencies of itemset
support in the environment.
The general conditions describing each of the environments are summarized in
Table 6.2.
Environment
Data-chunk Size
Number of Phases
I
800 transactions per chunk (Fixed)
4
II
400 transactions per chuck (Fixed)
8
III
From 200 to 600 transactions per chunk (Unfixed)
9
Table 6.2: Definition of the non-stationary environments using the Chess dataset. In
the first two environments, each data chunk has the same (fixed) number of transactions, while in the case of the third environment, the number of transactions was chosen
randomly (unfixed).
The idea behind setting the environments as phases is because the resulting
SOM will be queried to estimate support for some groups of itemsets after its
training has been performed with the data chunk representing the corresponding
phase. In all experiments, the map size has been set up to a fixed number of
nodes (about 900 nodes) that represent about one quarter of the total number of
transactions in the Chess dataset.
147
To measure the efficiency of our method to update a SOM for FIM purposes
throughout time, two other approaches, representing traditional mechanisms to
perform the update, are going to be tested for comparison.
The first approach (Chunk-SOM) corresponds to the training of a SOM with
only the corresponding data chunk which defines the changes occurred in the
database at the phase i of the environment. This approach can be understood as
the worst possible scenario to address the maintenance of the map because the accuracy of the results (itemset support) will be depending on how well the current
data chunk (current sampling) represents the overall tendency of the associations
in the entire D throughout time. Relying on this approach for itemset-support
recall, as further proven by the results, is risky, because although all of the data
chunks are drawn from the same data distribution D from the environment, the
fact is that they do not necessarily imply having the same behavior in their transactions.
The second approach to be evaluated will be our batch incremental training
for SOM (Bincremental-SOM). This represents the use of a batch-incrementaltraining for SOMs for the maintenance of itemset support. This method is characterized not only by learning the current data chunk at each phase k, but also by
utilizing some knowledge from the latest state of the SOM in the environment,
because it has retained information about, for instance, the topology and distribution of the past phases in the training environment. The premise followed by
our approach is that there is no reason to retain the old data chunks in the environment any longer once they have been learnt by a SOM.
Representing the third approach (Allchunks-SOM), a naive retraining of the
148
SOM will be used to update the map. This approach involves the training of
the SOM with all the data chucks produced up to the current phase k in the environment. The reason for using this approach is that its results represent the
best theoretical itemset-support approximations that can be generated from the
environment. This approach has the big disadvantage that all data chunks, representing the changes, have to get involved in the training of the map.
To make the comparison between our approach and the others, some error
readings will be calculated from the maps formed at each of the phases of the
environments. The metrics used to evaluate the accuracy and quality of the maps
are described as follows:
• The RMS (Root-Mean-Squared) error: This error metric, defined below
in Equation 6.11, is used to measure the accuracy of the generalization on
the support recall for the groups of the 1- and 2-itemsets. To calculate this
error, the target t will be represented by the results obtained from applying
the Apriori algorithm to the data chunk in turn. It is important to note that
N, which defines the number of itemsets in the recall, may vary during
an environment since data changes can also involve the apparition of new
items throughout time.
ERM S
v
u
N
u1 X
t
=
{y(xn ; m∗ ) − tn }2
N i=1
(6.11)
• The average quantization Error : This is a well-known error metric used to
measure the quality of a map based on the resolution of the mapping from
a Vector Quantization point of view, in which the aim is to look for a group
of vectors (codebook) that can represent the distribution of the input data
source in the most suitable form. The error is defined by Equation 6.12 in
149
which xi , mc and N define respectively an input vector from D, the BMU
for xi and the total number of patterns in D.
In our case, as the SOM will be changing and accumulating new knowledge, this error is going to be calculated to have an idea on how well the
approaches summarise the past and current data chunks at each phase. As
we assume that recently added transactions to a dynamic database can be
more interesting than those inserted long ago in the sense that they reflect
the current tendencies more closely, it will be very interesting to determine
which data, between the history and the latest state of the database (added
transactions), are better represented by the approaches tested.
Eq =
N
1 X
kxi − mc k
N i=1
(6.12)
In addition to these two metrics, we have also recorded other values to observe the manner in which the nodes in the maps are being used along the phases
of the environments.
Figures 6.6 and 6.7 are shown to illustrate the response of a SOM regarding the support estimation for the groups of 1- and 2-itemsets during the four
phases defining the environment I. One of the characteristics that can be observed
in these results is the magnitude of the differences between the first approach
(Chunk-SOM) and Apriori, in comparison to the other two (Bincremental-SOM
and Allchunks-SOM). The first approach, as predicted above, is by far the worst
approximation under non-stationary circumstances; therefore, no reliable results
to describe the support of the itemsets can be derived from it. Moreover, the
differences tend to become larger along the phases of the environment.
150
Chunk−SOM
Bincremental−SOM
Allchunks−SOM
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
−0.2
−0.2
−0.2
−0.4
66
−0.4
66
−0.4
50
50
50
0
0
0
−50
71
40
−50
71
−50
50
50
0
0
66
71
20
0
−20
−40
72
−50
72
−50
100
100
100
50
50
50
0
0
0
−50
−50
−50
−100
75
−100
75
−100
Figure 6.6: Differences between estimations (approximations) and calculations (real
values) made respectively by a trained SOM and the Apriori algorithm for the group of
1-itemsets throughout the four phases of the environment I. The first column from the
left describes the type of estimations that can be produced with a SOM trained with only
the data chunk in turn in the environment (Chunk-SOM). The second column represents
the estimations that can be made from a SOM trained with our incremental approach
(Bincremental-SOM). The last column shows the estimation made by a SOM trained
with always all the data chunks available en the environment (Allchunks-SOM).
151
72
75
Chunk−SOM
Bincremental−SOM
Allchunks−SOM
20
20
20
10
10
10
0
0
0
−10
2145
−10
2145
−10
50
50
50
0
0
0
−50
2485
−50
2485
−50
40
40
40
0
0
0
−40
2556
−40
2556
−40
100
100
100
50
50
50
0
0
0
−50
−50
−50
−100
2775
−100
2775
−100
Figure 6.7: Differences between estimations (approximations) and calculations (real
values) made respectively by a trained SOM and the Apriori algorithm for the group of
2-itemsets throughout the four phases of the environment I. The first column from the
left describes the type of estimations that can be produced with a SOM trained with only
the data chunk in turn in the environment (Chunk-SOM). The second column represents
the estimations that can be made from a SOM trained with our incremental approach
(Bincremental-SOM). The last column shows the estimation made by a SOM trained
with always all the data chunks available en the environment (Allchunks-SOM).
152
2145
2485
2556
2775
The behaviour of the RMS error, measuring the generalisation of the support recall for the 1- and 2-itemsets for the first environment, is shown in Figures 6.8 and 6.9. In these figures, it can be observed that even if, the approaches
have started on the same condition (same initialisation, same inputs) and have
provided the same result after concluding the first phase, they all follow different tendencies, in which the first approach stands out particularly for giving the
poorest results. In the case of our approach, even if it presents some differences
with the best case (a SOM trained always with all data chunks) its performance
can be stated to be satisfactory, since it remains steady during the incorporation
of new transactions, in particular, if we consider that its training has involved just
the new data chunk in turn, along with the knowledge extracted from the latest
trained SOM.
Allchunks−SOM
Bincremental−SOM
Chunk−SOM
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
Phase
Figure 6.8: The RMS error during the phases of the environment I for the group of the
1-itemsets.
In Figures 6.10 and 6.11, we show the values recorded on the quantization
153
3
Chunk−SOM
Bincremental−SOM
Allchunks−SOM
2.5
2
1.5
1
0.5
0
1
2
3
4
Phase
Figure 6.9: The RMS error during the phases of the environment I for the group of the
2-itemsets.
1.9
Chunk−SOM
Bincremental−SOM
Allchunks−SOM
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
1
2
3
4
Phase
Figure 6.10: Error describing the quality of the quantization of the approaches tested
for the data chunk describing the changes at each phase of the environment I.
154
3
Allchunks−SOM
Bincremental−SOM
Chunk−SOM
2.5
2
1.5
1
0.5
0
1
2
3
4
Phase
Figure 6.11: Error describing the quality of the quantization of the approaches tested for
the data chunks describing the previous phases (history) at each phase of the environment
I.
error at each phase k for the k th data chunk and all the past ones.
These figures help analysing the abilities of quantization of each of the approaches tested. A perfect quantization of some data is defined by a quantization
error equal to zero, and refers to the fact that the vectors defined, in this case by
the SOM nodes, describe the distribution of the input data perfectly.
In particular, the quatization of the maps produced for the data chunk provided at each phase is shown in Figure 6.10. Initially, it was expected that the
first approach (Chunk-SOM) was going to generate the best results for this task,
since the input data for its training at each phase is the same data used for the
measurement of the error. In other words, the best results were expected from
the first approach because its nodes (vectors) do not get distracted by any other
inputs during training unlike the maps of the other two approaches. Therefore,
the nodes can focus particularly on forming the mapping of the current data
155
chunk. Nevertheless, the results show that the converged map based on our approach (Bincremental-SOM) provides the best mapping resolution for the group
of transactions of the corresponding data chunk at each phase in the environment.
A reason for the decrease in the quantization error in our approach can be claimed
to be due to the fact that our bath-incremental SOM trained takes advantages of
the past knowledge. The key of the improvement is that, for instance, a map
formed, based on our proposal at phase k, has already an advantages in its initial
state since its vectors already map some regions of the data space defined by the
database; therefore, it can be claimed that our proposals produces the re-use of
past knowledge. Hence, the maps towards the end of the environment are created faster and produce a better approximation of the perfect quantization model.
Although the improvement observed in our approach could be claimed to be
replicated by using linear initialisation in the formation of the map, the fact is
that in the best of the approaches (Allchunks-SOM), which was initialised linearly, this positive characteristic is not presented. In fact, in this task the worst
quantization for the latest chunk at each phase was given by the SOM trained
with all data chunks, which makes us conclude that the vectors of the map got
distracted in the mapping of the most recent data chunks, due to the existence of
others (past data chunks to be mapped too) in the training.
A reduction of the quantization error under non-stationary conditions is a
very positive characteristic for our approach because we can state that the organisation of the map via our algorithm forms a stronger mapping to the newest
data chuck rather than the old ones. Hence, we can be sure that, in the recall of
support, the latest tendencies of the environment are going to be considered.
156
In the case of the quatization for the old data chunks at each phase, the results
are shown in Figure 6.11. In this case, all approaches start with a zero error since
in the initial phase there is no history to remember. In the consequent phases,
the best organization of vectors is given by the third approach since its training
always involves the presence of past and current chunks. For this task, the results
of our approach seem positive if we keep in mind that there were no past data for
its training. In other words, our approach still remembers the past phases of the
database satisfactorily.
The response of a SOM with our approach under the conditions set up for
the environment II, which has been formed by splitting the data chunks of the
environment I in half, regarding the RMS error for the itemset-support generalization for the group of 1- and 2-itemsets are shown in Figures 6.12 and 6.13
respectively.
0.8
Allchunks−SOM
Bincremental−SOM
Chunk−SOM
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
Phase
Figure 6.12: RMS Error during the phases of environment II for the group of the 1itemsets.
157
5
Allchunks−SOM
Bincrement−SOM
Chunk−SOM
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
2
3
4
5
6
7
8
Phase
Figure 6.13: RMS Error during the phases of environment II for the group of the 2itemsets.
In overall, the accuracy of the recalls of the approaches was noticed to be similar
to the ones presented for the first environment. The good characteristic of our
approach to be pointed out in this case is that even if there are more phases in the
environment, the error does not tend to increase substantially.
Satisfactorily, the property of mapping the latest data chunk better than the
history is still observable in the results for the quantization errors shown in Figures 6.14 and 6.15.
In the environment III, the size of the data chunks has been varied and selected randomly, giving as a result a nine-phase environment. The corresponding
results for the RMS and quantization errors are plotted in Figures 6.16, 6.17, 6.18,
and 6.19.
158
2
Allchunks−SOM
Bincremental−SOM
Chunk−SOM
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
1
2
3
4
5
6
7
8
Phase
Figure 6.14: Error describing the quality of the quantization of the approaches tested
for the data chunk describing the changes at each phase of the environment II.
3
2.5
2
1.5
1
0.5
Allchunks−SOM
Bincremental−SOM
Chunk−SOM
0
1
2
3
4
5
6
7
8
Phase
Figure 6.15: Error describing the quality of the quantization of the approaches tested for
the data chunks describing the previous phases (history) at each phase of the environment
II.
159
1
Chunk−SOM
Bincremental−SOM
Allchunks−SOM
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
Phase
6
7
8
9
Figure 6.16: RMS Error during the phases of environment III for the group of the
1-itemsets.
2
Chunk−SOM
Bincremental−SOM
Allchunks−SOM
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
2
3
4
5
Phase
6
7
8
9
Figure 6.17: RMS Error during the phases of environment III for the group of the
2-itemsets.
160
3
2.5
2
1.5
1
0.5
Chunk−SOM
BIncremental−SOM
Allchunks−SOM
0
1
2
3
4
5
Phase
6
7
8
9
Figure 6.18: Error describing the quality of the quantization of the approaches tested
for the data chunk describing the changes at each phase of the environment III.
2
Chunk−SOM
BIncremental−SOM
Allchunk−SOM
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
1
2
3
4
5
Phase
6
7
8
9
Figure 6.19: Error describing the quality of the quantization of the approaches tested for
the data chunks describing the previous phases (history) at each phase of the environment
III.
161
In general, we can state that the three approaches maintain the same behaviour
as the environments with fixed-size data chunks. The results for all of the experiments for the three environments are summarized in Tables 6.3 to 6.5.
Approach
First
Second
Third
Phase
1
2
3
4
1
2
3
4
1
2
3
4
RMS Error
1-itemsets 2-itemsets
0.11048
0.70959
9.7047
8.8469
10.817
9.6593
12.546
10.687
0.11048
0.70959
0.11357
0.91807
0.12896
1.0835
0.17268
1.1946
0.11048
0.70959
0.10551
0.81733
0.070201
0.90658
0.062413
0.9667
Present
484
435
397
393
484
526
546
545
484
593
615
649
BMU
History
0
274
398
521
0
303
502
577
0
326
507
569
Shared
0
157
174
243
0
245
402
436
0
326
507
569
Quantization Error
History Current chunk
0
1.1426
2.3581
1.1681
2.1766
1.3377
2.5064
1.3793
0
1.1426
1.4863
1.4542
1.6577
1.1557
1.8495
0.92327
0
1.1426
1.3588
1.4247
1.4937
1.7444
1.6426
1.8763
Table 6.3: Results obtained of the approaches tested for Environment I.
Some records, regarding the behavior of the use of the BMUs on the maps
during the training phases, have also been added to our result tables in order to
analyse the usage and organization of the mappings formed by the different approaches. The column BMU-Present defines the total number of nodes which
ended up becoming BMUs for the corresponding inputs. The column BMUHistory specifies the number of BMUs which have been used to map the old data
chunks. The column BMU-Shared represents the number of BMUs which have
been shared in the SOM to map both the most current and old data chunks.
Taking the results of Table 6.5 for discussion, since the settings of the environment from which they have been derived are closer to a real-life environment,
it can be noted that our approach (second approach) keeps using less numbers of
nodes than the third approach (theoretically the best approach since the map
162
Approach
First
Second
Third
Phase
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
RMS Error
1-itemsets 2-itemsets
0.10288
0.63257
7.8216
7.5049
10.973
10.156
15.424
14.1
14.226
12.628
16.862
14.779
17.325
14.748
14.449
12.382
0.10288
0.63257
0.13455
0.80306
0.14465
0.93041
0.14651
1.0392
0.15682
1.1311
0.16921
1.2214
0.17049
1.2807
0.1813
1.342
0.10288
0.63257
0.11048
0.70959
0.062258
0.7758
0.10551
0.81733
0.076753
0.8352
0.070201
0.90658
0.10098
0.93237
0.062413
0.9667
Present
325
319
296
317
291
281
286
247
325
448
439
445
489
478
493
497
325
484
529
593
613
615
651
649
BMU
History
0
189
310
372
365
395
468
482
0
261
339
402
526
555
586
613
0
264
377
465
556
559
604
610
Shared
0
106
165
196
107
151
145
161
0
221
258
288
398
379
405
416
0
264
377
465
556
559
604
610
Quantization Error
History Current chunk
0
0.89182
2.0268
0.88066
2.3137
0.92952
2.5348
0.89395
2.3648
1.0259
2.5568
1.0009
2.6447
1.0125
2.7223
1.0748
0
0.89182
1.2076
1.2426
1.4575
0.9156
1.6167
0.72646
1.6942
0.66307
1.8229
0.57516
1.8978
0.54156
1.98787
0.49316
0
0.89182
1.1315
1.1536
1.2825
1.3432
1.392
1.3908
1.4305
1.6856
1.5432
1.7477
1.6021
1.8078
1.6702
1.9172
Table 6.4: Results obtained of the approaches tested for Environment II.
163
Approach
First
Second
Third
Phase
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
RMS Error
1-itemsets 2-itemsets
0.091058
0.42505
5.1857
5.0553
11.573
10.909
16.689
15.39
13.922
12.775
13.733
12.242
16.585
14.597
15.307
12.985
15.566
13.375
0.091058
0.42505
0.16707
0.76713
0.23229
0.8775
0.26526
1.0197
0.25449
1.0779
0.21993
1.1506
0.22947
1.2419
0.24148
1.2712
0.27288
1.3688
0.091058
0.42505
0.10355
0.71514
0.1079
0.73557
0.061232
0.79409
0.077156
0.79983
0.078521
0.84692
0.070201
0.90658
0.10605
0.92097
0.062413
0.9667
Present
189
364
243
223
312
284
258
322
238
189
430
435
416
440
472
462
471
469
189
435
500
513
606
619
615
628
649
BMU
History
0
107
209
337
358
390
404
458
466
0
161
343
406
430
495
527
529
582
0
126
367
427
483
576
573
578
624
Shared
0
67
109
165
182
128
141
172
148
0
151
275
285
294
359
357
356
381
0
126
367
427
483
576
573
578
624
Quantization Error
History Current chunk
0
0.53331
1.8826
0.99159
2.3294
0.79141
2.4964
0.67263
2.4863
0.90183
2.3275
1.0423
2.5412
0.92153
2.6158
1.1486
2.79
0.87699
0
0.53331
1.0576
1.1491
1.397
0.8237
1.535
0.62643
1.6422
0.71565
1.7038
0.66279
1.8217
0.52506
1.8959
0.61216
1.9895
0.41183
0
0.53331
1.1129
1.1107
1.2155
1.2916
1.3085
1.3741
1.3954
1.423
1.4519
1.7477
1.5472
1.7535
1.6186
1.8095
1.6792
1.9126
Table 6.5: Results obtained of the approaches tested for Environment III.
164
is always built with the whole database) for building the mapping for the data
chunks during the environment. The latter can be understood as a good feature
for our approach since we part from the fact that the mapping formed by these
two approaches (Bincremental-SOM and Allchunks-SOM) give similar results
for itemset-support recalls by involving a different number of nodes in the answer. Furthermore, as depicted in Figure 6.20, which shows the runtime for the
evaluated methods for the experiment regarding the non-stationary environment
with unfixed data chunks, our batch incremental converges by consuming the
same or less time than the best approach (Allchunks-SOM) during the learning
of the non-stationary environment.
80
Allchucks−SOM
Chuck−SOM
BIncremental−SOM
70
Time (sec)
60
50
40
30
20
10
1
2
3
4
5
Phase
6
7
8
9
Figure 6.20: Runtime for the three approaches evaluated on the environment III.
A contradictory event, regarding the mapping of the old data chunks, seems
to occur with our approach since results show that it tends to gradually forget
the reference of the past due to employing an incremental procedure. For instance, maps formed after phase five need more BMUs than the ones obtained
in its training for producing the best mapping for the past chunks. Even if the
165
latter could be seen as a negative feature, the actual fact is that it is a positive
one, because the map gives more attention to the mapping of the latest chunk
rather than the historical ones; therefore, the recall will be influenced by the latest tendencies of the changes rather than the very old ones. As a consequence
of the latter, we can also observe that the number of shared BMUs between the
past and present data at each phase does not tend to agree, as in the third case in
which the sharing is complete. Therefore, we can assume that some BMUs need
to be combined with others in our approach for some reduction on the recall error (RMS). Nevertheless, promoting such a node fusion could result in updating
some nodes (vectors), which already satisfactorily map the most current state
of the database. In conclusion, it can be stated that a tradeoff has to be made
between a better generalization and a better quantization of the data chunks.
6.4 Conclusions
In the previous chapters we have proposed how two different neural networks can
be used for ARM by building itemset-support memories based on them. Since
these artificial memories may be learning from non-stationary environments, we
have stated that the update of the knowledge in them is necessary and relevant for
the development of the neural-based framework for ARM. Hence, in this chapter
we have looked at tackling the maintenance of the knowledge in a memory based
on a self-organising map.
To achieve our goal, we have investigated how to perform such an update in
the memory throughout time by the usage of incremental training. Since the concept of batch training has been important for the definition of a method for the
recall of itemset-support from a SOM, and knowing that this type of training has
166
been stated as unsuitable for non-stationary environments, we have proposed an
incremental training method for SOM in batch called Bincremental-SOM which
allows the new associations to be learnt without forgetting the previous knowledge.
To evaluate the accuracy of the proposed method, we set up experiments
which simulate some non-stationary situations with the information contained in
a dataset used for FIM studies. To compare the results given by our approach, we
also trained two other maps with the latest state and all of the partial states occurring in the environment. It can be concluded from our results, shown above, that
a SOM-based memory is able to keep its knowledge updated by performing its
training with our method. It was also observed that the quantization generated by
the SOM trained with our approach, for the newest group of inserts in the environment, is even better than the one produced with the map trained only with the
corresponding data chunk. Therefore, it can be stated that the hypotheses generated, which in this case represent itemset-support estimations, will be produced
with the latest itemset tendencies in the environment, but without forgetting the
past ones.
167
Chapter 7
Conclusions and Future Work
In this thesis, we have looked at the suitability of ANNs (Artificial Neural Networks) for the descriptive data-mining task of ARM (Association Rule Mining).
In particular, we have followed the premise that a neural-based framework can be
built for tasks like association rule mining because concepts, which are all implicated in the generation of these rules, such as association, frequency, counting,
and information storage, are also involved in the learning process performed by
humans which artificial neural networks target to imitate.
In order to begin with the development of such a neural-based framework and
considering the importance of itemset support for the generation of this type of
rules, we have focused on the development of the memory stage. This stage is in
charge of learning, storing, maintaining, and recalling of the property of itemset
support defined by the learnt associations, in order to supply with a frequencystatistical (support) information about the itemsets to other stages in the framework; for instance, processes controlling the generation of rules (in control of
the FIM logic).
In the quest for the most suitable neural network to become an itemset168
support (pattern-frequency) memory, we have studied two neural networks: a
self-organising map and an auto-associative memory. Our studies have focused
on determining if these two ANNs have the ability to learn frequency-statistical
information from the taught associations in order to make estimations about the
support of the itemsets of an environment. In other words, we have proposed how
these two neural networks can reproduce the values which normally result from
the counting of discrete patterns describing associations (itemsets). Additionally,
since data often describe events occurring in a non-stationary environment, we
have also proposed how the itemset-support knowledge embedded in the weight
matrix, generated by a self-organising map, can be updated while the environment changes throughout time.
Therefore, in order to complete this thesis, the final conclusions , of the work
proposed, are presented in this chapter. Answers to our research questions stated
in Section 1.4 are also given. Moreover, we will establish some links to other
pieces of research and guidelines for future work which can contribute to the
continuation with (1) the quest for an understanding of the counting of patterns
with artificial neural networks, and (2) the development of an ANN-based framework for ARM or similar tasks.
7.1 Final Results
Neural network technology has been successfully used to tackle problems aiming prediction and clustering. Nevertheless, their use for tasks like ARM is still
unclear and uncertain because the field lacks of research on this topic. For this
reason, we have studied the usage of ANNs for ARM.
169
After we established that ARM is a mechanical process whose realisation
involves some of the concepts presented in the learning process of biological
systems (Gardner-Medwin and Barlow, 2001), and motivated for imitating the
human behavior in the generation of association rules with ANNs, we have conducted research on the suitability of ANNs for ARM in this thesis. In particular,
we have followed the hypothesis that the embedded knowledge formed by an
ANN, as a result of its training with some environment defining associations,
can be used for the generation of association rules mainly because:
• We have created the baseline for the generation of such symbolic representations describing associations from the knowledge of two neural networks. This is, association rules, for instance, like those in Figure 7.1, can
be generated in future from the knowledge embedded in our studied ANNs
because we have proposed in this thesis mechanisms, through which the
support of itemsets, being the raw material needed for the rules, can be
estimated from the weight matrix of a CMM and a SOM. Nevertheless, as
stated later, the definition of the logic of the rule generation mechanism
remains open.
• We have pointed out that both technologies, ANNs and ARM, handle the
concept of association in one form or another to form knowledge from
data.
• We have stated all the different alternatives in which a neural network can
participate within the current ARM framework.
• We have categorized ARM as a mechanical inference task which can be
performed satisfactorily by humans through the counting and the making
of associations among the existing elements of data. Therefore, it was
170
assumed that ANNs can be used for ARM only if they are able to reproduce
the pattern counting done by their biological counterparts.
Figure 7.1: Example of association rules which can be further generated from the
knowledge learnt by a neural network about the Mushroom dataset defined in (D.J. Newman and Merz, 1998). All these rules, describing the associativity among the attributes
of a dataset, have the format of: if (list of items or attributes) then (list of items or
attributes) with [support=% and confidence=%].
A detailed explanation about the links between ARM and ANNs has been
given in Section 1.2 as a part of the motivations for this work.
In order to gain some insight into the use of ANNs for ARM, we have also
171
proposed the idea that the current ARM framework could be transformed into
a framework whose core can be represented by a neural network architecture
(defined in Chapter 3). As a first stage of the neural framework for ARM, we
have proposed to build an ANN-based artificial memory, which can learn, store,
maintain and recall knowledge about the occurrence of patterns in the training
dataset to support other stages of the framework with this knowledge. In other
words, we have looked at building an ANN-based memory which can provide a
good estimation about itemset support when it is queried.
Our research has been carried out principally to establish extraction mechanisms through which an auto-associatory memory (correlation matrix memory)
and a self-organising map can infer itemset support from their knowledge formed
after their training. Our main aim has involved reproducing the results given by
a traditional pattern counting process such as, for instance, the itemset-support
values given by the Apriori algorithm in an association rule mining process, by
using only the knowledge embedded in the mapping or weight matrix of our two
ANN candidates.
We have used the Apriori algorithm for comparison, because, as commented in
previous sections of the thesis, Apriori0 s itemset-support calculations are errorless since it counts directly the itemset occurrences within the data. Moreover,
it is the gold standard and historically most important algorithm for ARM, from
which many other algorithms have been derived in one way or another. In other
words, approximation methods, like our estimations here produced, should be
compared to the real values, like the ones generated by Apriori.
Discovering and emulating counting abilities with our ANN candidates has
been established as relevant for the success of this thesis for the following rea-
172
sons:
1. We identified the relevance of itemset support for the support-confidence
framework in ARM and therefore for the generation of rules. Hence, we
stated that in order to produce rules from ANNs, it is first necessary to
reproduce itemset support from them since it is the metric to determine the
frequent itemsets from which rules can be derived.
2. The achievement and viability to build itemset-support memories is directly related to the counting of patterns with ANNs. Moreover, the reproduction of counting with ANNs is challenging as well as important since
it takes part in the learning process performed by biological systems.
3. In order to have a mainly neural-based framework for ARM or similar
tasks, we defined that the first stage would incorporate an ANN, acting as
a memory, because it conveys autonomy to the framework in the sense that
its hypotheses or rules will depend exclusively on the knowledge accumulated by the artificial neurons. The latter is an activity occurring within the
brain, which stores knowledge (experiences) throughout time to be employed for different purposes. Moreover, we have also defined that the
existence of a mapping (feature knowledge space) formed with the associations would make the framework be independent from retaining the learnt
environment to provide answers after learning it. The latter is a drawback
presented in most of the related work summarised in Chapter 3.
Since the counting of patterns as well as the learning of an environment
can occur under totally dynamic conditions (non-stationary), we also looked at
proposing how to prepare the knowledge in the memory for these environments.
Therefore, we have proposed a method to update the knowledge embedded in a
173
self-organising map which learns from an environment with dynamic conditions.
Because our work involves reproducing counting properties on two wellknown neural models rather than theoretical ones, our work can also be understood as a continuation of theoretical work in (Gardner-Medwin and Barlow,
2001). Moreover, our approaches have taken into account the combinatorics of
patterns described by the associations among their elements. It is important to
state that our work does not claim that our interpretations, of how the counting
of patterns is performed by the two neural networks studied here, happen in biological systems. Nevertheless, they serve to form a better understanding of what
may be happening in the brain. Specific conclusions of each piece of research
can be summarized as follows:
Auto-Associative Memory Model (Chapter 4)
In this chapter, it has been studied whether a weighted auto-associative memory, based on a CMM (Correlation Matrix Memory), could represent our desired
itemset-support memory. This memory was investigated because it exploits the
concept of association among the input patterns to learn about its environment.
As a result of our analysis, it was found that a weighted m-by-m correlation matrix memory, in which m defines the number of items or attributes forming the
patterns, has the natural ability to learn statistical-frequency about the discrete
patterns used for its training.
After discovering such ability, we also pointed out that, because of the symmetry of its weight matrix, a CMM only needs around half of the elements to
represent the different learnt occurrence frequencies in the training data. Since
it has been advised in (Gardner-Medwin and Barlow, 2001) that it is convenient
174
for the counting of patterns to have direct representations, one-to-one relationships, between patterns (itemsets) and neurons rather than distributed ones, we
focused on looking for those direct representations on the m2 nodes defined by
this memory.
It was satisfactorily possible to identify one-to-one relationships between nodes
and some thought associations or itemsets. We concluded that frequency values
and therefore itemset support can be calculated straight for m(m + 1)/2 patterns
out of 2m patterns that idealistically we wanted this memory to represent (if each
node maps a different pattern). In particular, we have stated that the support values for the groups of 1- and 2-itemsets given by this memory are errorless as the
values given by the well-known Apriori algorithm which scans and counts the
patterns directly in the original data to calculate their itemset support.
In the case of the remainder k-itemsets, for 3 ≥ k ≤ m, we looked at defining
a distributed method to infer their frequencies needed for support calculations.
Hence, two methods, A and B, have been proposed to estimate support (supp)
ˆ
by using the distributed knowledge embedded within m(m + 1)/2 nodes of the
memory which defines directly single- and pairwise-event frequencies. In the
definition of both methods, it was assumed that the elements, defining a pattern
or itemset, are independent; therefore, an estimation about its support can be
generated through the calculation of the joint probability among the elements xi
Q
of an itemset X, such that supp(X)
ˆ
=
supp(xi ).
xi ∈X
In our method A, we proposed that the support of a k-itemset can be estimated by using the k individual item probabilities (values defining support in the
main diagonal of the matrix) as the resources for the estimation. In the method
B, we estimated itemset support by first forming approximately (k/2) pairwise
175
events (2-itemsets), whose support values are directly defined within the matrix,
with the elements of the queried k-itemset. Therefore, in this case, we have employed elements lying off the main diagonal for itemset-support estimation.
After testing our proposed methods with real-life datasets used in frequent
itemset mining benchmarks, it was noticed that the itemset-support estimations
made with the knowledge embedded in the weight matrix of this memory and
our two methods, present discrepancies with the real value support calculated
by the Apriori algorithm. Overlaps of itemset support, occurring among the
nodes forming this memory, have been blamed to be the cause of the support
differences. Thus, we strongly believe that better estimations can be obtained iff
the neurons of this memory are arranged in a different way.
Self-Organising Map Model (Chapter 5)
In contrast with Chapter 4, in which a supervised trained ANN was studied, we
focused here on using an unsupervised trained neural network for our needs.
In particular, the suitability of a SOM (Self-Organising Map) to become our
itemset-support memory was investigated.
This work involved analysing SOM training procedure, in order to identify
how its embedded knowledge, about the high-dimensional input data, can be
interpreted for producing itemset-support estimations. The idea of estimating
itemset support directly from the trained map was developed because other approaches (Changchien and Lu, 2001; Shangming Yang, 2004), which have also
explored the use of this ANN for ARM, suffer from the drawback of limiting
SOM usage to only cluster the data in order to form links between the original
data and the SOM clusters for ARM. Moreover, we wanted to evaluate the abil-
176
ity of a SOM to reproduce the counting of patterns of a high-dimensional space
with the coded information about it distributed in the two-dimensional space described by its neurons.
To define our mechanism for itemset support estimation, the following concepts were important:
• The identification of the best matching units because they control the update of the map during training.
• The prior probability associated to each node because it defines the number
of times that a node has been hit by input patterns.
• The components of the codewords or reference vectors associated with the
BMUs because each of them represents a random variable whose value is
an approximation of the expected value, or mean, of the represented item.
• The concept of batch training.
• The concept of the partition of events which has been stated to take part in
the training of a SOM.
• The concept of independence among the items. Thus, the corresponding
support of an itemset has been calculated by finding the joint probability
among the corresponding variables.
The estimations derived from a trained SOM via our proposed mechanism
were also tested against their counterparts provided by the Apriori algorithm.
As we have also stated the SOM to be a representative of the theoretical
projection model defined in (Gardner-Medwin and Barlow, 2001), it was investigated to produce representations among input patterns and neurons which cannot
177
be derived directly, as in the case of the auto-associative memory, but distributed.
In other words, our mechanism uses the information of the reference vectors associated to the BMUs to estimate itemset support.
In our experiments, we also analysed other factors which can affect the accuracy of the SOM estimations. For instance, we performed some experiments,
involving different values of the radius and map layouts, to determine the best
parameters for SOM for ARM. According to our results, we have concluded
that rectangular SOMs provide more accurate estimations than the hexagonal
maps. The best estimations, in terms of the radius utilized, were given by the
map formed with the minimum value; therefore, we have concluded that the best
scenario will be reached if the SOM algorithm is reduced as a pure Vector Quantization algorithm in which influences among the nodes are not carried out. It
is important to state that the best estimations also occur when the initialisation
of the maps were done linearly, which is not a characteristic in the biological
systems; hence, to be more congruent with what happens in real life, it can be
stated that a value, such as, for instance, 0.5 for the radius, which controls the
dispersion of the influence during training, is enough to have satisfactory values
of itemset support.
Since we wanted to improve the results obtained by a SOM, we investigated
likewise if the concept of emergent feature maps, proposed by Ultsch (Ultsch,
1999), could be used for this aim. Our results with this concept, which involves
creating large maps for mining tasks, showed an improvement in the itemsetsupport estimations. The reason for this improvement was that since there were
more nodes in the memory, the input patterns had a better way to get distributed
along the map. Thus, it has been concluded that among a better distribution of
178
the input associations in the map, a better estimation will be produced certainly.
Comparing The Proposed Memories
By summarizing in Table 7.1 some of the characteristics of the studied neural
networks, conclusions regarding their use for ARM, involving a training data
with n transactions derived from an itemset space defined by m items, can be
drawn as follows:
Characteristic
Total nodes in the network
Type of training
Epochs for training
Nodes updated for learning
Nodes utilized for support recalling
Pattern representation for support recalling
Main usage
AAM-based Memory
m2
supervised
1
k2
1, k or k/2
direct(for k < 3),
distributed(for k > 2 )
pattern association
pattern matching
SOM-based Memory
√
5 n (Vesanto et al., 2000)
unsupervised
several
1
√
mb ≤ 5 n
always distributed
clustering and
data visualisation
Table 7.1: Characteristics of the two itemset-support memories based on the studied
neural networks. m defines the total number of items from which the n transactions
or itemsets of a training data D are derived. mb corresponds to the number of BMUs
formed in an epoch training.
In the case of an AAM-based memory, the maximum number of nodes needed
to define this neural network depends directly on the m number of items forming the training data, rather than, as a SOM-based memory, on the n number of
itemsets in the training dataset D.
The training of an AAM-based memory is supervised, while the training of
a SOM-based memory is unsupervised. When they are both learning from an
environment, an input k-itemset provokes k 2 nodes in an AAM-based memory
to be updated, while only one node is directly hit by the pattern in the case of
179
a SOM-based memory; nevertheless, in practise, this node has to propagate its
influence to the rest of the map. Therefore, it can be stated that the presence of
an input itemset updates all nodes forming the SOM-based memory.
The training of the AAM-based memory requires only one pass over the environment to collect knowledge, while in the case of the SOM-based memory,
even if it can produce estimations after the first training epoch, it needs more
passes to converge and therefore provides more accurate results.
Once these memories have been trained, recalling or estimation of support for
a queried k-itemset from a AAM-based memory requires: one node for the 1and 2-itemsets since direct pattern representations were found and defined, and
k or k/2 nodes when k > 2 through our proposed methods which use individualor paired-itemset probabilities respectively. In the case of a SOM-based memory, since the knowledge has been distributed along the BMU in the map, a mb
number of nodes is used from which k components of their reference vectors are
taken into account for an estimation.
Based on the results shown in Tables 4.6 and 4.5 in Chapter 4, and 5.8 in
Chapter 5, which have been summarised here in Tables 7.2 and 7.3, it can be
concluded that even if our AAM-based memory gives errorless itemset support
for the group of the 1- and 2-itemsets, its performance is beaten generally by
its SOM-based counterpart whose estimations are made depending on the good
distribution of the association within its nodes. Nevertheless, in terms of the
steadiness of the itemset-support estimations, the AAM-based memory has an
advantages over a SOM-based memory since the latter depends on initial factors
for the distribution of its knowledge and therefore for the itemset-support recalls.
180
Itemset Size
k
3
4
5
6
7
SOM
Hexagonal Rectangular
0.42918
0.3668
0.52008
0.4352
0.61963
0.50871
0.76948
0.62251
1.0274
0.81303
CMM
Method A Method B
0.617510
0.47519
0.775860
0.50515
0.912940
0.86483
1.089600
0.9105
1.394400
1.4274
Itemsets Tested
167
203
128
39
4
Table 7.2: Comparison of the generalised errors between our approaches for SOM and
CMM for the different number of tested itemsets, which resulted a query from the Chess
dataset with a minimal support between 90 and 100 %.
Itemset Size
k
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SOM
Hexagonal Rectangular
0.92081
0.89465
1.0748
1.0476
1.2055
1.1724
1.327
1.2829
1.4407
1.3794
1.5395
1.4548
1.6126
1.4983
1.6496
1.4996
1.648
1.4559
1.6202
1.3801
1.5947
1.3037
1.6045
1.2653
1.6755
1.2926
1.8418
1.4084
2.0514
1.5868
CMM
Method A Method B
2.9141
2.285
3.593
2.6836
4.2165
3.6522
4.8524
3.995
5.5205
5.0014
6.2197
5.2905
6.9495
6.494
7.7125
6.6983
8.5192
8.1773
9.3857
8.3329
10.32
10.098
11.317
10.258
12.416
12.252
13.865
12.791
14.903
14.751
Itemsets per group
4985
25500
88170
217705
397947
550220
581647
471908
293209
138294
48473
12023
1896
152
5
Table 7.3: Comparison of the generalised errors between our approaches for SOM and
CMM for the different number of tested itemsets, which resulted a query from the Chess
dataset with a minimal support between 45 and 100 %.
Maintaining Pattern Frequency Throughout Time with a SOM (Chapter 6)
Since the knowledge of a SOM-based memory can become invalid as a result
of changes occurring in the training environment, we have looked at proposing
a mechanism to maintain it. In particular, we tackled the problem of training a
SOM incrementally in batch because:
181
• In order to generate association rules from the weight matrix of a trained
SOM, the procedure to estimate itemset-support assumes that its has been
in batch.
• Batch mechanisms have not been considered suitable for non-stationary
environments.
Therefore, we have proposed an incremental batch mechanism for SOM
which performs by using the new data occurring in the environment and knowledge generated from the last state of the SOM-based memory because it represents the old data states in the environment.
To test our approach, we set a non-stationary environment artificially with
the associations of the Chess dataset by forming data chunks to represent the
different phases of it. We compared the itemset-support estimations made by
a SOM-based memory trained with our approach to results given by memories
trained with two traditional approaches which involve the retraining of SOM with
all data available and making estimations with the latest occurred data chuck in
the environment.
Our experiments have shown that, with our method, the update of the map
throughout time is possible. Even though we were not able to improve the results
given by the memory, which was always trained with all data in the environment,
interesting properties were identified in our results as follows:
• It was noticed that the error describing the quality of the itemset-support
recalls generally grows linearly.
• The quantization of the latest data chunk in the environment produced by
a memory trained with our approach is better than the one formed with a
182
memory which just learns that data chunk. This positive property of our
method is because we have exploited the re-use of knowledge.
• Since a SOM trained with our approach is better at quantizing the present
data chunks than the old ones, it has been stated that the rules derived from
this trained SOM will positively reflect the latest itemset tendencies rather
than the old ones.
Therefore, we have concluded that batch training mechanisms should be considered as a real alternative for the update of a SOM for non-stationary environments. Moreover, we have moved another step forward in the development of the
neural-based framework for ARM since we have given the property of learning
incrementally to the artificial memory. The latter is important and necessary because it represents the basis to perform incremental ARM (Cheung et al., 1996b)
whose aim is to maintain the rules while the environment changes.
Answering Our Research Questions
With the conclusions drawn about this thesis above, our research questions, as
stated in the introduction, can be answered as follows:
• Can they be used for descriptive-data-mining techniques in which the aim
is to represent the data in form of association rules?
Yes, they can. One manner to describe data is by producing association
rules from data. To form such rules it is first necessary to determine the
basic but important itemset property of support. With the results of the
experiments conducted in this thesis on two different neural networks, it
can then be concluded that this itemset property can be estimated from the
knowledge embedded in their weight matrix.
183
Therefore, in order to produce association rules from data, the next step
would be to define a FIM logic ( a version of the Apriori algorithm to
mine the feature space defined by the weight matrix formed by our neural
networks) which would aim to traverse the structure defined in the ANN
in the best way.
• Would the knowledge learnt by an ANN be useful for describing associations among the elements of a database?
Yes, it would be useful to describe associations among the elements of
events from an environment. With this work, we have proven that the support value associated to those associations or itemsets can be either calculated directly from the weight matrix for some cases (the case of an AAMbased memory), or estimated indirectly through the use of the methods
proposed here, which decode the distributed knowledge in order to form
itemset-support estimations.
• Could it be possible that our chosen neural networks can have some knowledge describing the frequency of patterns distributed along their nodes?
I.e., Could the results of counting be generated from our ANN candidates?
Yes, our chosen ANN candidates have shown that they count patterns implicity while they learn them. This has been concluded since we discovered that they keep the knowledge needed to reproduce values, defining the
occurrence of patterns in a group of events, which are normally obtained
by scanning the events of the original environment for the count of patterns. Nevertheless, it has also been found out that the organization of the
learnt patterns within the networks can produce discrepancies between the
estimation and the real frequency of the patterns. Therefore, new internal
organizations should be investigated in future.
184
• Could it be possible to have an itemset-support memory based on an ANN,
as a substitute for the original database, to take part in such framework?
The reason for retaining the original data in the ARM framework is because it is the only source from which the itemset support can be calculated
throughout time. Nevertheless, our results show that a trained neural network is able to reproduce those values sufficiently well. Therefore, it can
be stated that these ANNs, acting as memories which can learn, retain and
recall knowledge about itemset support, can take the role of the original
data source in the ARM framework.
• Could it continue accumulating knowledge of the pattern frequency throughout time while the original data environment changes?
Yes, it could. The counting of patterns under a non-stationary environment has been investigated with a self-organising map, and the results have
shown that knowledge embedded in map can be updated while the original
environment adds new events to its definition. Additionally, results of our
experiments have also highlighted a positive property on the recalls made
by a SOM since they tend to be highly influenced by the latest state in the
environment rather that the old ones.
7.1.1 Contributions
• We have investigated the usage of ANNs for the transduction task of association rule mining by initially stating that the generation of rules can
be done by humans and we have aimed to reproduce this human behavior with ANNs. In specific, we have investigated if association rules can
be formed from the knowledge formed by two ANNs: an auto-associative
185
memory and a self-organising map.
• We have worked on making the use of ANN for ARM possible by developing itemset support memories based on an auto-associative memory and a
self-organising map which are able to reproduce an estimation of the itemset support after they have learnt associations describing an environment.
• We have reproduced the counting of patterns with the knowledge embedded in the ANNs rather than using the original dataset. In other words, we
have studied and proposed methods, decoding the created weight matrix,
through which the two neural networks studied can reproduce the counting
of patterns, which is traditionally done by scanning the original data.
• We have contributed to the work on the counting of patterns with neural
networks, since we have studied two well-known neural networks rather
than forming theoretical ones. Moreover, our studies have included the
combinatoric aspect of patterns which in this case is represented by the
term of itemsets.
• Since data is dynamic, we have also proposed how a memory based on
a self-organising map can keep its knowledge updated while the original
environment changes as a result of the incorporation of new patterns.
7.2 Future Work
Based on the results, findings, and conclusions drawn by this thesis, we define
the future work in this section , organised into different research paths, that can
be pursued for the development of the topic.
186
7.2.1 For The Auto-associativity-based Memory
It has been found that overlaps create differences in the estimation of support
since all patterns are piled up together in the same number of nodes. It happens
that while this ANN is learning, input patterns, such as, for instance, X and Y
defined respectively by 0011 and 1011 are stored in the same group of m2 nodes;
therefore, when the memory is queried to recall support for the pattern of 1010,
the estimation produces a difference with the real value because of the pattern
overlaps. To tackle this drawback, we can consider that patterns with similar
prefix can be grouped together in different branches of a neural network tree
structure. Nevertheless, this way of solving the problem will cross with some of
the work in the ARM field which has already proposed their split or organization
into tree- or trie-based structures for their counting. Therefore, we propose to
look for other alternatives to decode the knowledge of this ANN, or combining it
with other neural networks, to provide more accurate results for itemsets whose
size is large than 2 items. We have also discovered that the mapping, formed by
the auto-associative memory based on a correlation matrix memory, results in a
data structure which is similar to the matrix structure proposed in (Agarwal et al.,
2001) to perform frequent itemset mining by using lexicographic tree structures.
These triangular matrices have also been used by other approaches, for instance
in (Bodon, 2003), for the specific discovery of the 2-itemsets because of its compact memory representation.
7.2.2 For The Self-organising-map-based Memory
Our experiments and results have shown that the itemset-support estimations
made by this ANN are directly influenced by the way in which the knowledge,
about the input associations, is spread in the map. Therefore, a better organisation of the knowledge has to be found and applied in order to improve the results
187
given by this thesis. One way of doing it is by exploring the use of a more convenient distance metric for binary data in the training of this ANN because this
metric is responsible for allocating the input patterns to the nodes of the map
during training. To begin with this research path, a revision of the work proposed in (Leisch et al., 1998; Fernando Lourenco and Bacao, 2004) is necessary
since other metrics, different from the traditional Euclidean distance, have been
investigated for the formation of the map. Moreover, the study of the work of
Lebbah et al. (Lebbah et al., 2000), in which a batch version of the Kohonen
SOM algorithm dedicated to binary data is proposed, is important.
Another alternative is to propose a new distance metric from scratch which
we believe should consider the following:
Such a metric should produce formations of triangular binary matrices at
each BMU with the input itemsets. Therefore, it will be a metric which compares the training itemsets based not only on their item differences but also in
the hierarchial property of them. These matrices aim to organise the knowledge
of the SOM in a manner that a more accurate itemset-support estimations can be
produced. Some examples of these matrices are given in Figure 7.2.
Once these matrices have been formed and the corresponding codebook of
the map calculated during training, the estimation of itemset support should be
obtained as follows:
supp(X)
ˆ
=
mb
X
P (mj ) ∗ minsupp(xi ∈ X)
(7.1)
j=1
In which the probability of the event, representing an association or an itemset X, is defined by the minimal support among the items xi belonging to X rather
188
Figure 7.2: Triangular binary formations formed with the itemset search space formed
with 3 and 4 items.
than by the product defined by the joint probability of them which was defined
in Chapter 5 and re-state here in Equation 7.2. It is important to note that the
prior probability P(mj ) of the mb number of BMUs will still be necessary for
itemset-support estimation.
supp(X)
ˆ
=
mb
X
(P (mj )
j=1
Y
xi )
(7.2)
xi ∈X
Equation 7.1 can be also seen as a simple optimization of the current proposed estimation method of itemset support, because it requires less number of
operations to make an estimation. Moreover, the quality of its estimations generated can be expected to be better than our current results since input patterns
are better organised in the map. For instance, let Mjt be a binary triangular matrix allocated at the BMU mj representing input itemsets involving four items
(a,b,c,d). Because of the hierarchy existing among the input itemsets used to
build Mjt , which will be summarised by the reference vector formed in mj , an
189
ordering, based on the support property, appears in the items of the matrix such
that it is defined by either a ≥ b ≥ c ≥ d or a ≤ b ≤ c ≤ d. Therefore, if the
map is queried to recall the support of an itemset, for instance X = abc, the answer, representing itemset-support estimation of X, would involve determining
the support of either a or c depending on the ordering formed in the BMUs of
the map.
Since we have noticed that the quality of the estimations made by this neural
network is highly dependent on its initial state and the radius utilised for its training, it can be thought of evaluating approaches which cope with this instability
property of the SOM which produces under- or over-representation of the training data. Therefore, we propose to investigate alternative initilisation schemes
for SOM, for instance those proposed in (Su et al., 1999; Salem et al., 2003;
Attik et al., 2005), in order to improve the results given by a random initilised
SOM and to avoid the use of linear initialisation which has the inconvenience
that requires a lot of computations for the initial weights when the input data is
large.
In this thesis, we have initially explored the generation of association rules
known as categorical association rules because they only describe facts refereing to the associations defined among items without considering any quantitative
aspect of the variables they represent. To overcome this drawback of the categorical rules, another type of rules, known as quantitative association rules (Srikant
and Agrawal, 1996; Wijsen and Meersman, 1998; Aumann and Lindell, 2003),
have been already proposed for the description of data. Since the formation of
these rules involves the usage of statistics to define them, it can be assumed that
SOM also has the natural ability to generate them since the information about the
190
distribution of the variables learnt is coded in the nodes. Therefore, the development of mechanisms for this type of rules is needed. One way of tackling this
aim is by considering the work in (Giraudel and Lek, 2001) in which the statistical properties of SOM have been investigated, or by taking into consideration the
use of a SOM as a probability density estimator for classification problems (Yin
and Allinson, 2001). Hence, this idea would involve to approximate the distributions of the variables modeled by the input data in order to determine how
strongly or weakly they participate in the original environment and therefore in
the future rules.
When it has been demonstrated that quantitative association rules can be generated from the knowledge embedded in a trained SOM, it can be considered to
explore the work of Lebbah et al. (Lebbah et al., 2005) in which the creation of a
mapping for the analysis of mixed (numerical and binary) data is proposed. This
proposal may be useful for the generation of mixed association rules which will
be composed by both quantitative and qualitative components.
Taking into account that the formation of rules from a SOM is produced by
decoding the information of its codebook or mapping, which represents a summary of the input data formed by the ability of SOM for vector quantization, it
can be concluded that the final vectors in the map can be understood as a group
of representatives of the itemset patterns existing in the training environment.
Hence, it is possible to create a link between the generation of association rules
from a neural network and the work proposed by Yan et al. (Yan et al., 2005a),
which examines how to summarise a collection of itemsets using a finite number
of K representatives. In our case, this wanted itemset summary will be given by
the vectors created at each BMUs in the trained map.
191
Since the calculation of itemset support and the generation of rules throughout time implies developing mechanism for non-stationary environments, the
idea of using self-growing approaches derived from the traditional SOM, which
control the size of the map during training, can be considered for studying if these
can be trained in batch mode. To begin with this idea, the work of (Alahakoon
et al., 2000b) should be analysed.
7.2.3 For the Quality of the Itemset-support Estimation
One important aspect, regarding the itemset-support estimations made by any
neural network, is to define a manner in which the quality of its answers (estimations) can be measured. This is, although in this thesis we have used the
RMS error for measuring the quality of the estimations, it has been necessary to
know the real support values to perform such an error calculation; therefore, we
propose that it should be investigated a manner on how to measure such approximations or estimations with considering that the real value is totally unknown.
7.2.4 ANNs-based Candidate Generation Procedures
Another topic which is worth investigating is whether itemset support can be
inferred or predicted with neural networks. Itemset-support inference is a topic
which has not been studied extensively in the ARM field. For instance, an algorithm called PASCAL has been proposed in (Bastide et al., 2000) as an optimization of the Apriori algorithm based on counting inference. This is, some candidates during the mining process derive their support from the frequent itemsets
already discovered rather than performing the typical counting over the original
dataset. Although this approach reduces the number of candidates by inferring
the support for some of them, it can still happen that the rest of them, whose sup192
port need to be checked by the algorithms, may become unfrequent. Therefore, it
is important to count with candidate procedures which return exactly those itemsets that will certainly turn out frequent in order to not waste time checking fake
candidates. Moreover, candidate generation is an important topic in ARM since
it is necessary to determine how it influences the complexity of the discovery of
frequent itemsets.
Based on the above described and knowing a priori that the support of itemsets forms a function f (x) as in Figure 7.3, in which x represents itemsets from a
search space in a lexical order, we propose to investigate if an approximation of
that function could be generated with a neural network by interpolating a series
of n points, representing some itemsets from the space with their corresponding
support.
To approach this goal, we can start from the fact that a negative correlation exists between the size of the itemsets and their corresponding support. In
other words, the support tends to decrease while the number of items in a pattern increases. Hence, we propose to train an RBF (Radial Basis Functions)
neural networks (Bishop, 1995; Buhmann, 2003) in order to investigate if a candidate procedure can be defined with it. This goal can also be understood as a
task which aims for the prediction of itemset-support having as input knowledge
some information about the total itemset search space.
7.2.5 Distributed Association Rule Mining
The collection of data is often realized in distributed manner in real life. This is,
databases, representing and collecting huge amounts of local data, are set along
different sites in order to manage them in a strategic and satisfactory manner.
193
100
80
80
S u p p o r t (%)
S u p p o r t (%)
100
60
40
20
60
40
20
0
0
75
2775
2−i t e m s e t s
100
100
80
80
S u p p o r t (%)
S u p p o r t (%)
1−i t e m s e t s
60
40
20
60
40
20
0
67525
0
3−i t e m s e t s
1215450
4−i t e m s e t s
Figure 7.3: Function generated with the support of some groups of itemsets derived
from the Chess dataset.
Nevertheless, as the need of analysing them as a grand total arises, the development of distributed mining algorithms become crucial.
In ARM, this problem has been introduced in (Cheung et al., 1996a). In
order to tackle distributed ARM, a distributed database has been treated as a
large database formed by partitions representing each of the local sites in the
system. The common problem that approaches on this category have had to deal
with refers to the amount of data that should be transmitted amongst the sites to
achieve the generation of rules. For instance, as a first attempt, a solution could
involve moving either the original data partition or the corresponding set of the
frequent itemsets at each site; however, both solutions result in producing an
overutilisation of the channel because they can be large and contain data which
will turn out to be irrelevant.
194
Figure 7.4: Incremental SOM-based approach for distributed ARM. The rules
will be generated from the latest trained SOM.
Based on the above defined, it can be thought of extending our work as follows:
As a first approach, depicted in Figure 7.4, it can be investigated the usage of
a SOM moving from site to site and learning incrementally as defined in Chapter 6. This is, a SOM would be learning the local data distribution at each site
until it reaches a site in which the generation of rules will be performed by using
the mechanism proposed in Chapter 5. Since new tendencies can occur in the
different sites, it will be important if the approach integrates a self-growing node
structure rather than a traditional SOM which makes use of a fixed structure.
A second approach, given n Figure 7.5, would involve having local SOMs at
each site to learn the local data distributions, which will then be queried remotely
from, or moved to, a central location in which ARM will take place.
195
Figure 7.5: Local SOM-based approach for distributed ARM. While the local
maps are queried remotely in the model on the left, the trained maps are transmitted to be the source from rules will be generated in the model on the right.
It is also important to notice that the development of distributed algorithms
has a strong relationship with parallel-based algorithms (Zaki, 1999); therefore, a SOM can be stated to be a strong candidate for this type of mining
tasks since its training has already been performed in parallel based on its batch
mode (Lawrence et al., 1999b; Porrmann et al., 2003).
7.2.6 The Itemset Concept in Dynamic Data
ARM over Data Streams
In this thesis, we have started gaining some inside into the topic of generating
association rules from neural networks. Nevertheless, not all data is static in real
life as it is defined by the traditional ARM. This is, data, like data streams, can
196
arrive indefinitely and continuously; therefore, its mining becomes a complex
task since its availability often occurs in short periods of time.
Because research involving data streams is still at an early stage in the field
of ARM (Jiang and Gruenwald, 2006), we suggest that it should be investigated
if a neural network is able to perform such a mining task. The idea, which is depicted in Figure 7.6, consists in developing a two-stage proposal. In other words,
the data stream will first be learnt in a way similar way to that used with the static
data in this thesis, and then rules will be derived from the ANN.
Figure 7.6: A neural-based approach for ARM for data streams.
The analysis of this type of data has also been considered with approaches
based on ANNs. For instance, the usage of a self-organising map and an ART
has been considered in (Laerhoven et al., 2001; Laerhoven, 2001) and (Rajaraman and Tan, 2001).
As this type of data comes constantly, we could consider using the incremental proposal made for SOM in this thesis for its learning. Nevertheless, these
data can change so fast that there may not be a time to buffer it. Hence, the
197
sequential training of SOM should be considered necessary. The problem with
that sequential training of SOM is that the identification of the BMUs is not as
direct and easy as in the batch training since the map is updated every time that
a new pattern is presented. One way of identifying these nodes is by projecting
the training patterns into the map once the training has been finished. However,
as the data is not kept, this option is not feasible. Hence, one first problem would
be to find a method to keep track of the BMUs while the training is done in order
to use our itemset-support estimation method for the generation of association
rules.
Sequential Pattern Mining
The task of sequential pattern mining, introduced in (Agrawal and Srikant, 1995),
is similar to the traditional problem of ARM; nevertheless, its definition considers the concept of time in the generation of knowledge. This is, the transactions,
representing the input data, are defined by an Id costumer, an itemset, and a time
stamp. Alike ARM, the aim is to identify interesting patterns; however, in this
particular case, the target is to discover sequences, which are ordered lists of
itemsets, satisfying a minimum support constraint from the given data.
Since in this thesis, we have defined how to estimate itemset support from
two trained ANNs, our approaches can be used as a basis for a proposal for this
type of descriptive data-mining task. However, the main issue to be tackled is
the way in which the input data will be presented to the ANN. In other words, a
transformation method for the data is needed in order to form the corresponding
vectorial patterns required to feed the neural network.
198
Appendix A
The Apriori Algorithm
Since we have used the well-known Apriori Algorithm to evaluate and compare
the itemset support estimations derived from our ANN candidates through our
proposals, such an algorithm will be defined in this section.
The Apriori Algorithm was independently proposed in (Agrawal and Srikant,
1994; Mannila et al., 1994). This algorithm performs the generation of association rules by following the stages established in the confidence-support framework (Agrawal et al., 1993). This is, first, the frequent itemsets are generated,
and then the corresponding association rules are formed from such frequent itemsets. Apriori tackles the complexity of the discovery of frequent itemsets by
performing:
• The identification of only those itemsets whose support satisfy a minimum
threshold.
• The formation of new itemsets, known as candidates, by joining the already discovered frequent itemsets. This is, candidates representing the
set of itemsets with length k are derived from a combination of the set of
the frequent k-1-itemsets.
199
• The calculation of the candidate support by counting the number of occurrences of the itemset in turn directly from the high dimensional space
generated by the input data.
• The traversal of the itemset space in a BFS (Breadth-First Search) way.
• The pruning of fake candidates by considering the anti-monotonic property of itemset support: All subsets of a frequent itemset must be also frequent (Agrawal and Srikant, 1994).
A complete specification of the algorithm is given in Figure A.1.
Figure A.1: The Apriori algorithm. This figure was extracted from the original paper of
Agrawal (Agrawal and Srikant, 1994). The top pseudocode describes the main steps of
Apriori. The bottom SQL query defines the way in which candidates are formed during
a mining process.
It is important to state that this algorithm is historically important for ARM,
because it has influenced the development of new ARM algorithms in one way
or another since its introduction in 1994.
200
Appendix B
The Neural Network Candidate
Algorithms
Since two well-known neural network are part of our studies, we define below
their corresponding training algorithms employed in this thesis.
Auto-Associative Memory: Correlation Matrix Memory
A CMM (Correlation Matrix Memory), as shown in Figure B.1, is a single-layer
memory whose size depends on the problem to be tackled. For instance, while
a grid of n-by-m nodes is used to define a hetero-associative memory, a grid
of m-by-m nodes defines an auto-associative memory. Moreover, depending on
the format of its inputs, a CMM can have binary (Austin and Stonham, 1987) or
real synapses (Kohonen, 1978; Haykin, 1999; Ham and Kostanic, 2001). This
supervised memory is trained as follows:
1. A pair of input patterns X,Y is presented to the memory.
2. Its weight matrix is updated by a supervised Hebbian rule with the information defined by the input pair.
3. The steps 1 and 2 are performed until no more pairs are available.
201
Figure B.1: An Associative memory based on a CMM.
In the case of a binary memory, its training is commanded by Equation B.1,
while the training of its numeric or weighted counterpart is defined by Equation B.2.
wij =
m
_
yik xjk
(B.1)
yik xjk
(B.2)
k=1
wij =
m
X
k=1
The training in Equation B.1 and B.2 is defined respectively by a superposition and a sum function of the m matrices derived from by the m training pairs.
The idea behind its training is to accumulate information of the input associations
in such a way that when a stimulus is presented, this memory is able to remember
the corresponding associated patterns.
Self-Organising Map
The self-organising map (Kohonen, 1996) is one of the most utilized neural networks for data-mining problems because of its unsupervised property to learn
knowledge from data. The typical structure of a SOM is a two-dimensional ar-
202
range of neurons which re-organises itself during training.
Each neuron or node has associated a vector which models and summarises input
patterns coming from the training data.
Its training is based on a search and an update mechanisms. The former is
responsible for allocating the input patterns to the neurons which map them the
best. Each of these neurons mc are called winners or BMUs (Best Matching
Units). One mc satisfies for some inputs x the following condition:
∀i,
kx(t) − mc (t)k < kx(t) − mi (t)k.
The update mechanism establishes when and how the map will be updated
in order to learn the input data. Two modes, sequential and batch, are often used
to accomplish such a task (Kohonen, 1996; Kohonen, 1998). Both modes make
this ANN converge through an iterative data presentation. Nevertheless, they
differ from each other in when the update of the network will take place. This is,
while in a sequential or incremental mode the map is updated every time that a
new stimulus is presented, in a batch mode the map is modified after all training
vectors are propagated. In general, the steps involving a SOM training can be
summarised as follows:
• Initialisation of the network. This means assigning either random or linear
values to the nodes of the map.
• The presentation of a new input and the determination of its BMU. This is,
a search process, which compares the current input to the nodes in order
to discover the best match for it, is needed.
• The update of the map. In a sequential mode, this is done by Equation B.3,
while Equation B.4 governs the corresponding batch mode.
203
mi (t + 1) = mi (t) + hc(x),i (x(t) − mi (t))
(B.3)
P
j hji (t)Sj (t)
mi (t + 1) = P
j nV j (t)hji (t)
(B.4)
Both equations use a neighborhood or kernel function h(), defined in Equation B.5, to spread the influence of the current input(s) along the map. α is
known as the learning-rate factor which holds a value between 0 and 1 and
decreases monotonically in the sequential mode. On the other hand, this
factor is constant and equal to one in the batch mode. ri and rj (rc ) ∈ <2
define the grid positions of a node receiving and producing stimulation
respectively.
Ã
hc(x),i
kri − rj k2
= hji (t) = α(t) exp −
2σ 2 (t)
!
(B.5)
In the case of Equation B.4, the term Sj , described by Equation B.6, represents the the concentration of nvi input patterns allocated to each node. In
other words, it is a sum of all data points contained at some Voronoi region
region Vj = {xi | kxi − mj k < kxi − mk k ∀k 6= j} defined by a node mj .
Si (t) =
nvi
X
xj
(B.6)
j=1
This neural network has been employed for a large variety of problem domains, which will not be summarised here, but it can be consulted in (Oja et al.,
2003) for instance. Nevertheless, for the case of data mining, a SOM has been an
effective neural architecture for tasks like the visualization of high-dimensional
data (Vesanto, 1999; Flexer, 1999), classifiers (Yin and Allinson, 2001), the de-
204
scription of data through clustering (Vesanto and Alhoniemi, 2000), the generation of rules through the interpretation of its knowledge (Malone et al., 2006),
among others. Furthermore, a SOM has also served as the basis to build more
complex neural architectures. For instance, networks whose structure is not
only organised, but also grows autonomously during training (Alahakoon et al.,
2000a) for the modeling of dynamic data.
205
Bibliography
Aaron Ceglar, J. F. R. (2006). Association mining. ACM Computing Surveys,
38(2):1–42.
Agarwal, R. C., Aggarwal, C. C., and Prasad, V. V. V. (2001). A tree projection algorithm for generation of frequent item sets. Journal of Parallel and
Distributed Computing, 61(3):350–371.
Aggarwal, C. C. and Yu, P. S. (1998). A new framework for itemset generation.
In PODS, pages 18–24. ACM Press.
Agrawal, R., Imielinski, T., and Swami, A. N. (1993). Mining association rules
between sets of items in large databases. In Buneman, P. and Jajodia, S.,
editors, SIGMOD Conference, pages 207–216. ACM Press.
Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules
in large databases. In Bocca, J. B., Jarke, M., and Zaniolo, C., editors,
VLDB, pages 487–499. Morgan Kaufmann.
Agrawal, R. and Srikant, R. (1995). Mining sequential patterns. In Yu, P. S. and
Chen, A. S. P., editors, Eleventh International Conference on Data Engineering, pages 3–14, Taipei, Taiwan. IEEE Computer Society Press.
Alahakoon, D., Halgamuge, S. K., and Srinivasan, B. (2000a). Dynamic self-
206
organizing maps with controlled growth for knowledge discovery. IEEE
Transactions on Neural Networks, 11(3):601–614.
Alahakoon, D., Halgamuge, S. K., and Srinivasan, B. (2000b). Dynamic selforganizing maps with controlled growth for knowledge discovery. IEEENN, 11(3):601.
Alatas, B. and Akin, E. (2006). An efficient genetic algorithm for automated
mining of both positive and negative quantitative association rules. Soft
Comput, 10(3):230–237.
Alhoniemi, E., Himberg, J., and Vesanto, J. (1999). Probabilistic measures
for responses of self-organizing map units.
In Proc. of International
ICSC Congress on Computational Intelligence Methods and Applications
(CIMA’99), pages 286–290, Rochester, N.Y., USA. ICSC Academic Press.
Andrews, R., Diederich, J., and Tickle, A. B. (1995). Survey and critique of techniques for extracting rules from trained artificical neural networks. Knowledge Based Systems (UK), 8(6):378–389.
Aras, N., Altinel, I. K., and Oommen, J. (2003).
A kohonen-like decom-
position method for the euclidean traveling salesman problem-KNIES/spl
I.bar/DECOMPOSE. IEEE-NN, 14:869–890.
Attik, M., Bougrain, L., and Alexandre, F. (2005). Self-organizing map initialization. In Duch, W., Kacprzyk, J., Oja, E., and Zadrozny, S., editors,
ICANN (1), volume 3696 of Lecture Notes in Computer Science, pages 357–
362. Springer.
Aumann, Y. and Lindell, Y. (2003). A statistical theory for quantitative association rules. J. Intell. Inf. Syst, 20(3):255–283.
207
Austin, J. (1995). Distributed associative memories for high speed symbolic
reasoning. International Journal on Fuzzy Sets and Systems, 82:223–233.
Austin, J. (1996). Associative memories and the application of neural networks
to vision. In Handbook of Neural Computation. Institute of Physics and
Oxford University Press.
Austin, J., Kennedy, J., and Lees, K. (1995). The advanced uncertain reasoning
architecture. Weightless Neural Network Workshop.
Austin, J. and Stonham, T. (1987). An associative memory for use in image
recognition and occlusion analysis. Image and Vision Computing, 5(4):251–
261.
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., and Lakhal, L. (2000). Mining
frequent patterns with counting inference. SIGKDD Explorations, 2(2):66–
75.
Benitez, J. M., Castro, J. L., and Requena, I. (1997). Are artificial neural networks black boxes? IEEE Transactions on Neural Networks, 8(5):1156–
1164.
Bishop, C. M. (1995). Neural networks for Pattern Recognition. Oxford University Press.
Bodon, F. (2003). A fast apriori implementation. In Goethals, B. and Zaki,
M. J., editors, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), volume 90 of CEUR Workshop Proceedings, Melbourne, Florida, USA.
Bodon, F. (2006). A survey on frequent itemset mining. Technical report, Budapest University of Technology and Economics.
208
Borgelt, C. (2003). Efficient implementations of apriori and eclat.
Brin, S., Motwani, R., and Silverstein, C. (1997). Beyond market baskets: generalizing association rules to correlations. In SIGMOD ’97: Proceedings of
the 1997 ACM SIGMOD international conference on Management of data,
pages 265–276, New York, NY, USA. ACM Press.
Browne A., Hudson B., W. D. P. P. (2003). Knowledge extraction from neural networks. In Proceedings of the 29th Annual Conference of the IEEE
Industrial Electronics Society, Roanoke, Virginia, USA, pages 1909–1913.
Buhmann, M. D. (2003). Radial Basis Functions Theory and Implementations.
Cambridge Monographs on Applied and Computational Mathematics (No.
12). Cambridge.
Burdick, D., Calimlim, M., and Gehrke, J. (2001). Mafia: A maximal frequent
itemset algorithm for transactional databases. In Proceedings of the 17th International Conference on Data Engineering, pages 443–452, Washington,
DC, USA. IEEE Computer Society.
Carpenter, G. A. and Grossberg, S. (1989). Search mechanisms for adaptive
resonance theory (ART) architectures. In IEEE International Joint Conference on Neural Networks (3rd IJCNN’89), volume I, pages I–201–I–205,
Washington DC. IEEE. Boston U.
Changchien, S. W. and Lu, T.-C. (2001). Mining association rules procedure to
support on-line recommendation by customers and products fragmentation.
Expert Systems with Applications, 20(4):325–335.
Cheung, Han, Ng, Fu, and Fu (1996a). A fast distributed algorithm for mining association rules. In PDIS: International Conference on Parallel and
209
Distributed Information Systems. IEEE Computer Society Technical Committee on Data Engineering, and ACM SIGMOD.
Cheung, D. W.-L., Han, J., Ng, V., and Wong, C. Y. (1996b). Maintenance of
discovered association rules in large databases: An incremental updating
technique. In ICDE, pages 106–114.
Cios, K. (2000). Data Mining Methods for Knowledge Discovery. Kluwer Academic.
Coenen, F., Goulbourne, G., and Leng, P. (2004a). Tree structures for mining
association rules. Data Mining and Knowledge Discovery, 8(1):25–51.
Coenen, F., Leng, P., and Ahmed, S. (2004b). Data structure for association
rule mining: T-trees and p-trees. Knowledge and Data Engineering, IEEE
Transactions on, 16(6):774–778.
Craven, M. and Shavlik, J. (1999). Rule extraction: Where do we go from here?
Craven, M. and Shavlik, J. W. (1993). Learning symbolic rules using artificial
neural networks. In ICML, pages 73–80.
Craven, M. W. and Shavlik, J. W. (1997). Using neural networks for data mining.
Future Generation Computer Systems, 13(2–3):211–229.
DeGroot, M. (1975). Probability and Statistics. Addison-Wesley.
Devroye, L. (1987). A Course in Density Estimation. Birkhauser, Boston.
D.J. Newman, S. Hettich, C. B. and Merz, C. (1998). UCI repository of machine
learning databases.
Duch, W., Adamczak, R., and Grabczewski, K. (1996). Extraction of logical
rules from training data using backpropagation networks. In First Polish
Conference on Theory and Applications of Artificial Intelligence.
210
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification (2nd
Edition). Wiley-Interscience.
Eggermont, J. (1998). Rule-extraction and learning in the BP-SOM architecture.
El-Haji, M. and Zaiane, O. (2003). Inverted matrix: Efficient discovery of frequent items in large datasets in the context of interactive mining.
Eom, J.-H. (2006). Neural feature association rule mining for protein interaction
prediction. In Wang, J., Yi, Z., Zurada, J. M., Lu, B.-L., and Yin, H., editors,
ISNN (2), volume 3973 of Lecture Notes in Computer Science, pages 690–
695. Springer.
Eom, J.-H., Chang, J. H., and Zhang, B.-T. (2004). Prediction of implicit proteinprotein interaction by optimal associative feature mining. In Yang, Z. R.,
Everson, R. M., and Yin, H., editors, IDEAL, volume 3177 of Lecture Notes
in Computer Science, pages 85–91. Springer.
Eom, J.-H. and Zhang, B.-T. (2004). Adaptive neural network-based clustering
of yeast protein-protein interactions. In Das, G. and Gulati, V. P., editors,
CIT, volume 3356 of Lecture Notes in Computer Science, pages 49–57.
Springer.
Eom, J.-H. and Zhang, B.-T. (2005). Prediction of east protein-protein interactions by neural feature association rule. In Duch, W., Kacprzyk, J., Oja,
E., and Zadrozny, S., editors, ICANN (2), volume 3697 of Lecture Notes in
Computer Science, pages 491–496. Springer.
Fernando Lourenco, V. L. and Bacao, F. (2004). Binary-based similarity measures for categorical data and their application in self-organizing maps. In
JOCLAD 2004 - XI Jornadas de Classificacao e Analise de Dados.
211
Flexer, A. (1999). On the use of self-organizing maps for clustering and visualization. In Principles of Data Mining and Knowledge Discovery, pages
80–88.
Furao, S. and Hasegawa, O. (2004). An incremental neural network for nonstationary unsupervised learning. In ICONIP, pages 641–646.
Gaber, J., Bahi, J. M., and El-Ghazawi, T. A. (2000a). Parallel mining of association rules with a hopfield type neural network. In ICTAI, page 90. IEEE
Computer Society.
Gaber, K., Bahi, J., and El-Ghazawi, T. (2000b). Parallel mining of association rules with a hopfield type neural network. In 12th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI 2000), pages 90–93,
Vancouver, Canada.
Gardner-Medwin, A. R. and Barlow, H. B. (2001). The limits of counting accuracy in distributed neural representations. Neural Comput., 13(3):477–504.
Giraudel, J. L. and Lek, S. (2001). A comparison of self-organizing map algorithm and some conventional statistical methods for ecological community
ordination. ECOLOGICAL MODELLING, 146(1–3):329–339.
Goethals, B. (2002). Efficient Frequent Pattern Mining. PhD thesis, University
of Limburg, Belgium.
Goethals, B. (2003). Frequent itemset mining implementations repository.
Goethals, B. (2004). Memory issues in frequent itemset mining. In Haddad, H.,
Omicini, A., Wainwright, R. L., and Liebrock, L. M., editors, SAC, pages
530–534. ACM.
212
Goethals, B. and Zaki, M. J., editors (2003). FIMI ’03, Frequent Itemset Mining
Implementations, Proceedings of the ICDM 2003 Workshop on Frequent
Itemset Mining Implementations, 19 December 2003, Melbourne, Florida,
USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org.
Gouda, K. and Zaki, M. J. (2001). Efficiently mining maximal frequent itemsets.
In ICDM, pages 163–170.
Grahne, G. and Zhu, J. (2003).
ing frequent itemsets.
Efficiently using prefix-trees in min-
In Proceeding of the First IEEE ICDM
Workshop on Frequent Itemset Mining Implementations (FIMI’03).
http://www.cs.concordia.ca/db/dbdm/dm.html.
Grahne, G. and Zhu, J. (2005). Fast algorithms for frequent itemset mining
using fp-trees. IEEE Transactions on Knowledge and Data Engineering,
17(10):1347–1362.
Gunopulos, D., Mannila, H., and Saluja, S. (1997). Discovering all most specific
sentences by randomized algorithms. In Afrati, F. N. and Kolaitis, P. G.,
editors, Database Theory - ICDT ’97, 6th International Conference, Delphi,
Greece, January 8-10, 1997, Proceedings, volume 1186 of Lecture Notes in
Computer Science, pages 215–229. Springer.
Gupta, G., Strehl, A., and Ghosh, J. (1999). Distance based clustering of association rules.
Ham, F. M. and Kostanic, I. (2001). Principles of neurocomputing for science
and engineering. McGraw-Hill.
Hammer, B., Rechtien, A., Strickert, M., and Villmann, T. (2002).
Rule
extraction from self-organizing networks. In Dorronsoro, J. R., editor,
213
ICANN, volume 2415 of Lecture Notes in Computer Science, pages 877–
883. Springer.
Han, E.-H., Karypis, G., and Kumar, V. (1997). Scalable parallel data mining for
association rules. pages 277–288.
Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann.
Han, J., Pei, J., and Yin, Y. (2000a). Mining frequent patterns without candidate
generation. In Chen, W., Naughton, J. F., and Bernstein, P. A., editors,
SIGMOD Conference, pages 1–12. ACM.
Han, J., Pei, J., and Yin, Y. (2000b). Mining frequent patterns without candidate
generation. In Chen, W., Naughton, J., and Bernstein, P. A., editors, 2000
ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM
Press.
Han, J., Pei, J., and Yin, Y. (2000c). Mining frequent patterns without candidate generation. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD
international conference on Management of data, pages 1–12, New York,
NY, USA. ACM Press.
Hand, D., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. The
MIT Press, Cambridge, Massachusetts.
Haykin, S. (1999). Neural networks: A comprehensive foundation. PrenticeHall, New York.
Heskes, T. (2001). Self-organizing maps, vector quantization, and mixture modeling. IEEE-EC, 12:1299–1305.
214
Hilderman, R. and Hamilton, H. (1999). Knowledge discovery and interestingness measures: A survey.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent
collective computational abilities. Proceedings National Academy Science
USA, 79:2554–2558.
Hugh, G. (1997). Discrete Probability. Springer, New York.
Hung S., W. S. (2004). A time-based self-organising model for document clustering. In The International Joint Conference on Neural Networks.
Jacobsson, H. (2005). Rule extraction from recurrent neural networks: A taxonomy and review. Neural Comput., 17(6):1223–1263.
Jiang, N. and Gruenwald, L. (2006). Research issues in data stream association
rule mining. SIGMOD Rec., 35(1):14–19.
Jin, R. and Agrawal, G. (2002). Shared memory parallelization of data mining
algorithms: Techniques.
Jolliffe, I. T. (1986).
Principal Component Analysis.
Series in Statistics.
Springer-Verlag.
Joshi, M. V., Han, E.-H., Karypis, G., and Kumar, V. (1999). Efficient parallel
algorithms for mining associations. In Large-Scale Parallel Data Mining,
pages 83–126.
Jr., R. J. B., Goethals, B., and Zaki, M. J., editors (2004). FIMI ’04, Proceedings
of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations,
Brighton, UK, November 1, 2004, volume 126 of CEUR Workshop Proceedings. CEUR-WS.org.
215
Kantardzic, M. (2002). Data Mining: Concepts, Models, Methods, and Algorithms. IEEE Press and John Wiley.
Kaski, S., Nikkilä, J., and Kohonen, T. (1998a). Methods for interpreting a selforganized map in data analysis. In ESANN, pages 185–190.
Kaski, S., Nikkilä, J., and Kohonen, T. (1998b). Methods for interpreting a
self-organized map in data analysis. In Verleysen, M., editor, Proceedings of ESANN’98, 6th European Symposium on Artificial Neural Networks,
Bruges, April 22–24, pages 185–190. D-Facto, Brussels, Belgium.
Kecman, V. V. (2001). Learning and soft computing: support vector machines,
neural networks, and fuzzy logic models. Complex adaptive systems. pubMIT, pub-MIT:adr.
Kiang, M. Y. (2001). Extending the kohonen self-organizing map networks for
clustering analysis. COMPUTATIONAL STATISTICS & DATA ANALYSIS,
38(2):161–180.
Kimball, R. (1996). The Data Warehouse Toolkit: Practical Techniques for
Building Dimensional Data Warehouses. John Wiley.
Kohonen, T. (1978). Associative memory. Springer-Verlag, Berlin.
Kohonen, T. (1996). Self-Organizing Maps. Springer-Verlag. It explains SOM
technology.
Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21(1-3):1–6.
Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., nad V. Paatero, J. H., and
Saarela, A. (2000). Self-organization of a massive document collection.
IEEE Transactions on Neural Networks, 11:574–85.
216
Kohonen, T. and Somervuo, P. (1998). Hself-organizing maps of symbol strings.
Neurocomputing, 21(1–3):19–30.
Kohonen, T. and Somervuo, P. (2002). How to make large self-organizing maps
for nonvectorial data. Neural Networks, 15(8–9):945–952.
Krishnan, R., Sivakumar, G., and Bhattacharya, P. (1999). A search technique for
rule extraction from trained neural networks. Pattern Recognition Letters,
20(3):273–280.
Laerhoven, K. V. (2001). Combining the self-organizing map and K-means clustering for on-line classification of sensor data. In Dorffner, G., Bischof, H.,
and Hornik, K., editors, ICANN, volume 2130 of Lecture Notes in Computer
Science, pages 464–469. Springer.
Laerhoven, K. V., Aidoo, K. A., and Lowette, S. (2001). Real-time analysis of
data from many sensors with neural networks. In ISWC, pages 115–122.
IEEE Computer Society.
Lawrence, R. D., Almasi, G. S., and Rushmeier, H. E. (1999a). A scalable
parallel algorithm for self-organizing maps with applications to sparse data
mining problems. Data Min. Knowl. Discov., 3(2):171–195.
Lawrence, R. D., Almasi, G. S., and Rushmeier, H. E. (1999b). A scalable
parallel algorithm for self-organizing maps with applications to sparse data
mining problems. Data Min. Knowl. Discov, 3(2):171–195.
Lebbah, M., Badran, F., and Thiria, S. (2000). Topological map for binary data.
In ESANN, pages 267–272.
Lebbah, M., Chazottes, A., Badran, F., and Thiria, S. (2005). Mixed topological
map. In ESANN, pages 357–362.
217
Leisch, F., Weingessel, A., and Dimitriadou, E. (1998). Competitive learning for
binary valued data. In Niklasson, L., Bodén, M., and Ziemke, T., editors,
Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), volume 2, pages 779–784, Skövde, Sweden. Springer.
Liu, J., Pan, Y., Wang, K., and Han, J. (2002). Mining frequent item sets by
opportunistic projection.
Lu, H. J., Setiono, R., and Liu, H. (1996). Effective data mining using neural
networks. Ieee Trans. On Knowledge And Data Engineering, 8:957–961.
Malone, J., McGarry, K., Wermter, S., and Bowerman, C. (2006). Data mining
using rule extraction from kohonen self-organising maps. Neural Computing and Applications, 15(1):9–17.
Mannila, H. and Toivonen, H. (1997). Levelwise search and borders of theories
in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–
258.
Mannila, H., Toivonen, H., and Verkamo, A. I. (1994). Efficient algorithms
for discovering association rules. In Fayyad, U. M. and Uthurusamy, R.,
editors, AAAI Workshop on Knowledge Discovery in Databases (KDD-94),
pages 181–192, Seattle, Washington. AAAI Press.
McGarry, K. J., Wermter, S., and MacIntyre, J. (1999). Knowledge extraction
from radial basis function networks and multi-layer perceptrons. In IEEE
International Conference on Neural Networks (IJCNN’99), volume IV,
pages 2494–2497, Washington DC. IEEE.
Meo, R. (2003). Replacing support in association rule mining. Technical Report
RT70-2003, Universita degli Studi di Torino.
218
Mitra, S., Pal, S. K., and Mitra, P. (2002). Data mining in soft computing framework: a survey. IEEE-EC, 13:3–14.
Oja, M., Kaski, S., and Kohonen, T. (2003). Bibliography of self-organizing map
(som) papers: 1998-2001 addendum. Neural Computing Surveys, 3:1–156.
O’Keefe, S. (1995). Neural Networks for FAX Image Analysis. PhD thesis,
University of York.
Omiecinski, E. (2003). Alternative interest measures for mining associations in
databases. IEEE Trans. Knowl. Data Eng., 15(1):57–69.
Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. (1998). Pruning closed
itemset lattices for association rules. Proceedings of the BDA French Conference on Advanced Databases, October 1998. To appear.
Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. (1999). Discovering frequent
closed itemsets for association rules. Lecture Notes in Computer Science,
1540:398–416.
Pei, J., Han, J., and Mao, R. (2000). CLOSET: An efficient algorithm for mining
frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues
in Data Mining and Knowledge Discovery, pages 21–30.
Piatetsky-Shapiro, G. (2000). Knowledge discovery in databases: 10 years after.
hardcopy.
Piatetsky-Shapiro, G. and Frawley, W. J., editors (1991). Knowledge Discovery
in Databases. AAAI/MIT Press.
Porrmann, M., Witkowski, U., and Ruckert, U. (2003). A massively parallel
architecture for self-organizing feature maps. IEEE-NN, 14:1110–1121.
219
Rácz, B., Bodon, F., and Schmidt-Thieme, L. (2005). Benchmarking frequent
itemset mining algorithms: from measurement to analysis. In Goethals,
B., Nijssen, S., and Zaki, M. J., editors, Proceedings of ACM SIGKDD
International Workshop on Open Source Data Mining (OSDM’05), pages
36–45, Chicago, IL, USA.
Rajaraman, K. and Tan, A.-H. (2001). Topic detection, tracking, and trend analysis using self-organizing neural networks. In Cheung, D. W.-L., Williams,
G. J., and Li, Q., editors, PAKDD, volume 2035 of Lecture Notes in Computer Science, pages 102–107. Springer.
Roberto J. Bayardo, J. (1998). Efficiently mining long patterns from databases.
SIGMOD Rec., 27(2):85–93.
Salem, A.-B. M., Syiam, M. M., and Ayad, A. F. (2003). Improving selforganizing feature map (sofm) training algorithm using k-means initialization. In ICEIS (2), pages 399–405.
Sallans, B. (1997). Data mining for association rules with unsupervised neural
networks: Csc final project.
Setiono, R. (2000). Extracting ¡em¿M¡/em¿-of-¡em¿N¡/em¿ rules from trained
neural networks. IEEE-NN, 11(2):512.
Shangming Yang, Y. Z. (2004). Self-organizing feature map based data mining.
volume 3173, pages 193–198.
Shenoy, P., Haritsa, J. R., Sundarshan, S., Bhalotia, G., Bawa, M., and Shah, D.
(2000). Turbo-charging vertical mining of large databases. pages 22–33.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.
Chapman and Hall, New York.
220
Song, H.-H. and Lee, S.-W. (1998). A self-organizing neural tree for large-set
pattern classification. Neural Networks, IEEE Transactions on, 9(3):369–
380.
SPSS (1968). Clementine.
Srikant, R. and Agrawal, R. (1996). Mining quantitative association rules in large
relational tables. In Jagadish, H. V. and Mumick, I. S., editors, Proceedings
of the 1996 ACM SIGMOD International Conference on Management of
Data, pages 1–12, Montreal, Quebec, Canada.
Su, M.-C. and Chang, H.-T. (2001). A new model of self-organizing neural
networks and its application in data projection. IEEE-NN, 12:153–158.
Su, M.-C., Liu, T.-K., and Chang, H.-T. (1999). An efficient initialization scheme
for the self-organizing feature map algorithm. In IEEE International Conference on Neural Networks (IJCNN’99), volume III, pages 1906–1910,
Washington DC. IEEE.
Taha, I. A. and Ghosh, J. (1999). Symbolic interpretation of artificial neural
networks. Knowledge and Data Engineering, 11(3):448–463.
Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In KDD ’02: Proceedings of
the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32–41, New York, NY, USA. ACM Press.
Tickle, A. B., Andrews, R., Golea, M., and Diederich, J. (1998). The truth
will come to light: Directions and challenges in extracting the knowledge
embedded within trained artificial neural networks. IEEE-NN, 9(6):1057.
Toivonen, H. (1996). Sampling large databases for association rules. In Vijayaraman, T. M., Buchmann, A. P., Mohan, C., and Sarda, N. L., editors,
221
In Proc. 1996 Int. Conf. Very Large Data Bases, pages 134–145. Morgan
Kaufman.
Towell, G. G. and Shavlik, J. W. (1993).
Extracting refined rules from
knowledge-based neural networks. Machine Learning, 13:71–101.
Tsukimoto, H. (2000). Extracting rules from trained neural networks. IEEE-NN,
11(2):377.
Ultsch, A. (1999). Data mining and knowledge discovery with emergent selforganizing feature maps for multivariate time series.
Ultsch, A. and Siemon, H. P. (1990). Kohonen’s self organizing feature maps for
exploratory data analysis. In INNC Paris 90, pages 305–308. Universit”at
Dortmund.
Vapnik, V. (1998). Statistical Learning Theory. Wiley.
Vázquez, J. M., Macı́as, J. L. Á., and Santos, J. C. R. (2002). Discovering numeric association rules via evolutionary algorithm. In Cheng, M.-S., Yu,
P. S., and Liu, B., editors, PAKDD, volume 2336 of Lecture Notes in Computer Science, pages 40–51. Springer.
Veloso, A. (2003). New parallel algorithms for frequent itemset mining in large
databases.
Vesanto, J. (1999). SOM-based data visualization methods. Intelligent-DataAnalysis, 3:111–26.
Vesanto, J. and Ahola, J. (1999). Hunting for correlations in data using the selforganizing map. In Proc. of International ICSC Congress on Computational
Intelligence Methods and Applications (CIMA’99), Rochester, New York,
USA, June 22–25, pages 279–285. ICSC Academic Press.
222
Vesanto, J. and Alhoniemi, E. (2000). Clustering of the self-organizing map.
IEEE Transactions on Neural Networks, 11(3):586–600.
Vesanto, J., Himberg, J., Alhoniemi, E., and Parhankangas, J. (1999). Selforganizing map in matlab: the SOM toolbox. In Proc. of Matlab DSP Conference 1999, Espoo, Finland, November 16–17, pages 35–40.
Vesanto, J., Himberg, J., Alhoniemi, E., and Parhankangas, J. (2000). SOM
toolbox for matlab 5. Technical report.
Wijsen, J. and Meersman, R. (1998). On the complexity of mining quantitative
association rules. Data Min. Knowl. Discov, 2(3):263–281.
Woon, Y.-K., Ng, W.-K., and Lim, E.-P. (2004). A support-ordered trie for fast
frequent itemset discovery. IEEE Transactions on Knowledge and Data
Engineering, 16(7):875–879.
Yan, X., Cheng, H., Han, J., and Xin, D. (2005a). Summarizing itemset patterns:
a profile-based approach. In Grossman, R., Bayardo, R., and Bennett, K. P.,
editors, KDD, pages 314–323. ACM.
Yan, X., Zhang, C., and Zhang, S. (2005b). Armga: Identifying interesting
association rules with genetic algorithms. Applied Artificial Intelligence,
19(7):677–689.
Yianilos (1993). Data structures and algorithms for nearest neighbor search in
general metric spaces. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete
Algorithms).
Yin, H. and Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE-NN, 12:405–411.
223
Zaki, M. J. (1999). Parallel and distributed association mining: A survey. IEEE
Concurrency, 7(4):14–25.
Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Trans.
Knowl. Data Eng, 12(2):372–390.
Zaki, M. J. and Hsiao, C.-J. (2002). Charm: An efficient algorithm for closed
itemset mining. In Grossman, R. L., Han, J., Kumar, V., Mannila, H., and
Motwani, R., editors, SDM. SIAM.
Zaki, M. J., Parthasarathy, S., Li, W., and Ogihara, M. (1996). Evaluation of
sampling for data mining of association rules. Technical Report TR617.
Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997a). New algorithms
for fast discovery of association rules. In Heckerman, D., Mannila, H.,
Pregibon, D., Uthurusamy, R., and Park, M., editors, In 3rd Intl. Conf. on
Knowledge Discovery and Data Mining, pages 283–296. AAAI Press.
Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997b). Parallel algorithms for discovery of association rules. Data Mining and Knowledge
Discovery, 1(4):343–373.
Zhang (2000). Neural networks for classification: A survey. IEEETSMC: IEEE
Transactions on Systems, Man, and Cybernetics, 30:451–462.
224