Download Neural Networks as Artificial Memories for Association Rule Mining

Neural Networks as Artificial Memories for Association Rule Mining Vicente Oswaldo Baez Monroy Submitted for the degree of Doctor of Philosophy Department of Computer Science December, 2006 Abstract The collection of data is a high priority task in our daily life because humans are interested in understanding more about the variables of the diverse events happening around us. This analysis or understanding is derived from the usage of techniques which aim to produce predictive or descriptive models from data. While the former generates models to predict new status of data, the latter finds important patterns to describe it. In Computer Science, Data Mining is a multidisciplinary area responsible for producing these data-analysis techniques by developing algorithms which aim to form novel understandings of data. Among the descriptive data-mining techniques, the task of association rule mining stands out because of its simple, but powerful knowledge rule-format to represent how the attributes or items, which form events or patterns in an environment, associate amongst each other and how strong these associations are. Based on the support-confident framework proposed by Agrawal for association rule mining, the generation of these rules is typically achieved by first identifying the group of interesting or frequent itemsets in data and then generating rules from the discovered itemsets. To determine whether an itemset is frequent or not, the calculation of its corresponding support property must be performed since it defines its frequency of occurrence in the mined environment. i Although the number of approaches and strategies for association rule mining have been growing since 1993, there are very few proposals based on biologicallyinspired technology. In particular, there are barely any neural-network based approaches for the generation of this type of rules. To further the research in this field, we explore how neural networks can be used for association rule mining in this thesis. Since it has been assumed that association rules are a type of knowledge that humans can generate mechanically, and considering that neural networks imitate human behavior, we have stated that an implicit neural-based framework may exist for this data mining technique. In particular, we have followed the premise that association rules can be derived from the knowledge learnt by a neural network similar to those generated by traditional algorithms like Apriori. In order to perform association rule mining with neural networks, we focus on investigating how to perform the counting of patterns or itemsets, which is normally produced by looking for the patterns by scanning the high dimensional space defined by the original data environment, through decoding the knowledge embedded in an auto-associative memory and a self-organising map. This is, we have worked in the first stage of the neural-based proposed framework which involves the building of artificial memories that are able to learn, store and recall itemset support after they have been trained with data defining associations. Especially, we analyse and decode the training process and the weight matrix of a self-organising map and an auto-associative memory to propose itemsetsupport extraction mechanisms through which they are able to recall itemsetsupport when an itemset is presented as stimulus to the trained networks. ii Since data sources or environments are not static and any knowledge therefore derived from them, like rules or itemsets, tend to outdate as fast as new events occur, we have also investigated how the itemset-support knowledge accumulated by a memory must be maintained throughout time. Particularly, we propose how a self-organising-map based memory can maintain its knowledge of itemset support valid throughout time while it learns from a non-stationary environment. iii Contents 1 Introduction 1 1.1 The Role of Neural Networks in Data Mining . . . . . . . . . . 12 1.2 Linking ANNs and ARM: The Motivations . . . . . . . . . . . 14 1.3 The Neural-Network Candidates . . . . . . . . . . . . . . . . . 23 1.4 The Research Questions . . . . . . . . . . . . . . . . . . . . . 24 1.4.1 Aims and Objectives . . . . . . . . . . . . . . . . . . . 28 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.5 2 Association Rule Mining 33 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 The Scope of ARM . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.1 Formal Definition . . . . . . . . . . . . . . . . . . . . . 36 Frequent Itemset Mining . . . . . . . . . . . . . . . . . . . . . 37 2.3.1 The Calculation of Itemset Support . . . . . . . . . . . 38 2.4 Taxonomy of the FIMers . . . . . . . . . . . . . . . . . . . . . 40 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.3 3 Hypothetical Neural Network for Association Rule Mining 46 3.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Hypothetical ARM Framework Based on ANNs . . . . . . . . . 52 3.3 A Formal Definition of the Problem . . . . . . . . . . . . . . . 56 iv 3.4 Ideal ANN Characteristics for Building Memories for ARM . . 57 3.5 Reasons for Studying an AAM and a SOM for ARM . . . . . . 58 3.6 Similarities and Differences with Surveyed Approaches . . . . . 67 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4 An Auto-Associative Memory for ARM 4.1 70 Correlation Matrix Memory for ARM . . . . . . . . . . . . . . 71 4.1.1 The Learning of Itemset Support by a CMM . . . . . . . 71 4.1.2 Recalling Itesemt Support from The Weight Matrix of a CMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Complexity Analysis: CMM vs. Apriori . . . . . . . . . 79 4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.3 5 Itemset Support Generation From a Self-Organising Map 91 5.1 Considering a SOM for ARM: Principles . . . . . . . . . . . . 5.2 A Probabilistic Itemset-support Estimation Mechanism . . . . . 101 5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 111 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6 Incremental Training for Incremental ARM: A SOM Model 92 131 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2 Batch SOM for Non-stationary Environments . . . . . . . . . . 135 6.3 6.2.1 The Problem Definition . . . . . . . . . . . . . . . . . 136 6.2.2 Interpretation by Node Influences of the Batch Training 6.2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . 141 6.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . 144 137 Itemset Support Maintenance by Incremental SOM Training . . 145 6.3.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . 145 v 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7 Conclusions and Future Work 7.1 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.1.1 7.2 168 Contributions . . . . . . . . . . . . . . . . . . . . . . . 185 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.2.1 For The Auto-associativity-based Memory . . . . . . . 187 7.2.2 For The Self-organising-map-based Memory . . . . . . 187 7.2.3 For the Quality of the Itemset-support Estimation . . . . 192 7.2.4 ANNs-based Candidate Generation Procedures . . . . . 192 7.2.5 Distributed Association Rule Mining . . . . . . . . . . 193 7.2.6 The Itemset Concept in Dynamic Data . . . . . . . . . . 196 A The Apriori Algorithm 199 B The Neural Network Candidate Algorithms 201 vi List of Figures 1.1 An illustration of ARM applied to the data source defined in (a). The aim is to generate rules with a minimal threshold of 20%. As the first part of ARM, the frequent itemsets have to be discovered, such as in (b), based on their support. Then, association rules, as in (c), are formed from them. . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 A shopping receipt as an example of a data source in which the associativity concept can be exploited for the extraction of knowledge. . . 1.3 7 9 Framework defined by the processes and strategies for the problem of association rule mining. Its conception is based on the support- . . . . . . 18 1.4 Neural-based framework for ARM. . . . . . . . . . . . . . . . . . 23 2.1 Example of an itemset-search-space lattice. In this case, the data space confidence framework of Agrawal (Agrawal et al., 1993). is formed by 4 items. Indexes represent the lexicographic order of the itemsets in the space. . . . . . . . . . . . . . . . . . . . . . . . . 3.1 38 Hypothetical Neural-based framework for ARM. In particular, this thesis focuses on developing an artificial memory for its purposes (colored area). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 53 General structure of a mapping neural network. This appears in (Ham and Kostanic, 2001). . . . . . . . . . . . . . . . . . . . . . . . . vii 59 3.3 Outline of the theoretical internal support model defined in (GardnerMedwin and Barlow, 2001) to produce the counting of patterns with distributed representations in a group neurons. . . . . . . . . . . . . 3.4 62 Theoretical projection models defined in (Gardner-Medwin and Barlow, 2001) to produce the counting of patterns with distributed repre- . . . . . . . . . . . . . . . . . . . 66 4.1 Illustration of the accumulation of knowledge by a CMM. . . . . . . 73 4.2 Illustration of the accumulation of knowledge by a weightless CMM sentations in a group neurons. which has been modified to collect frequency information. The dark matrix illustrates the new matrix called the frequency matrix Mf which contains the corresponding pattern frequencies. 4.3 . . . . . . . . . . . 75-by-75 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Chess dataset will be made. . . . . . . . . . 4.4 85 119-by-119 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Mushroom dataset will be made. . . . . . . 5.1 84 129-by-129 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Connect dataset will be made. . . . . . . . . 4.5 75 85 Maps resulting from training a SOM with artificial datasets describing associations. The red hexagons on the gray maps define the hits received from the input patterns during training. Cluster formations are presented with the coloured maps. viii . . . . . . . . . . . . . . . . . 97 5.2 This figure illustrates the importance of the mean in the calculation of the support of an item from a trained SOM. Different number of transactions (n) composing of zeros and ones have been used to form the bottom graphs. These graphs show that different concentrations (densities) of these bistate values captured in an item induce the tendency of the distribution of the curve to approach the densest value (e.g., in the left graph the number of failures (zi =0) is greater than the number of successes, therefore the highest point of the distribution tends to be placed at 0). 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 The figure on the left depicts the intersections of an event B with events A1 ,. . . ,A5 of a partition over S. The figure on the right depicts the concept of Voronoi regions which can be formed on the SOM (the dots represent the codewords while the stars represent the data points assigned to each Voronoi region). . . . . . . . . . . . . . . . . . . . 109 5.4 Representation of the Probabilistic Itemset-support Estimation Mechanism (PISM) proposed in this chapter. . . . . . . . . . . . . . . . 111 5.5 Results for the support value of 15 itemsets (top graph), 255 itemsets (centre graph) and 65535 (bottom graph) obtained after using PISM in order to satisfy the query -All- to the map trained with the dataset Bin4, Bin8 and Bin16 respectively. For reference, the values corresponding to the same queries using an Apriori implementation are also plotted. . 115 5.6 Intermediate results (The support values of 15 itemsets) generated from using PISM for the query -All- to the map being trained with dataset Bin4x100. In both cases, the SOM needs five epochs to converge but after the first epoch, good estimations can be formed for the support of itemsets. The small difference in the performance between these two exercises is due to the type of initialisation chosen. . . . . . . . . . . 116 ix 5.7 Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 4Itemset- to the map trained with the dataset Chess. For reference, the values corresponding to the same queries (plots on the left) against the dataset Chess using an Apriori implementation are also plotted. . . . . . . . . . . . . . . . . . . . . . . 117 5.8 Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 3Itemset- to the map trained with the dataset Mushroom. For reference, the values corresponding to the same queries (plots on the left) against the dataset Mushroom using an Apriori implementation are also plotted. . . . . . . . . . . . . . . . . . . . . 118 5.9 Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 3Itemset- to the map trained with the dataset Connect. For reference, the values corresponding to the same queries (plots on the left) against the dataset Connect using an Apriori implementation are also plotted. . . . . . . . . . . . . . . . . . . . . . . 118 5.10 Distribution of the itemset-support estimations made by SOM via our method for the query -90to100Itemsets- for the Chess dataset. The corresponding errors are summarized in Table 5.8. . . . . . . . . . . 123 5.11 Generalising errors for the results given by Table 5.8. While the xaxis represents the different itemset groups, the y-axis determines the calculated error. . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.12 Distribution of the itemset-support generalisation error made by SOM via our method for the queries: 1Itemsets, 2Itemsets and 3Itemsets for the Chess dataset, when the size of the map, representing an itemsetsupport memory for ARM, increases. . . . . . . . . . . . . . . . . 127 x 5.13 Distribution of the itemset-support generalisation error made by SOM via our method for the query 45to100Itemsets for the Chess dataset, when the size of the map, representing an itemset-support memory for ARM, increases. . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.1 Algorithm SOM training in batch. . . . . . . . . . . . . . . . . . . 137 6.2 Tendency of the final influence given by the node mj depending on the strength of the influences received from other nodes. . . . . . . . . . 141 6.3 Incremental algorithm proposed for SOM in batch. While the first (top) function triggers the learning at each stage of a non-environment, the second (bottom) function performs the training of the SOM with the current data chunk and the old information coming from the set of best matching units of the latest trained map. . . . . . . . . . . . . . . . 143 6.4 Representation of a training data space describing a non-stationary environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5 Topologies formed by a SOM trained with our incremental batch approach through the six different phases defining the non-stationary environment represented in Figure 6.4. The black dots define the structure of the trained map. The data points used for the training of a SOM at each phase of the environment, according to the order in Table 6.1, are defined by the green and blue dots, which represent respectively the old knowledge (data extracted from the BMUs) and the current data chunk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 xi 6.6 Differences between estimations (approximations) and calculations (real values) made respectively by a trained SOM and the Apriori algorithm for the group of 1-itemsets throughout the four phases of the environment I. The first column from the left describes the type of estimations that can be produced with a SOM trained with only the data chunk in turn in the environment (Chunk-SOM). The second column represents the estimations that can be made from a SOM trained with our incremental approach (Bincremental-SOM). The last column shows the estimation made by a SOM trained with always all the data chunks available en the environment (Allchunks-SOM). . . . . . . . . . . . 151 6.7 Differences between estimations (approximations) and calculations (real values) made respectively by a trained SOM and the Apriori algorithm for the group of 2-itemsets throughout the four phases of the environment I. The first column from the left describes the type of estimations that can be produced with a SOM trained with only the data chunk in turn in the environment (Chunk-SOM). The second column represents the estimations that can be made from a SOM trained with our incremental approach (Bincremental-SOM). The last column shows the estimation made by a SOM trained with always all the data chunks available en the environment (Allchunks-SOM). . . . . . . . . . . . 152 6.8 The RMS error during the phases of the environment I for the group of the 1-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.9 The RMS error during the phases of the environment I for the group of the 2-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.10 Error describing the quality of the quantization of the approaches tested for the data chunk describing the changes at each phase of the environment I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 xii 6.11 Error describing the quality of the quantization of the approaches tested for the data chunks describing the previous phases (history) at each phase of the environment I. . . . . . . . . . . . . . . . . . . . . . 155 6.12 RMS Error during the phases of environment II for the group of the 1-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.13 RMS Error during the phases of environment II for the group of the 2-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.14 Error describing the quality of the quantization of the approaches tested for the data chunk describing the changes at each phase of the environment II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.15 Error describing the quality of the quantization of the approaches tested for the data chunks describing the previous phases (history) at each phase of the environment II. . . . . . . . . . . . . . . . . . . . . . 159 6.16 RMS Error during the phases of environment III for the group of the 1-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.17 RMS Error during the phases of environment III for the group of the 2-itemsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.18 Error describing the quality of the quantization of the approaches tested for the data chunk describing the changes at each phase of the environment III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.19 Error describing the quality of the quantization of the approaches tested for the data chunks describing the previous phases (history) at each phase of the environment III. . . . . . . . . . . . . . . . . . . . . 161 6.20 Runtime for the three approaches evaluated on the environment III. . . 165 xiii 7.1 Example of association rules which can be further generated from the knowledge learnt by a neural network about the Mushroom dataset defined in (D.J. Newman and Merz, 1998). All these rules, describing the associativity among the attributes of a dataset, have the format of: if (list of items or attributes) then (list of items or attributes) with [support=% and confidence=%]. . . . . . . . . . . . . . . . . . . . 171 7.2 Triangular binary formations formed with the itemset search space formed with 3 and 4 items. . . . . . . . . . . . . . . . . . . . . . 189 7.3 Function generated with the support of some groups of itemsets derived from the Chess dataset. . . . . . . . . . . . . . . . . . . . . . . . 194 7.4 Incremental SOM-based approach for distributed ARM. The rules will be generated from the latest trained SOM. . . . . . . . . . . 195 7.5 Local SOM-based approach for distributed ARM. While the local maps are queried remotely in the model on the left, the trained maps are transmitted to be the source from rules will be generated in the model on the right. . . . . . . . . . . . . . . . . . . 196 7.6 A neural-based approach for ARM for data streams. . . . . . . . . . 197 A.1 The Apriori algorithm. This figure was extracted from the original paper of Agrawal (Agrawal and Srikant, 1994). The top pseudocode describes the main steps of Apriori. The bottom SQL query defines the way in which candidates are formed during a mining process. . . . . . 200 B.1 An Associative memory based on a CMM. xiv . . . . . . . . . . . . . 202 List of Tables 4.1 List of real-life binary training datasets used in the testing of an autoassociative memory for ARM. They are part of the datasets normally used for testing FIM algorithms or FIM benchmarks (Jr. et al., 2004; Goethals and Zaki, 2003). . . . . . . . . . . . . . . . . . . . . . . 4.2 Support constraint conditions used to form the group of itemsets on which the memories will be tested. 4.3 81 . . . . . . . . . . . . . . . . . 83 Error results obtained in the experiments for the support recall for the groups of 3-itemsets, constrained as defined in Table 4.2, made by a CMM through our proposals. . . . . . . . . . . . . . . . . . . . . 4.4 86 Error results obtained in the experiments for the support recall for the groups of 4-itemsets, constrained as defined in Table 4.2, made by a CMM through our proposals. . . . . . . . . . . . . . . . . . . . . 4.5 86 Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Chess dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm through constraining the itemsets with a minimum support between 90 and 100%. 87 xv 4.6 Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Chess dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 45 and 100%. 4.7 87 Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Connect dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 75 and 100%. 4.8 88 Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Mushroom dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 45 and 100%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 88 List of binary-artificial training datasets used in the experiments for analysing SOM properties for ARM. Each dataset contains all the possible n itemsets generated by m items. . . . . . . . . . . . . . . . . 5.2 95 List of real-life binary training datasets used in the testing of PISM for SOM. They have been used in FIM-algorithm benchmarks (Jr. et al., 2004; Goethals and Zaki, 2003). . . . . . . . . . . . . . . . . . . . 112 5.3 List of queries used to form the groups of itemsets used for the testing of the itemset-support estimations from SOMs via PISM. In this case, k means the size of the itemsets and σ refers to the support used to form such itemset groups. . . . . . . . . . . . . . . . . . . . . . 113 xvi 5.4 Generalisation errors produced by a trained SOM through PISM for the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Chess dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5 Generalisation errors produced by a trained SOM through PISM for the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Mushroom dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6 Generalisation errors produced by a trained SOM through PISM for the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Connect dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 Generalisation errors for the trained SOMs with linear initialisation. . 122 5.8 Generalised errors for trained SOMs with linear initialisation on different ranges of groups of k-itemsets. . . . . . . . . . . . . . . . . . . 124 6.1 Data order followed in the incremental batch training. . . . . . . . . 144 6.2 Definition of the non-stationary environments using the Chess dataset. In the first two environments, each data chunk has the same (fixed) number of transactions, while in the case of the third environment, the number of transactions was chosen randomly (unfixed). . . . . . . . 147 6.3 Results obtained of the approaches tested for Environment I. . . . . . 162 6.4 Results obtained of the approaches tested for Environment II. . . . . . 163 6.5 Results obtained of the approaches tested for Environment III. . . . . 164 7.1 Characteristics of the two itemset-support memories based on the studied neural networks. m defines the total number of items from which the n transactions or itemsets of a training data D are derived. mb corresponds to the number of BMUs formed in an epoch training. xvii . . 179 7.2 Comparison of the generalised errors between our approaches for SOM and CMM for the different number of tested itemsets, which resulted a query from the Chess dataset with a minimal support between 90 and 100 %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.3 Comparison of the generalised errors between our approaches for SOM and CMM for the different number of tested itemsets, which resulted a query from the Chess dataset with a minimal support between 45 and 100 %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 xviii Acronyms and Symbols xix Acknowledgement I first thank My supervisor Dr. Simon O’Keefe for his time given throughout the development of this research. I must express my deeply gratitude to the CONACYT, which is the national counsel of science and technology in Mexico, and the central bank of Mexico for the support offered, without which this thesis and the experience around it would not be possible. Throughout these years I had the pleasure to meet many nice people and become their friend. Among them, I wish to acknowledge Patricia and Michael Lee for their support and positive words. I also want to express many thanks to the staff of the Computer Science Department of the University of York. Moreover, I want to thank the anonymous conference reviewers who provided some valuable comments which helped me in the conclusion of this work. Most importantly, I would like to express my thanks to Victoria Mueller who has been an important person to me in one way or another. Finally, I must thank my family (Marcela, Vicente and Carlos) for all their love, encouragement and support given to me throughout all these years. In particular, I must express my gratitude to my mother for showing me that everything is possible in this life. My deepest gratitude I must reserve for myself because we never gave up in the difficult times. xx Declaration I declare that work proposed by this thesis is solely my own, except where indicated, attributed or cited to other authors. Some of the material of this thesis have been previously published and presented in conferences. A complete list of publications is provided on the next page. xxi List of Publication The following is a list of publications that has been produced during the course of this research. 2005 • Vicente O. Baez-Monroy and Simon O’Keefe, ”Modelling Incremental Learning With The Batch SOM Training Method”, HIS ’05: Proceedings of the Fifth International Conference on Hybrid Intelligent Systems, IEEE Computer Society, Rio, Brazil, pages 542–544, 2005. • Vicente O. Baez-Monroy and Simon O’Keefe, ”Principles of Employing a Self-organizing Map as a Frequent Itemset Miner”, Artificial Neural Networks: Biological Inspirations - ICANN Proceedings, 15th International Conference, Warsaw, Poland, September 11-15, 2005. 2006 • Vicente O. Baez-Monroy and Simon O’Keefe, ”The Identification and Extraction of Itemset Support Defined by the Weight Matrix of a SelfOrganising Map”, 2006 IEEE World Congress on Computational Intelligence, 2006 Proceedings of the International Joint Conference on Neural Networks IJCNN, July 16-21, Vancouver, BC, Canada, 2006. xxii Chapter 1 Introduction Data is a common term to denote the collection of facts or raw information describing processes or activities, or the status of events occurring in an environment. Data normally exists in a variety of forms; however, it is often categorised into the groups of numbers (numeric or quantitative data) and symbols (categorical or qualitative data), depending on the meaning of the variables or features of the events that want to be modeled. The collection of data is an important activity which has been performed throughout our history and has contributed to the development of our societies in one form or another. For instance, its collection has allowed human beings develop methods for climate forecasting, medical diagnosis and treatments, highway and house planning, and so forth. As we are surrounded by data everywhere and its collection has an important significance for our daily life, some areas in Computer Science (CS) have been particularly dedicated to the development of the necessary technology to collect, store and model large amounts of data from almost every possible recordable source of information. For example: bank operations, video, medical checks, images, shopping activities, vehicles, web transactions, and many others. Among the areas responsible for the management of data, a first example is the 1 database area which focuses on its modeling. Normally, the term database in this area refers to an entity-relationship model environment, in which the data changes periodically due to new operations (inserts, deletes, updates) represented by transactions. A second example is the data warehouse area in which data is seen as a large repository containing integrated information for efficient querying and analysis. The resources of the data modelling are based on a star-schema concept (Kimball, 1996) rather than a transactional one. The main ambition behind the data collection is to provide humans with precious resources of data regarding a specific event which can be converted into the more valuable commodity called knowledge or information. Once data have been collected, the search for interpretations begins, aiming to gain a novel or better understanding of the variables modelled in the data. This search activity, known as data analysis, has been a permanent motivation for researchers in the development of new methods which can provide faster, better and new interpretations from diverse and unlimited data sources. For the last decade, it has gradually been becoming a rule that any project, addressing the extraction of knowledge from a data source, must follow the framework defined by the area of knowledge discovery. In this area, the term of KDD1 , attributed to Piatetsky-Shapiro et al (Piatetsky-Shapiro and Frawley, 1991; Piatetsky-Shapiro, 2000), has been coined to define that: The knowledge discovery in databases is the untrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns2 . 1 KDD stands for Knowledge Discovery in Databases. The term pattern is used here to define any final product (e.g., a series of clusters, a classification model, rules, etc.) resulted of any knowledge discovery process. 2 2 In the KDD process, while the first stages have been dedicated to the management, cleaning and transformation of data, the last ones are devoted to the manipulation of the knowledge which focuses on presenting the results in the most profitable and understandable format for the human experts. Occupying the central activity and representing the core of the process, data-mining algorithms are placed. These algorithms are responsible for conducting the discovery of knowledge by performing a series of operations to form a model which satisfies the definition of the values assigned to the input parameters for the mining activity. DM (Data Mining) (Han and Kamber, 2000; Hand et al., 2001; Cios, 2000) is an area in Computer Science whose efforts are set on developing these new algorithms for data analysis. The central focus of these algorithms is to build either predictive models, in which some features or variables of the original data are used to build a model which will be predicting unknown or future states of the model variables, or descriptive models, which focus on describing the data through finding hidden patterns or relationships in the data to gain some understanding of it. The quality of the models generated as well as the performance of the DM algorithm for a specific problem are often affected by the nature of the data, which can be redundant, imprecise, noisy, high-dimensional or dynamic. Therefore, a successful DM approach is often considered an algorithm which deals reasonably well with the characteristics defined by data. The spectrum of techniques in DM can be classified (according to the type of task to be performed) into clustering, classification, regression and prediction, association rule mining and sequence analysis. Nevertheless, this DM classifica- 3 tion is not unique and specialised extensions to each technique can also be found in literature (e.g., spatial mining, web mining, outlier detection, frequent itemset mining, text mining, bio mining, and so forth.). Data mining is often defined as a multi-disciplinary area which integrates miscellaneous techniques from different fields such as Machine learning, Statistics, Neural networks, etc. Therefore, it is rare to find a unique solution for a data-mining problem. Among the descriptive DM techniques, ARM (Association Rule Mining) stands out because of its comprehensive and powerful format to represent the knowledge discovered from a data source. ARM, introduced by Agrawal et al (Agrawal et al., 1993), is a technique to form rules defining the associativity among the components, attributes or variables of a data source or environment D. In its beginnings, ARM was conceived to analyse shopping-basket databases to identify customer-behavior patterns. Nowadays, it has been applied to problems in which the associativity property of the variables defining a dataset can be exploited (e.g., web mining, text mining, bio mining). In ARM, the data source D (database, dataset), which is represented by n probable different patterns x1 , ..., xk , ..., xn (transactions), can be modeled horizontally mainly in two forms. The first is as a collection of binary vectors of length m (m defines the number of components, elements, or attributes representing the transactions) in which while the absence of value at some pattern element, for instance xki =0, refers to the lack of participation of the element i for the pattern (transaction or association) xk , a value of one (xki =1) defines the contrary event for the element (its participation). The other method of representation is as size-varied-integer vectors whose elements represent the indexes i of elements with participation (xki =1) in the binary representation. Individual 4 elements of the patterns are known as items Ii , and an itemset is defined as a grouping or a combination of some items defining a pattern. Rules with the format of X ⇒ Y [supp, conf ]3 are the aim of ARM for describing the participation (associations) of a total set of items, Γ = {I1 , .., Im }, in the transactions contained in D. To form such descriptions of D, according to Agrawal (Agrawal et al., 1993), ARM can be divided into a two-stage problem. This standpoint of ARM is known as the support-confidence framework which basically defines that association rules can be formed by first finding the frequent itemsets in D and second by generating the rules from such itemsets. The former is a DM task known as FIM (Frequent Itemset Mining) (Goethals, 2003) and involves the formation and counting of the different possible k-itemsets for all 1 ≤ k ≤ m to determine the relevant ones based on a minimal support threshold σ. The latter, which defines the RG (Rule Generation) stage and has been identified as a straightforward process, uses the information resulting from FIM for the generation of rules. To determine which k-itemsets are the important (frequent) ones in a mining exercise, the generation of candidates and the calculation of itemset support are performed. Candidate generation refers to the formation of new k-itemsets (combinations of k items) during the process, while the calculation of the support refers to action taken (counting) by the algorithm to measure the appearance of the candidates in the n transactions. The relevance of the support for ARM, in particular for FIM, is high since it is the main metric by which an algorithm determines the status of an itemset against the threshold. This is, the support of a 3 This is the simplest format of an association rule. X and Y are known as the precedent (the ”if” part) and consequent (the ”then” part) of the rule. These two parts are called itemsets. Two basic metrics, supp and conf , defining the support and confidence of the rule, are used to define the interestingness of the rule. 5 rule, for instance A → B, is equal to the support of its itemset; i.e., supp(A → B) = supp(AB). For the generation of rules, another metric, known as confidence, is used to determine the strength of the rules which are derived from the final set of frequent itemsets along with their corresponding support. To illustrate the concept of ARM, an example is given in Figure 1.1 in which we show the type of knowledge that can be derived from performing ARM on a set of transactions representing a shopping basket dataset. As commented above and shown with our example, the generation of association rules is produced by performing first FIM, and then using the discovered itemsets to produce the target rules. Association rules provide a description of data in the form of ”if-then” statements. Unlike the if-then rules of logic, association rules are probabilistic in nature. The challenge of the generation of association rules from a database is considerable, since its complexity, which is mainly defined by the complexity of FIM, is exponential. That is, the exponential search space, formed by m items conforming n transactions in a database, is defined by 2m . Therefore, according to Goethals (Goethals, 2002), in many applications which involve hundreds of items, it is easy to find a search space with a number of itemsets larger than the number of atoms in the universe (≈ 1079 ). Even though other metrics have been proposed to define rule or itemset interestingness (Hilderman and Hamilton, 1999; Tan et al., 2002; Meo, 2003; Omiecinski, 2003) and some critiques have been made of the support-confidence framework (Brin et al., 1997; Aggarwal and Yu, 1998), the importance of the framework is not in question because its methodology has been adopted in one 6 Figure 1.1: An illustration of ARM applied to the data source defined in (a). The aim is to generate rules with a minimal threshold of 20%. As the first part of ARM, the frequent itemsets have to be discovered, such as in (b), based on their support. Then, association rules, as in (c), are formed from them. form or another by the majority of the successful and current state-of-the-art algorithms. For instance, Apriori (Agrawal and Srikant, 1994; Mannila et al., 1994), Eclat (Zaki et al., 1997a) and FP-growth (Han et al., 2000a). The diversity of the FIM algorithms in terms of the knowledge area which they are based on is not as vast as in other DM techniques. For example, as in pattern classification (Duda et al., 2000), in which approaches based on Decision Trees, Neural Networks, Bayesian-based methods, and many others, can be used 7 for performing the task. The reason of the latter is that the main work on ARM has focused on finding the best data structure to reduce the complexity of FIM (the counting of items and the selection of the frequent ones) rather than trying to develop possible alternatives to tackle the problem. For instance, proposals involving biologically-based technologies, such as genetic algorithms, have rarely been undertaken (Vázquez et al., 2002; Yan et al., 2005b; Alatas and Akin, 2006). In particular, it seems that the concept of approaches inspired by the learning capabilities of biological systems is not acceptable or suitable for generating such associativity descriptivedata models even if it could be possible to relate the problem of ARM to the general notion of inference in the learning activity performed by humans in which data-driven interactions with the environment constantly occur. For instance, in a task of learning from a data source such as the one represented in Figure 1.2, a human is able to perform two classical types of inference, called induction and deduction, in order to obtain and use an abstract representation (model or knowledge) of the data source. With such a model formed (deduction), he is able to categorise (predict) the corresponding group to new items depending on their nature and purpose (e.g., garlic and olive oil can be grouped into the taxonomy of condiments and cooking ingredients). A third inference case known as transduction4 could also be performed if the task would imply to form answers for particular points of interest directly from the data (training dataset) without having to build a global model of it. As stated in (Kantardzic, 2002), an important application of the latter is ARM. Moreover, this human could perform the counting of real items or the formation of new abstract representa4 This term refers to the fact of directly drawing conclusions about data without constructing a model. Transduction has been introduced in (Vapnik, 1998). 8 Figure 1.2: A shopping receipt as an example of a data source in which the associativity concept can be exploited for the extraction of knowledge. tions of items by combining them (e.g., milk and eggs and vanilla = cake). The counting of either real or abstract items will be done by scanning the information defined by either his shopping basket or his receipt. As a result of performing a pass over the data (learning), he would have kept some knowledge in his memory (brain area, neurons) regarding single or collective, real or abstract itemsets, which can be used to answer questions about the item appearance such as - How many trios involving diet coke, crisps and napkins have been bought? Or have toothpaste and aspirin been bought together? - It is very likely that his answers regarding the frequency (support) of either real or abstract items would not be necessarily exact at first, therefore, an error in his item-frequency recalls, defined by Errorrecalling = Realf requency -HumanApproximationf requency , can be stated to exist. The accuracy of his answers will depend on how well he is able to remember (memory issues) the number of appearances (itemset support) for each of the different 2m combinations that can be formed by his list of m individual items. In order to elevate the accuracy of his answers more passes on the data (training epochs) will have to be performed by him. 9 If the current situation involved a large number of sources (multiple transactions) needing to be mined, then the problem would become more complicated for the human. The complexity of this situation can be measured by O(mcn) in which m, c and n define the number of items, the number of possible candidates (the possible itemset search space whose size grows exponentially) and the number of transactions respectively. Even though the complexity grows as a result of adding a new item or transaction, a human will always be able to produce an answer, defining a generalisation of the frequency occurrence of the queried itemsets, through a learning (scanning and counting) of the associations existing in the data. For the time being, we can therefore conclude that ARM is a mechanical activity performed naturally by humans, which involves factors such as pattern association, pattern counting, and memory, in which the better the answer, the larger the memory abilities and the number of passes over the data source. Additionally, it has been stated by A.R. Gardner-Medwin and H.B. Barlow (Gardner-Medwin and Barlow, 2001) that the concepts of learning, counting, frequency, and association are all intimately related in the genetic mechanism of learning as follows: ...Sensory inputs are graded in character and may provide weak or strong evidence for identification of a discrete binary state of the environment such as the presence or absence of a specific object. Such classifications are the data on which much simple inference is built and about which associations must be learned. Learning any association requires a quantitative step in which the frequency of a joint event is observed to be very different from the frequency predicted from the probabilities of its constituents. Without this step, 10 associations cannot be reliably recognized, and inappropriate behavior could result from attaching too much importance to chance conjunctions or too little to genuine causal ones. Estimating a frequency depends in its turn on counting, using that word in the rather general sense of marking when a discrete event occurs and forming a measure of how many times it has occurred during some epoch. Counting is thus a crucial prerequisite for all learning, but the form in which sensory experiences are represented limits how accurately it can be done... The situation explained above is just a basic example of an associative reallife problem in which the computational power of the brain can be used to process a correct answer, which in this case cannot be performed as fast as we wish to. Nevertheless, in this digital era, artificial architectures such as an ANN (Artificial Neural Network) can be used to imitate the human brain since as defined in (Haykin, 1999), an ANN is: a massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for use. It resembles the brain in two respects: 1. Knowledge is acquired by the network through a learning process. 2. Interneuron connection strengths known as a synaptic weights are used to store its knowledge. Hence, an ANN can be used to perform tasks in which the concepts of learning and generalisation are involved. Moreover, its use becomes more attractive when the problem in turn implies making use of its power to recall previous 11 knowledge or their experience of the environment learnt. Therefore, in order to contribute with new biologically-inspired approaches that can be used for problems such as ARM in which the counting of patterns representing associations (events) is significant, we investigate in this thesis the suitability of ANNs for ARM. In particular, we focus on determining if association rules can be generated from the knowledge embedded in the synaptic weights of a neural network. Moreover, since the training data can be representing a dynamic environment, we also conduct research on determining if the synaptic weights of an ANN, which serve as the basis of the rules, can be updated throughout time with the changes occurring in the training environment. 1.1 The Role of Neural Networks in Data Mining As mentioned previously, data mining is constituted by a large number of techniques coming from different CS fields which have been incorporating throughout time. The diversity in techniques opens up the possibility of addressing a data-mining problem with different proposals. For instance, a classification problem is often achieved by using decision trees; nevertheless, neural networks have already been shown to be promising alternatives to conventional methods (Zhang, 2000). Nowadays, among the techniques forming the data-mining toolbox, ANNs have emerged to be considered as one of the most useful tools to the point of having become part of the current-commercial KDD solutions such as Clementine (SPSS, 1968). Their inclusion in the data-mining framework has not been easy since they were not initially considered suitable for these tasks due to the main criticism of neural networks which lies in the following aspects. First, neural networks have been categorised as black-boxes whose re- 12 sults lack for symbolic representation. Therefore their verification, integration and interpretation are difficult. Second, the time used-up for the training of a neural network can be beaten by conventional methods. As a response to these critiques, approaches, such as (Lu et al., 1996; Craven and Shavlik, 1997; Zhang, 2000), have appeared to argue and support for the reconsideration of neural networks for the data-mining framework. To argue for their inclusion, some points are outlined as follows: • To tackle the incomprehensibility of the neural models, some work has been produced for the extraction of rules and the visualization of the knowledge modeled by their weight matrices (Ultsch and Siemon, 1990; Hammer et al., 2002). • Neural networks are data driven self-adaptive methods which adjust to the input data. This adaptation is sometimes done without any explicit specification of functional or distributional form of the underlying model. For instance, the SOM (Self-Organizing Map) (Kohonen, 1996). • They are able to approximate any function with arbitrary accuracy. • They have the capability of remembering past states of the input data within their weight matrices; i.e., they resolve the elasticity-plasticity dilemma such as the ART network (Carpenter and Grossberg, 1989). • Neural networks are nonlinear models so that they can be more flexible in modeling real world complex relationships. • Statistical analysis can be conducted on the basis given by ANNs. • New training algorithms have reduced the training time without sacrificing the accuracy of the results. 13 • They form a more suitable inductive bias than the conventional algorithms. Because of the type of results given by neural networks, they have also been considered an important part of the soft computing5 framework for datamining (Mitra et al., 2002), which carries on with the aim of producing methods of computation that perform a reasonable solution at low cost by seeking an approximate solution to either imprecise or precise problems. Considering the above characteristics of neural networks, we can conclude that they are on the right track of scaling-up to problems involving large datasets and becoming important members of a new generation of algorithms in which human interaction to compute hypotheses will be relevant. Speaking generally, the number of contributions of neural network approaches for DM problems, whose total description goes beyond the limit of this thesis, such as classification, clustering and prediction is already vast and constantly increasing. Nevertheless, for other problems, such as descriptive DM tasks, in particular ARM, the attention of the ANN research community remains passive and uncertain. 1.2 Linking ANNs and ARM: The Motivations The conception of the use of ANNs for ARM may at first appear inappropriate since it can be argued that in the task of ARM there is nothing to be classified, clustered or predicted. Nevertheless, the use of ANNs starts making sense if some factors are well thought out as follows: The Meaning of Association. In ARM, association is a concept defined implic5 Soft computing (Kecman, 2001) is not a clearly defined discipline at present, but it follows the premises that: 1) the real world is pervasively imprecise and uncertain and 2) precision and certainty carry a cost. In other words, they are methodologies which aim to exploit the tolerance for impression, uncertainty, approximate reasoning, and partial truth in order to achieve tractability, robustness, and low-cost solutions. 14 itly by the appearance of m items in the n transactions of a data source D. It is a concept exploited for the formation of knowledge formatted like rules about some D and its discovery is the main purpose of this DM task. Associations are represented physically by patterns formed through the grouping of binary or numeric elements whose strengths of associativity are mainly measured by the frequency (support) of the members. Similarly, in the field of Neural Networks, the concept of associativity has also been employed. In this case, it defines an explicit relationship that exists between a pair of input patterns (input and target) presented to the neural network under supervised training. This concept is mainly exploited by neural architectures which imitate the concept of associative memory presented in our brain. In general terms, the explicit association among the inputs is the target to be learnt by some networks to form a knowledge which can be used for pattern association or recognition tasks. Hence, based on the fact that both technologies know and manage the concept of association for creating a better understanding of some environment D; why should we therefore not assume that one of the best techniques for pattern recognition under association concepts can perform the counting of patterns or sub-patterns (itemsets) defining associativity among their elements. Rules as Symbolic Representation of Knowledge. One of the fundamental considerations for not using ANNs for data analysis tasks has been the comprehensibility of their results. Therefore, since the late 80’s (see beginnings in (Towell and Shavlik, 1993; Craven and Shavlik, 1993; Andrews 15 et al., 1995)), some efforts have been put into gaining interpretations of these nonlinear systems. One approach to tackle this drawback has been to develop methods of rule extraction (Benitez et al., 1997; Tickle et al., 1998; Taha and Ghosh, 1999; Craven and Shavlik, 1999; Tsukimoto, 2000; Browne A., 2003; Malone et al., 2006), which turn the conversion of the real-valued knowledge formed in the weight matrix into symbolic rules, describing the inside of the neural model in a global or local manner (depending on the level of description). Rule-extraction mechanisms already exist for different neural architectures such as multi-layer perceptron (Duch et al., 1996; McGarry et al., 1999), recurrent networks (Jacobsson, 2005), self-organising networks (Hammer et al., 2002; Malone et al., 2006), radial basis function networks (McGarry et al., 1999), and even for hybrid architectures (Eggermont, 1998). They have focused on the formation of if -then or logic, m-of-n (Setiono, 2000) and fuzzy rules, in particular for classification tasks. According to Krishnan (Krishnan et al., 1999), rule extraction is a search problem because it requires exploring a space of candidate rules to test each individual candidate against the network to determine if they are valid. This validity refers to the rule property of describing what happens inside of the neural network. In particular, the behavior of the network for those instances (inputs) that will form the antecedents of such rules. A rule-extraction method searches until all the maximally-general rules have been found. Heuristics have also been employed to limit the combinatorics of the rule-transverse, for instance, the number of elements in the rule antecedent and/or the search has been limited to combinations of elements that only occur in the training dataset. Speaking generally, these approaches have been efficient mechanisms to make neural networks un- 16 derstandable. In ARM, rules are generated to explain the events occurring in a data source D. Their generation is realised by performing a process which resembles the counting of patterns in the high dimensional space formed by D. During rule generation, an algorithm typically uses a data structure (e.g., tries, hash trees, and many others.) to represent the findings (patterns and their properties) and to speed up the mining process. Moreover, heuristics have been also proposed to prune the exponential itemset search space formed by the items of D. Hence, based on the similarity between the extraction of rules from neural networks for classification tasks, and the process for generating associations rules from a database, it may be possible to extract association rules from an ANN through a mechanism which can perform the correct interpretation of the knowledge embedded in the weight matrix. Activities and Strategies in ARM. As a result of all the diversity of approaches in ARM proposed since 1993, a framework, which we have summarised and depicted in Figure 1.3, can be defined. Taking into account the current capabilities of neural networks, we could begin by assuming that they may be seen as alternatives to the following activities constituting such a framework: • Candidate-generation Strategies. Up to this day, researchers do not have a certain idea about what decreases the complexity of ARM (Bodon, 2006). Nevertheless, it has 17 Figure 1.3: Framework defined by the processes and strategies for the problem of association rule mining. Its conception is based on the support-confidence framework of Agrawal (Agrawal et al., 1993). 18 T1-{1,2,4,6,7} T2-{1,2,3,5,6,7} T3-{1,3,5,6} T4-{1,2,3,5,6,7} T5-{1,3,4,6} T6-{1,3,5,7} T7-{1,2,4,5,7} T8-{1,3,4,5,6} Integer (item index) T1-{1,1,0,1,0,1,1} T2-{1,1,1,0,1,1,1} T3-{1,0,1,0,1,1,0} T4-{1,1,1,0,1,1,1} T5-{1,0,1,1,0,1,0} T6-{1,0,1,0,1,0,1} T7-{1,1,0,1,1,0,1} T8-{1,0,1,1,1,1,0} Binary Data Representation Sampling T1-{1,1,0,1,0,1,1} T3-{1,0,1,0,1,1,0} T4-{1,1,1,0,1,1,1} T6-{1,0,1,0,1,0,1} T8-{1,0,1,1,1,1,0} Vertical (tidlist) i(1)-{T1,T2,T3,T4,T5,T6,T7,T8} i(2)-{T1,T2,T4,T7} Horizontal T1-{1,1,0,1,0,1,1} T2-{1,1,1,0,1,1,1} Data Format Data Definition (DD) PATRICIA Hash-trees Iterative Recursive Frequent, closed, maximal ITEMSETS AB (100%) ABC (10%) ACDF (100%) CDEFGH (70%) DE (10%) Intersecting Counting The counting of support Itemset-support monotonicity property Pruning itemsets Breadth-first search Depth-first search Traversing the search space Candidate Generation Tries RULES A->B (100%, 90%) AB->C (10%, 9%) AC->DF (100%, 12%) BF->GH (50%, 50%) ABCF->H (50%, 50%) CDE->FGH (70%, 90%) D->E (10%, 20%) Frequent itemset and support needed The calculation of confidence Minimal Confidence Threshold Minimal Support Threshold Itemset Representation Rule Generation (RG) Frequent Itemset Mining (FIM) been stated that a possibility may be defined by the total number of itemsets which have to be checked by an algorithm during the process in order to detect those which are interesting. To reduce such a number (c), a phase called candidate generation has been included as part of the definition of some FIM approaches. The phase aims to generate the corresponding group of k-itemset candidates, which may turn out to be frequent after the calculation of their support (e.g., by counting), based on the information provided, for instance, by the k-1th iteration regarding the frequent (k-1)-itemsets. The pruning of the unnecessary itemsets has been possible by using the following heuristic called the downward closure property of itemset support (Agrawal and Srikant, 1994): All subsets of a frequent itemset must also be frequent. Because the main purpose of this strategy is to determine which itemsets will turn out to be unfrequent as soon as possible in the mining process, a neural network, which learns incrementally some information provided from a FIM algorithm during an ARM process, could be taken into account to predict which itemsets are not worth to be considered in the mining process. • Itemset-storage-structure Strategies. Most of the work produced for FIM has focused on this type of strategy since it has been proven that the use of different data structures for representing the itemsets and candidates found during FIM can result in a reduction of time and memory issues (different datastructure-based algorithms are summarized in (Goethals, 2003)). The 19 strategics aim to find the most compact and easy-to-traverse data structure for representing the search space of FIM. Hence, we could think of employing a type of neural network which can organise its neural structure dynamically based on the incoming patterns (itemsets) in order to adapt it to form a representation of the search space lattice in which the mapping of the itemsets can be performed. This may be achieved by an unsupervised neural network. • Input-data-layout Strategies. One of the drawbacks of any DM technique is the dealing with data of high dimensionality. This problem, better known as the curse of dimensionality, is a factor which directly affects the performance of any algorithm. Moreover, the performance tends to get worse when the problem concerned also involves scanning large datasets with dense patterns. In order to tackle this nature of the data, solutions like, for instance, PCA (Principal Components Analysis) (Jolliffe, 1986), have been used to reduce the dimensionality of the original data, before it is mined or visualised. It is important to mention that the representation of the input data plays an important role within the data-mining framework, because it directly affects the result’s accuracy and the performance of the algorithm. Thus, a trade-off always has to be made between performance and accuracy. In ARM, proposals have been employing three different layouts for the input dataset. They involve either the representation (vertical or horizontal) in which the input data will be handled by the algorithm or the quantity (sampling techniques) of transactions that 20 will be employed to form the rules. It is possible that a neural network which has learnt interesting relationships inherent to the input dataset can also provide such a good representation of the associations. That is, input-data-attribute relationships will be encoded in the weight matrix whose size is considerably smaller than the original dataset. This manner of tackling the problem matches with the activities performed in the extraction of knowledge from neural networks explained above. Moreover, this idea would lead to building a form of artificial memory which would learn and remember associations from the input data (imitating what a human would do in the same situation). Nevertheless, in order to form such memories, it is crucial to investigate if counting of patterns can be performed by ANNs. • The Maintenance of Rules. Since data does not only describe stationary environments, researches have proposed the incorporation of the maintenance of rules as part of the general framework. For instance, this data characteristic was first taking into account in (Cheung et al., 1996b). To address the maintenance, the frequent itemset-representation data structure is periodically updated with the changes occurring in the original dataset. The major inconvenience of these approaches are that they are performed on top of, or derived from, the traditional FIM algorithms which brings a new complexity to the entire framework. Moreover, the re-use of knowledge from past mining activities is needed to realise the maintenance which results in keeping certain information of considerable size throughout time which can turn into a new data21 maintenance complexity for ARM. Hence, we believe it would not be mistaken to assume that a neural network with the capability of incremental learning can perform the maintenance of the itemsets by incorporating the changes suffered in the environment of the original data source. This idea can be interpreted to be very closely related to the previous one (artificial memory for ARM) since the ANN may be collecting new knowledge about the current state of the environment infinitely throughout time. Other Factors. One of the advantages of neural networks compared to some typical algorithms is that they can be implemented in hardware, which means that we can think about having an ANN-based piece of hardware in future, in which itemset support can be recalled and therefore FIM can be performed. Moreover, ANNs are considered to be parallel since they simulate the way the biological counterparts work. Employing neural networks for data analysis brings out the advantage of re-using their weight matrices for diverse mining tasks. For instance, it would no longer be necessary for analysts to run different environments in order to create models for clustering and association rules, if good results could be obtained from the same source of knowledge. Moreover, discovering that such re-use happens for different purposes (not only for predictive but also descriptive tasks) in some of the current models of ANNs would reinforce the idea that this biologically-inspired technology models some conditions happening in the human brain, in which different brain areas process or control different biological activities in response to stimuli by exploiting the knowledge allocated by their neurons. 22 D(t) RECALLING ITEMSET SUPPORT ENVIRONMENT TRAINING (COUNTING) A->B (100%, 90%) AB->C (10%, 9%) FIM / ARM AC->DF (100%, 12%) BF->GH (50%, 50%) LOGIC ABCF->H (50%, 50%) CDE->FGH (70%, 90%) D->E (10%, 20%) DECODING COUNTING D(t+k) NEURAL NETWORK (ITEMSET- SUPPORT MEMORY) (FREQUENT- PATTERN MEMORY) QUERYING SUPPORT FOR AN ITEMSET Figure 1.4: Neural-based framework for ARM. As described in this section, there are some links, which can be interpreted as part of the motivations for this research, between ARM and ANNs to believe that the use of an ANN for problems like ARM is possible. Thus, this research carries on with the hypotheses that the ARM framework can be transformed in such a way that neural networks constitute the core of the process. This conception of the idea is illustrated in Figure 1.4. 1.3 The Neural-Network Candidates An introduction of the models considered in this work will herein be given. Moreover, some of the reasons behind their inclusion within this study are described. Details of the justification for research on each neural network will be stated in Section 3.5 in Chapter 3. Auto-Associative Memory (AAM). This neural network, as its name says, is trained to capture associations between incoming pairs of patterns for future recognition. The input patterns can be expressed in form of unipolar 23 vectors. In learning tasks in which labels or targets yi do not exist to be associated with the training data xi , this ANN allows the formation of associations between a pattern xi and itself. The maximum size of the weight matrix formed can be expected to be m-by-m for an auto-associative memory, where m refers to the pattern dimensionality. It has the property to remember the associativity expressed between the input patterns under supervised training. Thus, it has the faculty to retrieve from its memory (weight matrix) the pattern most closely associated with the input pattern normally corrupted. Self-Organizing Map (SOM). This is one of the most commonly used neural networks for data mining. The main qualities of this neural network are: i) it can organise its structure according to the description of the dataset and ii) it learns from data in an entirely unsupervised training environment. iii) the maximum size of the map expected is high likely to be smaller than the size of the original dataset. iv) a self-organising map has also inherited some properties of Vector Quantization, therefore it compresses the distributions of the m-dimensional data into the two-dimensional map. 1.4 The Research Questions Inspired by the critiques on the application of ANNs for DM, the research presented here aims to investigate whether artificial neural networks are suitable for a data-mining technique which concerns neither clustering nor classification; in contrast, it forms descriptions of the content of a database D by providing knowledge in a rule format: Rules which describe not only the associativity among the elements or attributes of the patterns in a database, but also the frequency in which they occur; rules which normally are produced as a consequence of per- 24 forming the counting of patterns registered in a database. Although the course of this research can be addressed towards the different possibilities laid out above, and posteriorly discussed in Chapter 3, the scope of this thesis has been confined to investigating if the itemset property of support (the frequency of patterns) can be reproduced (discovered or calculated) from the knowledge embedded in the weight matrix of our candidate neural networks. In other words, we will be taking the problem of determining if our chosen ANN candidates can be biologically-based solutions for ARM. In particular, we focus on determining if they are suitable (if they have the information resources and properties) for becoming artificial memories for ARM, which will be able to count, store and recall itemset support after they have been trained with some data modelling associations (binary patterns). Moreover, because data is dynamic, we also study the problem of the maintenance of the knowledge in the memories through the concept of the incremental learning with neural networks. The reasons for moving the focus of the research in this direction can be summarized as follows: The Description of Data by ANNs. Data is often described by the results given through methods which form general models or identify local patterns. ANNs have been used for the former, as an alternative of the traditional algorithms, in particular, when predictions of new states of the learnt variables are needed. For the latter, ANNs are able to describe data by forming clusters of the patterns, existing in the data, with the information presented in their learnt knowledge. Nevertheless, the idea of describing data from such knowledge in the form of association rules, as humans do from the knowledge captured by their brains, is still unexplored. 25 The Importance of Itemset Support for ARM. Even though the itemset property of support has been criticised for not being the best metric to measure the interestingness of itemsets and therefore rules, its use and definition has been a fundamental part in the evolution of the approaches for ARM. In particular, it has an important role in the support-confidence framework defined since the problem of ARM was introduced to the DM field. Therefore, in order to determine which patterns, associations or itemsets may be statistically important for the generation of association rules, a neural network should at least be able to know or calculate the occurrence frequency of the itemsets; i.e., their support property in the learnt environment. Reproducing Counting Abilities with ANNs. Counting is an important activity in biological systems since it is a method to summarise what occurs in an environment. It also takes part in a learning process (Gardner-Medwin and Barlow, 2001). Moreover, results of counting are an important part of our daily knowledge which often is utilised for decision making. Reproducing counting abilities on artificial models is a challenging and relevant task for the modelling and conception of what may be happening in real life (in the human brain). Moreover, performing counting with ANNs can also be seen as the development of either interpretations of, or knowledge extraction mechanisms from them (weight matrices) in order to reproduce values which describe the frequency of events (patterns, itemsets) occurring within the data. Building Artificial Memories for ARM. In order to develop the framework illustrated in Figure 1.4, it is important to start defining an ANN architecture which can learn, recall, and store knowledge about its environment (the frequency of patterns) autonomously. This would result in its hypotheses 26 and recalls being formed exclusively from its knowledge, which was previously captured directly from the environment, rather than depending on results from the counting of patterns given by other proposals. In addition, different networks could be incorporated into the system in order to make more complex hypotheses from the knowledge stored in the ANN-based memory for other purposes. In detail, the questions which we are interested in answering with this thesis are as follows: • ANNs have been applied to form data models for classification and clustering for instance, but can they be used for descriptive-data-mining techniques in which the aim is to represent the data in form of association rules? • The interpretation of the knowledge (weight matrix, mapping) generated by ANNs has been a permanent task of research in the ANN field. In particular, mechanisms have been developed to describe hypotheses formed by the nodes of a neural network into symbolic representations for classification problems. Nevertheless, would the knowledge learnt by an ANN be useful for describing associations among the elements of a database? • In the support-confidence framework of ARM, FIM (first stage of ARM) calculates a metric called support to produce the raw material known as the frequent itemsets for the generation of rules. Therefore, could it be possible that our chosen neural networks can have some knowledge describing the frequency of patterns distributed along their nodes? I.e., Could the results of counting be generated from our ANN candidates? • For the last decade, an implicit framework has been built in ARM, illustrated in Figure 1.3, which is replete with processes and strategies in order 27 to break down the complexity of the data-mining task. Therefore, could it be possible to have an itemset-support memory based on an ANN, as a substitute for the original database, to take part in such framework ? • If the implementation of such neural-network itemset-support memory was feasible, could it continue accumulating knowledge of the pattern frequency throughout time while the original data environment changes? In general terms, this thesis focuses on the process of counting patterns and the maintenance of the counters with ANNs in order to propose their inclusion as itemset-support memories in frameworks which involve tasks like FIM. It is important to note that the development of this topic, artificial neural networks for association rule mining, as summarized in Chapter 3, has been rarely undertaken. Additionally, to the best of our knowledge, the topic with which this thesis deals has not yet been studied by other researchers. Therefore, this work primarily focuses on investigating which biologically-based capabilities, in conjunction with the different learning paradigms (supervised, unsupervised and hybrid learning) and architectures of the neural networks, can initially be used for the calculation of the itemset support and consequently for its maintenance. 1.4.1 Aims and Objectives Having defined the problem area domain and focus of the thesis, and having selected two neural networks as candidates, this work aims to achieve the following: • To gain some insight into the use of neural networks for techniques involving neither induction nor deduction (classification, prediction), but transduction (association rule mining) (Kantardzic, 2002). 28 – Normally, ANNs have been used to create global models by learning the dependencies of some given data (induction) in order to use such models for predicting outputs for future values (deduction). Therefore, it will be investigated if our ANN models can reproduce the outputs which can normally be given by the application of a transduction approach, such as the process of association rule mining. • To investigate if the counting of patterns, which normally is realised in a high-dimensional space defined by the training data, can be reproduced with the information defined within the low-dimensional space formed by the weight matrix of a trained neural network. – The exploration of counting with ANNs has been proposed in (GardnerMedwin and Barlow, 2001). However, many assumptions on two theoretical models were made in order to make a proposal. In our case, an analysis of two well-known neural networks, a self-organising map and an auto-associative memory, will be made to determine if these two models perform pattern counting (collect knowledge about frequency) while they learn. • To develop mechanisms of knowledge extraction for the weight matrix of the ANN candidates in order to generate (calculate) itemset support. – Once it has been stated that, as a result of training, pattern frequency is embedded in the knowledge of a neural network, we will focus on giving the right interpretation to the weight matrix in order to recall the frequency (support) of patterns (itemsets), which can be composed by different elements (items) of the learnt environment. 29 • To propose how the frequency pattern knowledge in a neural network can be maintained while the environment is changing. – In particular, a proposal on this topic will be given by exploiting the incremental learning abilities of a self-organising map, since it is an ANN which learns without supervision. • To propose the development of a neural framework for tasks like ARM by proving that itemset support can be taught, stored and recalled from an artificial memory, based on either a self-organising map or an autoassociative memory. – To consider our neural candidates as support-itemset memories, it would be important to evaluate the accuracy of their results. Therefore, the comparison of the neural networks against the values given by the counting process performed by Apriori will be realised. Since ANNs have been successfully applied to problems involving classification, prediction, clustering. This work aims to complement the success of ANNs by serving as an evidence that the knowledge learnt by them on normal training conditions can be reused for data-mining tasks whose performance requires the knowledge or information about the frequency of the training patterns. 1.5 Organisation The remaining chapters of the thesis are organized as follows: • We first provide the formal definition of ARM in Chapter 2. In particular, we focus on describing the sub-task of FIM. Moreover, since the Apriori algorithm, which is the gold standard and historically most important for 30 ARM, will be used to compare our results, its definition is given in Appendix A. • The little literature related to our aims is summarised in Chapter 3. Details regarding the characteristics, which are needed in a neural network to be able to perform FIM, are given here. In this chapter, we justify the study on our neural network candidates. Moreover, conceptions are given about how the ARM problem can be understood from an ANN point of view. • Chapter 4 describes our own interpretation on how the counting process is realized by an auto-associative memory. In particular, a correlation matrix memory is studied, and two methods to estimate itemset support from its weight matrix are proposed. Information about the CMM training is defined in Appendix B. • Chapter 5 explores the idea of using a self-organising map for the counting of patterns. Unlike other approaches of SOM for ARM, our approach called PISM (Probabilistic Itemset-support eStimation Mechanism), which is based on Probability theory, only uses the information coded by the best-matching units of the map for the estimation (recall) of itemset support. Our novel perspective of the use of SOM for ARM alleviates the drawbacks presented in similar approaches, which limit the use of SOM to just clustering the input data before FIM performs. Information about the SOM training is defined in Appendix B. • Chapter 6 develops the idea that the incremental learning capability of neural networks can be exploited for the task of the maintenance of rules in ARM. In particular, a study with the SOM is carried out to update itemsetsupport on the map. Since a batch training has been considered unsuitable for the update of a 31 SOM in non-stationary environments, we have developed BincrementalSOM (Batch incremental training for SOM), which is a mechanism based on using the knowledge captured by the best-matching units, to update the information of a SOM regarding the itemset support occurring in these environments. • Finally, the conclusions drawn by this thesis, its limitations and future directions are all defined in Chapter 7. 32 Chapter 2 Association Rule Mining 2.1 Introduction In the field of DM (Data Mining), the techniques responsible for the analysis of data are often classified into predictive and descriptive. While the former aims to build models from data for producing future inferences about it, the latter performs the discovery of hidden patterns in data. Among the descriptive techniques, ARM (Association Rule Mining) excels because of its easy manner of describing important patterns existing in data. Roughly speaking, it is a technique which searchers for patterns describing relevant associations among attributes, elements, or items of a given data source or environment. In the beginnings of ARM, which was introduced in (Agrawal et al., 1993), this DM task was only proposed for the analysis of market-basket data, but today its application has been extended to a variety of other domains. Taking into account the straight manner in which ARM forms its knowledge from an environment, which does not consider the building of a model, ARM has been classified as a transductive learning process (Kantardzic, 2002). The knowledge or inferences generated by ARM are described by a group of rules 33 with the format of: X → Y, which symbolise associations of patterns occurring in data. As the total number of possible rules, generated from a given data, can be exponentially large, their generation is considered to be infeasible; therefore, ARM proposals have focused on generating only the relevant rules based on calculating metrics which define their interestingness. For instance, state-of-theart algorithms, like Apriori, developed independently in (Agrawal and Srikant, 1994; Mannila et al., 1994), are based on the definition of the support-confidence framework (Agrawal et al., 1993), which states that ARM must be tackled by dividing it into the tasks of: FIM (Frequent Itemset Mining) and RG (Rule Generation). This is, rules are derived from a knowledge known as frequent itemsets discovered by FIM. Overall, ARM aims for the discovery of interesting rules which satisfy minimal constraints respecting, for instance, the properties of itemset support and rule confidence which are respectively involved in FIM and RG. ARM has been addressed by a large variety of algorithms. A description of the most well-known algorithms can be found in (Goethals, 2003; Bodon, 2006; Aaron Ceglar, 2006). The strategies to break down the complexity of the problem, which the current algorithms are based on, have involved: • The usage of different data structures to represent the knowledge or findings during the mining process. • The exploration of different layouts to represent the input data. • The implementation of efficient strategies to traverse the search space. • The definition of optimization methods for the reduction of I/O operations. This has involved the definition of parallel and distributed approaches. 34 • The approximation of the rule properties to reduce the number of data scans. • The type of target knowledge from which rules will be derived. In the history of ARM, the Apriori algorithm has been very important to the extent that its philosophy has been utilised in one form or another in the development of the state-of-the-art algorithms. Although many approaches have claimed reducing the complexity of the problem, the real causes that produce a good performance in the algorithms are still uncertain (Bodon, 2006), as well as the different sources from which these rules can be generated. Hence, it is important to continue research in this area in order to find answers to the current enigmas. 2.2 The Scope of ARM As a technique which makes inferences from data, ARM can be seen as addressing the generation of rules R from an environment, represented by a dataset or database D, through an algorithm A in charge of searching for rules which must satisfy thresholds T set up for a mining exercise. Such a search is typically conducted by the usage of diverse strategies S. In the best theoretical scenario, ARM should be tackled by an algorithm whose performance, represented by a function FA , minimizes Equation 2.1. This is, R, satisfying T , should be produced by the execution of A, which minimizes Ω that defines the complexity and computational resources utilized in a mining exercise over D, based on the use of S. R = min FA (Ω, D, T ) S 35 (2.1) 2.2.1 Formal Definition Let any symbol, literal or element ij from a set I = {i1 , i2 , . . . , im } be called an item, and a grouping or formation of items X such that X ⊆ I be called an itemset. In particular, an itemset X with k = |X| is called a k-itemset. Let D be a set of n transactions, events or patterns grouped in a dataset or database representing an environment. Each transaction ti is defined by a unique identifier id together with an itemset Y, satisfying Y ⊆ I. A ti is said to hold or support an itemset X iff X ⊆ ti .Y. A basic association rule is an implication defined by A → B in which the itemsets A ∧ B ⊂ I, but A ∩ B = ∅. The support of an itemset X with respect to a D is defined by the fraction of transactions supporting it. This can be defined as follows: supp(X) = |{ti | ti ∈ D ∧ ti.Y ⊇ X}| |D| (2.2) In the case of a rule defined by A → B, its support is given by the support of A∪B; therefore, supp(A → B)=supp(A∪B=X). Since the support of X defines the occurrence frequency of X in D, it can also be understood as the probability of X , P(X). The strength of a rule in D is defined by its confidence as follows: conf(A → B) = supp(A ∪ B) supp(A) (2.3) Since it is impractical and not desirable to mine and generate the total space of itemsets and rules, which grows exponentially in function of |I|, the challenge 36 of ARM has been to discover only those which are interesting. The interestingness of an itemset and a rule is determined by evaluating the properties of support and confidence against the thresholds of minsupp and minconf respectively. For instance, an itemset X is frequent iff supp(X) ≥ minsupp. Hence, based on the above defined, which describes the support-confidence framework (Agrawal et al., 1993), the aim of ARM is then: 1) to discover a set F = {Xi ∈ D|supp(Xi ) ≥ minsupp} representing all frequent itemsets out of a space defined by 2m possible ones, and 2) to perform rule generation with the information generated in F . In a simplified manner, association rule mining is the result of performing frequent itemset mining and rule generation. 2.3 Frequent Itemset Mining Frequent itemset mining has become an essential factor in the generation of association rules from data because it is in charge of seeking out the right raw material, known as frequent itemsets, from which rules are derived. Since its complexity mainly defines the complexity of ARM, it has been the focus of attention of researchers who have looked at developing an algorithm which can satisfy Equation 2.1. This is, an algorithm whose performance does not deteriorate abruptly with the possible conditions presented by the data and thresholds. FIM is a combinatorial search problem which aims to form a set of itemsets F through discovering the frequent ones from an itemset search space SI defined by the m items of a data source D. The target set can then be stated as follows: 37 F = {Xi ∈ SI |supp(Xi ) > minsupp} (2.4) In which, the search space SI, which is often represented by a lattice structure as in Figure 2.1, is defined by, SI = {X ∈ C(m, k) |∀k : k > 0 ∧ k ≤ |I| ∧ m = |I|} (2.5) Whose size satisfies  |SI| =  m X  m  m  =2 −1 k k>0 (2.6) Which refers to the total number of combinations, associations, or itemsets that define SI and which could occur in D. Figure 2.1: Example of an itemset-search-space lattice. In this case, the data space is formed by 4 items. Indexes represent the lexicographic order of the itemsets in the space. 2.3.1 The Calculation of Itemset Support In order to discover the set corresponding to the frequent itemsets from which association rules will be derived, the itemset property of support must be calcu38 lated. This property not only defines the frequency or probability of occurrence of an itemset within the mined data, but also represents a metric to determine the importance or interestingness of an itemset in the mining process. Therefore, to achieve such a calculation, three approaches have been proposed by the state-of-the-art algorithms as follows: 1. By occurrence counting. In this case, each itemset Xi under investigation has associated a counter which is incremented when it is discovered that a ti ⊇ Xi while the scanning of D is carried out. Since it is not feasible to count all the possible itemsets of the defined search space, algorithms in this category often make use of a procedure for CG (Candidate Generation) in order to focus the support calculation just on potential itemsets called candidates. A CG procedure is a function which forms candidates based on the frequent itemsets already discovered. At the beginning of ARM, these potential itemsets were compared to each transaction of D to determine their corresponding support; nevertheless, in order to reduce the runtime of ARM, it has been proposed the projection of the transactions into data structures representing the candidate itemsets. No error is produced in the itemset-support values generated from occurrence counting since they define the real number of appearances of an investigated itemset in the data. Occurrence counting is normally utilised by algorithms which perform a breath-first search and/or make use of the horizontal layout for the input data. Important approaches based on occurrence counting are the Apriori (Agrawal and Srikant, 1994; Mannila et al., 1994) and FPgrowth (Han et al., 2000b) algorithms. 2. By set intersections. In this type of approaches, the typical horizontal layout of the input data D is replaced by a vertical one. Therefore, contrary to traditional transactions, each item ij has associated a list, known as tidlist, 39 containing the identifiers of the transactions that support it. In this case, a candidate C, which represents X∪Y, is formed by an intersection such that C.tidlist = X.tidlist ∩ Y.tidlist. Therefore, the support of an itemset is determined by |C.idlist|. A good representative algorithm based on set intersections is Eclat (Zaki, 2000). Similar to occurrence-counting-based values, the support values generated through set intersections are errorless. 3. By estimation. As the itemset support calculated from the above approaches involves scanning the original data in one way or another, it has been stated that they calculate the real itemset support. Nevertheless, since the number of candidates, whose support needs to be determined, and the number of transactions or lists, representing the input data, can be large in real-life, algorithms based on support estimation aim to infer this itemset property without having to use the original data. This is, they propose using other information to make the estimations; for instance, information defined by the discovered frequent itemsets during the process. The advantage of this type of approach above the traditional one is that the number of candidates generated, as well as the number of data passes, are reduced. Since the itemset-support produced is an estimation and not a calculation, it is important to state that the total number of itemsets found, and consequently rules formed, can be affected as a result of the quality of the estimations. One example of itemset-support estimation is the PASCAL algorithm in (Bastide et al., 2000). 2.4 Taxonomy of the FIMers There is no a unique classification of the current algorithms for FIM. Therefore, in this section, we will form a taxonomy based on the different strategies pro- 40 posed for the benefit of the mining task. Input Data Definition Strategy. This type of strategy based its existence on either the amount of transactions needed to describe the tendency behavior of the items in the original data D satisfactorily, or the type of layout used to define D. For instance, it has been proposed that transforming the natural horizontal layout of D into a vertical one, which is formed by sets describing the behavior of each item, can provide advantages for the mining; cf., (Shenoy et al., 2000). Moreover, since D is often large in real life, it has also been investigated whether the tendency of the items, defined in m transactions of D, can be captured by just k transactions. Therefore, the use of sampling d to define D has been already studied and tested (Toivonen, 1996; Zaki et al., 1996). As FIM is performed over d rather than D, discrepancies in the number of frequent itemsets discovered along with their corresponding support can be found with respect to the traditional approaches. Itemset Support Calculation Strategy. These strategies refer to the manner in which the support of itemset candidates are calculated during a mining process. As described above in Section 2.3.1, there are three main approaches utilised by the current algorithms. Itemset Storage Structure Strategy. While an algorithm searches the space, the knowledge already discovered, which is formed by the frequent itemsets and their respective support and often used for the generation of new candidates, has to be stored or mapped into a data structure. Since the size of this data structure directly influences the performance of a FMI algorithm, a large part of research in the field has been done on finding the most suitable data structure for ARM. This has involved looking for a 41 data structure that allows not only itemsets to be represented compactly, but also fast access within the structure. In other words, this type of strategy aims to produce the most effective itemset storage structure which can be successfully exploited by an algorithm. The data structures proposed have involved: hash trees (Agrawal and Srikant, 1994), enumeration-set trees (Agarwal et al., 2001; Coenen et al., 2004a), adhoc trees (Han et al., 2000c; Coenen et al., 2004b), matrices (El-Haji and Zaiane, 2003), arrays (Grahne and Zhu, 2005; Liu et al., 2002), tries (Woon et al., 2004) and others (Goethals, 2003). Search Itemset Space Strategy. Since FIM seeks the most interesting itemsets within a search space that can be represented by a tree structure, approaches have investigated efficient manners to traverse such a space. For instance, algorithms based on Breadth-First Search (BFS) and Depth-First Search (DFS) have been proposed. The former is also known as a level-wise search since the algorithm moves level by level generating candidates and discovering frequent k-itemsets until no more candidates exist. In contrast, a DFS-based or class-wise algorithm works recursively by checking a family of k-itemsets before it moves towards a new one. In order to perform the best searching, complement techniques, involving heuristics and procedures, have also been proposed. The objective, in this case, is to lead the course of an algorithm to focus on certain areas of the space in which interesting itemsets are likely to be found. The most frequent used heuristic is defined by the downward closure property of itemset support, which establishes that: All subsets of a frequent itemset must be also frequent. The latter is a consequence of the anti-monotonic relationship between the support σ and the k number of items defining an 42 itemset within the space. This is, while k increases during the m levels of a search space, σ tends to decrease. In the case of the procedures, candidate generation has been one of the most used since it prunes fake candidate itemsets as soon as possible during a mining exercise. Hence, the number of elements visited and checked in the space is reduced to just those itemsets which are very certain to become frequent. By considering the properties of the frequent itemsets, for instance, their location in the search space, which often occurs to be at the top of the lattice, approaches, like in (Mannila and Toivonen, 1997) and (Goethals, 2002), have pointed out respectively the existence of borders between the frequent and unfrequent itemsets, and in the generation of candidates. Optimization Strategy. Approaches in this group have focused on finding the best way of administrating the computational resources for an ARM process. That is, to provide the best mining runtime by performing the best management of the hardware is the aim of this category. For instance, some researchers have worked on the parallelization of ARM (Zaki et al., 1997b; Han et al., 1997; Joshi et al., 1999; Jin and Agrawal, 2002; Veloso, 2003). Since the formation of association rules can demand large amounts of memory, proposals, for instance, in (Goethals, 2004), have considered simple techniques to extend the capability of the algorithms. Other approaches have identified that most of the time is spent when the algorithms output their results; therefore, the implementation of fast routines has been developed (Rácz et al., 2005). 43 Target Knowledge Strategy. Although the target knowledge of ARM is defined by a set of rules representing item associations, most of the approaches have focused on undertaking only the discovery of the set of the frequent itemsets F because the generation of rules can be produced from its elements. Based on the fact that set F can still be large in real life, other approaches have investigated to seek for memberless set representations of F from which rules can be generated indistinctly. Therefore, attention has been given to the concepts of frequent maximal (Gunopulos et al., 1997) and closed (Pasquier et al., 1998) itemsets. The MFI (Maximal Frequent Itemsets) form a set M whose elements satisfy that no frequent supersets exist. In other words, M is composed by the itemsets lying down in the frequent boundary space. Even though MFIbased algorithms have helped to reduce the complexity of ARM and to develop new pruning and search strategies such as, for instance, top-down and/or bottom-up searches, the definition of M has been criticised because it is not possible to generate the support of their itemset subsets that are also frequent by definition. Important algorithms with this conception are Maxminer (Roberto J. Bayardo, 1998), Mafia (Burdick et al., 2001), FPMax (Grahne and Zhu, 2003) and GenMax (Gouda and Zaki, 2001). In order to overcome the disadvantage of the set M , the discovery of CFI (Closed Frequent Itemsets) has been proposed because the support for all frequent itemsets can be generated from the closed ones. An itemset Y is closed if no proper superset of Y exists that has the same support. Some 44 CFI-based algorithms are A-Close (Pasquier et al., 1999), CLOSET (Pei et al., 2000) and CHARM (Zaki and Hsiao, 2002). 2.5 Conclusions Since this thesis is investigating the suitability of two neural networks for association rule mining, we have given the relevant background on ARM in this chapter. In particular, we have focused on providing information on the first phase known as frequent itemset mining. Some concepts about this mining task have been described above, because our main interest is to reproduce the itemset support values calculated by the current algorithms, which in one way or another have to perform pattern counting from data. Inspire of many algorithms, based on different strategies, having been proposed for ARM or FIM since 1993, biologically-inspired approaches have been rarely investigated. Therefore, in the next chapters we develop some ideas regarding the creation of ANNs-based approaches for the generation of this type of rules by exploiting the knowledge allocated in their weight matrix after training. 45 Chapter 3 Hypothetical Neural Network for Association Rule Mining Unlike other DM research topics, in which the literature is so vast, the lack of research is a property of the addressed topic of ANNs for ARM. Nevertheless, a summary of the little literature available for the topic will be presented in this chapter. We also justify the study on our neural network candidates for ARM. Additionally, we define the stages of the proposed ANN-based framework for ARM and a list of the properties that we consider as relevant in an ANN for ARM. As part of the conclusions of this chapter, we will point out the differences between our research and the existing work related. 3.1 Literature Review To the best of our knowledge, the use of neural networks for association rule mining was first explored by Sallans (Sallans, 1997). In this work, he made use of the unsupervised techniques of FA (Factor Analysis) and MGM (Mixture 46 of Gaussian Models) for his study. This work linked ANNs and ARM due to the fact that classification rules can be generated from ANNs for classification. The series of transactions were simulated and based on some underlying patterns (seed patterns), which might or might not be correlated, and the addition of some noise. The conclusions of this study have reported that while the FA-based model could never converge with the ARM data, and no clear reason was found for such behavior, the MGM network showed that it is able to learn the seed patterns used to generate the noisy transactions. The worst performance of MGM was shown when the patterns were highly correlated and the data was noisy, while the best performance resulted from uncorrelated patterns and clean data. This work does not state anything regarding the calculation of support, neither on how the MGM model should be understood for the generation of itemsets or rules. Gupta et al. (Gupta et al., 1999) undertook the problem of pruning or grouping final association rules for inspection and analysis. Even if the work focused on proposing a new distance metric to group association rules, a SOM took part in the proposed grouping methodology. The idea developed was initially to calculate the distance values among the rules through their metric. Then, MDS (Multi-Dimensional Scaling) was employed to form a vectorial representation of the distances which served as inputs to the SOM which clustered such vector space in order to visualize the rules. Due to the orientation of the work, it can be categorized as part of the techniques developed for the management and visualization of rules rather than their discovery, in which a neural network was used to cluster the rule space. In 2000, a frequent itemset algorithm based on the Hopfield neural network was presented by Gaber et al. (Gaber et al., 2000b). The work was inspired by the 47 facts that it has been demonstrated to be possible the extraction of classification rules from trained neural networks and that the Hopfield network has been used for combinatorial optimization problems. The authors concluded that a Hopfield network in an arrangement of n-by-p nodes with their proposed energy function should be enough to map the maximal frequent itemsets1 of a given input set of n transactions and p items. Even if the idea of ANNs for ARM makes sense, the absence of experiments and results makes it difficult to consider this work as a formal solution to the problem. Moreover, no indications have been given on how to interpret the neural network after its training. Even though the work of A.R. Gardner-Medwin and H.B. Barlow (GardnerMedwin and Barlow, 2001) is not related to the topic of ANNs for ARM, it is highly relevant for our purposes since it deals with the problem of counting in distributed neural representation. Initially, it was stated that: Learning about a causal or statistical association depends on comparing frequencies of joint occurrence with frequencies expected from separate occurrences, and to do this, events must somehow be counted... Hence, interested in how events can be counted by biological mechanisms, Gardner-Medwin and Barlow defined two theoretical neural models to explore the effects of counting. In this work, in the process of counting, each event E, representing some stimulation, produces an activity representation (a representation or activity pattern is formed by the state activity [0,1] of the neurons) in the network. 1 A maximal itemset is a variant of a frequent itemset. It can be understood as a super-frequent itemset whose derived itemsets are all frequent. Algorithms targeting this type of itemsets were commented in Chapter 2. 48 Two types of representations have been established to model the relation between events and their neuronal representations. First, a direct representation, which would model the ideal state for counting, is one that has at least one of the neurons active exclusively for the presented stimulation. Therefore, those neurons are defined to have a one-to-one relationship between their activity and the occurrence of the event. Second, a representation is known to be distributed when its active neurons participate also in the representation of other events in a counting epoch. Therefore, the relationship between activity nodes, supporting distributed representations, and frequency events can be stated to be many-tomany. They followed the idea that distributed representation exists in the biological models, therefore, interference that results from the overlap of the distributed representations was defined as a problem which can be dealt with in two ways: (1) direct representation must be generated or (2) the frequency of an event must be estimated from the frequency of use of its individual active elements. It was also noticed that although a distributed representation may increase the variance of the estimated counts and impairs the speed and reliability of learning, it is often regarded as a desirable feature of the brain since it brings the capacity to distinguish a large number of events with a finite and relative number of neurons. As a result of this work, two neural models were proposed: (a) the projection model is composed by an arrangement of Z binary neurons which will get activated or deactivated as a response to some events during a counting epoch. The frequency of the occurrence of a particular event Ec is estimated by the usage of 49 all the neurons that respond actively in its representation. In other words, the Z synaptic weights, which are proportional to the usages of individual neurons, are summed up into an accumulator neuron X to produce a frequency estimation. (b) the internal support model employs Z2 (full connectivity) synapses to count all possible parings of activity. Its excitatory synapses acquire strengths proportional to the number of times that pre- and postsynaptic neurons have been activated together during events experienced in a counting epoch. In other words, internal excitatory synapses within the network measure the frequency of cooccurrences of activity in pairs of neurons by a Hebbian mechanism. Therefore, the total internal activation, stabilizing the representation of an event Ec , is estimated by testing the effect of a diffuse inhibitory influence on the number of active neurons. In general, this theoretical work addressed the problem of counting on distributed representations by focusing on forming event frequency estimations taking into account the usage of the neurons and the representation overlaps produced in a counting epoch. A procedure for mining association rules from a database for on-line recommendation is developed by Changchien and Lu (Changchien and Lu, 2001). The implemented system depends on three different technologies: a start schema database, a SOM, and RST (Rough Set Theory). In particular, SOM is just used to cluster the transformed and normalised transaction records. Since the authors determined that SOM cannot explain the resulting clusters itself, they proposed association rules to explain their meaning. Hence, RST is employed to derive rules that explain the characteristics of each cluster and the attribute relationships among the different clusters. This work does neither generate frequent itemsets nor calculate itemset support; instead, it uses a confidence metric based 50 on RST to form rules. It does not discuss the accuracy of the final rules at all, but based on the explanation of the work, it can be classified as a soft-mining solution. Moreover, a strong dependency between the SOM and the database is presented and exploited for the performance of the entire system. Yang and Zhang (Shangming Yang, 2004) have also proposed that a SOM can be used for ARM because of its clustering properties, in particular due to the fact that its use makes clustering be a two-level approach (first, the data is clustered with a SOM and then the SOM is clustered), which gives the benefit that computational load decreases (Vesanto and Alhoniemi, 2000). This work neither provides any type of experimentation nor results. Roughly speaking, the authors only expose the idea that the split of the binary data, representing associations, into similar groups formed by SOM may benefit the task of FIM but no results are shown to support such an assumptions. To continue with the improvement of their approaches on PPI (Protein-Protein Interaction) prediction methods, in which an ANN (Eom and Zhang, 2004) and an adaptation of ARM (Eom et al., 2004) were used separately, Eom et al. decided to explore the idea of combining their past proposals in order to generate association rules directly from the ANN weight matrix to improve accuracy in the protein predictions (Eom and Zhang, 2005; Eom, 2006). Mainly, a supervised ART-1 network is used to classify vectors Xi , modeling PPI attributes, into k different classes. After this network converges, a weight-to-rule decoding procedure is initialized to transform its weight matrix into a form of association rules. To form rules between input attributes and their corresponding class k , the authors stated that a vector which maximizes the final value of the kth output node must be calculated. Taking into account the characteristic of the problem, 51 Eom et al. stated that rule formation from an ANN can be addressed as a nonlinear integer optimization problem. Therefore, a GA (Genetic Algorithm) was used to do the maximization of the objective function. The idea behind using a GA is that it would look for the best chromosome which can maximize the corresponding network output and give the best combination of input features as a result. Once these chromosomes have been discovered, they are decoded into a form of association rules called neural feature association rule for protein prediction. The few results presented show that the combination between an ANN and ARM provides better accuracy in the PPI prediction than its antecedents. 3.2 Hypothetical ARM Framework Based on ANNs Because it has been defined that relevant concepts for ARM such as counting and association take part in the process of learning performed by biological systems (Gardner-Medwin and Barlow, 2001) and knowing that such biological behaviors can be artificially imitated by using ANNs, we can state that an implicit ANN-based framework exists for tasks like association rule mining. This hypothetical framework, which is depicted in Figure 3.1, can be defined to be constituted for the following stages: 1. Environment Definition. Roughly speaking, this stage focuses on performing tasks for transformation, collection, manipulation of the data describing a collection of events of some environment. In this thesis, we use the original representation of the events in which the association among their elements is formatted binary. Nonetheless, finding a more advantageous representation of the input patterns without losing either the hierarchial property among them or the associativity property of their elements is a relevant clue for the improvement of their learning. In particular, since 52 ENVIRONMENT E DG LE NG I OWAR KN SH ARTIFICIAL MEMORY EXTRACTING TRAVERSING 53 NEW FEATURE SPACE OTHER NEURAL NETWORKS TASK LOGIC E TR XTR AV AC E T R SI ING NG Figure 3.1: Hypothetical Neural-based framework for ARM. In particular, this thesis focuses on developing an artificial memory for its purposes (colored area). EXTRACTING QUERYING INCREMENTAL LEARNING (counting) ARM LOGIC A->B (100%, 90%) AB->C (10%, 9%) AC->DF (100%, 12%) BF->GH (50%, 50%) ABCF->H (50%, 50%) CDE->FGH (70%, 90%) D->E (10%, 20%) KNOWLEDGE Candidate Generation Itemset Representation patterns can be spare and/or high dimensional by nature. 2. Learning. This is a task mainly governed by the course of some learning algorithm responsible for modifying the neural-network architecture (nodes or weight matrix) in charge of acting as the artificial memory which accumulates the knowledge presented in the environment throughout time. Therefore, the learning algorithm is principally determined by the type of neural network used to learn the data coming from the environment. In our particular case, the training algorithms of a self-organising map and an auto-associative memory will be evaluated to fulfill this framework task. 3. Artificial Memory. The purpose of this stage is important for the generation of rules because it forms the base knowledge of the framework from which the properties of the rules to be generated and therefore the rules will be derived. The main quality of any ANN to become an artificial memory for ARM is that its embedded knowledge can allow functions or methods F to be defined to describe properties of the learnt associations. Initially, for the generation of rules, itemset support must be calculated (recalled) from the nodes of the chosen neural network. In other words, it is necessary for the neural network to be able of acting as an artificial memory which has the ability of producing the counting of patterns in order to identify their corresponding number of occurrence in the environment. Another way of comprehending the knowledge formed at this stage is as a new mapping or feature space in which the original associations or patterns have been quantified and coded in a compact representation, limited by the space formed with the nodes of the network for future usages. Therefore, to generate rules from the resulting space, it is important to define 54 the correct decoding of it in order for the frequency of the taught patterns (support) to be determined as an estimation of what is normally discovered in the original space through the counting of the patterns. Moreover, this embedded knowledge about the environment can be used to supply the formation (training) of other neural architectures in order to create more complex rules describing the environment. In this thesis, we focus on developing this section. Therefore, our two ANN candidates will be studied to establish whether they can become our desired itemset-support memory by determining if they can reproduce frequency knowledge about the patterns in the training data, which is usually generated by performing the counting of them. 4. Knowledge Sharing. This is a task in charge of supplying or transferring knowledge from the main memory to procedures for the formations of new neural architectures (training of other neural networks), so that the proposed framework can re-utilise the collected knowledge for other tasks, such as, for instance, the prediction of behaviors of associations in the original environment. 5. Extracting-Traversing and Extracting-Querying. Since the arrangement of nodes, forming the artificial memory, can be seen as a data structure in which knowledge of an environment is distributed and accumulated because of training, it is necessary to build techniques which permit the exploration, recovery and exploitation of the information satisfactorily defined in the neural structure. To accomplish these tasks, especially in the representation of the extracted 55 knowledge, techniques produced in the ARM field could be reused; nevertheless, a re-definition of them needs to done to adjust their performance to the new space defined by the weight-matrix structure. 6. Task Logic. This stage groups techniques and methodologies to lead the extraction of knowledge in order to solve a mining process. It concentrates some of the strategies to be followed in the extraction of knowledge from the memory in order to speed up the process. This stage will keep a direct relationship with the strategies defined to carry on with the competition of the finding of association rules. Operation boundaries are not completly defined as the aims of their existence can be supported by results given by other neural structures or other stages existing in the framework. 7. ARM Logic. This is the stage which leads to the generation of rules from the knowledge embedded in the neural network. For instance, it will be responsible for applying the correct strategies for planning the best logistic of the mining exercise. In other words, this stage would administrate the total resources, data and processes, for the generation of association rules from a trained neural network with information about events occurred in some environment. 3.3 A Formal Definition of the Problem Let D be a finite n collection of binary events or discrete patterns X called itemsets which represent different formations of associations among the m elements (items) of a set in an environment. 56 Let Φ be a feature space or knowledge formed (weight matrix) by the m nodes of a neural model mAN N trained with D. Under the original conditions of D, a value f (itemset support), representing the frequency of occurrence of an association or pattern X in D, is calculated by performing the counting process P of that pattern in the environment so that f (X|D) = P(X, D). Since mAN N has already acquired knowledge about D, the aim is to reproduce an estimation of the frequency of occurrence of pattern X known as fb through the application of a decoding procedure Θ on the feature space Φ formed such that f (X|D) ≈ fb(X|Φ) = Θ(X, Φ). Therefore, the problem can be set up as seeking a definition of Θ for a particular Φ in order to produce frequency estimations for patterns in D. 3.4 Ideal ANN Characteristics for Building Memories for ARM Based on the problem defined above, we consider important to state some properties of ANNs which might be relevant for tackling our current problem as follows: • It is important to consider ANNs whose mapping or weight matrix formed is smaller than the size of the environment represented by patterns in D. Therefore, quantization properties can be relevant for tackling the summarisation or compacting of the original environment. 57 • It is required that the neural-based model deals well with large n-dimensional patterns, particularly with data that could be formatted as binary arrays. • The neural-network model will have to be able to learn new patterns without forgetting past knowledge. This characteristic in neural networks is known as the elasticity-plasticity dilemma. The exploit of this characteristic of a neural network would avoid employing tree-party processes for the maintenance of rules. • Initially, neural networks that conduct unsupervised learning attract our attention because no information about the patterns or associations in the environment is known a priori. Although, supervised trained ANNs can also be considered for our aims if they are able to learn the associations of the elements of the input pattern by supporting the definition of autoassociative inputs. 3.5 Reasons for Studying an AAM and a SOM for ARM The world of neural networks is vast in terms of the different types of paradigms of learning algorithms (e.g., error-correction learning, hebbian learning, competitive learning, etc.) and architectures (recurrent and feed-forward networks). It is the combination of learning algorithms and architectures which makes the generation of solutions for learning tasks such as pattern association, pattern recognition (classification), control, function approximation, filtering, and others possible. 58 Since this thesis concerns investigating the suitability of ANNs for association rule mining, we have chosen two different neural networks from the large list of supervised and unsupervised candidates. The selection of an auto-associative memory and a self-organising map to take part in our study is mainly because of their properties, which are summarised as follows: Auto-Associative Memory Among the learning tasks, PA (Pattern Association) has caught our interest as a viable alternative to finding answers to the research questions, stated in Section 1.4 in Chapter 1, because it is a task which can be performed by involving concepts of learning, memory and association. To achieve pattern association, according to Ham and Kostanic (Ham and Kostanic, 2001), any neural network (e.g., feedforward multilayer perceptron networks, counterpropagation networks, radial basis function networks and associative memory networks) from the group of the mapping networks, whose general structure is shown in Figure 3.2, can be employed for this purpose. Figure 3.2: General structure of a mapping neural network. This appears in (Ham and Kostanic, 2001). As a mapping network, in which the input patterns are coded and projected into the synapse weights during training, an associative memory aims to imitate the memory capabilities performed by the human brain, which has the ability to retrieve and store information via the management of the association con59 cept (Kohonen, 1978). In other words, this type of artificial neural network is taught to associate (O’Keefe, 1995). It learns the relationships, established by the pairs of input patterns, through storing them in a content addressable manner. This neural network learns knowledge from its environment by exploiting the explicit associations defined by each of the input-pattern pairs {X, Y } utilized for its training. Moreover, the concept of association is presented and used when information is retrieved from it. In other words, this neural network is able to give a response to the environment when a stimulus is presented to its inputs by using the concept of association between the stimulus and its weight matrix. One of the main abilities of this neural network is to remember information given in its training. This is possible because it forms a mapping (weight matrix) in which the target patterns (memorized patterns) and the associations are organised and stored for future recalls, that can involve queries with corrupted or noisy stimuli. Like its biological counterpart (the human brain), this ANN also handles the concept of memory in two stages: storage and recall of information. While the former phase refers to the training of the network, the latter refers to the extraction of information from the weight matrix in response to a stimulus. These two phases resemble the operations performed by the human in our example, stated in Chapter 1, in which he scans the data defining some shopping baskets (shopping transactions) to memorise the most of the information in order to answer queries about the contents of the baskets, for instance, queries regarding the associations among the items bought. To recall information from its weight matrix, a stimulus is always required. Un- 60 like conventional memories, they have the ability to learn and generalise. Moreover, no exact location of the required information is needed for its recovery; instead, the recall is formed with the distributed information in its nodes. Although a task of pattern association can be confused with a task of PR (Patter Recognition), which can be understood as the process whereby a received pattern is assigned to one of the prescribed classes (Haykin, 1999), a difference in the definition of the targets yi between these two tasks can be pointed out. Whereas an ANN for a pure PR task normally uses yi as a value to define the class ci (ci = yi ) to which an input pattern xi belongs, an ANN for PA uses yi as a vector, known as the memorised pattern, to represent a pattern to which an input pattern ,called the key pattern, will be associated. For our problem, which involves the estimation of itemset support from a weight matrix, we believe that a neural network for PA is practically more appropriate than for PR, mainly because, in the definition of our problem, in particular, in the part concerning the definition of the input data, which can be summarised as a dataset D composed of itemsets (binary patterns or binary transactions), there is no other information apart from the itemsets (transactions) themselves, which can serve to fulfill the targets needed in a pure PR task. Therefore, it can be stated that the target definition of the different learning tasks has been an important factor in leading this research towards the study of neural networks which let data speak for themselves, especially in cases, such as ours, in which there are no labels associated to the input data before they are learnt. Since it has been stated that the target yi (memorized pattern) will be defined by the corresponding key pattern xi of each input pair, our current research could be confined to the study of the suitability of an AM (Auto-Associative Memory) 61 for ARM since this neural network holds the characteristic of yi = xi ∀xi in D in its inputs. We focus on studying an AM, because it also satisfies the properties of the theoretical internal support model depicted in Figure 3.3, which was defined by A.R. Gardner-Medwin and H.B. Barlow (Gardner-Medwin and Barlow, 2001) to study the limits of counting. Being the latter, a very important task needed for the generation of association rules from databases. Figure 3.3: Outline of the theoretical internal support model defined in (GardnerMedwin and Barlow, 2001) to produce the counting of patterns with distributed representations in a group neurons. According to Austin (Austin, 1996) almost any ANN used for pattern classification can be used for building an associative memory. Nevertheless, factors to be considered for real-life problems are the speed of training and the type of inputs they handle (Ham and Kostanic, 2001). The Hopfield neural network (Hopfield, 1982) can be regarded as a good candidate since it is an ANN which bases its operativity of learning and recovering memories (patterns) in a manner similar to the human brain. For instance, a 62 Hopfield network is able to recover (remember) a thought pattern with just partial information given as a stimulus to the network. Hence, it can be stated that a characteristic of this ANN is robustness. A Hopfield network for ARM has already been proposed by Gaber et al. (Gaber et al., 2000a) from which I have concluded the following: This neural network was not used to calculate the support of the itemsets from a dataset, instead the identification of a variant of them, called maximal itemsets, was proposed. Due to the nature of the maximal itemsets, the main drawback of this work is that to calculate support for all itemsets derived from the maximal discovered itemsets an extra pass over the training data is needed. In other words, even if the network has learnt to detect the maximal itemsets from an environment D, it is not capable of providing information about the support of the all possible frequent itemsets (associations) happening in it. Taking Gaber’s work as an antecedent, and considering the options available for building an AM exposed, for instance, by Austin (Austin, 1996) and Ham and Kostanic (Ham and Kostanic, 2001), we have decided to study an AM based on a CMM (Correlation Matrix Memory), because a CMM is purely oriented to PA tasks which deals with binary and non-binary data and has fast training and recall stages. A CMM, whose training is defined in Appendix B, has been chosen because its operativity has been used to build more robust and complex ANN systems; cf., Konohen’s book (Kohonen, 1978), or ADAM (Advanced Distributed Associative Memory) and AURA (Advanced Uncertain Reasoning Architecture) system both 63 developed by Austin et al. (Austin and Stonham, 1987; Austin, 1995; Austin et al., 1995). Moreover, as it will be explained in next chapter, it has a natural ability for ARM due to it accumulates knowledge from its environment through a superposition of the memories representing the input pairs, which allows the estimation of itemset support be possible. Self-Organising Map According to Kohonen (Kohonen, 1996), various human sensory impressions are neurologically mapped into the brain so that spatial or other relations among stimuli correspond to spatial relations among the neurons organised into a twodimensional map. Hence, in order to build a SOM-based memory, our aim is to determine whether the counting of patterns occurred in a high dimensional space can be reproduced with the knowledge embedded in the two-dimensional mapping produced by a SOM. Our study is focused on a SOM in Chapter 5 because it is an astonishing model of the unsupervised neural networks which form a knowledge based on the resources expressed by the training data by letting data speak for itself. Furthermore, the interpretation of its embedded knowledge has become an important activity for data analysis. For instance, the inspection of correlations among the learnt variables has been investigated by considering the similarities existing in the planes formed by such variables (Vesanto and Ahola, 1999). The visualisation of the contribution of each variable in the formation of the map has also been addressed (Kaski et al., 1998a). To facilitate visual inspection of the highdimensional data, visual techniques for the cluster results given by a SOM, based on data projection methods, have been developed (Su and Chang, 2001). Also, the extraction of logical rules from trained self-organising networks, in particular 64 for classification problems, has been explored to create different understandings from them (Hammer et al., 2002; Malone et al., 2006). More importantly, a SOM has the abilities to undertake tasks, such as data clustering (Vesanto and Alhoniemi, 2000; Kiang, 2001) and vector quantization (Heskes, 2001) which make the formation of a compact representation of the training environment possible. Kohonen et al. (Kohonen et al., 2000) have tackled the problem of producing massive maps under environments in which data naturally occur in large amounts to perform their visual exploration. The usage of a SOM for combinatorial problems has already been explored since the individual SOM neurons tend to learn the properties of the underlaying distribution of the space in which they operate (Aras et al., 2003). The inclusion of SOM technology to the process of the generation of association rules has been investigated in (Changchien and Lu, 2001; Shangming Yang, 2004). Nevertheless, as above stated, these proposals have the disadvantage that, neither work on the interpretation of the trained map for ARM was proposed, nor the reproduction of the counting process was investigated. On the contrary, the proposals have based their implementation on limiting the SOM to clustering the input data space and making strong dependencies between the SOM clusters and the original data for the generation of association rules. An indirect consideration of SOMs for ARM can be interpreted from the work of Heskes (Heskes, 2001), where the relationship between SOMs, VQ and Mixture Modeling is explored. In this work, neither itemset support, nor rules 65 are generated. Nevertheless, in one experiment, Heskes uses basket-market data in order to build a map to model the relationships between the items defining the transactions of a dataset, following the assumption that items of similar groups have similar co-occurrence frequencies with other items in the basket. In this case, the training data is defined by a matrix with the relative frequencies (support) of the items, in order to calculate conditional probabilities to be used in the distance metric proposed for the formation of the map. Figure 3.4: Theoretical projection models defined in (Gardner-Medwin and Barlow, 2001) to produce the counting of patterns with distributed representations in a group neurons. Additionally, due to the way in which a SOM learns from its environment, we have categorised it as a practical representation of the theoretical projection model, depicted in Figure 3.4, defined by A.R. Gardner-Medwin and H.B. Barlow (Gardner-Medwin and Barlow, 2001), in which the knowledge about the frequency of the pattern occurrences in an environment, is distributed during learning in the neural components of the networks. Therefore, we have assumed that estimations about the occurrences of patterns, known as itemset support in ARM, can be produced from a SOM by interpreting the local knowledge generated by the nodes of the map. 66 3.6 Similarities and Differences with Surveyed Approaches Taking into account our aims and the current pieces of research for the topic of ANNs for ARM, we strongly believe it is convenient to define that this research differs from the approaches summarized above as follows: Our work will not focus on detecting either seed patterns (Sallans, 1997) or maximal itemsets (Gaber et al., 2000b) because even if an ANN was able to detect or learn them correctly, it would still need to know at least one of their properties, such as, for instance, support, to measure the relevance of such patterns or associations in the environment to generate the desired rules. Therefore, this thesis concentrates more on evaluating if our ANN candidates can learn something about the support of the input associations in order to recall those values when some stimulus is presented to the network. Additionally, we believe that through capturing support with ANNs, the calculation of other metrics, for instance, like rule confidence, which defines the conditional probability among the itemsets forming the body of the rules, can be done straightforward from the same embedded knowledge in the memory. Similar to other approaches, a SOM will be used. Nevertheless, we do not want to restrict its abilities to just clustering data for ARM as proposed in (Changchien and Lu, 2001; Shangming Yang, 2004); instead, it is believed that due to its unsupervised properties, the SOM is able to let data speak for themselves and therefore properties like the frequency of the training patterns, must exist in the resultant mapping and its decoding is the target. 67 Because we believe the dependency between the proposed neural network and its training data shown in (Gaber et al., 2000b; Changchien and Lu, 2001; Shangming Yang, 2004) is a negative property for the future neural framework for ARM, we are interested in breaking it down and alternatively decoding the knowledge distributed within the nodes of the chosen ANNs in order to discover or calculate itemset support from it. Regarding our main interest in finding a pattern counting ability in our candidates, it can be stated that our work will produce applied neural models rather than theoretical ones as in (Gardner-Medwin and Barlow, 2001). However, our selected neural networks, a self-organising map and an auto-associative memory, can theoretically be categorised into the proposed projection and internal support models respectively. In addition, our studies will not ignore the associative statistical factor in the input patterns as it was done in (Gardner-Medwin and Barlow, 2001). Therefore, our proposal will endeavor to estimate support values for any varied item combinations. Probably the work most similar to ours, in the sense of generating association rules from an ANN and employing incremental training of ANNs for maintaining rules throughout time, is the one of (Eom and Zhang, 2005; Eom, 2006); nevertheless, we focus on studying ANNs which can handle unsupervised tasks rather than supervised ones because we do believe that employing a supervised ANN would limit the operativity of the wanted framework to only data which could be classified a priori. The latter is not a feature of the problems often tackled by ARM. It is important to state that our work is about making the baseline for the 68 representation of a system in the format of association rules through the knowledge learnt by a neural network about such a system, and not to do with getting the optimal set of rules for that system, which we have considered to be a further task after having discovered that association rules can be produced from a trained neural network. 3.7 Conclusions In this chapter, the literature involved in our prime objectives has been summarized. The proposed ANN-based framework for ARM has been explained in more detail. We stated some of the characteristics that we think are important for an ANN to have if it is to be used for developing a neural-based framework for ARM or similar task involving the counting of patterns. Not less important, the reasons for studying an auto-associative memory and a self-organising map for association rule mining have been stated here. Additionally, we have specified the differences and affinities with this research and current literature. 69 Chapter 4 An Auto-Associative Memory for ARM To provide answers to the research questions stated in the introduction, in particular, to determine if a trained ANN can have the resources (information) in its weight matrix to answer queries regarding the support of the itemsets drawn from the training dataset, we begin by studying the suitability of an associative memory for association rule mining, because this particular ANN bases its operativity on the concept of association. This neural network exploits the associativity property among the components of the inputs not only for learning its environment, but also for emulating human-memory operations in the recall of information from the knowledge embedded in its weight matrix. In particular, we focus on studying an auto-associative memory based on a Correlation Matrix Memory (CMM). After justifying the research on this type of neural network in Chapter 3, it will be first studied and analysed here if changes need to be applied to the training rule of this neural network in order to learn not only the associations defined 70 by the input patterns, but also information about the appearance frequency of the patterns that is needed for itemset-support calculations. Secondly, an interpretation of the resulting mapping (the weight matrix formed by supervised training) is proposed to perform itemset-support recalls when stimuli (queries about itemsets) are presented to the memory. To evaluate the accuracy of the recalls made by the associative memory through our proposals, we compare its results with the results obtained by the Apriori algorithm, which is an important algorithm for the calculation of support and consequently for the general process of association rule mining. Conclusions about the response of the associative memory for the calculation of itemset support will be given at the end of this chapter. 4.1 Correlation Matrix Memory for ARM Since the operativity of a correlation matrix memory is known to be in two stages: learning and recalling; we will first point out the natural ability of the CMM to learn information about the number of appearances of the components describing the input patterns. Secondly, we will propose how the mapping, resulted from a correlation matrix memory, should be understood in order to retrieve information regarding the support of any itemset, defined in the training patterns, when this memory is queried to recall such information. 4.1.1 The Learning of Itemset Support by a CMM To begin, it is necessary to re-state that a CMM is a one-layer memory structure which basically is defined by a square array whose dimension is the m number of elements in the patterns (this statement on the memory size is only valid in the 71 case of an auto-associative memory). Therefore, the corresponding memory matrix, resulting from supervised training, is represented an M ∈ <m×m or Bm×m for non-binary and binary models respectively. In order to be trained in any D environment, both types of CMM, which have also been identified by O’Keefe (O’Keefe, 1995) as weightless or weighted memories respectively, require pairs of patterns representing associations with a form X → Y in which an input or key pattern is associated with an output or memorised pattern. In our case, the available data is signified by a group of n unipolar vectors (patterns or itemsets) ∈ {0, 1}, in which a value of one assigned to some k elements or items is used to define their existence within the pattern, otherwise zero is assigned. This is, each input is an m-vector defining a particular association among items from the set I = {i1 , i2 , ..., im }. The corresponding associations to be presented to the memory may look as {(X1 , Y1 ), (X2 , Y2 ), ..., (Xn , Yn )} (4.1) In which, both patterns, the key and memorised, will obtain their values from the n transactions in D. In particular, each pattern of each pair (Xk {0, 1}mx1 , Yk {0, 1}mx1 ) will take the same value defined by a transaction tk . Hence, a pair k satisfies: Xk → Yk = (Xk , Yk ) Xk = {x1k , x2k , ..., xmk } ; Yk = {y1k , y2k , ..., ymk } (4.2) Xk = Yk = tk .x To train a weighted CMM, the equation governing this procedure, has been defined in literature (Haykin, 1999; Ham and Kostanic, 2001) as follows: M= m X k=1 72 Yk XkT (4.3) It is the term Yk XkT or outer product which has been determined to be an estimation of the weight matrix W(k) of the neural network functioning as a linear associator (Ham and Kostanic, 2001). This matrix W(k), which associates or maps Yk onto Xk , forms a mapping representing solely the association described by the k th input pair in turn. Therefore, a resulting matrix M must be understood as a grouping or encoding (sum) of the m weighted matrices W. This summing or correlation among matrices can be illustrated as in Figure 4.1. Figure 4.1: Illustration of the accumulation of knowledge by a CMM. Individual weight values of the network, whose update resembles a generalisation of the Hebbian rule learning, can be expressed by wij = m X yik xjk (4.4) k=1 As a consequence of using unipolar elements as inputs, the product yi xj of some k th pattern will be in one of the two following states: yik xjk =    1 existance of association ij in k   0 (4.5) otherwise In particular, it can be noticed that a matrix k, representing a thought association between patterns X and Y, will show existence values in its components which satisfy: 73 wij = 1 ∀ {i, j}|{i, j} ∈ {P ({j|ij = 1 ∈ X}, 2)} (4.6) In which, the pais of indexes {i, j} derived from P, which represents the set of n2 permutations with repetitions among the indexes of items that have an existence in the input pattern X, define the matrix elements that need to be updated to learn the corresponding auto-association of the pattern X. Being conscious of the sum operation performed by the CMM to learn the incoming data, it can then be deduced from Equation 4.3 that, for the benefit of this thesis, the value at each component wij not only means the existence of an association between elements i and j, but also the number of times that such association has occurred in the environment. Therefore, we can asseverate that a weighted CMM naturally builds a frequency matrix when its training involves bipolar inputs. Having discovered the latter characteristic in this ANN has been relevant because its values associated to its nodes can be used, as will be explained further later, to recall itemset support. In the case of the weightless CMM whose training is based on : wij = m _ yik xjk (4.7) k=1 in which the accumulation of knowledge is produced by performing a superposition, through a bitwise-OR operator, among the different W matrices resulting from the different associations described by the training pairs (Austin and Stonham, 1987). This is, this type of CMM has the characteristic of unifying the Ws rather than summing them up. Therefore, the final weight matrix W fires its elements wij in those cases in which the individual matrices Ws have active 74 Figure 4.2: Illustration of the accumulation of knowledge by a weightless CMM which has been modified to collect frequency information. The dark matrix illustrates the new matrix called the frequency matrix Mf which contains the corresponding pattern frequencies. intersections in order to record the associations. This W is defined by m _ W = Xk YkT (4.8) k=1 Unfortunately for the weightless CMM, it does not have the natural ability, as its weighted counterpart, to collect knowledge regarding frequencies of the pattern components. Therefore, to reproduce such an ability, we propose to build a matrix as in Figure 4.2 for the gathering of such information; nevertheless, a proper justification for the construction and maintenance of this data structure needs to be found. In particular, when it has been concluded that the weighted CMM performs this process naturally. So far it has been found that, due to the auto-associative property of a weighted CMM, this ANN is able to learn information about the occurrences of the components of its inputs naturally. This is, no change was made to its traditional training to gather frequency knowledge about the input patterns. Hence, in the next section, we will explain how a trained CMM can be understood in order to 75 estimate itemset support from it. 4.1.2 Recalling Itesemt Support from The Weight Matrix of a CMM One important characteristic of the frequency matrix formed by a CMM is that it is symmetric. This leads our desired itemset-support recall mechanism to focus on the knowledge of only n(n + 1)/2 nodes rather than the n2 elements of the complete matrix. This is, more than half of the nodes, defined by (n2 − (n(n + 1)/2)), can be disposed of for our aims. The new resource matrix, from which itemset-support values will be estimated, is represented by   0  w11   w21 w22  W = . .. ..  .. . .   wm1 wm1 · · · wmm         (4.9) To use the embedded information in this triangular W matrix for our purposes, the right interpretation needs to be drawn in order to produce accurate support values from it. Therefore, to achieve itemset-support recalls from W when a stimulus is presented, we propose the following mechanism of interpretation: Calculating The Support for 1-itemsets In this case, the aim is to generate a recall of support for any of the individual items ij ∈ I. This can be understood as the recall of a value among the elements of the matrix whose indexes satisfy that i = j. Therefore, our recall action is focused on the elements of the main diagonal of the matrix {w00 , w11 , ..., wmm }. Since it has already been stated that the number of occurrences of such elements 76 are defined in the corresponding elements wii , giving a recall for the support of these items only involves carrying out the calculation of P(wii ) which defines the probability of the ith item in the n patterns defining the training dataset. Therefore, the corresponding P(wii ) is defined as follows: supp(i) = P (wii ) = f req(wii ) n (4.10) Calculating The Support for 2-itemsets To calculate support to the group of 2-itemsets, which contain C(m,2) combinations derived from m items, we need to apply the Equation 4.10 to each of the elements located off of the main diagonal. In particular, to recall support of a rule ij → ik defined by supp(ij → ik ) or simply supp(ij ik ), it is only necessary to use the corresponding value of wij from the matrix. Therefore, the recall of support for any itemset, defined by X = {ij} from this particular group, can be defined as follows: supp(X) = supp(ij) = P (wij ) = f req(wij ) n f or all i 6= j (4.11) Estimating Support for The k-itemsets (2 > k ≤ m) The calculation of support for the groups of the k-itemsets when k >2 is not as straightforward as the two previous cases. This is, in this case, we need to define a mechanism which can combine the knowledge in the CMM to estimate itemset support. In order to achieve this new aim, two proposals, based on Definition 1, will be discussed below. Definition 1 (Independent Random Variables) Let k random variables (X1 ,X2 ,...,Xk ) 77 represent k events (i1 ,i2 ,...,ik ). As such events do not have influence in any form among them, it can be stated they are independent from one another; therefore, the probability holding that they occur together can be specified by calculating its joint probability as follows: Pr(X1 = i1 , . . . , Xk = ik ) = k Y Pr(Xi = ii ) (4.12) i=1 In our first proposal, which will be identified as the method A, the support value will be estimated by employing the probabilities given by elements of the diagonal matrix. In other words, it will be assumed that the probability of a k-itemset X is held by the joint probability of the items associated within it. Therefore, a support value can be represented by supp(X) ˆ = P (X) = k Y P (wii ) (4.13) i∈X For the second proposal or method B, instead of using the k individual probabilities, the following is proposed: By using the associative property of multiplication over the elements of Equation 4.13, paired arrangements of products among the k elements can be established as follows: P (X) = (P (w11 ) ∗ P (w22 )) ∗ · · · ∗ (P (wjj ) ∗ P (wkk )) (4.14) In which, each paired product of probabilities can be stated to hold the following property, P (A ∩ B) = P (AB) = P (A)P (B) 78 (4.15) Which establishes that the probability of the intersection between two events is defined by the product of their single probabilities. Hence, P(X) in Equation 4.14 can be re-defined as P (X) = (P (w12 )) ∗ · · · ∗ (P (wjk )) (4.16) Therefore, in order to give a recall of the support of a k-itemset where k >2, the calculation needed is given by          supp(X) ˆ = P (X) = k/2 Q P (wij ) when k is even pairs{i,j}∈X where i>j    (k−1)/2  Q     P (wij ) ∗ P (wkk )      pairs{i,j}∈X when k is odd where i>j (4.17) 4.1.3 Complexity Analysis: CMM vs. Apriori Since the use of a CMM for ARM is totally novel, we have considered it relevant to provide a theoretical basis for the complexity analysis between the Apriori algorithm and a CMM for ARM. It is important to state that although different Apriori implementations exist, the analysis conducted here is based on a breadth first implementation since it is the gold standard from which many other implementations have been derived. Let m denote the total number of items, from which an n number of transactions have been derived to define a dataset D. Therefore, the time and space complexities can be expressed as follows: 79 Space Complexity Since all the itemsets discovered so far have to be saved by Apriori while the mining is performed, the space required to represent all of them is O(2m ), assuming that each itemset is represented independently in each node of the chosen data structure. In the case of a CMM, this complexity is defined by its weight matrix, in which the input patterns are compressed, and is equal to O(m2 ). Time Complexity In order to determine if a k-itemset (an itemset with k items) is frequent, the counting of its occurrences in D is done by Apriori. Therefore, the complexity of such a calculation is defined by O(nk), whilst an estimation, realised by our methods A and B, produces a complexity of O(k) and O(0.5k) respectively. There is another complexity involved in the case of the CMM, which is produced by the process of learning D and is defined by O(nm2 ). 4.2 Experiments It has been stated above how the resources, needed for the calculation for itemset support, are being captured (learnt) by a CMM during training. Additionally, it has also been proposed how the auto-associative memory based on CMM should be interpreted for retrieving itemset support from its weight matrix. In order to determine the accuracy of the proposed mechanisms for itemsetsupport recalls from a trained CMM, some experiments with real-life datasets, defined in Table 4.1, will be performed in this section. 80 Dataset Chess Connect Mushroom Number of Transactions (n) 3196 67557 8124 Number of Items (m) 75 129 (avg. 43 per transaction) 119 (avg. 23 per transaction) Table 4.1: List of real-life binary training datasets used in the testing of an autoassociative memory for ARM. They are part of the datasets normally used for testing FIM algorithms or FIM benchmarks (Jr. et al., 2004; Goethals and Zaki, 2003). The comparison of the CMM recalls, defining itemset-support estimations, is realised against the zero-error support values given by the Apriori algorithm. In particular, we use the implementation of Borgelt (Borgelt, 2003) of the Apriori algorithm which scans the dataset D in turn and builds a data structure to do the counting of the support of itemsets. It is very important to note that the results of the experiments presented here do not involve calculations for the groups of 1- and 2-itemsets. The latter was concluded because the corresponding support for such itemset groups is a straightforward procedure, as explained above, and which for the benefit of this thesis produces zero-error-support values as Apriori algorithm does. These errorless values have been possible because of the CMM’s natural ability to learn and store this itemset property (support) of the input patterns while it learns the corresponding associations. Therefore, the experimentation carried out here focuses on testing the proposed procedures for the estimation of the support for k-itemsets when k >2. In these experiments, while the method A, defined by Equation 4.13, involves exploiting the independence property of the items for the calculation of the itemset support, the method B, defined by Equation 4.17, uses the information of the support of the 2-itemsets involved in the k-itemset to make a recall. In particular, our experiments consist of querying the associative memory to recall the support for 81 some groups of k-itemsets for testing its accuracy. To determine the accuracy of the generalization given by this memory, an error for each of the different itemset groups, involved in the queries, will be measured. The calculated error will be represented by the well-known RMS (Root-Mean-Square) error defined as follows: E RM S v u n u1 X =t {yi (xi ; W ) − ti }2 n i=1 (4.18) In which, xi represents a k-itemset of a group of n k-itemsets, whose support values yi and ti (fluctuating in the range of 0 -it never occurs- and 100 -it always occurs-) have been calculated respectively by a recall made from a weight matrix W of a trained CMM and a traditional FIM process performed by the Apriori algorithm for comparison. Since it is our aim to evaluate the accuracy of the results given by an autoassociative memory, it is necessary to define which itemsets will be employed for testing. Although it would be ideal to check the response of the trained CMM for the total itemset search space, the fact is that it is unfeasible because of the search-space size. Therefore, some groups representing different support tendencies and sections of the itemset search space have been chosen to be created as follows: Group A The rare itemsets (low-frequent). This group is formed by elements which can have a support fluctuating between 0.001 and 1 percentage of the total transactions of the dataset. This group represents itemsets which rarely appear in the dataset, therefore they can be considered to be outlier 82 Group Type A B C Support constraint applied Itemsets with support in the range of 0.001% – 1% Itemsets with support in the range of 45% – 55% Itemsets with support in the range of 90% – 100% Table 4.2: Support constraint conditions used to form the group of itemsets on which the memories will be tested. associations in the environment. Group B The regular itemsets (semi-frequent). The support values of this group lay between 45 and 55 percent. These are itemsets which appear in a moderate way in the dataset. The associations described by this group can sometimes be interpreted as the obvious knowledge. Nevertheless, their changes of support can be useful to detect abrupt tendencies in the dataset throughout time. Group C The constant itemsets (high-frequent). This is the group of certainty. It contains itemsets which appear frequently in the dataset. An itemset in this group can have a support in the range of 90 to 100 percent. This group describes the common interests among the transactions defined by an environment. After defining the itemset groups on which our proposals will be tested, the elements of each group have been determined by running the Apriori algorithm over the original datasets with the corresponding support constraint conditions defined in Table 4.2. Once the groups have been formed, each member is used as stimulus to query the corresponding memories in order to compare the support values. Before the results of the experiments are shown and discussed, we have plotted each of the final trained W s, which have learnt the associations of the datasets 83 0.9 0.8 0.7 0.6 1 0.5 0.5 0 0 0 0.4 10 10 20 20 0.3 30 30 40 40 0.2 50 50 0.1 60 60 70 70 80 Figure 4.3: 75-by-75 frequency matrix formed by a CMM, from which itemset-support recalls about the Chess dataset will be made. of Table 4.1, in Figures 4.3, 4.4, and 4.5. Each of these matrices represents the available source of knowledge from which each memory will produce with a recall (estimation) of itemset support for the queried itemsets (patterns) through our proposals. The numbers describing the errors calculated for the chosen datasets corresponding to the recalls for 3- and 4-itemset groups are shown in Tables 4.3 and 4.4 respectively. It is important to comment that, due to the range of values available to represent itemset support (in this case from 0 to 100%), the worst error scenario in a recall will be determined by a value of 100. In order to discover the accuracy of recalls made by the CMM with our proposals in situations, which not only involve itemsets with the same size, we have 84 Figure 4.4: 129-by-129 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Connect dataset will be made. Figure 4.5: 119-by-119 frequency matrix formed by a CMM, from which itemsetsupport recalls about the Mushroom dataset will be made. 85 Dataset Chess Connect Mushroom Itemset Group A B C A B C A B C Itemsets in Group (n) 12,340 1,871 167 113,013 1,540 826 21,961 74 - RMS Error Method A Method B 0.790250 0.766740 3.534600 2.769100 0.617510 0.475190 0.21098 0.20019 2.9726 2.3955 0.67045 0.38234 1.0606 1.0402 4.9631 4.1033 - Table 4.3: Error results obtained in the experiments for the support recall for the groups of 3-itemsets, constrained as defined in Table 4.2, made by a CMM through our proposals. Dataset Chess Connect Mushroom Itemset Group A B C A B C A B C Itemsets in Group (n) 269,719 12,359 203 2,776,831 14,743 2,451 243,608 72 - RMS Error Method A Method B 0.724600 0.534530 4.172900 3.185000 0.775860 0.505150 0.19219 0.13916 3.5693 2.8307 0.77525 0.38354 0.76988 0.7246 5.2328 3.5672 - Table 4.4: Error results obtained in the experiments for the support recall for the groups of 4-itemsets, constrained as defined in Table 4.2, made by a CMM through our proposals. set up some experiments with the three selected datasets and summarised the final results in Tables 4.5 and 4.6. In general, these results show that differences exist between the zero-error counted by Apriori and the approximations recalled from the CMM. The best approximations were shown by the proposal which uses the support of the 2itemsets for estimating the support of a k-itemset presented to the memory. 86 Group of Itemsets (k) 1 2 3 4 5 6 7 Number of Itemset (n) 167 203 128 39 4 RMS Error Method A Method B 0.617510 0.47519 0.775860 0.50515 0.912940 0.86483 1.089600 0.9105 1.394400 1.4274 Table 4.5: Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Chess dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm through constraining the itemsets with a minimum support between 90 and 100%. Group of Itemsets (k) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of Itemset (n) 4985 25,500 88,170 217,705 397,947 550,220 581,647 471,908 293,209 138,294 48,473 12,023 1,896 152 5 RMS Error Method A Method B 2.9141 2.285 3.593 2.6836 4.2165 3.6522 4.8524 3.995 5.5205 5.0014 6.2197 5.2905 6.9495 6.494 7.7125 6.6983 8.5192 8.1773 9.3857 8.3329 10.32 10.098 11.317 10.258 12.416 12.252 13.865 12.791 14.903 14.751 Table 4.6: Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Chess dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 45 and 100%. 87 Group of Itemsets (k) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of Itemset (n) 2,757 13,218 44,721 111,159 208,126 297,836 327,797 277,133 178,389 85,839 29,903 7,135 1,052 76 RMS Error Method A Method B 1.6031 0.98297 1.921 1.0486 2.1543 1.4329 2.3382 1.3499 2.4889 1.6243 2.6207 1.4417 2.7472 1.6749 2.8768 1.4244 3.012 1.6514 3.1502 1.3619 3.2857 1.5753 3.4064 1.2622 3.4853 1.4725 3.4506 1.152 Table 4.7: Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Connect dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 75 and 100%. Group of Itemsets (k) 1 2 3 4 5 6 Number of Itemset (n) 110 91 37 6 RMS Error Method A Method B 4.2501 3.5735 4.8628 3.2724 5.3181 5.153 5.7803 4.8079 Table 4.8: Error results obtained in the experiments for the support recall for groups of different k-itemsets of the Mushroom dataset made by a CMM through our proposals. The elements (itemsets) of the groups used to make the CMM recall were defined by the Apriori algorithm by constraining the itemsets with a minimum support between 45 and 100%. 88 4.3 Conclusions This chapter has been the first step in the quest for a neural network whose embedded knowledge can be used accurately for the generation of association rules, as if such rules were generated from the original (training) data D. In order to generate such rules, the property of itemset support is necessary, since it is the parameter (metric) to evaluate which itemsets (combinations) from a total space of 2n itemsets (n is the number of items of elements forming the itemsets, patterns, or transactions) are interesting for the process. Therefore, in this chapter, we have investigated if itemset support can be generated from the weight matrix formed by an associative memory. This neural network has been chosen from the large taxonomy of ANNs because its operativity is governed mainly by the use of the association concept among the input patterns. In particular, it has been analysed if this memory, after training, has the knowledge needed to assign a value defining itemset support to the stimulus (queried itemsets) presented to the memory. As a result of our analysis, it has been found that the weighted type of this memory has the natural ability to learn information about the frequency of itemsets defined by the input associations. After discovering that itemset frequencies are embedded in the weight matrix, we defined how it is possible to calculate the support for the group of the 1- and 2-itemsets directly from the weight matrix. These recalled support values from the memory are as errorless as the values given by the well-known Apriori algorithm, which scans and counts the dataset directly to calculate itemset support. In the case of the support for the groups of k-itemsets when 2 > k ≤ m, 89 two methods have been proposed to tackle the problem using the only information available in the matrix, the support information of the groups of 1- and 2-itemsets, to give a recall. Both methods, A and B, assume that items (their events of existence) are probabilistic independent among one another. While the method A uses the individual item probabilities (values defining support in the main diagonal of the matrix) as resources for the estimation, the method B forms pairwise items, whose support values are defined within the matrix, from the queried itemsets in order to give an answer. One of the advantages of using a CMM is that the space complexity generated by its weight matrix, from which itemset support will be recalled and defined by O(2m ), is much smaller than the one formed by its counterpart, the Apriori algorithm, which will be equal to O(m2 ) in the worst scenario. Errors in the recalls have resulted from employing both methods. Nevertheless, the method B has shown slightly better results. In summary, it can be stated that this memory is suitable for the perfect (errorless) calculation of support for the 1- and 2-itemset groups. Nevertheless, improvements for the case of the support for itemsets larger than two items remain an open problem. It is relevant to note that the usage of a CMM for ARM has been conceivable due to its training, which, as shown in figure 4.1, is the result of a superposition of the CMMs defining the input pairs of patterns. 90 Chapter 5 Itemset Support Generation From a Self-Organising Map After concluding in the previous chapter that although a supervised ANN, like an auto-associative memory, has the ability to remember itemset support for the groups of 1- and 2-itemsets perfectly, it struggles in recalling support for larger itemesets, due to overlaps produced by the distribution of the learnt associations in its weight matrix. In contrast to the previous chapter, here we will be looking at an unsupervised ANN rather than a supervised one for the task of building our desired memory. First of all, we will focus on studying a Self-Organising Map, because it has been successfully used for data mining tasks such as data visualisation, clustering and modelling. Additionally, its usage for ARM has already been proposed, but with some limitations. To begin with, a study on the suitability of a SOM for ARM is presented. With the conclusions drawn from the study and considering the biological SOM 91 properties, we will propose how to interpret a trained SOM with patterns, representing associations in an environment, in order to extract itemset support from its nodes. In other words, we propose how to reproduce the counting of patterns from the knowledge embedded in the map by our proposal named PISM (Probabilistic Itemset-support eStimation Mechanism). The accuracy of the itemset-support estimations, made by a SOM for either real-life or artificial datasets, is tested versus itemset-support results calculated by the Apriori algorithm. In order to improve the accuracy of the estimations, the concept of emergent feature maps is also studied. Conclusions on the suitability and considerations for the use of this neural network for ARM are given at the end. 5.1 Considering a SOM for ARM: Principles Motivated by the fact that pattern-occurrence counting values can be reproduced from the knowledge formed within the weight matrix of an auto-associative memory, based on a correlation matrix memory, our attention is now turned to studying whether a property of a network with an unsupervised training, such as the SOM (Self-Organising Map), can be utilised to count itemsets, and thus can become our itemset-support memory for ARM. The SOM is an outstanding neural network, due to its peculiar characteristics for multi-dimensional data clustering and visualisation. As stated in Chapter 3, its advantage is that it captures the knowledge from a dataset without supervision. To use a SOM for a data-mining task (Kaski et al., 1998b), it is necessary to set up first some initial parameters (for example, radius, neighborhood function, 92 and others.). In the SOM training, an NNS (Nearest-Neighbor Search) (Yianilos, 1993) is at the core of the process. This NNS provides a mechanism for the selection of the BMUs (Best-Matching Units) or winners in order to determine the corresponding group of nodes, which triggers the update of the map. Through this update process, the map m re-organises itself iteratively in such a manner that the state M can be reached, so that m → M when t → ∞. In this final state M , the map can be declared to be a model, which has already gained certain information from the input or training dataset in its weight matrix (reference models, codebook), and from which some important clustering and data-visualisation properties can be derived satisfactorily. However, due to the nature of ARM, the model formed will be utilised for a task which concerns neither clustering nor visualisation. Although the training of a SOM is typically performed in a sequential mode, in which the map is updated every time that a single input vector randomly chosen is presented, we have considered the usage of the batch mode, in which the map is updated once when the whole data is presented, because it has been used to propose, for instance, the creation of large maps (Song and Lee, 1998; Kohonen et al., 2000), the formation of maps for non vectorial data (Kohonen and Somervuo, 2002) or string data (Kohonen and Somervuo, 1998), and the speedup of the SOM training through the parallelisation of it (Lawrence et al., 1999a; Kohonen et al., 2000). Moreover, as will be shown further, batch training allows for an easy identification of the distributed knowledge in the neurons during a counting or training epoch. Therefore, a SOM can be updated in batch (Kohonen, 1996; Kohonen et al., 2000) by calculating the new values of its reference vectors associated to its neurons as follows: 93 P j hji (t)Sj (t) mi (t + 1) = P j nV j (t)hji (t) (5.1) This new state of the map (mi (t + 1)) can be understood as the spread of influences from each node mj generated by the corresponding data inputs represented by Sj (t) weighted by a neighborhood kernel function hji (t). The term Sj , defined below in Equation 5.2, describes a sum function of the nvi inputs contained at the Voronoi region Vj = {xi | kxi − mj k < kxi − mk k ∀k 6= j}, corresponding to node mj . Si (t) = nvi X xj (5.2) j=1 In the case of the kernel function, this is normally defined by a Gaussian function as follows: Ã kri − rj k2 hij (t) = exp − 2σ 2 (t) ! (5.3) In which ri and rj represent the positions of nodes mi and mj on the SOM grid, and σ defines the neighborhood radius. One characteristic of this kernel is that it presents its highest value at the origin of the influence (at the winners) and decreases monotonically along the remaining nodes of the map. Considering Equation 5.1, the update of the map can be interpreted as a process which places Gaussian functions close to the values Sj /nV j . Each of these values represents a data point which can be considered to be the mean µj of the nV j numbers of allocated patterns xi at node mj , or the centroid nj of the Voronoi set Vj defined by node mj as follows: 94 nj = 1 X xi nvj x ∈V i (5.4) j After such functions have been placed at the corresponding nodes (the BMUs), their influences need to be propagated to the rest of nodes in the map, in order to produce a new state of the SOM. The strength of the influences is different and is determined by the number of data points allocated at the node and its position in the map. It is worth noting that this interpretation of the update resembles a process of density estimation (Silverman, 1986; Devroye, 1987), in which gaussian functions are often placed at the input values in order to be summed up to determine an estimation fb of the real density distribution f of the input data. Dataset Bin4 Bin6 Bin8 Bin10 Bin12 Bin16 Transactions(itemsets) n 15 63 255 1023 4095 65535 Items m 4 6 8 10 12 16 Description All itemsets derived from 4 items All itemsets derived from 6 items All itemsets derived from 8 items All itemsets derived from 10 items All itemsets derived from 12 items All itemsets derived from 16 items Table 5.1: List of binary-artificial training datasets used in the experiments for analysing SOM properties for ARM. Each dataset contains all the possible n itemsets generated by m items. To identify which SOM characteristics can be used for ARM, experiments were conducted with the datasets described in Table 5.1. It must be noted these data are all artificial, because our interest implies not only to find out how the SOM behaves when it is fed with a data distribution formed with discrete (binary) patterns describing associations, but also to observe the SOM’s response when it is trained with datasets containing the whole itemset data space, which is defined by 2n -1 itemsets derived from n items. 95 For our experiments, the following points apply: i) the use of a batch training method, ii) the Gaussian function is used as the neighborhood function because of its probabilistic properties, and iii) the size of the radius is constant and set to one because in that manner the BMUs will always spread their maximum influence to the entire map. The maps in Figure 5.1 represent the results for this analytical exercise which can be interpreted as follows: Formation of Clusters. To make the clusters visible, the D-matrix1 of each map is plotted and coloured using principal components for colour assignment (Vesanto et al., 2000). The size of the clusters, as well as the strength of the influences, are also directly related to the number of patterns allocated to the BMUs; therefore, in the resulting maps, particularly those with more transactions, it is important to note that the clusters organized around the edges of the square grid are relatively homogenous in size. That is, the hypercube, whose vertices define the different binary patterns of a multi-space, is projected (compressed) into the two-dimensional space bounded by the neurons of the map, forming a circular formation of clusters along with a communal cluster in the middle. A good example, in which this cluster phenomena is visible, is the resulting map for 12 items in Figure 5.1. This cluster size repartition is due to the fact that the input data distribution is controlled by limiting the appearance of any item to 2n /2 times in the dataset. Although, this is something rare in real-life problems, it helps to understand the theoretical scenario when all itemsets exist in the training 1 It is a distance matrix among the most commonly used means for the visualization of the SOM. It investigates the differences of adjacent nodes. In a D-Matrix, the median of the computed distances among the nodes and their neighbor nodes is determined. 96 Figure 5.1: Maps resulting from training a SOM with artificial datasets describing associations. The red hexagons on the gray maps define the hits received from the input patterns during training. Cluster formations are presented with the coloured maps. 97 dataset. Another characteristic that also affects the number of clusters on the maps is their size. The size of the map is often determined heuristically, which normally results in having a codebook which can always be expected to be considerably smaller than the size of the original dataset. In other words, the original dataset with its hidden associations would be compressed and coded in a distributed manner within the nodes of the map in such a way that its interpretation remains unknown. Identification of Winners. One important characteristic of this ANN is that the identification of BMUs is simple. For example, in all figures, the size of the red hexagons represents the number of hits received during an epoch training. The importance of the set of BMUs (nodes with non-zero hits) is high, because they trigger the update of the map at each epoch, and also because it is to these nodes that the input Patterns are allocated. Thus, it can be assumed that any node mj which has been hit mj .] times has a high relevance for the calculation of itemset support iff mj .] = |Vj | ≥ σ (minimum support threshold). The latter initially leads to the conclusion that if a cluster Ci contains strong (that is very dense, very populated or very frequent) BMUs then it can also be considered strong. Using the language of ARM, it is to be expected that if a cluster Ci is found to be frequent on the map, then some of its members, called winners, may also be frequent. It is important to note that the latter is only partially correct because even though a winner is found to be frequent, that fact does not ensure that all the members (items) of the patterns in it would also be considered frequent. As we have initially classified a SOM to be a representative of the projection model defined in (Gardner-Medwin and Barlow, 2001), the number of hits to a node can be interpreted as a property which defines the usage of 98 that node in the process of counting input patterns during an epoch. Therefore, estimations regarding the occurrence frequency of the patterns can be calculated by using such a node property. Due to the way in which a SOM clusters data, it can expected that some BMUs might share similarities among the elements of their patterns even though they will never share the same patterns (ni .S ∩ nj .S = Ø , for all i 6= j). Therefore, it can be assumed that the true support value of an itemset will involve the collection of the corresponding value from the nodes which in turn share the itemset. This can be viewed as a soft clustering procedure in which a data point may have multiple memberships, but the weighted memberships may sum to one. Dependency Amongst Clusters. As a consequence of training, a set S of patterns would have been formed in each BMU. To be using SOMs for FIM, a SOM has to be able to provide the support for any itemset possibly formed by the input patterns. From this perspective, the dependency amongst the binary patterns is relevant. This dependency of the patterns at each BMU can be seen iff they are re-arranged into a hierarchical data structure in order to help the FIM algorithm with counting support from each pattern. This pattern dependency is important since it can be part of a method to calculate the true support of a k-itemset, determining for instance if the kitemset ⊆ m-itemset. To determine whether a pattern is either a parent of or a child of another pattern, a bitwise operator can be adopted as follows: Let A and B be two binary patterns and ∨ be a bitwise OR operator such that the operation A ∨ B can give the following dependencies: 99 • A and B are related directly only if (A ∨ B) gives either A or B as a result, so that A is considered the parent of B, (A ∨ B) = A, meaning that B ⊆ A, otherwise A will be considered the child of B, (A ∨ B) = B, meaning that A ⊆ B. • A partial or zero dependency is defined when (A ∨ B) gives neither A nor B as a result, meaning that a possible dependency exists among the clusters on the map. Following the idea described above, local data structures can be built at each BMU in order to visualise the way in which a SOM splits the datainput-space lattice into the nodes. With the definition of the pattern dependencies, we might build a more complex structure involving all the patterns found and organised by a SOM. Therefore, this idea would lead to the building of a data structure, for instance a tree or trie, whose nodes would contain the pattern definition and its corresponding frequency, which will have to be computed on the-fly. The latter resembles the operation performed by a traditional FIM algorithm so that there would not be a good reason to use a SOM for ARM. At this stage, it can be concluded that even if the separation of the patterns could reduce the computational complexity of the counting, and the identification of important sectors of the input-data space could be detected on the map this neural network is still dependant on the original dataset. In order to overcome the current disadvantage, we could look at two alternatives: The first would involve adopting either the construction of a hierarchical data structure with the input patterns, or their direct use for FIM as proposed in (Shangming Yang, 2004). Nevertheless, both of these proposals would be limiting the use of a SOM to just the formation of clusters, which as commented previously would not be a sufficient justification to claim that association rules can be generated directly from 100 knowledge, derived from this neural network. The second approach is more challenging and interesting, and involves taking this research a step further and investigating a method of decoding the knowledge of the map in order to form an interpretation to eliminate the dependency between the SOM and the training dataset for FIM. Such an interpretation would bring a positive benefit for the use of a SOM for ARM, because the SOM would fulfill the role of a large artificial memory which learns associations, defined in an environment, so that recalls about the support value of any itemset can be made satisfactorily any time that this memory is queried. Therefore, in the next section, we will develop a knowledge extraction mechanism called PISM (Probabilistic Itemset-support eStimation Mechanism) to decode or interpret a trained SOM in order to reproduce the process of the counting of patterns or itemsets from its neurons. 5.2 A Probabilistic Itemset-support Estimation Mechanism To achieve our aim, which corresponds with the development of a SOM-based memory for ARM, we first need to find out how and where the associations are being represented within the SOM. It has already been stated that the representations of the input associations are all distributed among the nodes in the map, but it is necessary to determine which group of nodes in the map will contribute to the itemset-support estimations. In other words, the selection of the nodes, which will serve as the source of knowledge for itemset-support estimations, is indispensable to be realised. 101 Therefore, after observing in our experiments that the set of BMUs is the group responsible for the changes occurring in the map during training, and by following the recommendation made in (Alhoniemi et al., 1999), which states that the information allocated in the BMUs is often an attractive source of information for the development of many applications, we will concentrate on defining an extraction method for the BMUs formed in a training epoch. In order to achieve the selection of such nodes from the trained map, two initial definitions are stated as follows: Definition 2 (Set of Winners) A set W, defining the final winners, is formed by a number of nodes mb from the final map M such that, W = {Mi | Mi .#S > 0 or Mi .S 6= φ} mb = |W | ≤ |M | Where S is a set whose elements are the patterns that hit the node i. Definition 3 (Winning Vector) A vector Z = {z1 , . . . , zn } will be called a final winning reference vector if its respective node forms part of the set W. In addition, zji will be understood as the ith component of Z that has acquired some information about the ith component of some input patterns (xi ) which due to their similarities have been allocated by the training process in the node associated to Mj . Once W = {w1 , w2 , · · · , wmb } has been identified in a converged map M by using Definition 2, the next stage requires the application of some concepts from the Probability Theory to the reference vector of each node in the set W in order to evaluate whether it is feasible to obtain our desired knowledge, the support of itemsets, from these vectors. It is relevant to note that it is not exclusively necessary to extract the set W from the converged map, but it can be assumed 102 that more accurate itemset-support estimations can be made if the latest state of a map is utilized. To define a value Pr that holds the probability of the fact that a node, for instance mj , has become a BMU mc for some input patterns after completing a training epoch, it can be first assumed that becoming a winner in the process is equally likely to happen among all of the nodes in the map; therefore, the probability of each element contained in W can be specified as follows: Definition 4 (Prior Probability: Becoming a Winner) In SOM training, where N defines the number of input patterns contained in D, the probability that a node mi has become mc during an epoch training is defined by the ratio between the number of times that mc has been hit by D and the N number of input occurrences in the training. µ Pr(mi → mc ) = mi .#S N ¶ (5.5) Where #S is understood as the number of data points which belong to the Voronoi set Vi delimited by the winner mi . The previous definition is the prior probability that a node has become a BMU. This Pr(i) appeared in (Alhoniemi et al., 1999) when Alhoniemi et al was giving a probability interpretation of the response of the nodes to a new data sample using Bayes theorem. Initially, the prior probability values of the nodes are all zero, and some will change after the conclusion of each training epoch. In this work, the values given by the converged map will be used. Before continuing to define our extraction mechanism, we consider it impor103 tant to explain how the reference vector Z of a BMU i holds the corresponding probabilities (frequencies) of each pattern component xi of the total number of patterns allocated to that node. Therefore, the explanation is as follows: As mentioned above, the input patterns, representing the associations to be learnt, are formatted binary. Nevertheless, once they are presented to the SOM, they are handled as m-dimensional real vectors due to: i) the use of an Euclidean distance metric for the BMU searching rather than the use of a Hamming distance which could be more appropriated for binary vectors, and ii) the vectors resulting from applying Equation 5.1 are formed by real, but not binary numbers. Therefore, the input patterns can be defined to be real vectors whose components are bistate since they hold zeros and ones and the tendency of having less zeros than ones or vice versa will depend upon the hidden associations defined by them. In other words, each variable or component zi , representing an item behaviour, can be defined to be a bimodal distribution whose peaks will be defined at the values of one and zero. This bistate property of each zi is important for the purpose of this work, as a normal distribution can be created from these values. The different concentrations or densities of these two values will form a density distribution whose mean µ, which is the highest point of the distribution, defines a value containing the percentage of success events (zi =1) for this variable or item. In ARM, this percentage of success of zi of a total number of events N is what the support of zi represents. In other words, the µ of each component measures the number of occurrences of that component. Some examples describing this idea are depicted in Figure 5.2, in which the curve of the density distribution of each bottom graph represents an estimation of the real distribution formed by 104 the n transactions used in each example. For this case, the plotted distributions have been formed by using the concept of a gaussian-kernel estimator (Silverman, 1986; Devroye, 1987). It has been mentioned above that the update of the SOM, particularly Equation 5.1, can be interpreted as an aggregate of influences coming from different BMUs on the map. Hence, If a BMU is about to be updated, it can be derived from 5.1 that the final value for its codeword is a value very close to the mean of all the patterns accumulated in it. The difference between the real mean and the value given by the SOM is due to receiving influences from others weighted by the distances among the nodes and the radius used for the update of the map during training. Then, It can be concluded that the values of the components defined in each BMU are estimations of the real means, and could therefore be used for the calculation of the support instead of the original set of patterns. Reviewing Equation 5.1, which is used for the conventional batch training, and analysing the training procedure, it can be concluded that the updating of all the components zi in the map (codebook) is realized independently. The final value possessed in the component zi is never affected by the updating process of the component zj at any point during training and vice versa. Meanwhile, it has been stated in (Hugh, 1997) that whenever there is a kind of physical independence between events or processes, it shall be assumed that they have mathematical independence. Therefore, assuming that each component zj of a final winning reference vector Z is independent from the others, and that each of these components {z1 , z2 , . . . , zn } defines a probability of appearance of their corresponding elements {x1 , x2 , . . . , xn } in the input patterns, we can proceed to define that: 105 Figure 5.2: This figure illustrates the importance of the mean in the calculation of the support of an item from a trained SOM. Different number of transactions (n) composing of zeros and ones have been used to form the bottom graphs. These graphs show that different concentrations (densities) of these bistate values captured in an item induce the tendency of the distribution of the curve to approach the densest value (e.g., in the left graph the number of failures (zi =0) is greater than the number of successes, therefore the highest point of the distribution tends to be placed at 0). 106 Definition 5 (Independence Amongst Components of a Winner) Let a SOMtraining process be defined as an experiment where the possible outcomes S (S is a sample space in probability terms) are defined by a countable infinite set of data points in Rn (n is the dimensionality of the input data) which are allocated in D. Let two random components zi and zk at the winning vector mj be represented by two discrete random variables, A and B respectively, which describe the probability of their corresponding events of occurrence. Assume that these variables A and B have no influence on each other. Thus, the probability that they both occur together can be described by the joint probability, the product of the probabilities of these two individual variables, i.e. Pr(A = zi , B = zk ) = Pr(A = zi ) ∗ Pr(B = zk ) If the case is to study the joint distribution of n random variables X1 ,. . .,Xn , which conveniently can be represented by a random vector X = (X1 , . . . , Xn ), describing values from z1 ,. . .,zn then the corresponding joint probability can be obtained by, Pr(X1 = z1 , . . . , Xn = zn ) = n Y Pr(Xi = zi ) (5.6) i=1 Definition 5 is possible, due to the fact that all components have been declared independent. It has also been said in (DeGroot, 1975) that n random variables X1 ,. . .,Xn have a discrete joint distribution if the random vector (X1 ,. . .,Xn ) can have only a finite number or an infinite sequence of different possible values (x1 ,. . .,xn ) in Rn . Then, the joint probability function of X1 ,. . .,Xn is defined to be a function f so that for any point (x1 ,. . .,xn ) ∈ Rn , 107 f (x1 , . . . , xn ) = Pr(X1 = z1 , . . . , Xn = zn ) So far, it has been specified that every node in the map has a real value attached defining its prior probability. This prior probability must be zero for all nodes which are not elements in the set W . Consequently, in Equation 5.6, a joint probability is defined describing the fact that some random variables happen together at each winner. Thus, the next step is to define the manner of calculating the total or final probability Pr(E|M ) that some event E, which represents k variables happening together, occurs in the structure defined by a resultingtrained map M . To define the value Pr(E|M ), the concept of probability partition found in (DeGroot, 1975) can be utilized as follows: Definition 6 (Partitioned Data Space) Let S denote a sample space (input data or pattern space) of some experiment (training) and consider k events A1 ,. . . ,Ak in such a way that they are disjoint (they do not share elements). Thus, it is said that these k events form a partition of S. If the k events A1 ,. . . ,Ak form a partition of S, and if B is any other event in S, then the events A1 B, A2 B,. . . ,Ak B will consequently form a partition of B as illustrated in Figure 5.2. Hence, it is possible to write B = (A1 B) ∪ (A2 B) ∪ · · · ∪ (Ak B) Moreover, since the k events (right side of the equation) are disjoint, Pr(B) = k X Pr(Aj B) j=1 Finally, it is known that if Pr(Aj ) >0 for j = 1, 2, . . . , k then Pr(Aj B) = Pr(Aj ) Pr( B| Aj ). Thus, it follows that 108 Pr(B) = k X Pr(Aj ) Pr( B| Aj ) j=1 Where P r( B| Aj ) defines the conditional probability of the event B occurring at the partition defined by the event Aj . Figure 5.3: The figure on the left depicts the intersections of an event B with events A1 ,. . . ,A5 of a partition over S. The figure on the right depicts the concept of Voronoi regions which can be formed on the SOM (the dots represent the codewords while the stars represent the data points assigned to each Voronoi region). For the purpose of this work, it can then be stated that the k events A1 ,. . . ,Ak are the result of the ”best-matching” process during training, and that they represent the group of nodes over which the input data space is split, forming k different events (partitions) which are disjoint (no pattern is shared among them). Similarly, the event B can be seen as any event that is likely to occur defined by some of the elements (items) of the vectors allocated at each node. In summary, the final probability of an event E can be defined as follows: Definition 7 (Total Probability - The Frequency Occurrence of a Pattern -) Having a final map M after training a SOM with D containing discrete patterns (binary patterns), it is possible to obtain the corresponding probability of the event E, describing the fact that k components or items, x1 ,. . .,xk of the input vectors appear together in the training environment D, through calculating the 109 sum of the partial probabilities regarding the event E (z1 ,. . .,zk ) in those neurons, defined by the set W , in which the knowledge of D has been distributed. This calculation is represented by: Pr(E|D) ≈ Pr(E |M ) = mb X Pr(Wi ) Pr(E |Wi ) (5.7) i=1 in which, P r(Wi ) represents the priori probability of each BMU neuron and P r(E|Wi ) defines the probability of the event E in that neuron. This final probability P r(E|M ), whose calculation is depicted in Figure 5.4, can also seen as the estimation the frequency occurrence of the event E in D. It is important to clarify that while Pr(E |M ) is the estimation of the real value Pr(E |D) which normally is calculated by doing the counting of the event or pattern E in the whole dataset D. An Equation summarising this definition is as follows:   Pr(E |M ) = mb X Pr(Wi ) i=1 k Y Pr(zj ) (5.8) j=1(z∈Wi ) To conclude the description of this method, it is essential to use the correct terminology in order to describe this proposed method using the frequentitemset-mining language. Therefore, it must be first defined that a k-itemset is a k-multi event E occurring within the components of map M. Each possible kitemset has an associated support value (probability of appearance) in D that can be calculated from M by, Definition 8 (Estimation of Itemset Support from a SOM) Let M be a map resulting from training a SOM with N binary patterns grouped in an environment D representing transactions involving m items. An estimation of the support (supp) ˆ of an itemset X, defining a pattern or event E formed with k items, which is equal to an estimation of the frequency occurrence of such a pattern in the training 110 Figure 5.4: Representation of the Probabilistic Itemset-support Estimation Mechanism (PISM) proposed in this chapter. environment, can be calculated from the embedded and distributed knowledge in M by summing up the probabilities registered by E in the BMUs, defined by a set W , for such D, as follows:   supp(X) ˆ = Pr(X |M ) = mb X Pr(Wi ) i=1 k Y Pr(zj ) (5.9) j=1(z∈Wi ) 5.3 Experiments and Results In order to verify the accuracy of the knowledge extraction method described above, which at this stage is only concerned with the estimation of the support metric for any possible itemset derived from the patterns occurring in a training 111 dataset or environment, some datasets described in Tables 5.1 and 5.2 have been used to test it. In all the experiments presented here, the SOM-training method is batch, the radius is set to one, and the neighborhood function is Gaussian unless otherwise stated. Dataset Chess Connect Mushroom Transactions(m) Items(n) 3196 67557 8124 75 129 (Avg. 43) 119 (Avg. 23) Table 5.2: List of real-life binary training datasets used in the testing of PISM for SOM. They have been used in FIM-algorithm benchmarks (Jr. et al., 2004; Goethals and Zaki, 2003). The experiments have been carried out in two stages, involving artificial and real-life datasets. Here, the focus lies on testing how accurate the results given by a SOM via our proposal are when this neural network is being queried to provide the support value of some group of itemsets. A list with all the queries involved in our experiments is presented in Table 5.3, which also defines the way in which each trained map will be queried. To be congruent with the previous chapter, the groups of itemsets for testing our extraction method on the trained SOMs have also been formed by mining the artificial and real-life datasets with some constraints, concerning the support σ and/or the size k of the tested itemsets, through the use of the Apriori implementation developed by Borgelt (Borgelt, 2003). As in the previous chapter, in order to corroborate how far or close the estimations made by our proposal are from the real values, the itemset support given by the Apriori algorithm, implemented by Borgelt, will likewise be used for comparison. 112 Query All 1Itemsets 2Itemsets 3Itemsets 4Itemsets 1to3Itemsets 45to100Itemsets 75to100Itemsets 90to100Itemsets Description Applied to Maps Trained With All itemsets All 1-itemsets All 2-itemsets All 3-itemsets All 4-itemsets All k-itemsets where 1 ≤ k ≥ 3 All k-itemsets where 45 ≤ σ ≥ 100 All k-itemsets where 75 ≤ σ ≥ 100 All k-itemsets where 90 ≤ σ ≥ 100 All artificial datasets All real-life datasets All real-life datasets All real-life datasets Chess dataset All real-life datasets Chess, mushroom datasets Connect dataset Chess dataset Table 5.3: List of queries used to form the groups of itemsets used for the testing of the itemset-support estimations from SOMs via PISM. In this case, k means the size of the itemsets and σ refers to the support used to form such itemset groups. Each experiment begins with the training of a SOM M with a dataset or environment D. Consequently, a group of itemsets L along with their real support property, representing the itemsets satisfying some constraints C, is generated by mining D with the Apriori algorithm. The itemsets existing in L are then used as stimuli to make M recall their support or frequency occurrence values via our method from the knowledge embedded in its group of BMUs. Once the real and estimated itemset-support values are collected, a generalization error is calculated. To evaluate the effectiveness of a SOM in recalling itemset support, the RMS error, defined by Equation 5.10, has been applied. E RM S v u N u1 X =t {y(xn ; m∗ ) − tn }2 N i=1 (5.10) In this particular case, y stands for the method proposed here for calculating itemset-support from a trained map m∗ which is queried to recall the support of an itemset x. The comparison of y is then done versus t which holds the support 113 of x generated by the Apriori algorithm over the original dataset. N stands for the number of patterns contained in the itemset group requested. For example, N will be equal to 75 if a SOM, trained with the Chess dataset, is queried to recall support values for the group of 1-itemsets defined for such a dataset. In some of the graphs shown below, along the x-axis, the itemsets, for which the SOM is queried, are arranged. Each itemset is plotted based on their order of appearance (from left to right, from bottom to top) in the lattice structure representing the complete data space formed by the k items. An example of a lattice showing this arrangement can be seen in Figure 2.1 in Chapter 2. In the case of the artificial datasets, the maps are queried for all the support values for all of the possible itemsets (combinations) that can be formed by the n items or attributes of the dataset. This is to determine if our events E, representing patterns or associations in the environment, whose final probability or frequency occurrence is defined by the probabilities of the individual random variables representing the items, are independent or pairwise independent. This concern is important because if it were found that components of the events are only pairwise independent, then the estimations made by this method for events, involving the joint probability of more than 3 variables or items, could present large differences to the real frequency counters. Therefore, it could not be claimed that a SOM works as an itemset-support memory for ARM. In Figure 5.5, the results for the artificial datasets obtained from applying PISM, in particular from applying Equation 5.9, to make SOM recall itemset support are shown. 114 60 SOM (PISM) Apriori Implementation 50 Support (%) 40 30 20 10 0 15 k−itemsets (1<=k<=4) 60 SOM (PISM) Apriori Implementation 50 Support (%) 40 30 20 10 0 255 k−itemsets (1<=k<=8) 60 SOM (PISM) Apriori Implementation 50 Support (%) 40 30 20 10 0 65535 k−itemsets (1<=k<=16) Figure 5.5: Results for the support value of 15 itemsets (top graph), 255 itemsets (centre graph) and 65535 (bottom graph) obtained after using PISM in order to satisfy the query All- to the map trained with the dataset Bin4, Bin8 and Bin16 respectively. For reference, the values corresponding to the same queries using an Apriori implementation are also plotted. 115 To provide an indication of the behavior of a SOM and how accurate the results can be during training, some temporal (intermediate) stages of a SOM before it converges are plotted in Figure 5.6. It should be noted in these two figures that even though the mappings, derived from a SOM training, represent the same input environment, it does not imply that the same results will be generated by them. The variation in the results is the consequence of the map initialisation, that for this work, is done randomly. To overcome this unstable situation, the map could then have been initialised linearly. It is relevant to point out from Figure 5.6 that since the very first training iteration (epoch) both SOMs are able to provide a good estimation for the real itemset support. 70 SOM (PISM) Epoch 1 SOM (PISM) Epoch 3 SOM (PISM) Epoch 5 Apriori Implemetation Support (%) 60 50 40 30 20 10 0 15 k−itemsets (1<=k<=4) 80 SOM (PISM) Epoch 1 SOM (PISM) Epoch 3 SOM (PISM) Epoch 5 Apriori Implemetation Support (%) 60 40 20 0 15 k−itemsets (1<=k<=4) Figure 5.6: Intermediate results (The support values of 15 itemsets) generated from using PISM for the query -All- to the map being trained with dataset Bin4x100. In both cases, the SOM needs five epochs to converge but after the first epoch, good estimations can be formed for the support of itemsets. The small difference in the performance between these two exercises is due to the type of initialisation chosen. To give a taste of the quality of the estimations given by a SOM via the pro116 posed method for some of the query cases for real-life datasets, the graphs in Figures 5.7, 5.8, and 5.9 are depicted. These resulting mappings of the trained SOMs compared to the Apriori results have, at first glance, a better accuracy than the artificial ones. An explanation for this improvement in the itemset-support mapping is that in the real-life datasets the distribution of the patterns is unbalanced, that is, the fact of finding only a single repetition for each possible different pattern (itemset) of the distribution is only remotely likely to happen. The latter is a phenomenon happening in the artificial datasets used in this experimentation. SOM (PISM) Apriori Implementation S u p p o r t (%) 100 50 0 75 100 50 0 1−i t e m s e t s S u p p o r t (%) 100 50 0 2775 100 50 0 2−i t e m s e t s S u p p o r t (%) 50 67525 100 50 0 3−i t e m s e t s 67525 3−i t e m s e t s S u p p o r t (%) 100 50 0 2775 2−i t e m s e t s 100 0 75 1−i t e m s e t s 100 50 0 1215450 4−i t e m s e t s 1215450 4−i t e m s e t s Figure 5.7: Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 4Itemset- to the map trained with the dataset Chess. For reference, the values corresponding to the same queries (plots on the left) against the dataset Chess using an Apriori implementation are also plotted. In order to assess the capability of a SOM to generalise the recall of the support of any itemset, a series of three experiments has initially been conducted with the Chess, Mushroom and Connect datasets for the groups of the itemsets 117 SOM (PISM) Apriori Implementation 100 80 80 S u p p o r t (%) 100 60 40 20 0 60 40 20 0 119 1−i t e m s e t s 100 80 80 S u p p o r t (%) 100 60 40 20 0 60 40 20 0 7021 2−i t e m s e t s 7021 2−i t e m s e t s 100 80 80 S u p p o r t (%) 100 60 40 20 0 119 1−i t e m s e t s 60 40 20 0 273819 3−i t e m s e t s 273819 3−i t e m s e t s Figure 5.8: Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 3Itemset- to the map trained with the dataset Mushroom. For reference, the values corresponding to the same queries (plots on the left) against the dataset Mushroom using an Apriori implementation are also plotted. SOM (PISM) Apriori Implementation 100 80 80 S u p p o r t (%) 100 60 40 20 0 60 40 20 0 129 1−i t e m s e t s 100 80 80 S u p p o r t (%) 100 60 40 20 0 60 40 20 0 8256 2−i t e m s e t s 8256 2−i t e m s e t s 100 80 80 S u p p o r t (%) 100 60 40 20 0 129 1−i t e m s e t s 60 40 20 0 349504 3−i t e m s e t s 349504 3−i t e m s e t s Figure 5.9: Results (plots on the right) obtained after using PISM in order to satisfy the queries -1Itemset : 3Itemset- to the map trained with the dataset Connect. For reference, the values corresponding to the same queries (plots on the left) against the dataset Connect using an Apriori implementation are also plotted. 118 plotted in Figures 5.7, 5.8 and 5.9. In these experiments, we are interested in evaluating the effectiveness of our method applied to a SOM with random initialisation in recalling itemset support for itemset groups whose components are formed by the same number of items. The corresponding generalisation errors for these experiments are summarised in Tables 5.4, 5.5, and 5.6. These numbers have been obtained by applying Equation 5.10 to the results given by a SOM and our method, and original dataset and the Apriori algorithm. Experiment 1 2 3 Itemset Group 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1 0.13781 1.3735 1.4338 0.27685 1.426 1.4803 0.21732 1.3974 1.4593 Rectangular Radius 0.5 0.14819 0.96662 1.0097 0.15956 0.95871 1.0012 0.14176 0.99191 1.037 0.001 0.26277 0.8871 0.92083 0.16423 0.75058 0.78645 0.14409 0.79192 0.83168 1 0.15412 1.3771 1.4415 0.15149 1.3789 1.4431 0.15006 1.3699 1.4314 Hexagonal Radius 0.5 0.18899 1.0398 1.0841 0.12718 0.99795 1.0454 0.16442 1.0199 1.0659 0.001 0.15363 0.80162 0.84045 0.1863 0.82116 0.85914 0.15158 0.78599 0.82551 Table 5.4: Generalisation errors produced by a trained SOM through PISM for the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Chess dataset. Experiment 1 2 3 Itemset Group 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1 0.14403 0.42695 0.2484 0.19336 0.44692 0.25762 0.21152 0.4413 0.25409 Rectangular Radius 0.5 0.14272 0.31001 0.18097 0.095607 0.28158 0.16553 0.14741 0.29438 0.17267 0.001 0.16962 0.32908 0.193 0.15078 0.30919 0.18251 0.18457 0.34385 0.20083 1 0.14957 0.46244 0.27029 0.1531 0.45812 0.26885 0.16683 0.4298 0.25081 Hexagonal Radius 0.5 0.11507 0.28882 0.17016 0.10954 0.29399 0.1724 0.11674 0.30569 0.18004 Table 5.5: Generalisation errors produced by a trained SOM through PISM for the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Mushroom dataset. 119 0.001 0.13477 0.3464 0.2034 0.15362 0.32803 0.1929 0.12932 0.31145 0.18481 Experiment 1 2 3 Itemset Group 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1 0.35658 0.76868 0.6471 0.44143 0.80008 0.66575 0.39652 0.76256 0.63775 Rectangular Radius 0.5 0.49537 0.68955 0.55741 0.38671 0.65154 0.53933 0.33373 0.63865 0.53422 0.001 0.78376 0.77088 0.57361 0.83169 0.80109 0.59207 0.8609 0.84214 0.6246 1 0.45956 0.80169 0.66536 0.38639 0.7705 0.64628 0.33847 0.76385 0.64553 Hexagonal Radius 0.5 0.36677 0.65929 0.5489 0.28075 0.63333 0.53553 0.78373 0.83437 0.6366 Table 5.6: Generalisation errors produced by a trained SOM through PISM for the queries 1Itemsets, 2Itemsets and 3Itemsets derived from the Connect dataset. It is important to note that errors for three different values of radius for different map layouts have been calculated. Radius has been varied to determine the best parametric state of a SOM for FIM. The tuning of the radius is done since this parameter can work as the regulator of the BMU influences exchanged during training, and its definition can affect the accuracy of the final mapping. The way in which the nodes are distributed in the SOM, hexagonal or rectangular, has also been tested since we have assumed that different organisations of neurons can also influence the final estimated itemset-support values. Because the six maps of each experiment were initialised randomly, it can be stated that there is no relationship amongst their results. Nevertheless, what can be stated to have a relationship is the three values, representing errors for the three groups of itemsets, because these values have been calculated from the same trained map at each experiment. By using such relationships, a tendency in the resulting errors for the tested groups for each dataset can be stated as follows: • In case of the Chess dataset, Table 5.4, the error tends to increase while k, which defines the size of the itemsets, increases without any exception 120 0.001 0.74751 0.75174 0.56305 0.82694 0.79409 0.58539 0.6665 0.71637 0.54759 of the type of radius employed in the experiments. No clear tendency is presented regarding on which map layout provides the best results, so the layout can be stated to be irrelevant for this dataset in the estimation of itemset support. • For the case of the Mushroom dataset, Table 5.5, the error tends to form a distribution with a mode at the 2-itemsets without any exception of the type of radius. As the above case, a tendency to give more accurate results, amongst the two map layouts, is not clear either. • In the last case, regarding the results for the Connect dataset, Table 5.6, the error also tends to form a distribution with a mode at the 2-itemsets, as similar to the previous case, but no defined order between the extreme error values is presented. Since it was not possible to establish which radius is the best for the estimation of itemset support with the results given above, the next experiments will involve initialising all the maps, utilising different radius’, in the same manner. To produce such initialisation in the maps, a linear initialisation will be applied, because in this way the weight vectors are initialised along the linear subspace spanned by the two principal eigenvectors of the input dataset (Kohonen, 1996). Only one experiment is shown for each group of itemsets in Table 5.7, because the same results will always be obtained for all cases. This is, a map trained in batch with a linear initilisation will converge to a final state in an identical manner because its initial state is always the same. Unlike our previous experiments, in this case, the results maintain a relationship for the different radius and itemset sizes. Therefore, in general terms, it can be stated that while the radius decreases for training, the generalisation error 121 Dataset Itemset Group Chess Mushroom Connect 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1-itemsets 2-itemsets 3-itemsets 1 0.2074 1.3238 1.3837 0.24362 0.41104 0.23652 0.8835 0.95011 0.72913 Rectangular Radius 0.5 0.13494 0.95034 0.99443 0.28344 0.35578 0.20278 0.67506 0.75157 0.58338 0.001 0.15624 0.5906 0.61271 0.15767 0.34597 0.20182 0.74965 0.66644 0.47813 1 0.26896 1.3455 1.4024 0.24715 0.42121 0.24215 0.87661 0.95628 0.73671 Hexagonal Radius 0.5 0.15379 1.0044 1.0498 0.29623 0.36697 0.20841 0.7407 0.79004 0.60596 0.001 0.15624 0.5906 0.61271 0.15767 0.34597 0.20182 0.74965 0.66644 0.47813 Table 5.7: Generalisation errors for the trained SOMs with linear initialisation. also decreases. It can also be noted that the error values for the different itemset groups, given by, either the rectangular, or hexagonal maps with a radius set to 0.001, are all the same respectively. In the cases, in which the radius is greater than 0.001, it is observed that the maps with a rectangular layout give better estimations than their hexagonal counterparts. In summary, based on the results in Table 5.7, some temporal conclusions can be made as follows: • It can temporally be determined that the value of the error tends to grow while k, which is the number of components (items) in the itemsets, does it as well. • It can also be concluded that the generalisation error has a direct relationship with the amount of influence defined by the radius used for training. • The best error values come from the maps whose radius parameter has been reduced to 0.001. Therefore, it can be assumed that performing noexchange-of-BMU-influence, i.e., setting the radius to zero, which reduces the SOM to a Vector Quantization algorithm, could generate even better estimations for itemset support. In order to find out if the above conclusions can satisfy more realistic mining 122 activities, in which support for different groups of k-itemsets with different sizes is needed, some trained SOMs have also been tested on the queries, constraining by the support property of the itemsets in the dataset, defined in Table 5.3. To visualise the SOM behavior in this type of mining exercises, we have plotted, in Figure 5.10, an example of the SOM estimations against the real support values to model their discrepancies. The complete results for these new experiments, involving different ranges of varied groups of k-itemsets, are summarized in Table 5.8. For the 3−itemset group For the 4−itemset group 100 Real Itemset Support Real Itemset Support 100 98 96 94 92 90 88 90 92 94 96 98 98 96 94 92 90 88 100 90 Estimated Itemset Support For the 5−itemset group 96 98 96 94 92 90 92 94 96 98 Estimated Itemset Support 93 92 91 90 89 90 91 92 93 Estimated Itemset Support For the 7−itemset group Real Itemset Support 91.5 91 Hexagonal SOM Rectangular SOM 90.5 90 89 100 94 Real Itemset Support Real Itemset Support 94 For the 6−itemset group 98 90 88 92 Estimated Itemset Support Apriori 89.5 90 90.5 91 91.5 Estimated Itemset Support Figure 5.10: Distribution of the itemset-support estimations made by SOM via our method for the query -90to100Itemsets- for the Chess dataset. The corresponding errors are summarized in Table 5.8. 123 94 Dataset Query Chess 90to100Itemsets Chess 45to100Itemsets Mushroom 45to100Itemsets Connect 75to100Itemsets k 3 4 5 6 7 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 3 4 5 6 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Map Layout Hexagonal Rectangular 0.42918 0.3668 0.52008 0.4352 0.61963 0.50871 0.76948 0.62251 1.0274 0.81303 0.92081 0.89465 1.0748 1.0476 1.2055 1.1724 1.327 1.2829 1.4407 1.3794 1.5395 1.4548 1.6126 1.4983 1.6496 1.4996 1.648 1.4559 1.6202 1.3801 1.5947 1.3037 1.6045 1.2653 1.6755 1.2926 1.8418 1.4084 2.0514 1.5868 0.47635 0.44559 0.54856 0.49902 0.62168 0.49366 0.71689 0.51841 1.6103 1.6109 1.6982 1.6883 1.7487 1.7322 1.7817 1.7626 1.8063 1.7886 1.8296 1.8168 1.858 1.8532 1.8969 1.9027 1.9503 1.9686 2.0217 2.0538 2.1142 2.1612 2.23 2.293 2.3675 2.4485 2.5059 2.6096 2.24 2.4146 Itemsets per group 167 203 128 39 4 4985 25500 88170 217705 397947 550220 581647 471908 293209 138294 48473 12023 1896 152 5 100 91 37 6 2757 13218 44721 111159 208126 297836 327797 277133 178389 85839 29903 7135 1052 76 1 Table 5.8: Generalised errors for trained SOMs with linear initialisation on different ranges of groups of k-itemsets. 124 Roughly speaking, it has been noted again that the rectangular maps produce better estimations than the hexagonal ones. Nevertheless, the differences established between them are not considerable. The tendency of the error to growth, while k increases, is also presented. However, in some cases, it tends to stay steady or even to decrease as shown in Figure 5.11. For Chess dataset with 90to100Itemsets Query 1.2 For Chess dataset with 45to100Itemsets Query 2.2 Hexagonal SOM Rectangular SOM 1 2 Hexagonal SOM Rectangular SOM 1.8 0.8 1.6 0.6 1.4 1.2 0.4 1 0.2 3 4 5 6 7 For Chess Mushroom with 45to100Itemsets Query 0.8 Hexagonal SOM 0.75 Rectangular SOM 0.8 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 For Chess Connect with 75to100Itemsets Query 3 Hexagonal SOM Rectangular SOM 0.7 2.5 0.65 0.6 0.55 2 0.5 0.45 0.4 3 4 5 6 1.5 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Figure 5.11: Generalising errors for the results given by Table 5.8. While the x-axis represents the different itemset groups, the y-axis determines the calculated error. Reviewing the numbers defined by the error tables, it can be determined that even though the generalisation errors look relatively small for the experiments, some considerable differences between the estimated and real support values can be found for some itemsets. Although these differences tend to get diminished during training, their final values could cause misleading support values during the detection of frequent itemsets in an ARM exercise so that it is important to 125 minimize such misleading effects. As an explanation for these discrepancies in the support results, two options may be considered: a) the type of metric used to organise the patterns on the map, or b) the property of assuming that the items (events) of the patterns are independent at each BMU. Moreover, both reasons could be directly influenced by the number of neurons contained in the map since this characteristic of the map is responsible for limiting the node-space available for any new incoming pattern. In other words, the larger the number of nodes available in the map, the better the distribution of the patterns along the map, and therefore the better the resolution of the outcome. The latter is a concept known as emergent feature maps, in which their structure is defined by large number of neurons, defined by Ultsch (Ultsch, 1999) which has shown to have the potential for data mining tasks (e.g., classification and clustering). Emergence occurs in natural- as well as artificial-systems, and refers to the ability of a system to produce a phenomenon on a new, higher level, as a result of the cooperation of many elementary processes (Ultsch, 1999). That is, new data features can emerge from structures formed by the cooperation of a large number of neurons. The latter is a characteristic that it is not presented in the traditional SOMs in which the number of neurons is controlled and limited (this number is approximately equal to the number of clusters) to show emergence. Therefore, a series of experiments with the Chess dataset have been set up to evaluate if the concept above defined could improve the response of a SOM for FIM. In these experiments, different sizes of a SOM have been used and the results are shown in Figures 5.12 and 5.13. The sizes of the map involved in these experiments range from two nodes to three times the map size H, which defines the number of neurons in the map 126 Error calculated For 1−itemsets 0.6 0.4 0.2 0 2 0.5H H 1.5H Size of the map 2H 2.5H 3H 2H 2.5H 3H 2H 2.5H 3H Error Calculated For 2−itemsets 3 2 1 0 2 0.5H H 1.5H Size of Map Error Calculated For 3−itemsets 3 2 1 2 0.5H H 1.5H Size of Map Figure 5.12: Distribution of the itemset-support generalisation error made by SOM via our method for the queries: 1Itemsets, 2Itemsets and 3Itemsets for the Chess dataset, when the size of the map, representing an itemset-support memory for ARM, increases. needed for mapping the n number of input patterns contained in the training dataset. H is normally calculated heuristically (Vesanto et al., 1999) by Equation 5.11. In this case, it has resulted in being 289 for the 3196 patterns existing in the Chess dataset. √ H=5 n (5.11) At first suspected and then corroborated by the results shown in the Figures 5.12, and 5.13, the error for the recall of the support of k-itemsets (where k is greater than 1) tends to decrease while the number of nodes increases in the map. The improvement is due to the fact that the input patterns develop a better organization on the map, which also breaks down the dependency property among the items of the input patterns. That is, patterns which are truly related remain closer on the map but they do not share the same node. 127 15 3−itemset Group 4−itemset Group 5−itemset Group 6−itemset Group 7−itemset Group 8−itemset Group 9−itemset Group 10−itemset Group 11−itemset Group 12−itemset Group 13−itemset Group 14−itemset Group 15−itemset Group 16−itemset Group 17−itemset Group 1.6 1.4 Error Error 10 1.8 1.2 1 5 0.8 0.6 0.4 0 2 0.5H H 1.5H 2H 2.5H 3H Size of Map 3H Size of Map Figure 5.13: Distribution of the itemset-support generalisation error made by SOM via our method for the query 45to100Itemsets for the Chess dataset, when the size of the map, representing an itemset-support memory for ARM, increases. This improvement in the error for the group of the 1-itemsets does not represent the behaviour explained above. In this particular case, it starts with a value very close to zero since there are just two nodes which accumulate the total input patterns so that the values, representing the calculated support, are placed closer to their corresponding means because there are not too many BMU influences to combine. The contrary effect occurs when the map size is increased; this has the result that the error tends to increase, due to the fact that some BMU influences are presented in the updating of the map. Nevertheless, the likely error stays steady and low along the experiments. In a more real scenario, in which the value for different itemset groups is 128 extracted from a SOM, for instance, the query 45to100Itemsets over the Chess dataset, it is evident how the error in the estimations tend to decrease while H increases in the experiments. This is due to the better distribution of the patterns along the nodes; therefore, it can be concluded the use of the concept of emergent feature maps is advantageous for the quality of the itemset-support estimations made by this SOM-based memory for ARM. 5.4 Conclusions In the discovery of association rules from datasets, the task of FIM has to be performed first, since it provides the raw material to form the possible rules. To measure which itemsets from the database are interesting for the data mining task, the support of itemsets has to be calculated. Motivated for the results obtained from the knowledge formed with an auto-associative memory in the previous chapter, we focus on exploring an unsupervised neural network. In particular, we investigate the use of a SOM for ARM. In this work, the exploration of a novel application for SOM, which refers to using its codebook for the estimation of itemset support, has been undertaken. Unlike other proposals (Changchien and Lu, 2001; Shangming Yang, 2004), which form the literature of this topic, in this work, the suitability of estimating ”itemset support” from the weight matrix of this neural network has been investigated. In particular, the calculation of the support has been proposed by an extraction mechanism called PISM (Probabilistic Itemset-support eStimation Mechanism), which uses only the winners formed in the final map for forming itemset-support estimations. Thus, the input dataset can be discarded after training, since the winners have gathered enough capability (information) to map the 129 associations occurring in the multi-dimensional input patterns. To validate the suitability of a SOM for this type of data mining task, results of some experiments have also been given versus an implementation of the Apriori algorithm. Numeric comparisons have also been done by using an error metric between the estimated value extracted from a SOM against a traditional FIM algorithm. The results of the experiments have shown that the suitability of a SOM for ARM is realistic, in particular if the concept of emergent SOM is utilized. In summary, we have satisfactorily tackled the problem of how to reproduce the counting of patterns, occurring in a high dimensional space of an environment, with the learnt information or knowledge embedded in the two-dimensional space formed by a trained SOM. Therefore, it can be concluded that a SOM can be seen as a good candidate for our desired memory, due to its ability to learn information about the frequency-occurrence of patterns hidden in training associations. 130 Chapter 6 Incremental Training for Incremental ARM: A SOM Model Because itemset support is a very important metric for the generation of association rules, an extraction mechanism for SOMs, which decodes the formed codebook, has been proposed to estimate itemset support in the previous chapter. As a result of our proposal, the generation of association rules directly from the knowledge of a trained SOM has become feasible; therefore, unlike other related proposals (Shangming Yang, 2004; Changchien and Lu, 2001), we have stated that the training data is no longer needed for ARM. Nevertheless, a problem, affecting the validity of the itemset-support knowledge embedded in the trained SOM, emerges as soon as the database1 used for its training starts acquiring new transactions. This is, the itemset information in the SOM is not still valid to describe the current state of the database. To tackle this new problem, resulted from the dynamics of the data, incremental-training 1 Throughout this chapter, the term database will be used instead of the term dataset to refer to the data source because we part from the fact that data have been recorded and stored in a model beforehand. Moreover, we believe the use of database is more appropriated for the overtaking problem since dataset is often used in literature to describe a data file with a finite number of samples (transactions, patterns). 131 mechanisms can be used; nevertheless, their definitions have been based on nonbatch procedures. Because the concept of batch training is important for itemset-support estimation and recall from a trained SOM, we undertake the task of developing an incremental batch training mechanism for SOMs called Bincremental-SOM in this chapter. In particular, we will propose how the itemset knowledge embedded in the map should be updated to maintain it valid while the original environment changes periodically. 6.1 Introduction Previously in Chapter 5, we have stated that itemset support can be estimated from a trained SOM, especially from its BMUs. These estimations have been possible because the codebook has been decoded probabilistically. As the codebook stores information about itemset support, we have considered the SOM to be an artificial memory from which the support of an itemset X can be recalled anytime that it is needed. Therefore, we have proposed that the SOM can take the role of an itemset-support provider, instead of the original database, in the ARM framework. Additionally, this resulting trained SOM can be also seen as an abstract descriptive model Msom formed with the associations happening in the training database D. In the real world, the possibility that data are static is almost impossible. Especially, if we refer to data which are used for analysis which typically tend to change their state as a result of events occurring in their environment. This type of environments are called non-stationary, since new states appear as a result of 132 changes throughout time. For instance, new shopping patterns may emerge in the transactions of a database as a result of the introduction or promotion of new products in the market. Because of all the changes or dynamics presented in a database, any knowledge derived from it, producing either predictive or descriptive models, tends to loose accuracy in representing the current state of the underlying database. The ignorance of this inconvenient-but-realistic data characteristic in the development of data-mining approaches becomes a serious problem for data analysis. In particular, misleading and wrong decisions can be made by end users for not having the latest state of the data represented in their models. For example, a mortgage can be initially given to a bad-credit client just because, in current usage, the model is not considering the most up-to-date tendencies in the market. Unprofitable selling policies can be applied to some items in a supermarket if the rules used do not catch the actual tendency of the customer-shopping behavior. Therefore, it has become important to consider developing algorithms which can cope with this undesirable data property in order to keep the models alive, updated, or valid as reasonably as possible. As part of the DM toolbox, any SOM-based approach, such as our itemsetsupport memory, suffers from the problem stated above because it is well-known that any trained SOM only learns the state presented in the input database at the moment of its training. One way of making a SOM learn a new state of D is by employing the simplest solution to tackle the problem, which involves retraining the SOM with the latest information contained in D (history and changes together). Nevertheless, this solution is neither practical nor optimal because it 133 requires keeping the entire D throughout time to perform the training and consequently no advantage can be taken from the past knowledge learnt by its nodes. In the ANN world, this problem has been tackled by performing incremental training of neural networks, which allows them to update their knowledge with the current information of their environment without loosing or forgetting the information already embedded in their weight matrix. When a neural network learns incrementally, it can also be established that it addresses the elasticityplasticity dilemma, since it adapts its internal structure in order to capture the basis of the new incoming inputs without forgetting the past experiences. During the past years, proposals related to SOM technology have appeared to tackle the lack of training under non-stationary environments. For instance, GSOM (Alahakoon et al., 2000a) is focused on developing a self-organisingand-growing structure for continuous learning. This variant of a SOM is trained under a sequential mechanism which grows while it learns through the usage of some heuristics which determine if the deletion or insertion of nodes into the structure is suitable. Since our aim is to update the model Msom , representing an artificial memory full of information about itemset support, and having in mind that the itemsetsupport extraction mechanism proposed assumes that Msom has been formed by a batch training, we can re-state that the underlying problem has turned out to involve not only updating the map under non-stationary conditions but also doing it by a batch-incremental training. According to Hung et al (Hung S., 2004), the limitation of learning models in a non-stationary environment has been addressed by introducing the concept of non-batch learning. This concept involves techniques such as online learning, lifelong learning, incremental learning and 134 knowledge transfer, which have all pointed out the limitation of employing batch learning for tasks concerning non-stationary environments. The main criticism of the batch methods refers to the need of keeping the entire database for training which results impractical for these environments. Therefore, we focus here on defining a batch training mechanism for SOM suitable for non-stationary environments. In particular, the latter is necessary because our interest lies in gaining some insight into the topic regarding the problem of incremental association rule mining in which the maintenance of the frequent itemsets and their support, according to Cheung et al (Cheung et al., 1996b), is relevant to maintain the association rules up-to-date. In other words, the work proposed below should be considered as a proposal needed to perform incremental ARM from a neural-network standpoint. The key insight in the maintenance of itemset support will result from the exploitation of incremental learning properties of neural networks. The idea will then involve studying whether the dynamics, derived from the presence of data chunks (defining inserts) from D, can be incorporated incrementally to the model Msom (t), formed at time t, in order to produce Msom (t + k), which defines the latest state of D. Our approach must be interpreted as an incremental learning task for SOMs which makes the map retain the past knowledge while the latest state of the support of itemsets in the environment is learning. 6.2 Batch SOM for Non-stationary Environments Practically speaking, we can state that a database itself represents a non-stationary environment formed by different phases throughout time. These phases, defin- 135 ing the size-varied groups of transactions or data chunks, can normally occur at different speed rates which adds a new complexity to the incremental problem. Nevertheless, we will assume in this case that among the different phases forming the environment (database), there is always a time space when the data can be buffered in order to be posteriorly presented to the SOM for learning. 6.2.1 The Problem Definition Let D be a database and t a metric to measure time. Thus, a data distribution that represents D at the time t can be expressed by D(t). Let D+ be a group of new data points, representing itemsets, that makes D pass from the state t to t+x; S such that D(t+x) = D(t) D+ . Let Msom be a SOM-based model resulting from training a SOM with input data. Thereby, Msom (D(t)) defines a SOM describing some concepts, for instance, the topology, the distribution or the associativity of the attributes or items of D(t). The problem is then to produce a model Msom (D(t + x)) by not using traditional procedures, because they require the presence of the entire dataset D(t+x) for a solution. Instead, it is necessary to do it in batch to produce a satisfactory 0 approximation model Msom (D(t + x)) in comparison to the possible traditional 0 final model Msom (D(t + x)), so that Msom (D(t + x)) ≈ Msom (D(t + x)). The main characteristic of the wanted mechanism is that the old data chunks D(i), for all i < j in which j defines the index of the latest data chunk in the environment, will not be needed for generating such an approximation. However, the use of some knowledge K from Msom (D(j − 1)) can be considered. In terms of the above definition, the algorithm we aim for can be defined as 136 follows: Msom (D(t))0 = γ( K[Msom (D(t − 1))] , D+ ) ≈ Msom (D(t)) (6.1) Where γ() is the target algorithm with inputs parameters: K[Msom (D(t−1))] and D+ that define some knowledge of the SOM generated at phase (t-1) and the new group of transactions of D respectively. 6.2.2 Interpretation by Node Influences of the Batch Training SOM training can typically be realised under two modes: sequential or batch, whose usage depends exclusively on the characteristics of the task to be tackled. The differences between them are basically the manner in which they perform the update of the map m. Overall, both modes always look for the best group of reference vectors (weight matrix) which can map and quantize the distribution and information of the training data accurately. The steps needed for doing the SOM training in batch are summarized in detail in Figure 6.1. Figure 6.1: Algorithm SOM training in batch. In the search task for BMUs, a distance metric, for instance the Euclidean distance, is needed to perform the comparison between the vectors of the map and the inputs (step 4). A new state in the map at the epoch i will result from 137 using Equation 6.2 (step 6). A final map, representing the information of D, will be created after the total number of epochs has been reached. P j hji (t)Sj (t) mi (t + 1) = P j nV j (t)hji (t) (6.2) From Equation 6.2, it can be deduced that each node contributes to the modification of the map until it converges. In particular, it can be stated that each node mi contributes with its own Si to the update of the map at each epoch. The term S, which is defined below, is used to define the concentration of input patterns at each node. Si (t) = nvi X xj (6.3) j=1 Here xj refers to the nvi different patterns that have chosen the node mi as their BMU. S can also be understood as a collection of the different data points allocated at the Voronoi region Vi = {xi | kxi − mi k < kxi − mk k ∀k 6= i}. An expanded version of Equation 6.2, which governs training, can be derived as follows: mi (t + 1) = h1i (t)S1 (t) + · · · + hji (t)Sj (t) + · · · + hmi (t)Sm (t) nV1 (t)h1i (t) + · · · + nVj (t)hji (t) + · · · + nVm (t)hmi (t) (6.4) In this representation of the batch-training equation, the way in which each of the units contribute to the update of the map is even more evident. Although it seems that all the nodes take part in the update operation, the reality is that not all of them have the resources to provide an influence to such a change. The lack of contribution from some nodes is due to the fact that they have not become BMUs; therefore, they do not have any data point in their regions which can be shared with the rest of the map. For instance, if inputs have chosen only nodes 138 mj and mk (out of a total of m nodes) as BMUs during a training epoch, then a particular situation, describing the update of node mi from nodes m1 , mj and mk , can be expressed decomposing Equation 6.4 as follows:    ∆i (m1 (t)) =   ∆i (V1 (t)) = 0    ∆i (mj (t)) = hji (t)Sj (t) nV1 (t)h1i (t)+···+nVj (t)hji (t)+···+nVk (t)hki (t)+···+nVm (t)hmi (t)   ∆i (Vj (t)) = ∆i (bmuj (t)) 6= 0    ∆i (mk (t)) = h1i (t)S1 (t) nV1 (t)h1i (t)+···+nVj (t)hji (t)+···+nVk (t)hki (t)+···+nVm (t)hmi (t) hki (t)Sk (t) nV1 (t)h1i (t)+···+nVj (t)hji (t)+···+nVk (t)hki (t)+···+nVm (t)hmi (t)   ∆i (Vk (t)) = ∆i (bmuk (t)) 6= 0 (6.5) Where ∆k (mj (t)) defines the contribution or influence given to the node mk from the node mj or the Voronoi region Vj at time t. In the particular case of the term ∆i (m1 (t)) (top term), its influence becomes zero because this node was not able to allocate any pattern input in its region (|V1 | = 0). After observing that only some nodes contribute to the map formation, a new formulation of the batch training equation can be expressed by, mi (t + 1) = X ∆i (mj (t)) (6.6) mj ∈W Which represents the update of the map in terms of the influential factors generated by a set W containing nodes satisfying the condition of: W = {mi | mi .|S| > 0 or mi .S 6= φ} Where S is a set whose elements are patterns that have hit the node mi . 139 (6.7) In order to use Equation 6.6 for the training of the SOM, it is necessary to state mathematically how each of the node influence can be calculated. Therefore, if we employ one of the non-zero-result-influential components stated in 6.5, for instance, the term ∆i (mj (t)), and considering that two nodes become BMUs, then an expression, defining the influence from node mj to mi , can be stated as follows: ∆i (mj (t)) = hji (t)Sj (t) nVj (t)hji (t) + nVk (t)hki (t) In which, it is important to note that the total influence generated from mj to mi depends not only on the state produced by mj but also on external influences formed at others BMUs. This external influence is represented in the denominator of the expression, and represents the number of data points allocated into those other BMUs weighted by the corresponding neighborhood function originating from them. Hence, a total influence generated from mj to mi at time t can be described by, ∆i (mj (t)) = Sj (t) i h nVk (t)hki (t) nVj (t) + hji (t) It is important to point out that the final response from node mj results in a value almost equal to the centroid of its Voronoi region Vj defined by P nj = n1v xi . The difference between these two terms can be stated to be j xi ∈Vj controlled by an influential coefficient β, which defines the ratio of the influences of other BMUs apart from mj to the distance between mi and mj . Therefore, it can be stated that the influence generated at a node mj towards a node mi in a SOM training in batch can be defined by: 140 ∆i (mj (t)) = Sj (t) nVj (t) + βji (6.8) In which βji is defined by, P βji = (k∈W )∧(k6=j) nVk (t)hki (t) hji (t) (6.9) The tendency of the possible differences between the resulting value ∆i (mj (t)) and the corresponding value of the centroid mj (t) is represented by the graph in Figure 6.2. Final Node Influence Sj(t)/n vj(t) Di (mj(t)) 0 Influence from other nodes Figure 6.2: Tendency of the final influence given by the node mj depending on the strength of the influences received from other nodes. 6.2.3 The Algorithm In order to satisfy the requirements stated in Equation 6.1, we need to determine which information from an old trained map is relevant to keep for its maintenance throughout time. This is, we need to identify K since it defines a knowl- 141 edge about the past of an environment. Therefore, we have proposed K to be defined by the last group of BMUs since they have triggered the last change of the available map. In other words, a historical knowledge K about a training environment D up to time t can be formed from a model Msom (D(t)) by extracting the following properties of each BMU: • The reference vector, since it summarises all the patterns grouped at the node. • The hit histogram, which also defines a prior probability of the node. This is, the number of times that the corresponding node has been hit by the inputs. Once this knowledge has been obtained, the next step is to define a procedure to re-use it in a training process in which a SOM is about to learn the new state D(t + x) drawn from the original environment. The latter is necessary otherwise the past knowledge will be forgotten as soon as the new training with the latest state in D begins. To perform the re-usage of the past knowledge, we propose adding a new term to Equation 6.6 based on the fact that a batch SOM training is the result of summing up neural influences. Therefore, we propose that the SOM training in batch should be defined by, mi (t + 1) = X ∆i (mj (t)) + X ∆i (K[MSOM (D(t))]) (6.10) mj ∈W In which, t of ∆i (K[MSOM (D(t))]) refers to the previous state defined in the non-stationary environment rather than describing the t epoch in the current training. The insertion of this new component to the traditional training of the SOM is proposed as a medium to keep track of the old data because it defines 142 not only the past organization of data (centroids) but also how strong or weak the pattern populations allocated to the BMUs were. Based on the above explained, our desired γ() algorithm is defined as in Figure 6.3. The first function is responsible for triggering the learning of the data at each of the stages existing in the non-stationary environment. The second function, which is called as a result of the appearance of a new data chunk in the environment, is in charge of not only learning the new data chunk in the environment, but also retaining the old knowledge captured by the previous trained map. In other words, both sources, new patterns and reference vectors in the BMUs of the previous SOM, serve as vector inputs for the training of the new map (Step 4). Moreover, a type of linear initialisation occurs with our approach at the moment of the learning of a new data chunk, because the old map is also re-used to initialise the new one (Step 2). Function Batch-Incremental () 1) t=0 // t defines the current stage in the training environment 2) Mt=0 // it represents the dynamic SOM 3) while (t<endofEnvironment) 4) Mt+1 = UpdateSOMIncludingPast(Mt, D(t+1)) 5) t++ 6) end // this function returns a trained SOM with information defining the current and past data chunks. Function UpdateSOMIncludingPast (Mt, D(t+1)) 1) K = extract-bmus-knowledge(Mt) // knowledge about the past is captured 2) mi= initialisation (K) // knowledge is also used for initialisation of a SOM 3) for (i=1;i=numepochs; i++) do begin 4) forall patterns p Î D(t+x) È K do begin 5) call LookingforBMU(p); 6) end 7) mi = update-map() // the batch equation is employed 8) end 9) return (mi) // resulting map describing the data chunks of the environment up to the time t+1 Figure 6.3: Incremental algorithm proposed for SOM in batch. While the first (top) function triggers the learning at each stage of a non-environment, the second (bottom) function performs the training of the SOM with the current data chunk and the old information coming from the set of best matching units of the latest trained map. 143 6.2.4 Experiments To test our approach defined above, we have re-created the experiment conditions presented in (Furao and Hasegawa, 2004), in which a neural-network training is tested for non-stationary environments by using diverse data chunks to represent different data distributions. In our case, the training data space, which will be learnt incrementally by our approach, is shown in Figure 6.4. To simulate a nonstationary environment, a SOM will be fed with these data regions (A1, A2, A3, B, C and D - defining different topologies -) in the order defined in Table 6.1. Figure 6.4: Representation of a training data space describing a non-stationary environment. Phase environment I II III IV V VI Inputs chunk knowledge A1 no A2 yes(A1) A3 yes(A2) B yes(A3) C yes(B) D yes(C) Table 6.1: Data order followed in the incremental batch training. The aim of this experiment is to observe if the internal structure of the SOM, trained incrementally by using the algorithm defined above, is able to map all of 144 the different topologies represented in the training data. In other words, it is desirable that the presentation of a new data topology (data chunk) does not make the nodes forget previous knowledge or mappings about other topologies. The mappings, formed at each of the six-phase training of a non-stationary environment by a SOM trained with our proposal, are shown in Figure 6.5. In all the phases apart from the first one, the SOM is trained with the corresponding data chunk (changes in the data) along with the historical knowledge (the group of the BMUs) describing the previous phases. It can be noted that in each of the different phases the SOM not only covers the topology of the corresponding data chuck for the phases, but also uses some of its nodes to retain the knowledge learnt previously. 6.3 Itemset Support Maintenance by Incremental SOM Training After defining a method which allows the batch-incremental training of a SOM, we will focus in this section on investigating whether such a proposal can be used to maintain the knowledge itemset support embedded in a trained map throughout time. This is, in the next experiments we will evaluate if a trained SOM, acting as an artificial itemset-support memory, is able to update its knowledge with the changes occurring in a non-stationary environment that will be simulated by partitioning a FIM real-life dataset. 6.3.1 Experiments To evaluate the suitability of our batch-incremental-training algorithm for SOMs as an approach for the maintenance of the knowledge (itemset support) learnt by 145 Environment III 100 Environment IV A1 100 A1 B 50 50 A2 A2 0 0 −50 −50 A3 A3 −100 −100 0 70 0 70 Environment V Environment VI 100 100 A1 A1 B 50 B 50 A2 A2 0 D 0 C C −50 −50 A3 A3 −100 −100 0 70 0 Figure 6.5: Topologies formed by a SOM trained with our incremental batch approach through the six different phases defining the non-stationary environment represented in Figure 6.4. The black dots define the structure of the trained map. The data points used for the training of a SOM at each phase of the environment, according to the order in Table 6.1, are defined by the green and blue dots, which represent respectively the old knowledge (data extracted from the BMUs) and the current data chunk. 146 70 a SOM from dynamic data, a series of experiments will be conducted by simulating three different non-stationary environments derived from the Chess dataset (a real-life dataset with 3196 transactions defined by 75 items). In each environment, the dataset is separated into different data chunks with different sizes to model different phases of changes in which the database can get involved as a result of the dynamic of the environment. It is important to state that the radius parameter of the batch training equation, which is responsible for shrinking the neighborhoods and tuning the trained map for detecting fine data structures (Kohonen, 1996), will be maintaining a constant value of 1 in our experiments, since we are interested in capturing in the final map the global tendencies of itemset support in the environment. The general conditions describing each of the environments are summarized in Table 6.2. Environment Data-chunk Size Number of Phases I 800 transactions per chunk (Fixed) 4 II 400 transactions per chuck (Fixed) 8 III From 200 to 600 transactions per chunk (Unfixed) 9 Table 6.2: Definition of the non-stationary environments using the Chess dataset. In the first two environments, each data chunk has the same (fixed) number of transactions, while in the case of the third environment, the number of transactions was chosen randomly (unfixed). The idea behind setting the environments as phases is because the resulting SOM will be queried to estimate support for some groups of itemsets after its training has been performed with the data chunk representing the corresponding phase. In all experiments, the map size has been set up to a fixed number of nodes (about 900 nodes) that represent about one quarter of the total number of transactions in the Chess dataset. 147 To measure the efficiency of our method to update a SOM for FIM purposes throughout time, two other approaches, representing traditional mechanisms to perform the update, are going to be tested for comparison. The first approach (Chunk-SOM) corresponds to the training of a SOM with only the corresponding data chunk which defines the changes occurred in the database at the phase i of the environment. This approach can be understood as the worst possible scenario to address the maintenance of the map because the accuracy of the results (itemset support) will be depending on how well the current data chunk (current sampling) represents the overall tendency of the associations in the entire D throughout time. Relying on this approach for itemset-support recall, as further proven by the results, is risky, because although all of the data chunks are drawn from the same data distribution D from the environment, the fact is that they do not necessarily imply having the same behavior in their transactions. The second approach to be evaluated will be our batch incremental training for SOM (Bincremental-SOM). This represents the use of a batch-incrementaltraining for SOMs for the maintenance of itemset support. This method is characterized not only by learning the current data chunk at each phase k, but also by utilizing some knowledge from the latest state of the SOM in the environment, because it has retained information about, for instance, the topology and distribution of the past phases in the training environment. The premise followed by our approach is that there is no reason to retain the old data chunks in the environment any longer once they have been learnt by a SOM. Representing the third approach (Allchunks-SOM), a naive retraining of the 148 SOM will be used to update the map. This approach involves the training of the SOM with all the data chucks produced up to the current phase k in the environment. The reason for using this approach is that its results represent the best theoretical itemset-support approximations that can be generated from the environment. This approach has the big disadvantage that all data chunks, representing the changes, have to get involved in the training of the map. To make the comparison between our approach and the others, some error readings will be calculated from the maps formed at each of the phases of the environments. The metrics used to evaluate the accuracy and quality of the maps are described as follows: • The RMS (Root-Mean-Squared) error: This error metric, defined below in Equation 6.11, is used to measure the accuracy of the generalization on the support recall for the groups of the 1- and 2-itemsets. To calculate this error, the target t will be represented by the results obtained from applying the Apriori algorithm to the data chunk in turn. It is important to note that N, which defines the number of itemsets in the recall, may vary during an environment since data changes can also involve the apparition of new items throughout time. ERM S v u N u1 X t = {y(xn ; m∗ ) − tn }2 N i=1 (6.11) • The average quantization Error : This is a well-known error metric used to measure the quality of a map based on the resolution of the mapping from a Vector Quantization point of view, in which the aim is to look for a group of vectors (codebook) that can represent the distribution of the input data source in the most suitable form. The error is defined by Equation 6.12 in 149 which xi , mc and N define respectively an input vector from D, the BMU for xi and the total number of patterns in D. In our case, as the SOM will be changing and accumulating new knowledge, this error is going to be calculated to have an idea on how well the approaches summarise the past and current data chunks at each phase. As we assume that recently added transactions to a dynamic database can be more interesting than those inserted long ago in the sense that they reflect the current tendencies more closely, it will be very interesting to determine which data, between the history and the latest state of the database (added transactions), are better represented by the approaches tested. Eq = N 1 X kxi − mc k N i=1 (6.12) In addition to these two metrics, we have also recorded other values to observe the manner in which the nodes in the maps are being used along the phases of the environments. Figures 6.6 and 6.7 are shown to illustrate the response of a SOM regarding the support estimation for the groups of 1- and 2-itemsets during the four phases defining the environment I. One of the characteristics that can be observed in these results is the magnitude of the differences between the first approach (Chunk-SOM) and Apriori, in comparison to the other two (Bincremental-SOM and Allchunks-SOM). The first approach, as predicted above, is by far the worst approximation under non-stationary circumstances; therefore, no reliable results to describe the support of the itemsets can be derived from it. Moreover, the differences tend to become larger along the phases of the environment. 150 Chunk−SOM Bincremental−SOM Allchunks−SOM 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 −0.2 −0.2 −0.2 −0.4 66 −0.4 66 −0.4 50 50 50 0 0 0 −50 71 40 −50 71 −50 50 50 0 0 66 71 20 0 −20 −40 72 −50 72 −50 100 100 100 50 50 50 0 0 0 −50 −50 −50 −100 75 −100 75 −100 Figure 6.6: Differences between estimations (approximations) and calculations (real values) made respectively by a trained SOM and the Apriori algorithm for the group of 1-itemsets throughout the four phases of the environment I. The first column from the left describes the type of estimations that can be produced with a SOM trained with only the data chunk in turn in the environment (Chunk-SOM). The second column represents the estimations that can be made from a SOM trained with our incremental approach (Bincremental-SOM). The last column shows the estimation made by a SOM trained with always all the data chunks available en the environment (Allchunks-SOM). 151 72 75 Chunk−SOM Bincremental−SOM Allchunks−SOM 20 20 20 10 10 10 0 0 0 −10 2145 −10 2145 −10 50 50 50 0 0 0 −50 2485 −50 2485 −50 40 40 40 0 0 0 −40 2556 −40 2556 −40 100 100 100 50 50 50 0 0 0 −50 −50 −50 −100 2775 −100 2775 −100 Figure 6.7: Differences between estimations (approximations) and calculations (real values) made respectively by a trained SOM and the Apriori algorithm for the group of 2-itemsets throughout the four phases of the environment I. The first column from the left describes the type of estimations that can be produced with a SOM trained with only the data chunk in turn in the environment (Chunk-SOM). The second column represents the estimations that can be made from a SOM trained with our incremental approach (Bincremental-SOM). The last column shows the estimation made by a SOM trained with always all the data chunks available en the environment (Allchunks-SOM). 152 2145 2485 2556 2775 The behaviour of the RMS error, measuring the generalisation of the support recall for the 1- and 2-itemsets for the first environment, is shown in Figures 6.8 and 6.9. In these figures, it can be observed that even if, the approaches have started on the same condition (same initialisation, same inputs) and have provided the same result after concluding the first phase, they all follow different tendencies, in which the first approach stands out particularly for giving the poorest results. In the case of our approach, even if it presents some differences with the best case (a SOM trained always with all data chunks) its performance can be stated to be satisfactory, since it remains steady during the incorporation of new transactions, in particular, if we consider that its training has involved just the new data chunk in turn, along with the knowledge extracted from the latest trained SOM. Allchunks−SOM Bincremental−SOM Chunk−SOM 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 Phase Figure 6.8: The RMS error during the phases of the environment I for the group of the 1-itemsets. In Figures 6.10 and 6.11, we show the values recorded on the quantization 153 3 Chunk−SOM Bincremental−SOM Allchunks−SOM 2.5 2 1.5 1 0.5 0 1 2 3 4 Phase Figure 6.9: The RMS error during the phases of the environment I for the group of the 2-itemsets. 1.9 Chunk−SOM Bincremental−SOM Allchunks−SOM 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 1 2 3 4 Phase Figure 6.10: Error describing the quality of the quantization of the approaches tested for the data chunk describing the changes at each phase of the environment I. 154 3 Allchunks−SOM Bincremental−SOM Chunk−SOM 2.5 2 1.5 1 0.5 0 1 2 3 4 Phase Figure 6.11: Error describing the quality of the quantization of the approaches tested for the data chunks describing the previous phases (history) at each phase of the environment I. error at each phase k for the k th data chunk and all the past ones. These figures help analysing the abilities of quantization of each of the approaches tested. A perfect quantization of some data is defined by a quantization error equal to zero, and refers to the fact that the vectors defined, in this case by the SOM nodes, describe the distribution of the input data perfectly. In particular, the quatization of the maps produced for the data chunk provided at each phase is shown in Figure 6.10. Initially, it was expected that the first approach (Chunk-SOM) was going to generate the best results for this task, since the input data for its training at each phase is the same data used for the measurement of the error. In other words, the best results were expected from the first approach because its nodes (vectors) do not get distracted by any other inputs during training unlike the maps of the other two approaches. Therefore, the nodes can focus particularly on forming the mapping of the current data 155 chunk. Nevertheless, the results show that the converged map based on our approach (Bincremental-SOM) provides the best mapping resolution for the group of transactions of the corresponding data chunk at each phase in the environment. A reason for the decrease in the quantization error in our approach can be claimed to be due to the fact that our bath-incremental SOM trained takes advantages of the past knowledge. The key of the improvement is that, for instance, a map formed, based on our proposal at phase k, has already an advantages in its initial state since its vectors already map some regions of the data space defined by the database; therefore, it can be claimed that our proposals produces the re-use of past knowledge. Hence, the maps towards the end of the environment are created faster and produce a better approximation of the perfect quantization model. Although the improvement observed in our approach could be claimed to be replicated by using linear initialisation in the formation of the map, the fact is that in the best of the approaches (Allchunks-SOM), which was initialised linearly, this positive characteristic is not presented. In fact, in this task the worst quantization for the latest chunk at each phase was given by the SOM trained with all data chunks, which makes us conclude that the vectors of the map got distracted in the mapping of the most recent data chunks, due to the existence of others (past data chunks to be mapped too) in the training. A reduction of the quantization error under non-stationary conditions is a very positive characteristic for our approach because we can state that the organisation of the map via our algorithm forms a stronger mapping to the newest data chuck rather than the old ones. Hence, we can be sure that, in the recall of support, the latest tendencies of the environment are going to be considered. 156 In the case of the quatization for the old data chunks at each phase, the results are shown in Figure 6.11. In this case, all approaches start with a zero error since in the initial phase there is no history to remember. In the consequent phases, the best organization of vectors is given by the third approach since its training always involves the presence of past and current chunks. For this task, the results of our approach seem positive if we keep in mind that there were no past data for its training. In other words, our approach still remembers the past phases of the database satisfactorily. The response of a SOM with our approach under the conditions set up for the environment II, which has been formed by splitting the data chunks of the environment I in half, regarding the RMS error for the itemset-support generalization for the group of 1- and 2-itemsets are shown in Figures 6.12 and 6.13 respectively. 0.8 Allchunks−SOM Bincremental−SOM Chunk−SOM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 Phase Figure 6.12: RMS Error during the phases of environment II for the group of the 1itemsets. 157 5 Allchunks−SOM Bincrement−SOM Chunk−SOM 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 Phase Figure 6.13: RMS Error during the phases of environment II for the group of the 2itemsets. In overall, the accuracy of the recalls of the approaches was noticed to be similar to the ones presented for the first environment. The good characteristic of our approach to be pointed out in this case is that even if there are more phases in the environment, the error does not tend to increase substantially. Satisfactorily, the property of mapping the latest data chunk better than the history is still observable in the results for the quantization errors shown in Figures 6.14 and 6.15. In the environment III, the size of the data chunks has been varied and selected randomly, giving as a result a nine-phase environment. The corresponding results for the RMS and quantization errors are plotted in Figures 6.16, 6.17, 6.18, and 6.19. 158 2 Allchunks−SOM Bincremental−SOM Chunk−SOM 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 1 2 3 4 5 6 7 8 Phase Figure 6.14: Error describing the quality of the quantization of the approaches tested for the data chunk describing the changes at each phase of the environment II. 3 2.5 2 1.5 1 0.5 Allchunks−SOM Bincremental−SOM Chunk−SOM 0 1 2 3 4 5 6 7 8 Phase Figure 6.15: Error describing the quality of the quantization of the approaches tested for the data chunks describing the previous phases (history) at each phase of the environment II. 159 1 Chunk−SOM Bincremental−SOM Allchunks−SOM 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 Phase 6 7 8 9 Figure 6.16: RMS Error during the phases of environment III for the group of the 1-itemsets. 2 Chunk−SOM Bincremental−SOM Allchunks−SOM 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 Phase 6 7 8 9 Figure 6.17: RMS Error during the phases of environment III for the group of the 2-itemsets. 160 3 2.5 2 1.5 1 0.5 Chunk−SOM BIncremental−SOM Allchunks−SOM 0 1 2 3 4 5 Phase 6 7 8 9 Figure 6.18: Error describing the quality of the quantization of the approaches tested for the data chunk describing the changes at each phase of the environment III. 2 Chunk−SOM BIncremental−SOM Allchunk−SOM 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 1 2 3 4 5 Phase 6 7 8 9 Figure 6.19: Error describing the quality of the quantization of the approaches tested for the data chunks describing the previous phases (history) at each phase of the environment III. 161 In general, we can state that the three approaches maintain the same behaviour as the environments with fixed-size data chunks. The results for all of the experiments for the three environments are summarized in Tables 6.3 to 6.5. Approach First Second Third Phase 1 2 3 4 1 2 3 4 1 2 3 4 RMS Error 1-itemsets 2-itemsets 0.11048 0.70959 9.7047 8.8469 10.817 9.6593 12.546 10.687 0.11048 0.70959 0.11357 0.91807 0.12896 1.0835 0.17268 1.1946 0.11048 0.70959 0.10551 0.81733 0.070201 0.90658 0.062413 0.9667 Present 484 435 397 393 484 526 546 545 484 593 615 649 BMU History 0 274 398 521 0 303 502 577 0 326 507 569 Shared 0 157 174 243 0 245 402 436 0 326 507 569 Quantization Error History Current chunk 0 1.1426 2.3581 1.1681 2.1766 1.3377 2.5064 1.3793 0 1.1426 1.4863 1.4542 1.6577 1.1557 1.8495 0.92327 0 1.1426 1.3588 1.4247 1.4937 1.7444 1.6426 1.8763 Table 6.3: Results obtained of the approaches tested for Environment I. Some records, regarding the behavior of the use of the BMUs on the maps during the training phases, have also been added to our result tables in order to analyse the usage and organization of the mappings formed by the different approaches. The column BMU-Present defines the total number of nodes which ended up becoming BMUs for the corresponding inputs. The column BMUHistory specifies the number of BMUs which have been used to map the old data chunks. The column BMU-Shared represents the number of BMUs which have been shared in the SOM to map both the most current and old data chunks. Taking the results of Table 6.5 for discussion, since the settings of the environment from which they have been derived are closer to a real-life environment, it can be noted that our approach (second approach) keeps using less numbers of nodes than the third approach (theoretically the best approach since the map 162 Approach First Second Third Phase 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 RMS Error 1-itemsets 2-itemsets 0.10288 0.63257 7.8216 7.5049 10.973 10.156 15.424 14.1 14.226 12.628 16.862 14.779 17.325 14.748 14.449 12.382 0.10288 0.63257 0.13455 0.80306 0.14465 0.93041 0.14651 1.0392 0.15682 1.1311 0.16921 1.2214 0.17049 1.2807 0.1813 1.342 0.10288 0.63257 0.11048 0.70959 0.062258 0.7758 0.10551 0.81733 0.076753 0.8352 0.070201 0.90658 0.10098 0.93237 0.062413 0.9667 Present 325 319 296 317 291 281 286 247 325 448 439 445 489 478 493 497 325 484 529 593 613 615 651 649 BMU History 0 189 310 372 365 395 468 482 0 261 339 402 526 555 586 613 0 264 377 465 556 559 604 610 Shared 0 106 165 196 107 151 145 161 0 221 258 288 398 379 405 416 0 264 377 465 556 559 604 610 Quantization Error History Current chunk 0 0.89182 2.0268 0.88066 2.3137 0.92952 2.5348 0.89395 2.3648 1.0259 2.5568 1.0009 2.6447 1.0125 2.7223 1.0748 0 0.89182 1.2076 1.2426 1.4575 0.9156 1.6167 0.72646 1.6942 0.66307 1.8229 0.57516 1.8978 0.54156 1.98787 0.49316 0 0.89182 1.1315 1.1536 1.2825 1.3432 1.392 1.3908 1.4305 1.6856 1.5432 1.7477 1.6021 1.8078 1.6702 1.9172 Table 6.4: Results obtained of the approaches tested for Environment II. 163 Approach First Second Third Phase 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 RMS Error 1-itemsets 2-itemsets 0.091058 0.42505 5.1857 5.0553 11.573 10.909 16.689 15.39 13.922 12.775 13.733 12.242 16.585 14.597 15.307 12.985 15.566 13.375 0.091058 0.42505 0.16707 0.76713 0.23229 0.8775 0.26526 1.0197 0.25449 1.0779 0.21993 1.1506 0.22947 1.2419 0.24148 1.2712 0.27288 1.3688 0.091058 0.42505 0.10355 0.71514 0.1079 0.73557 0.061232 0.79409 0.077156 0.79983 0.078521 0.84692 0.070201 0.90658 0.10605 0.92097 0.062413 0.9667 Present 189 364 243 223 312 284 258 322 238 189 430 435 416 440 472 462 471 469 189 435 500 513 606 619 615 628 649 BMU History 0 107 209 337 358 390 404 458 466 0 161 343 406 430 495 527 529 582 0 126 367 427 483 576 573 578 624 Shared 0 67 109 165 182 128 141 172 148 0 151 275 285 294 359 357 356 381 0 126 367 427 483 576 573 578 624 Quantization Error History Current chunk 0 0.53331 1.8826 0.99159 2.3294 0.79141 2.4964 0.67263 2.4863 0.90183 2.3275 1.0423 2.5412 0.92153 2.6158 1.1486 2.79 0.87699 0 0.53331 1.0576 1.1491 1.397 0.8237 1.535 0.62643 1.6422 0.71565 1.7038 0.66279 1.8217 0.52506 1.8959 0.61216 1.9895 0.41183 0 0.53331 1.1129 1.1107 1.2155 1.2916 1.3085 1.3741 1.3954 1.423 1.4519 1.7477 1.5472 1.7535 1.6186 1.8095 1.6792 1.9126 Table 6.5: Results obtained of the approaches tested for Environment III. 164 is always built with the whole database) for building the mapping for the data chunks during the environment. The latter can be understood as a good feature for our approach since we part from the fact that the mapping formed by these two approaches (Bincremental-SOM and Allchunks-SOM) give similar results for itemset-support recalls by involving a different number of nodes in the answer. Furthermore, as depicted in Figure 6.20, which shows the runtime for the evaluated methods for the experiment regarding the non-stationary environment with unfixed data chunks, our batch incremental converges by consuming the same or less time than the best approach (Allchunks-SOM) during the learning of the non-stationary environment. 80 Allchucks−SOM Chuck−SOM BIncremental−SOM 70 Time (sec) 60 50 40 30 20 10 1 2 3 4 5 Phase 6 7 8 9 Figure 6.20: Runtime for the three approaches evaluated on the environment III. A contradictory event, regarding the mapping of the old data chunks, seems to occur with our approach since results show that it tends to gradually forget the reference of the past due to employing an incremental procedure. For instance, maps formed after phase five need more BMUs than the ones obtained in its training for producing the best mapping for the past chunks. Even if the 165 latter could be seen as a negative feature, the actual fact is that it is a positive one, because the map gives more attention to the mapping of the latest chunk rather than the historical ones; therefore, the recall will be influenced by the latest tendencies of the changes rather than the very old ones. As a consequence of the latter, we can also observe that the number of shared BMUs between the past and present data at each phase does not tend to agree, as in the third case in which the sharing is complete. Therefore, we can assume that some BMUs need to be combined with others in our approach for some reduction on the recall error (RMS). Nevertheless, promoting such a node fusion could result in updating some nodes (vectors), which already satisfactorily map the most current state of the database. In conclusion, it can be stated that a tradeoff has to be made between a better generalization and a better quantization of the data chunks. 6.4 Conclusions In the previous chapters we have proposed how two different neural networks can be used for ARM by building itemset-support memories based on them. Since these artificial memories may be learning from non-stationary environments, we have stated that the update of the knowledge in them is necessary and relevant for the development of the neural-based framework for ARM. Hence, in this chapter we have looked at tackling the maintenance of the knowledge in a memory based on a self-organising map. To achieve our goal, we have investigated how to perform such an update in the memory throughout time by the usage of incremental training. Since the concept of batch training has been important for the definition of a method for the recall of itemset-support from a SOM, and knowing that this type of training has 166 been stated as unsuitable for non-stationary environments, we have proposed an incremental training method for SOM in batch called Bincremental-SOM which allows the new associations to be learnt without forgetting the previous knowledge. To evaluate the accuracy of the proposed method, we set up experiments which simulate some non-stationary situations with the information contained in a dataset used for FIM studies. To compare the results given by our approach, we also trained two other maps with the latest state and all of the partial states occurring in the environment. It can be concluded from our results, shown above, that a SOM-based memory is able to keep its knowledge updated by performing its training with our method. It was also observed that the quantization generated by the SOM trained with our approach, for the newest group of inserts in the environment, is even better than the one produced with the map trained only with the corresponding data chunk. Therefore, it can be stated that the hypotheses generated, which in this case represent itemset-support estimations, will be produced with the latest itemset tendencies in the environment, but without forgetting the past ones. 167 Chapter 7 Conclusions and Future Work In this thesis, we have looked at the suitability of ANNs (Artificial Neural Networks) for the descriptive data-mining task of ARM (Association Rule Mining). In particular, we have followed the premise that a neural-based framework can be built for tasks like association rule mining because concepts, which are all implicated in the generation of these rules, such as association, frequency, counting, and information storage, are also involved in the learning process performed by humans which artificial neural networks target to imitate. In order to begin with the development of such a neural-based framework and considering the importance of itemset support for the generation of this type of rules, we have focused on the development of the memory stage. This stage is in charge of learning, storing, maintaining, and recalling of the property of itemset support defined by the learnt associations, in order to supply with a frequencystatistical (support) information about the itemsets to other stages in the framework; for instance, processes controlling the generation of rules (in control of the FIM logic). In the quest for the most suitable neural network to become an itemset168 support (pattern-frequency) memory, we have studied two neural networks: a self-organising map and an auto-associative memory. Our studies have focused on determining if these two ANNs have the ability to learn frequency-statistical information from the taught associations in order to make estimations about the support of the itemsets of an environment. In other words, we have proposed how these two neural networks can reproduce the values which normally result from the counting of discrete patterns describing associations (itemsets). Additionally, since data often describe events occurring in a non-stationary environment, we have also proposed how the itemset-support knowledge embedded in the weight matrix, generated by a self-organising map, can be updated while the environment changes throughout time. Therefore, in order to complete this thesis, the final conclusions , of the work proposed, are presented in this chapter. Answers to our research questions stated in Section 1.4 are also given. Moreover, we will establish some links to other pieces of research and guidelines for future work which can contribute to the continuation with (1) the quest for an understanding of the counting of patterns with artificial neural networks, and (2) the development of an ANN-based framework for ARM or similar tasks. 7.1 Final Results Neural network technology has been successfully used to tackle problems aiming prediction and clustering. Nevertheless, their use for tasks like ARM is still unclear and uncertain because the field lacks of research on this topic. For this reason, we have studied the usage of ANNs for ARM. 169 After we established that ARM is a mechanical process whose realisation involves some of the concepts presented in the learning process of biological systems (Gardner-Medwin and Barlow, 2001), and motivated for imitating the human behavior in the generation of association rules with ANNs, we have conducted research on the suitability of ANNs for ARM in this thesis. In particular, we have followed the hypothesis that the embedded knowledge formed by an ANN, as a result of its training with some environment defining associations, can be used for the generation of association rules mainly because: • We have created the baseline for the generation of such symbolic representations describing associations from the knowledge of two neural networks. This is, association rules, for instance, like those in Figure 7.1, can be generated in future from the knowledge embedded in our studied ANNs because we have proposed in this thesis mechanisms, through which the support of itemsets, being the raw material needed for the rules, can be estimated from the weight matrix of a CMM and a SOM. Nevertheless, as stated later, the definition of the logic of the rule generation mechanism remains open. • We have pointed out that both technologies, ANNs and ARM, handle the concept of association in one form or another to form knowledge from data. • We have stated all the different alternatives in which a neural network can participate within the current ARM framework. • We have categorized ARM as a mechanical inference task which can be performed satisfactorily by humans through the counting and the making of associations among the existing elements of data. Therefore, it was 170 assumed that ANNs can be used for ARM only if they are able to reproduce the pattern counting done by their biological counterparts. Figure 7.1: Example of association rules which can be further generated from the knowledge learnt by a neural network about the Mushroom dataset defined in (D.J. Newman and Merz, 1998). All these rules, describing the associativity among the attributes of a dataset, have the format of: if (list of items or attributes) then (list of items or attributes) with [support=% and confidence=%]. A detailed explanation about the links between ARM and ANNs has been given in Section 1.2 as a part of the motivations for this work. In order to gain some insight into the use of ANNs for ARM, we have also 171 proposed the idea that the current ARM framework could be transformed into a framework whose core can be represented by a neural network architecture (defined in Chapter 3). As a first stage of the neural framework for ARM, we have proposed to build an ANN-based artificial memory, which can learn, store, maintain and recall knowledge about the occurrence of patterns in the training dataset to support other stages of the framework with this knowledge. In other words, we have looked at building an ANN-based memory which can provide a good estimation about itemset support when it is queried. Our research has been carried out principally to establish extraction mechanisms through which an auto-associatory memory (correlation matrix memory) and a self-organising map can infer itemset support from their knowledge formed after their training. Our main aim has involved reproducing the results given by a traditional pattern counting process such as, for instance, the itemset-support values given by the Apriori algorithm in an association rule mining process, by using only the knowledge embedded in the mapping or weight matrix of our two ANN candidates. We have used the Apriori algorithm for comparison, because, as commented in previous sections of the thesis, Apriori0 s itemset-support calculations are errorless since it counts directly the itemset occurrences within the data. Moreover, it is the gold standard and historically most important algorithm for ARM, from which many other algorithms have been derived in one way or another. In other words, approximation methods, like our estimations here produced, should be compared to the real values, like the ones generated by Apriori. Discovering and emulating counting abilities with our ANN candidates has been established as relevant for the success of this thesis for the following rea- 172 sons: 1. We identified the relevance of itemset support for the support-confidence framework in ARM and therefore for the generation of rules. Hence, we stated that in order to produce rules from ANNs, it is first necessary to reproduce itemset support from them since it is the metric to determine the frequent itemsets from which rules can be derived. 2. The achievement and viability to build itemset-support memories is directly related to the counting of patterns with ANNs. Moreover, the reproduction of counting with ANNs is challenging as well as important since it takes part in the learning process performed by biological systems. 3. In order to have a mainly neural-based framework for ARM or similar tasks, we defined that the first stage would incorporate an ANN, acting as a memory, because it conveys autonomy to the framework in the sense that its hypotheses or rules will depend exclusively on the knowledge accumulated by the artificial neurons. The latter is an activity occurring within the brain, which stores knowledge (experiences) throughout time to be employed for different purposes. Moreover, we have also defined that the existence of a mapping (feature knowledge space) formed with the associations would make the framework be independent from retaining the learnt environment to provide answers after learning it. The latter is a drawback presented in most of the related work summarised in Chapter 3. Since the counting of patterns as well as the learning of an environment can occur under totally dynamic conditions (non-stationary), we also looked at proposing how to prepare the knowledge in the memory for these environments. Therefore, we have proposed a method to update the knowledge embedded in a 173 self-organising map which learns from an environment with dynamic conditions. Because our work involves reproducing counting properties on two wellknown neural models rather than theoretical ones, our work can also be understood as a continuation of theoretical work in (Gardner-Medwin and Barlow, 2001). Moreover, our approaches have taken into account the combinatorics of patterns described by the associations among their elements. It is important to state that our work does not claim that our interpretations, of how the counting of patterns is performed by the two neural networks studied here, happen in biological systems. Nevertheless, they serve to form a better understanding of what may be happening in the brain. Specific conclusions of each piece of research can be summarized as follows: Auto-Associative Memory Model (Chapter 4) In this chapter, it has been studied whether a weighted auto-associative memory, based on a CMM (Correlation Matrix Memory), could represent our desired itemset-support memory. This memory was investigated because it exploits the concept of association among the input patterns to learn about its environment. As a result of our analysis, it was found that a weighted m-by-m correlation matrix memory, in which m defines the number of items or attributes forming the patterns, has the natural ability to learn statistical-frequency about the discrete patterns used for its training. After discovering such ability, we also pointed out that, because of the symmetry of its weight matrix, a CMM only needs around half of the elements to represent the different learnt occurrence frequencies in the training data. Since it has been advised in (Gardner-Medwin and Barlow, 2001) that it is convenient 174 for the counting of patterns to have direct representations, one-to-one relationships, between patterns (itemsets) and neurons rather than distributed ones, we focused on looking for those direct representations on the m2 nodes defined by this memory. It was satisfactorily possible to identify one-to-one relationships between nodes and some thought associations or itemsets. We concluded that frequency values and therefore itemset support can be calculated straight for m(m + 1)/2 patterns out of 2m patterns that idealistically we wanted this memory to represent (if each node maps a different pattern). In particular, we have stated that the support values for the groups of 1- and 2-itemsets given by this memory are errorless as the values given by the well-known Apriori algorithm which scans and counts the patterns directly in the original data to calculate their itemset support. In the case of the remainder k-itemsets, for 3 ≥ k ≤ m, we looked at defining a distributed method to infer their frequencies needed for support calculations. Hence, two methods, A and B, have been proposed to estimate support (supp) ˆ by using the distributed knowledge embedded within m(m + 1)/2 nodes of the memory which defines directly single- and pairwise-event frequencies. In the definition of both methods, it was assumed that the elements, defining a pattern or itemset, are independent; therefore, an estimation about its support can be generated through the calculation of the joint probability among the elements xi Q of an itemset X, such that supp(X) ˆ = supp(xi ). xi ∈X In our method A, we proposed that the support of a k-itemset can be estimated by using the k individual item probabilities (values defining support in the main diagonal of the matrix) as the resources for the estimation. In the method B, we estimated itemset support by first forming approximately (k/2) pairwise 175 events (2-itemsets), whose support values are directly defined within the matrix, with the elements of the queried k-itemset. Therefore, in this case, we have employed elements lying off the main diagonal for itemset-support estimation. After testing our proposed methods with real-life datasets used in frequent itemset mining benchmarks, it was noticed that the itemset-support estimations made with the knowledge embedded in the weight matrix of this memory and our two methods, present discrepancies with the real value support calculated by the Apriori algorithm. Overlaps of itemset support, occurring among the nodes forming this memory, have been blamed to be the cause of the support differences. Thus, we strongly believe that better estimations can be obtained iff the neurons of this memory are arranged in a different way. Self-Organising Map Model (Chapter 5) In contrast with Chapter 4, in which a supervised trained ANN was studied, we focused here on using an unsupervised trained neural network for our needs. In particular, the suitability of a SOM (Self-Organising Map) to become our itemset-support memory was investigated. This work involved analysing SOM training procedure, in order to identify how its embedded knowledge, about the high-dimensional input data, can be interpreted for producing itemset-support estimations. The idea of estimating itemset support directly from the trained map was developed because other approaches (Changchien and Lu, 2001; Shangming Yang, 2004), which have also explored the use of this ANN for ARM, suffer from the drawback of limiting SOM usage to only cluster the data in order to form links between the original data and the SOM clusters for ARM. Moreover, we wanted to evaluate the abil- 176 ity of a SOM to reproduce the counting of patterns of a high-dimensional space with the coded information about it distributed in the two-dimensional space described by its neurons. To define our mechanism for itemset support estimation, the following concepts were important: • The identification of the best matching units because they control the update of the map during training. • The prior probability associated to each node because it defines the number of times that a node has been hit by input patterns. • The components of the codewords or reference vectors associated with the BMUs because each of them represents a random variable whose value is an approximation of the expected value, or mean, of the represented item. • The concept of batch training. • The concept of the partition of events which has been stated to take part in the training of a SOM. • The concept of independence among the items. Thus, the corresponding support of an itemset has been calculated by finding the joint probability among the corresponding variables. The estimations derived from a trained SOM via our proposed mechanism were also tested against their counterparts provided by the Apriori algorithm. As we have also stated the SOM to be a representative of the theoretical projection model defined in (Gardner-Medwin and Barlow, 2001), it was investigated to produce representations among input patterns and neurons which cannot 177 be derived directly, as in the case of the auto-associative memory, but distributed. In other words, our mechanism uses the information of the reference vectors associated to the BMUs to estimate itemset support. In our experiments, we also analysed other factors which can affect the accuracy of the SOM estimations. For instance, we performed some experiments, involving different values of the radius and map layouts, to determine the best parameters for SOM for ARM. According to our results, we have concluded that rectangular SOMs provide more accurate estimations than the hexagonal maps. The best estimations, in terms of the radius utilized, were given by the map formed with the minimum value; therefore, we have concluded that the best scenario will be reached if the SOM algorithm is reduced as a pure Vector Quantization algorithm in which influences among the nodes are not carried out. It is important to state that the best estimations also occur when the initialisation of the maps were done linearly, which is not a characteristic in the biological systems; hence, to be more congruent with what happens in real life, it can be stated that a value, such as, for instance, 0.5 for the radius, which controls the dispersion of the influence during training, is enough to have satisfactory values of itemset support. Since we wanted to improve the results obtained by a SOM, we investigated likewise if the concept of emergent feature maps, proposed by Ultsch (Ultsch, 1999), could be used for this aim. Our results with this concept, which involves creating large maps for mining tasks, showed an improvement in the itemsetsupport estimations. The reason for this improvement was that since there were more nodes in the memory, the input patterns had a better way to get distributed along the map. Thus, it has been concluded that among a better distribution of 178 the input associations in the map, a better estimation will be produced certainly. Comparing The Proposed Memories By summarizing in Table 7.1 some of the characteristics of the studied neural networks, conclusions regarding their use for ARM, involving a training data with n transactions derived from an itemset space defined by m items, can be drawn as follows: Characteristic Total nodes in the network Type of training Epochs for training Nodes updated for learning Nodes utilized for support recalling Pattern representation for support recalling Main usage AAM-based Memory m2 supervised 1 k2 1, k or k/2 direct(for k < 3), distributed(for k > 2 ) pattern association pattern matching SOM-based Memory √ 5 n (Vesanto et al., 2000) unsupervised several 1 √ mb ≤ 5 n always distributed clustering and data visualisation Table 7.1: Characteristics of the two itemset-support memories based on the studied neural networks. m defines the total number of items from which the n transactions or itemsets of a training data D are derived. mb corresponds to the number of BMUs formed in an epoch training. In the case of an AAM-based memory, the maximum number of nodes needed to define this neural network depends directly on the m number of items forming the training data, rather than, as a SOM-based memory, on the n number of itemsets in the training dataset D. The training of an AAM-based memory is supervised, while the training of a SOM-based memory is unsupervised. When they are both learning from an environment, an input k-itemset provokes k 2 nodes in an AAM-based memory to be updated, while only one node is directly hit by the pattern in the case of 179 a SOM-based memory; nevertheless, in practise, this node has to propagate its influence to the rest of the map. Therefore, it can be stated that the presence of an input itemset updates all nodes forming the SOM-based memory. The training of the AAM-based memory requires only one pass over the environment to collect knowledge, while in the case of the SOM-based memory, even if it can produce estimations after the first training epoch, it needs more passes to converge and therefore provides more accurate results. Once these memories have been trained, recalling or estimation of support for a queried k-itemset from a AAM-based memory requires: one node for the 1and 2-itemsets since direct pattern representations were found and defined, and k or k/2 nodes when k > 2 through our proposed methods which use individualor paired-itemset probabilities respectively. In the case of a SOM-based memory, since the knowledge has been distributed along the BMU in the map, a mb number of nodes is used from which k components of their reference vectors are taken into account for an estimation. Based on the results shown in Tables 4.6 and 4.5 in Chapter 4, and 5.8 in Chapter 5, which have been summarised here in Tables 7.2 and 7.3, it can be concluded that even if our AAM-based memory gives errorless itemset support for the group of the 1- and 2-itemsets, its performance is beaten generally by its SOM-based counterpart whose estimations are made depending on the good distribution of the association within its nodes. Nevertheless, in terms of the steadiness of the itemset-support estimations, the AAM-based memory has an advantages over a SOM-based memory since the latter depends on initial factors for the distribution of its knowledge and therefore for the itemset-support recalls. 180 Itemset Size k 3 4 5 6 7 SOM Hexagonal Rectangular 0.42918 0.3668 0.52008 0.4352 0.61963 0.50871 0.76948 0.62251 1.0274 0.81303 CMM Method A Method B 0.617510 0.47519 0.775860 0.50515 0.912940 0.86483 1.089600 0.9105 1.394400 1.4274 Itemsets Tested 167 203 128 39 4 Table 7.2: Comparison of the generalised errors between our approaches for SOM and CMM for the different number of tested itemsets, which resulted a query from the Chess dataset with a minimal support between 90 and 100 %. Itemset Size k 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 SOM Hexagonal Rectangular 0.92081 0.89465 1.0748 1.0476 1.2055 1.1724 1.327 1.2829 1.4407 1.3794 1.5395 1.4548 1.6126 1.4983 1.6496 1.4996 1.648 1.4559 1.6202 1.3801 1.5947 1.3037 1.6045 1.2653 1.6755 1.2926 1.8418 1.4084 2.0514 1.5868 CMM Method A Method B 2.9141 2.285 3.593 2.6836 4.2165 3.6522 4.8524 3.995 5.5205 5.0014 6.2197 5.2905 6.9495 6.494 7.7125 6.6983 8.5192 8.1773 9.3857 8.3329 10.32 10.098 11.317 10.258 12.416 12.252 13.865 12.791 14.903 14.751 Itemsets per group 4985 25500 88170 217705 397947 550220 581647 471908 293209 138294 48473 12023 1896 152 5 Table 7.3: Comparison of the generalised errors between our approaches for SOM and CMM for the different number of tested itemsets, which resulted a query from the Chess dataset with a minimal support between 45 and 100 %. Maintaining Pattern Frequency Throughout Time with a SOM (Chapter 6) Since the knowledge of a SOM-based memory can become invalid as a result of changes occurring in the training environment, we have looked at proposing a mechanism to maintain it. In particular, we tackled the problem of training a SOM incrementally in batch because: 181 • In order to generate association rules from the weight matrix of a trained SOM, the procedure to estimate itemset-support assumes that its has been in batch. • Batch mechanisms have not been considered suitable for non-stationary environments. Therefore, we have proposed an incremental batch mechanism for SOM which performs by using the new data occurring in the environment and knowledge generated from the last state of the SOM-based memory because it represents the old data states in the environment. To test our approach, we set a non-stationary environment artificially with the associations of the Chess dataset by forming data chunks to represent the different phases of it. We compared the itemset-support estimations made by a SOM-based memory trained with our approach to results given by memories trained with two traditional approaches which involve the retraining of SOM with all data available and making estimations with the latest occurred data chuck in the environment. Our experiments have shown that, with our method, the update of the map throughout time is possible. Even though we were not able to improve the results given by the memory, which was always trained with all data in the environment, interesting properties were identified in our results as follows: • It was noticed that the error describing the quality of the itemset-support recalls generally grows linearly. • The quantization of the latest data chunk in the environment produced by a memory trained with our approach is better than the one formed with a 182 memory which just learns that data chunk. This positive property of our method is because we have exploited the re-use of knowledge. • Since a SOM trained with our approach is better at quantizing the present data chunks than the old ones, it has been stated that the rules derived from this trained SOM will positively reflect the latest itemset tendencies rather than the old ones. Therefore, we have concluded that batch training mechanisms should be considered as a real alternative for the update of a SOM for non-stationary environments. Moreover, we have moved another step forward in the development of the neural-based framework for ARM since we have given the property of learning incrementally to the artificial memory. The latter is important and necessary because it represents the basis to perform incremental ARM (Cheung et al., 1996b) whose aim is to maintain the rules while the environment changes. Answering Our Research Questions With the conclusions drawn about this thesis above, our research questions, as stated in the introduction, can be answered as follows: • Can they be used for descriptive-data-mining techniques in which the aim is to represent the data in form of association rules? Yes, they can. One manner to describe data is by producing association rules from data. To form such rules it is first necessary to determine the basic but important itemset property of support. With the results of the experiments conducted in this thesis on two different neural networks, it can then be concluded that this itemset property can be estimated from the knowledge embedded in their weight matrix. 183 Therefore, in order to produce association rules from data, the next step would be to define a FIM logic ( a version of the Apriori algorithm to mine the feature space defined by the weight matrix formed by our neural networks) which would aim to traverse the structure defined in the ANN in the best way. • Would the knowledge learnt by an ANN be useful for describing associations among the elements of a database? Yes, it would be useful to describe associations among the elements of events from an environment. With this work, we have proven that the support value associated to those associations or itemsets can be either calculated directly from the weight matrix for some cases (the case of an AAMbased memory), or estimated indirectly through the use of the methods proposed here, which decode the distributed knowledge in order to form itemset-support estimations. • Could it be possible that our chosen neural networks can have some knowledge describing the frequency of patterns distributed along their nodes? I.e., Could the results of counting be generated from our ANN candidates? Yes, our chosen ANN candidates have shown that they count patterns implicity while they learn them. This has been concluded since we discovered that they keep the knowledge needed to reproduce values, defining the occurrence of patterns in a group of events, which are normally obtained by scanning the events of the original environment for the count of patterns. Nevertheless, it has also been found out that the organization of the learnt patterns within the networks can produce discrepancies between the estimation and the real frequency of the patterns. Therefore, new internal organizations should be investigated in future. 184 • Could it be possible to have an itemset-support memory based on an ANN, as a substitute for the original database, to take part in such framework? The reason for retaining the original data in the ARM framework is because it is the only source from which the itemset support can be calculated throughout time. Nevertheless, our results show that a trained neural network is able to reproduce those values sufficiently well. Therefore, it can be stated that these ANNs, acting as memories which can learn, retain and recall knowledge about itemset support, can take the role of the original data source in the ARM framework. • Could it continue accumulating knowledge of the pattern frequency throughout time while the original data environment changes? Yes, it could. The counting of patterns under a non-stationary environment has been investigated with a self-organising map, and the results have shown that knowledge embedded in map can be updated while the original environment adds new events to its definition. Additionally, results of our experiments have also highlighted a positive property on the recalls made by a SOM since they tend to be highly influenced by the latest state in the environment rather that the old ones. 7.1.1 Contributions • We have investigated the usage of ANNs for the transduction task of association rule mining by initially stating that the generation of rules can be done by humans and we have aimed to reproduce this human behavior with ANNs. In specific, we have investigated if association rules can be formed from the knowledge formed by two ANNs: an auto-associative 185 memory and a self-organising map. • We have worked on making the use of ANN for ARM possible by developing itemset support memories based on an auto-associative memory and a self-organising map which are able to reproduce an estimation of the itemset support after they have learnt associations describing an environment. • We have reproduced the counting of patterns with the knowledge embedded in the ANNs rather than using the original dataset. In other words, we have studied and proposed methods, decoding the created weight matrix, through which the two neural networks studied can reproduce the counting of patterns, which is traditionally done by scanning the original data. • We have contributed to the work on the counting of patterns with neural networks, since we have studied two well-known neural networks rather than forming theoretical ones. Moreover, our studies have included the combinatoric aspect of patterns which in this case is represented by the term of itemsets. • Since data is dynamic, we have also proposed how a memory based on a self-organising map can keep its knowledge updated while the original environment changes as a result of the incorporation of new patterns. 7.2 Future Work Based on the results, findings, and conclusions drawn by this thesis, we define the future work in this section , organised into different research paths, that can be pursued for the development of the topic. 186 7.2.1 For The Auto-associativity-based Memory It has been found that overlaps create differences in the estimation of support since all patterns are piled up together in the same number of nodes. It happens that while this ANN is learning, input patterns, such as, for instance, X and Y defined respectively by 0011 and 1011 are stored in the same group of m2 nodes; therefore, when the memory is queried to recall support for the pattern of 1010, the estimation produces a difference with the real value because of the pattern overlaps. To tackle this drawback, we can consider that patterns with similar prefix can be grouped together in different branches of a neural network tree structure. Nevertheless, this way of solving the problem will cross with some of the work in the ARM field which has already proposed their split or organization into tree- or trie-based structures for their counting. Therefore, we propose to look for other alternatives to decode the knowledge of this ANN, or combining it with other neural networks, to provide more accurate results for itemsets whose size is large than 2 items. We have also discovered that the mapping, formed by the auto-associative memory based on a correlation matrix memory, results in a data structure which is similar to the matrix structure proposed in (Agarwal et al., 2001) to perform frequent itemset mining by using lexicographic tree structures. These triangular matrices have also been used by other approaches, for instance in (Bodon, 2003), for the specific discovery of the 2-itemsets because of its compact memory representation. 7.2.2 For The Self-organising-map-based Memory Our experiments and results have shown that the itemset-support estimations made by this ANN are directly influenced by the way in which the knowledge, about the input associations, is spread in the map. Therefore, a better organisation of the knowledge has to be found and applied in order to improve the results 187 given by this thesis. One way of doing it is by exploring the use of a more convenient distance metric for binary data in the training of this ANN because this metric is responsible for allocating the input patterns to the nodes of the map during training. To begin with this research path, a revision of the work proposed in (Leisch et al., 1998; Fernando Lourenco and Bacao, 2004) is necessary since other metrics, different from the traditional Euclidean distance, have been investigated for the formation of the map. Moreover, the study of the work of Lebbah et al. (Lebbah et al., 2000), in which a batch version of the Kohonen SOM algorithm dedicated to binary data is proposed, is important. Another alternative is to propose a new distance metric from scratch which we believe should consider the following: Such a metric should produce formations of triangular binary matrices at each BMU with the input itemsets. Therefore, it will be a metric which compares the training itemsets based not only on their item differences but also in the hierarchial property of them. These matrices aim to organise the knowledge of the SOM in a manner that a more accurate itemset-support estimations can be produced. Some examples of these matrices are given in Figure 7.2. Once these matrices have been formed and the corresponding codebook of the map calculated during training, the estimation of itemset support should be obtained as follows: supp(X) ˆ = mb X P (mj ) ∗ minsupp(xi ∈ X) (7.1) j=1 In which the probability of the event, representing an association or an itemset X, is defined by the minimal support among the items xi belonging to X rather 188 Figure 7.2: Triangular binary formations formed with the itemset search space formed with 3 and 4 items. than by the product defined by the joint probability of them which was defined in Chapter 5 and re-state here in Equation 7.2. It is important to note that the prior probability P(mj ) of the mb number of BMUs will still be necessary for itemset-support estimation. supp(X) ˆ = mb X (P (mj ) j=1 Y xi ) (7.2) xi ∈X Equation 7.1 can be also seen as a simple optimization of the current proposed estimation method of itemset support, because it requires less number of operations to make an estimation. Moreover, the quality of its estimations generated can be expected to be better than our current results since input patterns are better organised in the map. For instance, let Mjt be a binary triangular matrix allocated at the BMU mj representing input itemsets involving four items (a,b,c,d). Because of the hierarchy existing among the input itemsets used to build Mjt , which will be summarised by the reference vector formed in mj , an 189 ordering, based on the support property, appears in the items of the matrix such that it is defined by either a ≥ b ≥ c ≥ d or a ≤ b ≤ c ≤ d. Therefore, if the map is queried to recall the support of an itemset, for instance X = abc, the answer, representing itemset-support estimation of X, would involve determining the support of either a or c depending on the ordering formed in the BMUs of the map. Since we have noticed that the quality of the estimations made by this neural network is highly dependent on its initial state and the radius utilised for its training, it can be thought of evaluating approaches which cope with this instability property of the SOM which produces under- or over-representation of the training data. Therefore, we propose to investigate alternative initilisation schemes for SOM, for instance those proposed in (Su et al., 1999; Salem et al., 2003; Attik et al., 2005), in order to improve the results given by a random initilised SOM and to avoid the use of linear initialisation which has the inconvenience that requires a lot of computations for the initial weights when the input data is large. In this thesis, we have initially explored the generation of association rules known as categorical association rules because they only describe facts refereing to the associations defined among items without considering any quantitative aspect of the variables they represent. To overcome this drawback of the categorical rules, another type of rules, known as quantitative association rules (Srikant and Agrawal, 1996; Wijsen and Meersman, 1998; Aumann and Lindell, 2003), have been already proposed for the description of data. Since the formation of these rules involves the usage of statistics to define them, it can be assumed that SOM also has the natural ability to generate them since the information about the 190 distribution of the variables learnt is coded in the nodes. Therefore, the development of mechanisms for this type of rules is needed. One way of tackling this aim is by considering the work in (Giraudel and Lek, 2001) in which the statistical properties of SOM have been investigated, or by taking into consideration the use of a SOM as a probability density estimator for classification problems (Yin and Allinson, 2001). Hence, this idea would involve to approximate the distributions of the variables modeled by the input data in order to determine how strongly or weakly they participate in the original environment and therefore in the future rules. When it has been demonstrated that quantitative association rules can be generated from the knowledge embedded in a trained SOM, it can be considered to explore the work of Lebbah et al. (Lebbah et al., 2005) in which the creation of a mapping for the analysis of mixed (numerical and binary) data is proposed. This proposal may be useful for the generation of mixed association rules which will be composed by both quantitative and qualitative components. Taking into account that the formation of rules from a SOM is produced by decoding the information of its codebook or mapping, which represents a summary of the input data formed by the ability of SOM for vector quantization, it can be concluded that the final vectors in the map can be understood as a group of representatives of the itemset patterns existing in the training environment. Hence, it is possible to create a link between the generation of association rules from a neural network and the work proposed by Yan et al. (Yan et al., 2005a), which examines how to summarise a collection of itemsets using a finite number of K representatives. In our case, this wanted itemset summary will be given by the vectors created at each BMUs in the trained map. 191 Since the calculation of itemset support and the generation of rules throughout time implies developing mechanism for non-stationary environments, the idea of using self-growing approaches derived from the traditional SOM, which control the size of the map during training, can be considered for studying if these can be trained in batch mode. To begin with this idea, the work of (Alahakoon et al., 2000b) should be analysed. 7.2.3 For the Quality of the Itemset-support Estimation One important aspect, regarding the itemset-support estimations made by any neural network, is to define a manner in which the quality of its answers (estimations) can be measured. This is, although in this thesis we have used the RMS error for measuring the quality of the estimations, it has been necessary to know the real support values to perform such an error calculation; therefore, we propose that it should be investigated a manner on how to measure such approximations or estimations with considering that the real value is totally unknown. 7.2.4 ANNs-based Candidate Generation Procedures Another topic which is worth investigating is whether itemset support can be inferred or predicted with neural networks. Itemset-support inference is a topic which has not been studied extensively in the ARM field. For instance, an algorithm called PASCAL has been proposed in (Bastide et al., 2000) as an optimization of the Apriori algorithm based on counting inference. This is, some candidates during the mining process derive their support from the frequent itemsets already discovered rather than performing the typical counting over the original dataset. Although this approach reduces the number of candidates by inferring the support for some of them, it can still happen that the rest of them, whose sup192 port need to be checked by the algorithms, may become unfrequent. Therefore, it is important to count with candidate procedures which return exactly those itemsets that will certainly turn out frequent in order to not waste time checking fake candidates. Moreover, candidate generation is an important topic in ARM since it is necessary to determine how it influences the complexity of the discovery of frequent itemsets. Based on the above described and knowing a priori that the support of itemsets forms a function f (x) as in Figure 7.3, in which x represents itemsets from a search space in a lexical order, we propose to investigate if an approximation of that function could be generated with a neural network by interpolating a series of n points, representing some itemsets from the space with their corresponding support. To approach this goal, we can start from the fact that a negative correlation exists between the size of the itemsets and their corresponding support. In other words, the support tends to decrease while the number of items in a pattern increases. Hence, we propose to train an RBF (Radial Basis Functions) neural networks (Bishop, 1995; Buhmann, 2003) in order to investigate if a candidate procedure can be defined with it. This goal can also be understood as a task which aims for the prediction of itemset-support having as input knowledge some information about the total itemset search space. 7.2.5 Distributed Association Rule Mining The collection of data is often realized in distributed manner in real life. This is, databases, representing and collecting huge amounts of local data, are set along different sites in order to manage them in a strategic and satisfactory manner. 193 100 80 80 S u p p o r t (%) S u p p o r t (%) 100 60 40 20 60 40 20 0 0 75 2775 2−i t e m s e t s 100 100 80 80 S u p p o r t (%) S u p p o r t (%) 1−i t e m s e t s 60 40 20 60 40 20 0 67525 0 3−i t e m s e t s 1215450 4−i t e m s e t s Figure 7.3: Function generated with the support of some groups of itemsets derived from the Chess dataset. Nevertheless, as the need of analysing them as a grand total arises, the development of distributed mining algorithms become crucial. In ARM, this problem has been introduced in (Cheung et al., 1996a). In order to tackle distributed ARM, a distributed database has been treated as a large database formed by partitions representing each of the local sites in the system. The common problem that approaches on this category have had to deal with refers to the amount of data that should be transmitted amongst the sites to achieve the generation of rules. For instance, as a first attempt, a solution could involve moving either the original data partition or the corresponding set of the frequent itemsets at each site; however, both solutions result in producing an overutilisation of the channel because they can be large and contain data which will turn out to be irrelevant. 194 Figure 7.4: Incremental SOM-based approach for distributed ARM. The rules will be generated from the latest trained SOM. Based on the above defined, it can be thought of extending our work as follows: As a first approach, depicted in Figure 7.4, it can be investigated the usage of a SOM moving from site to site and learning incrementally as defined in Chapter 6. This is, a SOM would be learning the local data distribution at each site until it reaches a site in which the generation of rules will be performed by using the mechanism proposed in Chapter 5. Since new tendencies can occur in the different sites, it will be important if the approach integrates a self-growing node structure rather than a traditional SOM which makes use of a fixed structure. A second approach, given n Figure 7.5, would involve having local SOMs at each site to learn the local data distributions, which will then be queried remotely from, or moved to, a central location in which ARM will take place. 195 Figure 7.5: Local SOM-based approach for distributed ARM. While the local maps are queried remotely in the model on the left, the trained maps are transmitted to be the source from rules will be generated in the model on the right. It is also important to notice that the development of distributed algorithms has a strong relationship with parallel-based algorithms (Zaki, 1999); therefore, a SOM can be stated to be a strong candidate for this type of mining tasks since its training has already been performed in parallel based on its batch mode (Lawrence et al., 1999b; Porrmann et al., 2003). 7.2.6 The Itemset Concept in Dynamic Data ARM over Data Streams In this thesis, we have started gaining some inside into the topic of generating association rules from neural networks. Nevertheless, not all data is static in real life as it is defined by the traditional ARM. This is, data, like data streams, can 196 arrive indefinitely and continuously; therefore, its mining becomes a complex task since its availability often occurs in short periods of time. Because research involving data streams is still at an early stage in the field of ARM (Jiang and Gruenwald, 2006), we suggest that it should be investigated if a neural network is able to perform such a mining task. The idea, which is depicted in Figure 7.6, consists in developing a two-stage proposal. In other words, the data stream will first be learnt in a way similar way to that used with the static data in this thesis, and then rules will be derived from the ANN. Figure 7.6: A neural-based approach for ARM for data streams. The analysis of this type of data has also been considered with approaches based on ANNs. For instance, the usage of a self-organising map and an ART has been considered in (Laerhoven et al., 2001; Laerhoven, 2001) and (Rajaraman and Tan, 2001). As this type of data comes constantly, we could consider using the incremental proposal made for SOM in this thesis for its learning. Nevertheless, these data can change so fast that there may not be a time to buffer it. Hence, the 197 sequential training of SOM should be considered necessary. The problem with that sequential training of SOM is that the identification of the BMUs is not as direct and easy as in the batch training since the map is updated every time that a new pattern is presented. One way of identifying these nodes is by projecting the training patterns into the map once the training has been finished. However, as the data is not kept, this option is not feasible. Hence, one first problem would be to find a method to keep track of the BMUs while the training is done in order to use our itemset-support estimation method for the generation of association rules. Sequential Pattern Mining The task of sequential pattern mining, introduced in (Agrawal and Srikant, 1995), is similar to the traditional problem of ARM; nevertheless, its definition considers the concept of time in the generation of knowledge. This is, the transactions, representing the input data, are defined by an Id costumer, an itemset, and a time stamp. Alike ARM, the aim is to identify interesting patterns; however, in this particular case, the target is to discover sequences, which are ordered lists of itemsets, satisfying a minimum support constraint from the given data. Since in this thesis, we have defined how to estimate itemset support from two trained ANNs, our approaches can be used as a basis for a proposal for this type of descriptive data-mining task. However, the main issue to be tackled is the way in which the input data will be presented to the ANN. In other words, a transformation method for the data is needed in order to form the corresponding vectorial patterns required to feed the neural network. 198 Appendix A The Apriori Algorithm Since we have used the well-known Apriori Algorithm to evaluate and compare the itemset support estimations derived from our ANN candidates through our proposals, such an algorithm will be defined in this section. The Apriori Algorithm was independently proposed in (Agrawal and Srikant, 1994; Mannila et al., 1994). This algorithm performs the generation of association rules by following the stages established in the confidence-support framework (Agrawal et al., 1993). This is, first, the frequent itemsets are generated, and then the corresponding association rules are formed from such frequent itemsets. Apriori tackles the complexity of the discovery of frequent itemsets by performing: • The identification of only those itemsets whose support satisfy a minimum threshold. • The formation of new itemsets, known as candidates, by joining the already discovered frequent itemsets. This is, candidates representing the set of itemsets with length k are derived from a combination of the set of the frequent k-1-itemsets. 199 • The calculation of the candidate support by counting the number of occurrences of the itemset in turn directly from the high dimensional space generated by the input data. • The traversal of the itemset space in a BFS (Breadth-First Search) way. • The pruning of fake candidates by considering the anti-monotonic property of itemset support: All subsets of a frequent itemset must be also frequent (Agrawal and Srikant, 1994). A complete specification of the algorithm is given in Figure A.1. Figure A.1: The Apriori algorithm. This figure was extracted from the original paper of Agrawal (Agrawal and Srikant, 1994). The top pseudocode describes the main steps of Apriori. The bottom SQL query defines the way in which candidates are formed during a mining process. It is important to state that this algorithm is historically important for ARM, because it has influenced the development of new ARM algorithms in one way or another since its introduction in 1994. 200 Appendix B The Neural Network Candidate Algorithms Since two well-known neural network are part of our studies, we define below their corresponding training algorithms employed in this thesis. Auto-Associative Memory: Correlation Matrix Memory A CMM (Correlation Matrix Memory), as shown in Figure B.1, is a single-layer memory whose size depends on the problem to be tackled. For instance, while a grid of n-by-m nodes is used to define a hetero-associative memory, a grid of m-by-m nodes defines an auto-associative memory. Moreover, depending on the format of its inputs, a CMM can have binary (Austin and Stonham, 1987) or real synapses (Kohonen, 1978; Haykin, 1999; Ham and Kostanic, 2001). This supervised memory is trained as follows: 1. A pair of input patterns X,Y is presented to the memory. 2. Its weight matrix is updated by a supervised Hebbian rule with the information defined by the input pair. 3. The steps 1 and 2 are performed until no more pairs are available. 201 Figure B.1: An Associative memory based on a CMM. In the case of a binary memory, its training is commanded by Equation B.1, while the training of its numeric or weighted counterpart is defined by Equation B.2. wij = m _ yik xjk (B.1) yik xjk (B.2) k=1 wij = m X k=1 The training in Equation B.1 and B.2 is defined respectively by a superposition and a sum function of the m matrices derived from by the m training pairs. The idea behind its training is to accumulate information of the input associations in such a way that when a stimulus is presented, this memory is able to remember the corresponding associated patterns. Self-Organising Map The self-organising map (Kohonen, 1996) is one of the most utilized neural networks for data-mining problems because of its unsupervised property to learn knowledge from data. The typical structure of a SOM is a two-dimensional ar- 202 range of neurons which re-organises itself during training. Each neuron or node has associated a vector which models and summarises input patterns coming from the training data. Its training is based on a search and an update mechanisms. The former is responsible for allocating the input patterns to the neurons which map them the best. Each of these neurons mc are called winners or BMUs (Best Matching Units). One mc satisfies for some inputs x the following condition: ∀i, kx(t) − mc (t)k < kx(t) − mi (t)k. The update mechanism establishes when and how the map will be updated in order to learn the input data. Two modes, sequential and batch, are often used to accomplish such a task (Kohonen, 1996; Kohonen, 1998). Both modes make this ANN converge through an iterative data presentation. Nevertheless, they differ from each other in when the update of the network will take place. This is, while in a sequential or incremental mode the map is updated every time that a new stimulus is presented, in a batch mode the map is modified after all training vectors are propagated. In general, the steps involving a SOM training can be summarised as follows: • Initialisation of the network. This means assigning either random or linear values to the nodes of the map. • The presentation of a new input and the determination of its BMU. This is, a search process, which compares the current input to the nodes in order to discover the best match for it, is needed. • The update of the map. In a sequential mode, this is done by Equation B.3, while Equation B.4 governs the corresponding batch mode. 203 mi (t + 1) = mi (t) + hc(x),i (x(t) − mi (t)) (B.3) P j hji (t)Sj (t) mi (t + 1) = P j nV j (t)hji (t) (B.4) Both equations use a neighborhood or kernel function h(), defined in Equation B.5, to spread the influence of the current input(s) along the map. α is known as the learning-rate factor which holds a value between 0 and 1 and decreases monotonically in the sequential mode. On the other hand, this factor is constant and equal to one in the batch mode. ri and rj (rc ) ∈ <2 define the grid positions of a node receiving and producing stimulation respectively. Ã hc(x),i kri − rj k2 = hji (t) = α(t) exp − 2σ 2 (t) ! (B.5) In the case of Equation B.4, the term Sj , described by Equation B.6, represents the the concentration of nvi input patterns allocated to each node. In other words, it is a sum of all data points contained at some Voronoi region region Vj = {xi | kxi − mj k < kxi − mk k ∀k 6= j} defined by a node mj . Si (t) = nvi X xj (B.6) j=1 This neural network has been employed for a large variety of problem domains, which will not be summarised here, but it can be consulted in (Oja et al., 2003) for instance. Nevertheless, for the case of data mining, a SOM has been an effective neural architecture for tasks like the visualization of high-dimensional data (Vesanto, 1999; Flexer, 1999), classifiers (Yin and Allinson, 2001), the de- 204 scription of data through clustering (Vesanto and Alhoniemi, 2000), the generation of rules through the interpretation of its knowledge (Malone et al., 2006), among others. Furthermore, a SOM has also served as the basis to build more complex neural architectures. For instance, networks whose structure is not only organised, but also grows autonomously during training (Alahakoon et al., 2000a) for the modeling of dynamic data. 205 Bibliography Aaron Ceglar, J. F. R. (2006). Association mining. ACM Computing Surveys, 38(2):1–42. Agarwal, R. C., Aggarwal, C. C., and Prasad, V. V. V. (2001). A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371. Aggarwal, C. C. and Yu, P. S. (1998). A new framework for itemset generation. In PODS, pages 18–24. ACM Press. Agrawal, R., Imielinski, T., and Swami, A. N. (1993). Mining association rules between sets of items in large databases. In Buneman, P. and Jajodia, S., editors, SIGMOD Conference, pages 207–216. ACM Press. Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Bocca, J. B., Jarke, M., and Zaniolo, C., editors, VLDB, pages 487–499. Morgan Kaufmann. Agrawal, R. and Srikant, R. (1995). Mining sequential patterns. In Yu, P. S. and Chen, A. S. P., editors, Eleventh International Conference on Data Engineering, pages 3–14, Taipei, Taiwan. IEEE Computer Society Press. Alahakoon, D., Halgamuge, S. K., and Srinivasan, B. (2000a). Dynamic self- 206 organizing maps with controlled growth for knowledge discovery. IEEE Transactions on Neural Networks, 11(3):601–614. Alahakoon, D., Halgamuge, S. K., and Srinivasan, B. (2000b). Dynamic selforganizing maps with controlled growth for knowledge discovery. IEEENN, 11(3):601. Alatas, B. and Akin, E. (2006). An efficient genetic algorithm for automated mining of both positive and negative quantitative association rules. Soft Comput, 10(3):230–237. Alhoniemi, E., Himberg, J., and Vesanto, J. (1999). Probabilistic measures for responses of self-organizing map units. In Proc. of International ICSC Congress on Computational Intelligence Methods and Applications (CIMA’99), pages 286–290, Rochester, N.Y., USA. ICSC Academic Press. Andrews, R., Diederich, J., and Tickle, A. B. (1995). Survey and critique of techniques for extracting rules from trained artificical neural networks. Knowledge Based Systems (UK), 8(6):378–389. Aras, N., Altinel, I. K., and Oommen, J. (2003). A kohonen-like decom- position method for the euclidean traveling salesman problem-KNIES/spl I.bar/DECOMPOSE. IEEE-NN, 14:869–890. Attik, M., Bougrain, L., and Alexandre, F. (2005). Self-organizing map initialization. In Duch, W., Kacprzyk, J., Oja, E., and Zadrozny, S., editors, ICANN (1), volume 3696 of Lecture Notes in Computer Science, pages 357– 362. Springer. Aumann, Y. and Lindell, Y. (2003). A statistical theory for quantitative association rules. J. Intell. Inf. Syst, 20(3):255–283. 207 Austin, J. (1995). Distributed associative memories for high speed symbolic reasoning. International Journal on Fuzzy Sets and Systems, 82:223–233. Austin, J. (1996). Associative memories and the application of neural networks to vision. In Handbook of Neural Computation. Institute of Physics and Oxford University Press. Austin, J., Kennedy, J., and Lees, K. (1995). The advanced uncertain reasoning architecture. Weightless Neural Network Workshop. Austin, J. and Stonham, T. (1987). An associative memory for use in image recognition and occlusion analysis. Image and Vision Computing, 5(4):251– 261. Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., and Lakhal, L. (2000). Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2):66– 75. Benitez, J. M., Castro, J. L., and Requena, I. (1997). Are artificial neural networks black boxes? IEEE Transactions on Neural Networks, 8(5):1156– 1164. Bishop, C. M. (1995). Neural networks for Pattern Recognition. Oxford University Press. Bodon, F. (2003). A fast apriori implementation. In Goethals, B. and Zaki, M. J., editors, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), volume 90 of CEUR Workshop Proceedings, Melbourne, Florida, USA. Bodon, F. (2006). A survey on frequent itemset mining. Technical report, Budapest University of Technology and Economics. 208 Borgelt, C. (2003). Efficient implementations of apriori and eclat. Brin, S., Motwani, R., and Silverstein, C. (1997). Beyond market baskets: generalizing association rules to correlations. In SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 265–276, New York, NY, USA. ACM Press. Browne A., Hudson B., W. D. P. P. (2003). Knowledge extraction from neural networks. In Proceedings of the 29th Annual Conference of the IEEE Industrial Electronics Society, Roanoke, Virginia, USA, pages 1909–1913. Buhmann, M. D. (2003). Radial Basis Functions Theory and Implementations. Cambridge Monographs on Applied and Computational Mathematics (No. 12). Cambridge. Burdick, D., Calimlim, M., and Gehrke, J. (2001). Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the 17th International Conference on Data Engineering, pages 443–452, Washington, DC, USA. IEEE Computer Society. Carpenter, G. A. and Grossberg, S. (1989). Search mechanisms for adaptive resonance theory (ART) architectures. In IEEE International Joint Conference on Neural Networks (3rd IJCNN’89), volume I, pages I–201–I–205, Washington DC. IEEE. Boston U. Changchien, S. W. and Lu, T.-C. (2001). Mining association rules procedure to support on-line recommendation by customers and products fragmentation. Expert Systems with Applications, 20(4):325–335. Cheung, Han, Ng, Fu, and Fu (1996a). A fast distributed algorithm for mining association rules. In PDIS: International Conference on Parallel and 209 Distributed Information Systems. IEEE Computer Society Technical Committee on Data Engineering, and ACM SIGMOD. Cheung, D. W.-L., Han, J., Ng, V., and Wong, C. Y. (1996b). Maintenance of discovered association rules in large databases: An incremental updating technique. In ICDE, pages 106–114. Cios, K. (2000). Data Mining Methods for Knowledge Discovery. Kluwer Academic. Coenen, F., Goulbourne, G., and Leng, P. (2004a). Tree structures for mining association rules. Data Mining and Knowledge Discovery, 8(1):25–51. Coenen, F., Leng, P., and Ahmed, S. (2004b). Data structure for association rule mining: T-trees and p-trees. Knowledge and Data Engineering, IEEE Transactions on, 16(6):774–778. Craven, M. and Shavlik, J. (1999). Rule extraction: Where do we go from here? Craven, M. and Shavlik, J. W. (1993). Learning symbolic rules using artificial neural networks. In ICML, pages 73–80. Craven, M. W. and Shavlik, J. W. (1997). Using neural networks for data mining. Future Generation Computer Systems, 13(2–3):211–229. DeGroot, M. (1975). Probability and Statistics. Addison-Wesley. Devroye, L. (1987). A Course in Density Estimation. Birkhauser, Boston. D.J. Newman, S. Hettich, C. B. and Merz, C. (1998). UCI repository of machine learning databases. Duch, W., Adamczak, R., and Grabczewski, K. (1996). Extraction of logical rules from training data using backpropagation networks. In First Polish Conference on Theory and Applications of Artificial Intelligence. 210 Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification (2nd Edition). Wiley-Interscience. Eggermont, J. (1998). Rule-extraction and learning in the BP-SOM architecture. El-Haji, M. and Zaiane, O. (2003). Inverted matrix: Efficient discovery of frequent items in large datasets in the context of interactive mining. Eom, J.-H. (2006). Neural feature association rule mining for protein interaction prediction. In Wang, J., Yi, Z., Zurada, J. M., Lu, B.-L., and Yin, H., editors, ISNN (2), volume 3973 of Lecture Notes in Computer Science, pages 690– 695. Springer. Eom, J.-H., Chang, J. H., and Zhang, B.-T. (2004). Prediction of implicit proteinprotein interaction by optimal associative feature mining. In Yang, Z. R., Everson, R. M., and Yin, H., editors, IDEAL, volume 3177 of Lecture Notes in Computer Science, pages 85–91. Springer. Eom, J.-H. and Zhang, B.-T. (2004). Adaptive neural network-based clustering of yeast protein-protein interactions. In Das, G. and Gulati, V. P., editors, CIT, volume 3356 of Lecture Notes in Computer Science, pages 49–57. Springer. Eom, J.-H. and Zhang, B.-T. (2005). Prediction of east protein-protein interactions by neural feature association rule. In Duch, W., Kacprzyk, J., Oja, E., and Zadrozny, S., editors, ICANN (2), volume 3697 of Lecture Notes in Computer Science, pages 491–496. Springer. Fernando Lourenco, V. L. and Bacao, F. (2004). Binary-based similarity measures for categorical data and their application in self-organizing maps. In JOCLAD 2004 - XI Jornadas de Classificacao e Analise de Dados. 211 Flexer, A. (1999). On the use of self-organizing maps for clustering and visualization. In Principles of Data Mining and Knowledge Discovery, pages 80–88. Furao, S. and Hasegawa, O. (2004). An incremental neural network for nonstationary unsupervised learning. In ICONIP, pages 641–646. Gaber, J., Bahi, J. M., and El-Ghazawi, T. A. (2000a). Parallel mining of association rules with a hopfield type neural network. In ICTAI, page 90. IEEE Computer Society. Gaber, K., Bahi, J., and El-Ghazawi, T. (2000b). Parallel mining of association rules with a hopfield type neural network. In 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2000), pages 90–93, Vancouver, Canada. Gardner-Medwin, A. R. and Barlow, H. B. (2001). The limits of counting accuracy in distributed neural representations. Neural Comput., 13(3):477–504. Giraudel, J. L. and Lek, S. (2001). A comparison of self-organizing map algorithm and some conventional statistical methods for ecological community ordination. ECOLOGICAL MODELLING, 146(1–3):329–339. Goethals, B. (2002). Efficient Frequent Pattern Mining. PhD thesis, University of Limburg, Belgium. Goethals, B. (2003). Frequent itemset mining implementations repository. Goethals, B. (2004). Memory issues in frequent itemset mining. In Haddad, H., Omicini, A., Wainwright, R. L., and Liebrock, L. M., editors, SAC, pages 530–534. ACM. 212 Goethals, B. and Zaki, M. J., editors (2003). FIMI ’03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org. Gouda, K. and Zaki, M. J. (2001). Efficiently mining maximal frequent itemsets. In ICDM, pages 163–170. Grahne, G. and Zhu, J. (2003). ing frequent itemsets. Efficiently using prefix-trees in min- In Proceeding of the First IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03). http://www.cs.concordia.ca/db/dbdm/dm.html. Grahne, G. and Zhu, J. (2005). Fast algorithms for frequent itemset mining using fp-trees. IEEE Transactions on Knowledge and Data Engineering, 17(10):1347–1362. Gunopulos, D., Mannila, H., and Saluja, S. (1997). Discovering all most specific sentences by randomized algorithms. In Afrati, F. N. and Kolaitis, P. G., editors, Database Theory - ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings, volume 1186 of Lecture Notes in Computer Science, pages 215–229. Springer. Gupta, G., Strehl, A., and Ghosh, J. (1999). Distance based clustering of association rules. Ham, F. M. and Kostanic, I. (2001). Principles of neurocomputing for science and engineering. McGraw-Hill. Hammer, B., Rechtien, A., Strickert, M., and Villmann, T. (2002). Rule extraction from self-organizing networks. In Dorronsoro, J. R., editor, 213 ICANN, volume 2415 of Lecture Notes in Computer Science, pages 877– 883. Springer. Han, E.-H., Karypis, G., and Kumar, V. (1997). Scalable parallel data mining for association rules. pages 277–288. Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann. Han, J., Pei, J., and Yin, Y. (2000a). Mining frequent patterns without candidate generation. In Chen, W., Naughton, J. F., and Bernstein, P. A., editors, SIGMOD Conference, pages 1–12. ACM. Han, J., Pei, J., and Yin, Y. (2000b). Mining frequent patterns without candidate generation. In Chen, W., Naughton, J., and Bernstein, P. A., editors, 2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press. Han, J., Pei, J., and Yin, Y. (2000c). Mining frequent patterns without candidate generation. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 1–12, New York, NY, USA. ACM Press. Hand, D., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. The MIT Press, Cambridge, Massachusetts. Haykin, S. (1999). Neural networks: A comprehensive foundation. PrenticeHall, New York. Heskes, T. (2001). Self-organizing maps, vector quantization, and mixture modeling. IEEE-EC, 12:1299–1305. 214 Hilderman, R. and Hamilton, H. (1999). Knowledge discovery and interestingness measures: A survey. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings National Academy Science USA, 79:2554–2558. Hugh, G. (1997). Discrete Probability. Springer, New York. Hung S., W. S. (2004). A time-based self-organising model for document clustering. In The International Joint Conference on Neural Networks. Jacobsson, H. (2005). Rule extraction from recurrent neural networks: A taxonomy and review. Neural Comput., 17(6):1223–1263. Jiang, N. and Gruenwald, L. (2006). Research issues in data stream association rule mining. SIGMOD Rec., 35(1):14–19. Jin, R. and Agrawal, G. (2002). Shared memory parallelization of data mining algorithms: Techniques. Jolliffe, I. T. (1986). Principal Component Analysis. Series in Statistics. Springer-Verlag. Joshi, M. V., Han, E.-H., Karypis, G., and Kumar, V. (1999). Efficient parallel algorithms for mining associations. In Large-Scale Parallel Data Mining, pages 83–126. Jr., R. J. B., Goethals, B., and Zaki, M. J., editors (2004). FIMI ’04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, November 1, 2004, volume 126 of CEUR Workshop Proceedings. CEUR-WS.org. 215 Kantardzic, M. (2002). Data Mining: Concepts, Models, Methods, and Algorithms. IEEE Press and John Wiley. Kaski, S., Nikkilä, J., and Kohonen, T. (1998a). Methods for interpreting a selforganized map in data analysis. In ESANN, pages 185–190. Kaski, S., Nikkilä, J., and Kohonen, T. (1998b). Methods for interpreting a self-organized map in data analysis. In Verleysen, M., editor, Proceedings of ESANN’98, 6th European Symposium on Artificial Neural Networks, Bruges, April 22–24, pages 185–190. D-Facto, Brussels, Belgium. Kecman, V. V. (2001). Learning and soft computing: support vector machines, neural networks, and fuzzy logic models. Complex adaptive systems. pubMIT, pub-MIT:adr. Kiang, M. Y. (2001). Extending the kohonen self-organizing map networks for clustering analysis. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 38(2):161–180. Kimball, R. (1996). The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. John Wiley. Kohonen, T. (1978). Associative memory. Springer-Verlag, Berlin. Kohonen, T. (1996). Self-Organizing Maps. Springer-Verlag. It explains SOM technology. Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21(1-3):1–6. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., nad V. Paatero, J. H., and Saarela, A. (2000). Self-organization of a massive document collection. IEEE Transactions on Neural Networks, 11:574–85. 216 Kohonen, T. and Somervuo, P. (1998). Hself-organizing maps of symbol strings. Neurocomputing, 21(1–3):19–30. Kohonen, T. and Somervuo, P. (2002). How to make large self-organizing maps for nonvectorial data. Neural Networks, 15(8–9):945–952. Krishnan, R., Sivakumar, G., and Bhattacharya, P. (1999). A search technique for rule extraction from trained neural networks. Pattern Recognition Letters, 20(3):273–280. Laerhoven, K. V. (2001). Combining the self-organizing map and K-means clustering for on-line classification of sensor data. In Dorffner, G., Bischof, H., and Hornik, K., editors, ICANN, volume 2130 of Lecture Notes in Computer Science, pages 464–469. Springer. Laerhoven, K. V., Aidoo, K. A., and Lowette, S. (2001). Real-time analysis of data from many sensors with neural networks. In ISWC, pages 115–122. IEEE Computer Society. Lawrence, R. D., Almasi, G. S., and Rushmeier, H. E. (1999a). A scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems. Data Min. Knowl. Discov., 3(2):171–195. Lawrence, R. D., Almasi, G. S., and Rushmeier, H. E. (1999b). A scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems. Data Min. Knowl. Discov, 3(2):171–195. Lebbah, M., Badran, F., and Thiria, S. (2000). Topological map for binary data. In ESANN, pages 267–272. Lebbah, M., Chazottes, A., Badran, F., and Thiria, S. (2005). Mixed topological map. In ESANN, pages 357–362. 217 Leisch, F., Weingessel, A., and Dimitriadou, E. (1998). Competitive learning for binary valued data. In Niklasson, L., Bodén, M., and Ziemke, T., editors, Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), volume 2, pages 779–784, Skövde, Sweden. Springer. Liu, J., Pan, Y., Wang, K., and Han, J. (2002). Mining frequent item sets by opportunistic projection. Lu, H. J., Setiono, R., and Liu, H. (1996). Effective data mining using neural networks. Ieee Trans. On Knowledge And Data Engineering, 8:957–961. Malone, J., McGarry, K., Wermter, S., and Bowerman, C. (2006). Data mining using rule extraction from kohonen self-organising maps. Neural Computing and Applications, 15(1):9–17. Mannila, H. and Toivonen, H. (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241– 258. Mannila, H., Toivonen, H., and Verkamo, A. I. (1994). Efficient algorithms for discovering association rules. In Fayyad, U. M. and Uthurusamy, R., editors, AAAI Workshop on Knowledge Discovery in Databases (KDD-94), pages 181–192, Seattle, Washington. AAAI Press. McGarry, K. J., Wermter, S., and MacIntyre, J. (1999). Knowledge extraction from radial basis function networks and multi-layer perceptrons. In IEEE International Conference on Neural Networks (IJCNN’99), volume IV, pages 2494–2497, Washington DC. IEEE. Meo, R. (2003). Replacing support in association rule mining. Technical Report RT70-2003, Universita degli Studi di Torino. 218 Mitra, S., Pal, S. K., and Mitra, P. (2002). Data mining in soft computing framework: a survey. IEEE-EC, 13:3–14. Oja, M., Kaski, S., and Kohonen, T. (2003). Bibliography of self-organizing map (som) papers: 1998-2001 addendum. Neural Computing Surveys, 3:1–156. O’Keefe, S. (1995). Neural Networks for FAX Image Analysis. PhD thesis, University of York. Omiecinski, E. (2003). Alternative interest measures for mining associations in databases. IEEE Trans. Knowl. Data Eng., 15(1):57–69. Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. (1998). Pruning closed itemset lattices for association rules. Proceedings of the BDA French Conference on Advanced Databases, October 1998. To appear. Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. (1999). Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science, 1540:398–416. Pei, J., Han, J., and Mao, R. (2000). CLOSET: An efficient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21–30. Piatetsky-Shapiro, G. (2000). Knowledge discovery in databases: 10 years after. hardcopy. Piatetsky-Shapiro, G. and Frawley, W. J., editors (1991). Knowledge Discovery in Databases. AAAI/MIT Press. Porrmann, M., Witkowski, U., and Ruckert, U. (2003). A massively parallel architecture for self-organizing feature maps. IEEE-NN, 14:1110–1121. 219 Rácz, B., Bodon, F., and Schmidt-Thieme, L. (2005). Benchmarking frequent itemset mining algorithms: from measurement to analysis. In Goethals, B., Nijssen, S., and Zaki, M. J., editors, Proceedings of ACM SIGKDD International Workshop on Open Source Data Mining (OSDM’05), pages 36–45, Chicago, IL, USA. Rajaraman, K. and Tan, A.-H. (2001). Topic detection, tracking, and trend analysis using self-organizing neural networks. In Cheung, D. W.-L., Williams, G. J., and Li, Q., editors, PAKDD, volume 2035 of Lecture Notes in Computer Science, pages 102–107. Springer. Roberto J. Bayardo, J. (1998). Efficiently mining long patterns from databases. SIGMOD Rec., 27(2):85–93. Salem, A.-B. M., Syiam, M. M., and Ayad, A. F. (2003). Improving selforganizing feature map (sofm) training algorithm using k-means initialization. In ICEIS (2), pages 399–405. Sallans, B. (1997). Data mining for association rules with unsupervised neural networks: Csc final project. Setiono, R. (2000). Extracting ¡em¿M¡/em¿-of-¡em¿N¡/em¿ rules from trained neural networks. IEEE-NN, 11(2):512. Shangming Yang, Y. Z. (2004). Self-organizing feature map based data mining. volume 3173, pages 193–198. Shenoy, P., Haritsa, J. R., Sundarshan, S., Bhalotia, G., Bawa, M., and Shah, D. (2000). Turbo-charging vertical mining of large databases. pages 22–33. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York. 220 Song, H.-H. and Lee, S.-W. (1998). A self-organizing neural tree for large-set pattern classification. Neural Networks, IEEE Transactions on, 9(3):369– 380. SPSS (1968). Clementine. Srikant, R. and Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In Jagadish, H. V. and Mumick, I. S., editors, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 1–12, Montreal, Quebec, Canada. Su, M.-C. and Chang, H.-T. (2001). A new model of self-organizing neural networks and its application in data projection. IEEE-NN, 12:153–158. Su, M.-C., Liu, T.-K., and Chang, H.-T. (1999). An efficient initialization scheme for the self-organizing feature map algorithm. In IEEE International Conference on Neural Networks (IJCNN’99), volume III, pages 1906–1910, Washington DC. IEEE. Taha, I. A. and Ghosh, J. (1999). Symbolic interpretation of artificial neural networks. Knowledge and Data Engineering, 11(3):448–463. Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 32–41, New York, NY, USA. ACM Press. Tickle, A. B., Andrews, R., Golea, M., and Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE-NN, 9(6):1057. Toivonen, H. (1996). Sampling large databases for association rules. In Vijayaraman, T. M., Buchmann, A. P., Mohan, C., and Sarda, N. L., editors, 221 In Proc. 1996 Int. Conf. Very Large Data Bases, pages 134–145. Morgan Kaufman. Towell, G. G. and Shavlik, J. W. (1993). Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71–101. Tsukimoto, H. (2000). Extracting rules from trained neural networks. IEEE-NN, 11(2):377. Ultsch, A. (1999). Data mining and knowledge discovery with emergent selforganizing feature maps for multivariate time series. Ultsch, A. and Siemon, H. P. (1990). Kohonen’s self organizing feature maps for exploratory data analysis. In INNC Paris 90, pages 305–308. Universit”at Dortmund. Vapnik, V. (1998). Statistical Learning Theory. Wiley. Vázquez, J. M., Macı́as, J. L. Á., and Santos, J. C. R. (2002). Discovering numeric association rules via evolutionary algorithm. In Cheng, M.-S., Yu, P. S., and Liu, B., editors, PAKDD, volume 2336 of Lecture Notes in Computer Science, pages 40–51. Springer. Veloso, A. (2003). New parallel algorithms for frequent itemset mining in large databases. Vesanto, J. (1999). SOM-based data visualization methods. Intelligent-DataAnalysis, 3:111–26. Vesanto, J. and Ahola, J. (1999). Hunting for correlations in data using the selforganizing map. In Proc. of International ICSC Congress on Computational Intelligence Methods and Applications (CIMA’99), Rochester, New York, USA, June 22–25, pages 279–285. ICSC Academic Press. 222 Vesanto, J. and Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3):586–600. Vesanto, J., Himberg, J., Alhoniemi, E., and Parhankangas, J. (1999). Selforganizing map in matlab: the SOM toolbox. In Proc. of Matlab DSP Conference 1999, Espoo, Finland, November 16–17, pages 35–40. Vesanto, J., Himberg, J., Alhoniemi, E., and Parhankangas, J. (2000). SOM toolbox for matlab 5. Technical report. Wijsen, J. and Meersman, R. (1998). On the complexity of mining quantitative association rules. Data Min. Knowl. Discov, 2(3):263–281. Woon, Y.-K., Ng, W.-K., and Lim, E.-P. (2004). A support-ordered trie for fast frequent itemset discovery. IEEE Transactions on Knowledge and Data Engineering, 16(7):875–879. Yan, X., Cheng, H., Han, J., and Xin, D. (2005a). Summarizing itemset patterns: a profile-based approach. In Grossman, R., Bayardo, R., and Bennett, K. P., editors, KDD, pages 314–323. ACM. Yan, X., Zhang, C., and Zhang, S. (2005b). Armga: Identifying interesting association rules with genetic algorithms. Applied Artificial Intelligence, 19(7):677–689. Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms). Yin, H. and Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE-NN, 12:405–411. 223 Zaki, M. J. (1999). Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14–25. Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng, 12(2):372–390. Zaki, M. J. and Hsiao, C.-J. (2002). Charm: An efficient algorithm for closed itemset mining. In Grossman, R. L., Han, J., Kumar, V., Mannila, H., and Motwani, R., editors, SDM. SIAM. Zaki, M. J., Parthasarathy, S., Li, W., and Ogihara, M. (1996). Evaluation of sampling for data mining of association rules. Technical Report TR617. Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997a). New algorithms for fast discovery of association rules. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., and Park, M., editors, In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, pages 283–296. AAAI Press. Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997b). Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1(4):343–373. Zhang (2000). Neural networks for classification: A survey. IEEETSMC: IEEE Transactions on Systems, Man, and Cybernetics, 30:451–462. 224

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Neural Networks as Artificial Memories for Association Rule Mining