Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2nd Vaagdevi International Conference On Information Technology For Real World Problems SOM-based Generating of Association Rules Peddi Kishor1, K. Yohan2, V. Kishore3 Sr. Asst. Professor, Department of CSE & IT, 2 Asst. Professor, Department of CSE, 3 Asst. Professor, Department of CSE, 1 Sree Chaitanya Institute of Technological Sciences, Karimnagar, A.P., India 2 Vaageswari Engineering College, Karimnagar, A.P., India 3 Vaageswari College of Enginerring, Karimnagar, A.P., India 1 [email protected], [email protected], [email protected] 1 Abstracta set of frequent patterns obtained through the Apriori approach. The clustering technique can be used to detect frequent patterns in a top-down manner as opposed to the traditional approach that employs a bottom-up lattice search. Furthermore we would like to find the necessary restrictions or limitations on the clustering technique so that the extracted patterns satisfy the specified support constraints. Frequent pattern discovery is an essential step in association rule mining. Most of the developed algorithms are a variant of Apriori and are based on the downward closure lemma concept with the support framework. Besides the fact that the problem is approached differently, commonality that remains in most of the current algorithms is that all possible candidate combinations are first enumerated, and then their frequency is determined by scanning of the transactional database. These two steps, candidate enumeration and testing are known to be the major bottlenecks in the Apriori-like approaches, which inspired some work towards methods that avoid this performance bottleneck. Depending on the aim of the application, rules may be generated from the frequent patterns discovered and the support and confidence of each rule can be indicated. Rules with high confidence often cover only small fractions of the total number of rules generated, which makes their isolation more challenging and costly. Index Terms—Association Rule Mining, Frequent Patterns, Self-Organizing Map. Data Mining, I INTRODUCTION Data Mining or knowledge discovery in database is to find new knowledge from database. However, the dimensionality, complexity, or amount of data is prohibitively large for manual analysis. With a huge amount of data stored in databases, it is increasingly important to develop powerful tools for mining interesting knowledge from it. In recent years, the computational efficiency of modern computer technology is making the mining fast and precise. One of the most interesting developments in this area is the application of neural computation. Self-Organizing Map (SOM) (Kohonen, 1990) is a type of neural network that uses the principles of competitive or unsupervised learning. In unsupervised learning there is no information about a desired output as is the case in supervised learning. This unsupervised learning approach forms abstractions by a topology preserving mapping of high dimensional input patterns into a lower-dimensional set of output clusters (Sestito & Dillon. 1994). These clusters correspond to frequently occurring patterns of features among the input data. Self-Organizing Map (SOM) is a type of neural network that uses the principles of competitive or unsupervised learning. In unsupervised learning there is no information about a desired output as is the case in supervised learning. This unsupervised learning approach forms abstractions by a topology preserving mapping of high dimensional input patterns into a lower-dimensional set of output clusters. These clusters correspond to frequently occurring patterns of features among the input data. Due to its simple structure and learning mechanism SOM has been successfully used in various applications and it has proven to be one of the effective clustering techniques. The useful properties of SOM mentioned above have motivated me to investigate whether the method can be applied to the problem of frequent pattern discovery in the association rule framework sense. More specifically, it needs to be determined whether the notion of finding frequent patterns can be interchangeably substituted with the notion of finding clusters of frequently occurring patterns in the data. There is a relationship between the task of separating frequent patterns from infrequent patterns from data and the task of finding clusters in data, as a cluster could represent a particular frequently occurring pattern in the data. A cluster would in this case correspond to a pattern, and hence it would be interesting to see whether a cluster set obtained from a clustering technique can correspond to A cluster would in this case correspond to a pattern, and hence it would be interesting to see whether a cluster set obtained from a clustering technique can correspond to a set of frequent patterns obtained through the Apriori approach. The self-organizing map (SOM) [5] is an unsupervised neural network algorithm. It has been widely applied to solve problems such as pattern recognition, financial data analysis, image analysis, process monitoring, and fault diagnosis [3, 6]. Some major components of data mining 67 VCON’10 minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset. The set of frequent k-itemsets is commonly denoted by LK. From Equation (2.3), we have are discussed and the applications of the SOM in association rules mining are proposed. 2 ASSOCIATION ANALYSES Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub problems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Since the second sub problem is quite straight forward, most of the researches focus on the first sub problem. The first sub-problem can be further divided into two subproblems: candidate large itemsets generation process and frequent itemsets generation process. We call those itemsets whose support exceed the support threshold as large or frequent item-sets, those itemsets that are expected or have the hope to be large or frequent are called candidate itemsets. … (2.4) Equation (2.4) shows that the confidence of rule A=>B can be easily derived from the support counts of A and AUB. That is, once the support counts of A, B, and AUB are found, it is straightforward to derive the corresponding association rules A=>B and B=>A and check whether they are strong. Thus the problem of mining association rules can be reduced to that of mining frequent itemsets. In general, association rule mining can be viewed as a two-step process: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min support. A. Association Rule Mining Steps Let I ={ I1, I2,…, Im} be a set of items. Let D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T. An association rule is an implication of the form A=>B, where A⊂I, B⊂I, and A B=¢. The rule A=>B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A/B (i.e., the union of sets A and B, or say, both A and B). This is taken to be the probability, P(AUB). The rule A => B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P (B/A). That is, 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence. 3 SELF ORGANIZING MAP A. Introduction SOM (Kohonen, 1990) is an unsupervised neural network that effectively creates spatially organized “internal representations” of the features and abstractions detected in the input space. It consists of an input layer and an output layer in form of a map (see figure 1). SOM is based on the competition among the cells in the map for the best match against a presented input pattern. Each node in the map has a weight vector associated with it, which are the weights on the links emanating from the input layer to that particular node. When an input pattern is imposed on the network, a node is selected from among all the output nodes as having the best response according to some criterion. Support (A=>B) => P (AUB) ………………(2.2) Confidence (A=>B) => P (B/A)…………… (2.3) Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. By convention, we write support and confidence values so as to occur between 0% and 100%, rather than 0 to 1.0. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set (computer, antivirus software) is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the frequency, support count, or count of the itemset. Note that the itemset support defined in Equation (2.2) is sometimes referred to as relative support, whereas the occurrence frequency is called the absolute support. If the relative support of an itemset I satisfy a pre specified Figure 1: SOM consisting of two input nodes and 3 * 3 map This output node is declared the “winner” and is usually the cell having the smallest Euclidean distance between its weight vector and the presented input vector. The winner and its neighboring cells are then updated to match the presented input pattern more closely. Neighborhood size 68 2nd Vaagdevi International Conference On Information Technology For Real World Problems decrease of the generalization capability (Sestito & Dillon, 1994). and the magnitude of update shrink as the training proceeds. After the learning phase, cells that respond in similar manner to the presented input patterns are located close to each other, and so clusters can be formed in the map. Existing similarities in the input space are revealed through the ordered or topology preserving mapping of high dimensional input patterns into a lower-dimensional set of output cluster. B. Self-Organizing Map in Association Rules Mining In association rules mining [12], a set of items is referred as itemset. An itemset that contains k items is a k-itemset. Suppose I = {i1, i2, ..., im} is a set of items, D is a set of database transactions, where each transaction T is a set of items such that T C I . Let A be set of items. A transaction set T is said to contain A if and only if A C T. An association rule is an implication of the form A =>B, A C I, B C I and A B = . It has the following two significant properties: When used for classification purposes, SOM is commonly integrated with a type of supervised learning in order to assign appropriate class labels to the clusters. After the learning phase has come to completion, the weights on the links could be analyzed in order to represent the learned knowledge in a symbolic form. One such method is used in “Unsupervised BRAINNE” (Sestito & Dillon, 1994) which extracts a set of symbolic knowledge structures in form of concepts and concept hierarchies from a trained neural network. A similar method is used in the current work. After the supervised learning is complete each cluster will have a rule or pattern associated with it, which determines which data objects are covered by that cluster. Support: (A => B) = P (AUB) The probability of itemset D contains both itemset A and B. Confidence: (A => B) = P (A/B) The probability of itemset D contains A also contains B. Rules that satisfy both minimum support threshold and a minimum confidence threshold are called strong. Association rules mining is the rule to generate support and confidence which are greater than or equal to their minimum support and confidence. An itemset satisfies minimum support is called a frequent itemset, the set of frequent k-item sets is commonly denoted by Lk. The notion of competitive learning hints that the way knowledge is learnt by SOM is competitive based on certain criteria. Therefore, with a smaller output space the level of competition increases. Patterns exposed more frequently to SOM in the training phase are more likely to be learnt while the ones exposed less frequently are more likely to be disregarded. The number of nodes in the output space, width x height, in SOM is normally set less than or equal to 2n, where n denotes the number of features in the input space, i.e. the dimension of input space. C. SOM Clustering Transactions as in Table 1 in a database can be modeled as data vectors, each transaction will be converted to an input vector. If it has item i in the transaction, the ith component in the vector will be 1, otherwise 0. The data modeling can be done at the time when transactions are extracted from the database. To train the input vectors, a SOM or GHSOM [3] neural network can be initialized to generate map units. Different from the usual neural networks we used in other pattern, in this particular training, the network has only one vector in each row, each neuron just has neighbors in different rows. From the map units, one can easily find the relationship between different items. The frequently occurring patterns (or rules) will influence the organization of the map to the highest degree, and hence after training the resulting clusters correspond to the most frequently occurring patterns. The patterns that are not covered by any clusters have not occurred often enough to have an impact on the self-organization of the map and are often masked out by frequent patterns. Input Transformation: A transactional record, R, can be represented as a tuple of items from 1 to n number of items, {I1, I2, …, In}. A transactional database, T, of size s, consists of many records {R1, …, Rs}. For a record, Ri, if an item, Ij, is a part of the record then it will have a corresponding value 1; and 0 when the opposite condition occurs. Representing transactional records in this way is suitable for the SOMs input layer. Table1. Transaction data for Allen’s Electronics Branch Training the SOM: SOM needs to be trained with a large set of training data until it reaches a terminating condition. The terminating condition is reached when either, the number of training iterations (epochs) has reached its maximum or the mean square error (mse) value has reached its minimum or a pre-specified limit. The mse value should be chosen small enough so that the necessary correlations can be learned. On the other hand it should be large enough to avoid the problem of over-fitting which results in a Transactions in Table1 are converted to SOM input samples as follows: 69 VCON’10 2. Generate association rules for a giving minimum support count. For a k-itemset, in Fig. 1, the support count for 1itemset is the total count of 1s in each column. If the support count of an 1-itemset is less than the minimum support count, then the support count of any k-itemset which contains this 1-itemset will be less than the minimum support count. For saving time, we remove these columns if their total counts in the map units are less than the minimum support count, then use Apriori algorithm to generate strong association rules for the remained items. This will take less scanning time but generate same association rules for the database. In Fig. 1, if the minimum support count is 4, then column 7 and column 13 will be removed from the map units because their support counts are 3 and 2 respectively. Table 2 shows the remained items and their 1-item set support counts, the association rules will be generated based on these items. The SOM algorithm has two phases [5], the first is the search phase, during which each node i computes the Euclidean distance between its weight vector wi(t) and the input vector T(t) as following: Then it chooses the closest neuron i0 such that where neuron i0 is called the winner or winning cell. The second phase is the update phase, during this phase, a small number of neurons within a neighborhood around the winning cell, including the winning cell itself, are updated by where (t) is the learning constant. Figure2. Outputs of 5000 iterations of SOM trained transactions D. Association Rules Mining on SOM Clusters The benefit for this approach is, in a large database, it may generate thousands of transactions for analysis, but we are only interested in these data which appear repeatedly. Items in the transactions just appear a few times may not be important for the mining. The purpose of using SOM to train the data set is to obtain the visualization of structure of the transactions and reduce the association rules mining time. After iterations of SOM training, if two or more neurons are in the same cluster, it means these neurons have one more similar or same components in their weight vectors, therefore transactions which have some or even exactly same items will be in same cluster. There are more transactions in one cluster; the support of itemsets in the cluster will be higher. Using the SOM trained outputs, we have the following two approaches to generate association rules for the database: Table2. Items used to generate association rules for the database 1. Obtain the association rules by observing SOM trained map units. Fig. 1 is the map units of 5000 iterations SOM trained transactions of Allen’s Electronics Branch [1]. From the three classes in this figure, we have immediate conclusions about the association rules for this mining. For example, itemset {I5, I6}, {I3, I5, I6}, {I2, I5, I6}, {I5, I6, I10}, and {I1, I4, I9} and their subsets have bigger support counts than any other itemsets, these items must have some rules associated. In Apriori algorithm association rules mining, computation of frequent k-itemset Lk (k = 1, 2, ...) and their candidates Ck will be very tedious, we use SOM clustering to classify the transaction data, it always organizes the data to be neighborhood if they have some similar properties (same 70 2nd Vaagdevi International Conference On Information Technology For Real World Problems 9. Pyle D. Data Preparation for Data Mining. Morgan Kaufman Publishers, San Francisco, 1999. items appear in many different transactions). Therefore the structures of those data are visualized, from the structured map we can remove some columns and choose data for association rules mining. 10. Rauber, A., Merkl, D., Dittenbach, M.: The Growing Hierarchical Self-organizing Map: Exploratory Analysis of High-dimensional Data. IEEE Transactions on Neural Networks, Vol. 13, No. 6 (2002) 1331-1341 4 CONCLUSIONS AND FUTURE WORK Many algorithms were proposed to use SOM for data mining. Apriori algorithm for data mining is a very efficient utility when database is relatively small. However, it takes too much time to repeatedly scan the large-scale database. The combination of SOM training and Apriori algorithm can take the advantage. In this algorithm, each input vector represents a transaction in a database, Kohonen SelfOrganizing Map is used to train input vectors. According to the principles of the SOM, it is a similarity of neighborhood neural network, on which we have the visualization of the relationship between items in the database. The simulation experiment shows the feature map can be formed in very short time. By reviewing the SOM outputs, one can easily determine the positive and negative association rules in the database. It’s clearly the proposed algorithm makes the data mining a significantly improvement. 11. Sestito, S., & Dillon, T.S. (1994). Automated knowledge acquisition, Prentice Hall of Australia, Sydney. 12. Shangming Yang and Yi Zhang, Self-Organizing Feature Map Based Data Mining, University of Electronic Science and Technology of China, China. 13. T.Kohonen, Self-Organizing Maps, Springer, Berlin, Heidelberg, 1995. 14. T.Kohonen, E.Oja, O.Simula, A.Visa, and J.Kangas, Engineering applications of the self-organizing map, Proceedings of the IEEE, 84(10), October 1996. 15. Vesanto, J., Alhoniemi, E.: Clustering of the Selforganizing Map. IEEE Transactions on Neural Networks, Vol. 11, No. 3 (2000) 586–600. REFERENCES 16. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques 2E, 2006. 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In Proc. 1994 Int. Conf. Very Large Databases (VLDB’94) (1994) 2. Agrawal, R., Srikant, R.: Mining Sequential Patterns. Proc. of the Int’l Conference on Data Engineering (ICDE), Taipei, Taiwan (1995) 3. Debock, G., Kohonen, T.: Visual Exploration in Finance Using Self-organizing Maps. Springer-Verlag, London (1998) 4. Han, J., Kamber, M.: Data Mining – Concepts and Techniques. Higher Education Press, Beijing, China (2001) 5. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., Saarela, A:Self-organization of a Massive Document Collection. IEEE Transactions on Neural Networks, Vol. 11, No. 3 (2000) 574–585 6. Kohonen, T.: Self-organizing Maps. Springer-Verlag Berlin and Heidelberg, Germany (1995) 7. Kohonen, T. (1990). The Self Organizing Map. Proceedings of the IEEE, vol. 78, no.9, pp. 1464-1480, September. 8. O.Simula and J.Kangas, Neural Networks for Chemical Engineers, volume 6 of Computer-Aided Chemical Engineering}, chapter 14, Process monitoring and visualization using self-organizing maps. Elsevier, Amsterdam, 1995. 71