Download SOM-based Generating of Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
2nd Vaagdevi International Conference On Information Technology For Real World Problems
SOM-based Generating of Association Rules
Peddi Kishor1, K. Yohan2, V. Kishore3
Sr. Asst. Professor, Department of CSE & IT,
2
Asst. Professor, Department of CSE,
3
Asst. Professor, Department of CSE,
1
Sree Chaitanya Institute of Technological Sciences, Karimnagar, A.P., India
2
Vaageswari Engineering College, Karimnagar, A.P., India
3
Vaageswari College of Enginerring, Karimnagar, A.P., India
1
[email protected], [email protected], [email protected]
1
Abstracta set of frequent patterns obtained through the Apriori
approach. The clustering technique can be used to detect
frequent patterns in a top-down manner as opposed to the
traditional approach that employs a bottom-up lattice
search. Furthermore we would like to find the necessary
restrictions or limitations on the clustering technique so
that the extracted patterns satisfy the specified support
constraints.
Frequent pattern discovery is an essential step in
association rule mining. Most of the developed algorithms
are a variant of Apriori and are based on the downward
closure lemma concept with the support framework. Besides
the fact that the problem is approached differently,
commonality that remains in most of the current algorithms
is that all possible candidate combinations are first
enumerated, and then their frequency is determined by
scanning of the transactional database. These two steps,
candidate enumeration and testing are known to be the
major bottlenecks in the Apriori-like approaches, which
inspired some work towards methods that avoid this
performance bottleneck. Depending on the aim of the
application, rules may be generated from the frequent
patterns discovered and the support and confidence of each
rule can be indicated. Rules with high confidence often
cover only small fractions of the total number of rules
generated, which makes their isolation more challenging
and costly.
Index Terms—Association Rule Mining,
Frequent Patterns, Self-Organizing Map.
Data
Mining,
I INTRODUCTION
Data Mining or knowledge discovery in database is to find
new knowledge from database. However, the
dimensionality, complexity, or amount of data is
prohibitively large for manual analysis. With a huge
amount of data stored in databases, it is increasingly
important to develop powerful tools for mining interesting
knowledge from it. In recent years, the computational
efficiency of modern computer technology is making the
mining fast and precise. One of the most interesting
developments in this area is the application of neural
computation.
Self-Organizing Map (SOM) (Kohonen, 1990) is a type of
neural network that uses the principles of competitive or
unsupervised learning. In unsupervised learning there is no
information about a desired output as is the case in
supervised learning. This unsupervised learning approach
forms abstractions by a topology preserving mapping of
high dimensional input patterns into a lower-dimensional
set of output clusters (Sestito & Dillon. 1994). These
clusters correspond to frequently occurring patterns of
features among the input data.
Self-Organizing Map (SOM) is a type of neural network
that uses the principles of competitive or unsupervised
learning. In unsupervised learning there is no information
about a desired output as is the case in supervised learning.
This unsupervised learning approach forms abstractions by
a topology preserving mapping of high dimensional input
patterns into a lower-dimensional set of output clusters.
These clusters correspond to frequently occurring patterns
of features among the input data. Due to its simple
structure and learning mechanism SOM has been
successfully used in various applications and it has proven
to be one of the effective clustering techniques. The useful
properties of SOM mentioned above have motivated me to
investigate whether the method can be applied to the
problem of frequent pattern discovery in the association
rule framework sense. More specifically, it needs to be
determined whether the notion of finding frequent patterns
can be interchangeably substituted with the notion of
finding clusters of frequently occurring patterns in the data.
There is a relationship between the task of separating
frequent patterns from infrequent patterns from data and
the task of finding clusters in data, as a cluster could
represent a particular frequently occurring pattern in the
data. A cluster would in this case correspond to a pattern,
and hence it would be interesting to see whether a cluster
set obtained from a clustering technique can correspond to
A cluster would in this case correspond to a pattern, and
hence it would be interesting to see whether a cluster set
obtained from a clustering technique can correspond to a set
of frequent patterns obtained through the Apriori approach.
The self-organizing map (SOM) [5] is an unsupervised
neural network algorithm. It has been widely applied to
solve problems such as pattern recognition, financial data
analysis, image analysis, process monitoring, and fault
diagnosis [3, 6]. Some major components of data mining
67
VCON’10
minimum support threshold (i.e., the absolute support of I
satisfies the corresponding minimum support count
threshold), then I is a frequent itemset. The set of frequent
k-itemsets is commonly denoted by LK.
From Equation (2.3), we have
are discussed and the applications of the SOM in
association rules mining are proposed.
2 ASSOCIATION ANALYSES
Association rule mining is to find out association rules that
satisfy the predefined minimum support and confidence
from a given database. The problem is usually decomposed
into two sub problems. One is to find those itemsets whose
occurrences exceed a predefined threshold in the database;
those itemsets are called frequent or large itemsets. The
second problem is to generate association rules from those
large itemsets with the constraints of minimal confidence.
Since the second sub problem is quite straight forward,
most of the researches focus on the first sub problem. The
first sub-problem can be further divided into two subproblems: candidate large itemsets generation process and
frequent itemsets generation process. We call those itemsets
whose support exceed the support threshold as large or
frequent item-sets, those itemsets that are expected or have
the hope to be large or frequent are called candidate
itemsets.
… (2.4)
Equation (2.4) shows that the confidence of rule
A=>B can be easily derived from the support counts of A
and AUB. That is, once the support counts of A, B, and
AUB are found, it is straightforward to derive the
corresponding association rules A=>B and B=>A and check
whether they are strong. Thus the problem of mining
association rules can be reduced to that of mining frequent
itemsets. In general, association rule mining can be viewed
as a two-step process:
1. Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a predetermined
minimum support count, min support.
A. Association Rule Mining Steps
Let I ={ I1, I2,…, Im} be a set of items. Let D, the
task-relevant data, be a set of database transactions where
each transaction T is a set of items such that T ⊆ I. Each
transaction is associated with an identifier, called TID. Let
A be a set of items. A transaction T is said to contain A if
and only if A ⊆ T. An association rule is an implication of
the form A=>B, where A⊂I, B⊂I, and A B=¢. The rule
A=>B holds in the transaction set D with support s, where s
is the percentage of transactions in D that contain A/B (i.e.,
the union of sets A and B, or say, both A and B). This is
taken to be the probability, P(AUB). The rule A => B has
confidence c in the transaction set D, where c is the
percentage of transactions in D containing A that also
contain B. This is taken to be the conditional probability, P
(B/A). That is,
2. Generate strong association rules from the frequent
itemsets: By definition, these rules must satisfy minimum
support and minimum confidence.
3 SELF ORGANIZING MAP
A. Introduction
SOM (Kohonen, 1990) is an unsupervised neural network
that effectively creates spatially organized “internal
representations” of the features and abstractions detected in
the input space. It consists of an input layer and an output
layer in form of a map (see figure 1). SOM is based on the
competition among the cells in the map for the best match
against a presented input pattern. Each node in the map has
a weight vector associated with it, which are the weights on
the links emanating from the input layer to that particular
node. When an input pattern is imposed on the network, a
node is selected from among all the output nodes as having
the best response according to some criterion.
Support (A=>B) => P (AUB) ………………(2.2)
Confidence (A=>B) => P (B/A)…………… (2.3)
Rules that satisfy both a minimum support
threshold (min sup) and a minimum confidence threshold
(min conf) are called strong. By convention, we write
support and confidence values so as to occur between 0%
and 100%, rather than 0 to 1.0.
A set of items is referred to as an itemset. An
itemset that contains k items is a k-itemset. The set
(computer, antivirus software) is a 2-itemset. The
occurrence frequency of an itemset is the number of
transactions that contain the itemset. This is also known,
simply, as the frequency, support count, or count of the
itemset. Note that the itemset support defined in Equation
(2.2) is sometimes referred to as relative support, whereas
the occurrence frequency is called the absolute support. If
the relative support of an itemset I satisfy a pre specified
Figure 1: SOM consisting of two input nodes and 3 * 3 map
This output node is declared the “winner” and is usually the
cell having the smallest Euclidean distance between its
weight vector and the presented input vector. The winner
and its neighboring cells are then updated to match the
presented input pattern more closely. Neighborhood size
68
2nd Vaagdevi International Conference On Information Technology For Real World Problems
decrease of the generalization capability (Sestito & Dillon,
1994).
and the magnitude of update shrink as the training proceeds.
After the learning phase, cells that respond in similar
manner to the presented input patterns are located close to
each other, and so clusters can be formed in the map.
Existing similarities in the input space are revealed through
the ordered or topology preserving mapping of high
dimensional input patterns into a lower-dimensional set of
output cluster.
B. Self-Organizing Map in Association Rules Mining
In association rules mining [12], a set of items is referred as
itemset. An itemset that contains k items is a k-itemset.
Suppose I = {i1, i2, ..., im} is a set of items, D is a set of
database transactions, where each transaction T is a set of
items such that T C I . Let A be set of items. A transaction
set T is said to contain A if and only if A C T. An
association rule is an implication of the form A =>B, A C I,
B C I and A B = . It has the following two significant
properties:
When used for classification purposes, SOM is commonly
integrated with a type of supervised learning in order to
assign appropriate class labels to the clusters. After the
learning phase has come to completion, the weights on the
links could be analyzed in order to represent the learned
knowledge in a symbolic form. One such method is used in
“Unsupervised BRAINNE” (Sestito & Dillon, 1994) which
extracts a set of symbolic knowledge structures in form of
concepts and concept hierarchies from a trained neural
network. A similar method is used in the current work.
After the supervised learning is complete each cluster will
have a rule or pattern associated with it, which determines
which data objects are covered by that cluster.
Support: (A => B) = P (AUB)
The probability of itemset D contains both itemset A and B.
Confidence: (A => B) = P (A/B)
The probability of itemset D contains A also contains B.
Rules that satisfy both minimum support threshold and a
minimum confidence threshold are called strong.
Association rules mining is the rule to generate support and
confidence which are greater than or equal to their
minimum support and confidence. An itemset satisfies
minimum support is called a frequent itemset, the set of
frequent k-item sets is commonly denoted by Lk.
The notion of competitive learning hints that the way
knowledge is learnt by SOM is competitive based on
certain criteria. Therefore, with a smaller output space the
level of competition increases. Patterns exposed more
frequently to SOM in the training phase are more likely to
be learnt while the ones exposed less frequently are more
likely to be disregarded. The number of nodes in the output
space, width x height, in SOM is normally set less than or
equal to 2n, where n denotes the number of features in the
input space, i.e. the dimension of input space.
C. SOM Clustering
Transactions as in Table 1 in a database can be modeled as
data vectors, each transaction will be converted to an input
vector. If it has item i in the transaction, the ith component
in the vector will be 1, otherwise 0. The data modeling can
be done at the time when transactions are extracted from the
database. To train the input vectors, a SOM or GHSOM [3]
neural network can be initialized to generate map units.
Different from the usual neural networks we used in other
pattern, in this particular training, the network has only one
vector in each row, each neuron just has neighbors in
different rows. From the map units, one can easily find the
relationship between different items.
The frequently occurring patterns (or rules) will
influence the organization of the map to the highest degree,
and hence after training the resulting clusters correspond to
the most frequently occurring patterns. The patterns that are
not covered by any clusters have not occurred often enough
to have an impact on the self-organization of the map and
are often masked out by frequent patterns.
Input Transformation: A transactional record, R, can be
represented as a tuple of items from 1 to n number of items,
{I1, I2, …, In}. A transactional database, T, of size s,
consists of many records {R1, …, Rs}. For a record, Ri, if
an item, Ij, is a part of the record then it will have a
corresponding value 1; and 0 when the opposite condition
occurs. Representing transactional records in this way is
suitable for the SOMs input layer.
Table1. Transaction data for Allen’s Electronics Branch
Training the SOM: SOM needs to be trained with a large
set of training data until it reaches a terminating condition.
The terminating condition is reached when either, the
number of training iterations (epochs) has reached its
maximum or the mean square error (mse) value has reached
its minimum or a pre-specified limit. The mse value should
be chosen small enough so that the necessary correlations
can be learned. On the other hand it should be large enough
to avoid the problem of over-fitting which results in a
Transactions in Table1 are converted to SOM input samples
as follows:
69
VCON’10
2. Generate association rules for a giving minimum support
count. For a k-itemset, in Fig. 1, the support count for 1itemset is the total count of 1s in each column. If the
support count of an 1-itemset is less than the minimum
support count, then the support count of any k-itemset
which contains this 1-itemset will be less than the minimum
support count. For saving time, we remove these columns if
their total counts in the map units are less than the
minimum support count, then use Apriori algorithm to
generate strong association rules for the remained items.
This will take less scanning time but generate same
association rules for the database. In Fig. 1, if the minimum
support count is 4, then column 7 and column 13 will be
removed from the map units because their support counts
are 3 and 2 respectively. Table 2 shows the remained items
and their 1-item set support counts, the association rules
will be generated based on these items.
The SOM algorithm has two phases [5], the first is the
search phase, during which each node i computes the
Euclidean distance between its weight vector wi(t) and the
input vector T(t) as following:
Then it chooses the closest neuron i0 such that
where neuron i0 is called the winner or winning cell. The
second phase is the update phase, during this phase, a small
number of neurons within a neighborhood around the
winning cell, including the winning cell itself, are updated
by
where (t) is the learning constant.
Figure2. Outputs of 5000 iterations of SOM trained
transactions
D. Association Rules Mining on SOM Clusters
The benefit for this approach is, in a large
database, it may generate thousands of transactions for
analysis, but we are only interested in these data which
appear repeatedly. Items in the transactions just appear a
few times may not be important for the mining.
The purpose of using SOM to train the data set is to obtain
the visualization of structure of the transactions and reduce
the association rules mining time. After iterations of SOM
training, if two or more neurons are in the same cluster, it
means these neurons have one more similar or same
components in their weight vectors, therefore transactions
which have some or even exactly same items will be in
same cluster. There are more transactions in one cluster; the
support of itemsets in the cluster will be higher. Using the
SOM trained outputs, we have the following two
approaches to generate association rules for the database:
Table2. Items used to generate association rules for the
database
1. Obtain the association rules by observing SOM trained
map units. Fig. 1 is the map units of 5000 iterations SOM
trained transactions of Allen’s Electronics Branch [1]. From
the three classes in this figure, we have immediate
conclusions about the association rules for this mining. For
example, itemset {I5, I6}, {I3, I5, I6}, {I2, I5, I6}, {I5, I6,
I10}, and {I1, I4, I9} and their subsets have bigger support
counts than any other itemsets, these items must have some
rules associated.
In Apriori algorithm association rules mining, computation
of frequent k-itemset Lk (k = 1, 2, ...) and their candidates
Ck will be very tedious, we use SOM clustering to classify
the transaction data, it always organizes the data to be
neighborhood if they have some similar properties (same
70
2nd Vaagdevi International Conference On Information Technology For Real World Problems
9. Pyle D. Data Preparation for Data Mining. Morgan
Kaufman Publishers, San Francisco, 1999.
items appear in many different transactions). Therefore the
structures of those data are visualized, from the structured
map we can remove some columns and choose data for
association rules mining.
10. Rauber, A., Merkl, D., Dittenbach, M.: The Growing
Hierarchical Self-organizing Map: Exploratory Analysis of
High-dimensional Data. IEEE Transactions on Neural
Networks, Vol. 13, No. 6 (2002) 1331-1341
4 CONCLUSIONS AND FUTURE WORK
Many algorithms were proposed to use SOM for data
mining. Apriori algorithm for data mining is a very efficient
utility when database is relatively small. However, it takes
too much time to repeatedly scan the large-scale database.
The combination of SOM training and Apriori algorithm
can take the advantage. In this algorithm, each input vector
represents a transaction in a database, Kohonen SelfOrganizing Map is used to train input vectors. According to
the principles of the SOM, it is a similarity of neighborhood
neural network, on which we have the visualization of the
relationship between items in the database. The simulation
experiment shows the feature map can be formed in very
short time. By reviewing the SOM outputs, one can easily
determine the positive and negative association rules in the
database. It’s clearly the proposed algorithm makes the data
mining a significantly improvement.
11. Sestito, S., & Dillon, T.S. (1994). Automated
knowledge acquisition, Prentice Hall of Australia, Sydney.
12. Shangming Yang and Yi Zhang, Self-Organizing
Feature Map Based Data Mining, University of Electronic
Science and Technology of China, China.
13. T.Kohonen, Self-Organizing Maps, Springer, Berlin,
Heidelberg, 1995.
14. T.Kohonen, E.Oja, O.Simula, A.Visa, and J.Kangas,
Engineering applications of the self-organizing map,
Proceedings of the IEEE, 84(10), October 1996.
15. Vesanto, J., Alhoniemi, E.: Clustering of the Selforganizing Map. IEEE Transactions on Neural Networks,
Vol. 11, No. 3 (2000) 586–600.
REFERENCES
16. Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques 2E, 2006.
1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining
Association Rules. In Proc. 1994 Int. Conf. Very Large
Databases (VLDB’94) (1994)
2. Agrawal, R., Srikant, R.: Mining Sequential Patterns.
Proc. of the Int’l Conference on Data Engineering (ICDE),
Taipei, Taiwan (1995)
3. Debock, G., Kohonen, T.: Visual Exploration in Finance
Using Self-organizing Maps. Springer-Verlag, London
(1998)
4. Han, J., Kamber, M.: Data Mining – Concepts and
Techniques. Higher Education Press, Beijing, China (2001)
5. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela,
J., Paatero, V., Saarela, A:Self-organization of a Massive
Document Collection. IEEE Transactions on Neural
Networks, Vol. 11, No. 3 (2000) 574–585
6. Kohonen, T.: Self-organizing Maps. Springer-Verlag
Berlin and Heidelberg, Germany (1995)
7. Kohonen, T. (1990). The Self Organizing Map.
Proceedings of the IEEE, vol. 78, no.9, pp. 1464-1480,
September.
8. O.Simula and J.Kangas, Neural Networks for Chemical
Engineers, volume 6 of Computer-Aided Chemical
Engineering}, chapter 14, Process monitoring and
visualization using self-organizing maps. Elsevier,
Amsterdam, 1995.
71