Download Minimizing Spurious Patterns Using Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014
Minimizing Spurious Patterns Using
Association Rule Mining
Ruchi Goel
Dr. Parul Agarwal
M.Tech(CSE)
Assistant Professor
Jamia Hamdard University, New Delhi,India
(Department of Computer Science)
Jamia Hamdard University, New Delhi,India
ABSTRACT
Most of the clustering algorithms extract patterns which are of least
interest. Such pattern consists of data items which usually belong to
widely different support levels. Such data items belonging to
different support level have weak association between them, thus
producing least interested patterns which are of least interest. The
reason behind this problem is that such existing algorithms do not
have the basic knowledge regarding the co-occurrence relationship
between data items. Such algorithm cannot even consider the
knowledge regarding the co-occurrence relationship among the data
items in them as if it consider such knowledge, the goal of the
algorithm will conflict with this knowledge. I am going to propose a
solution to this problem by extracting highly correlated and interested
patterns known as maximized patterns. Confidence measure will be
used to extract maximized patterns. In this framework, the data
mining operation is performed not directly on the data set but the
data mining is performed on the highly correlated intensive patterns.
Using this strategy the effect of cross support pattern is also
minimized. A minimum threshold value is also being used to regulate
the intensive patterns.
Keywords: Asymmetric data set, Cooccurrence relation, Intensive
patterns, Minimum threshold, Spurious patterns.
I.
INTRODUCTION
Normally data sets consists of asymmetric data items. For
example, any departmental store having a wide range of
commodities of same price but their significance vary from
one commodity to another. Some belong to same support level
while some belongs to different support level. Normally
clustering algorithms are ineffective on such a asymmetric
datasets to perform effective clustering. The conventional
algorithms on the lower threshold value give large spurious
patterns that are weakly correlated data items. Such problem
require to design a measure that can perform even on low
support value and remove spurious patterns to get intensive
patterns.
For example, in any shopping mall there could be a large
range of commodities having some price which may vary
significantly from one commodity to the other while there
ISSN: 2231-2803
could be some commodities belonging to the same price level.
Thus, it could be said that in a shopping mall there is a wide
range of commodities belonging to different support levels but
few of them may belong to the same support level.
In such data sets if we use conventional clustering
algorithms for mining associated patterns then they will not be
effective. Most of the clustering algorithms defined so far rely
purely on support-based pruning strategy and this strategy
when used on highly asymmetric data sets proves to be
ineffective because of the following two reasons:
1. If the value of minimum threshold is very low, then the
number of spurious patterns in the overall extracted patterns
may also increase. Such spurious patterns contain data items
belonging to different support level. These spurious patterns
are called cross-support patterns and the data items which they
contain are weakly correlated with most of the data items
belonging to the pattern. For example, {chips, shampoo} is a
cross-support pattern as chips is a data item having high
support while shampoo on the other hand has quite low
support as compared to chips. Such data items are weakly
correlated and thus the patterns which contain them are
considered to be spurious. Besides this, using a lower value
for minimum threshold also increases the computational and
memory requirements substantially.
2. On the other hand if the value of minimum threshold is very
high, then there are chances that many interested patterns
having support less than the threshold value may be missed.
For example, {chips, cold drinks}.
II.
OBJECTIVE
I am going to reduce spurious patterns from the asymmetric
dataset using association rules. Such spurious pattern should
be minimized and we get the optimized pattern , on which we
can do decision-making.
So far, the clustering algorithms can’t discover cooccurrence relationship among activities performed by
specific group or object or between the data items. They
simply use their notions and reduce the size of data items by
removing items that provides less power to classify instances.
http://www.ijcttjournal.org
Page192
International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014
So it becomes tedious to do decision making on such clusters
containing asymmetric data items.
My solution will be to extract useful patterns at low support
value and in turn remove spurious patterns during mining.
This will generate the patterns having string co-occurrence
relationship between data items.
III.
APPROACH
FOR
SPURIOUS PATTERNS
MINIMIZING
As we know that, clustering is a process of assigning various
objects to various clusters, keeping in mind that objects or
data items belonging to a certain cluster has maximum
similarity with each other. Thus, provide patterns for the
purpose of discovering knowledge that can use for further
decision making. If some of the data items belonging to a
particular cluster moves from that particular cluster to some
other cluster because of the clustering algorithm being used
then it would become very tedious to gather knowledge from
such clusters. Even more it is much easy to gather knowledge
from the well understood patterns rather than interpreting the
data items directly.
Thus to solve this problem intensive-clustering is
an approach. In this approach, patterns are preserved such that
the data items belonging to a particular pattern always belongs
to a particular cluster. Intensive-patterns[3] are patterns which
contains data items which have high similarity or cooccurrence relation with each other. By high co-occurrence
relation in intensive patterns means that the presence of each
and every data item in that intensive-pattern highly implies the
presence of each and every other data item belonging to that
same intensive-pattern.
The intensive-confidence or (i-confidence) of an
itemset D={d1,d2,d3,….dm}, is denoted as I-conf(D), is a
measure that reflects the overall co-occurrence relation among
items within the itemset. This measure is defined as
min{conf{d1->d2,d3,…,dm},
conf{d2->d1,d3,….dm},
………., conf{dmd1,d2,……dm-1}}, where conf is the
conventional definition of association rule confidence.
The scope of intensive-confidence could be understood
properly with the help of following example. Consider an
itemset D= {desktop, printer, antivirus}. Assume that
supp({desktop})=0.1,
supp({printer})=0.1,
supp({antivirus})=0.06, and supp({desktop, printer ,antivirus
})=0.06, where supp is the support of an itemset. Then
conf{desktopprinter,
antivirus}=supp({desktop,printer,antivirus})/supp({desktop})
=0.6
ISSN: 2231-2803
conf{printerdesktop,
antivirus}=supp({desktop,printer,antivirus})/supp({printer})=
0.6
conf{antivirusdesktop,
printer}=supp({desktop,printer,antivirus})/supp({antivirus})=
1
Hence-conf(D)=min{conf{desktopprinter,
antivirus},
conf{printerdesktop ,antivirus}, conf{antivirusdesktop,
printer}}=0.6.
The collection of candidate patterns from the itemset(D) is an
intensive-pattern if and only if, the value of i-conf(D)>= Tc,
where Tc is the minimum threshold confidence which is
provided by the user. Further if for any intensive-pattern there
exist some subset of this intensive-pattern, then these subset
patterns should be removed from the set of all intensivepattern. The reason for this is due to the property of allconfidence [2].
Properties of I-confidence Measure
The I-confidence measure has four important properties,
namely the anti-monotone property, the cross-support
property, the strong co-occurrence relation property and the
all-confidence property.
1. Anti-Monotone
The I-confidence measure posses anti-monotone property.
This property states that if for all the data items belonging to
P, the value of I-confidence is greater than the threshold value
Tc, then for all the subsets of P, the value of I-confidence will
remain greater than the threshold value Tc.
How I-confidence measure uses this property of antimonotone? This could be easily explained with the help of the
following example: Suppose the supp({desktop}) = 0.2,
supp({antivirus}) = 0.6 and the supp({desktop, printer})= 0.3
and the value of minimum i-confidence threshold is 0.6, then
the i-confidence of the candidate pattern {desktop, printer} is
given
by
supp({desktop,
printer})/
max{supp({desktop}),supp({printer})}= 0.3/.6 = 0.5 which is
less than the minimum i-confidence of 0.6. Thus the candidate
pattern {desktop, printer} is not a intensive pattern. Moreover,
all the candidate patterns having {desktop, printer} as their
subset are pruned, like {desktop, printer, TV} is not a
intensive pattern. One thing should be noted down here, the
pruning here is done on the basis of I-confidence threshold. If
the value of I-confidence threshold is reduced to .45, then
{desktop, printer} will be an intensive pattern.
http://www.ijcttjournal.org
Page193
International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014
2. Strong Co-occurrence Relation
IV.
The I-confidence measure also posses the property of Strong
Co-occurrence relation among datasets. The I-confidence
measure take cares that all the data items contained in a data
set have strong co-occurrence relation among each other
which means to say strong association between each other.
This could be easily understood with the help of following
consideration. Suppose the value of I-confidence is 90% for
any itemset (D). Then if any of the data item belonging to the
itemset (D) occurs in any transaction, then there are 90%
chances that the remaining data items belonging to the same
itemset (D) will also occur in the same transaction.
ALGORITHM
FOR
SPURIOUS PATTERNS
MINIMIZING
Input
I: Item Set stored in database containing list of transactions
with their items and corresponding support
Min_threshold: Minimum Threshold value of i-confidence
* the value of Min_threshold will be provided by the user.
Variable
3. Cross-support patterns
Intensive : Intensive Pattern Set
The I-confidence helps in minimizing the cross-support
patterns which are actually the spurious patterns. It is always
very difficult to choose the right threshold value for the
purpose of mining the large collection of data. If we set a very
high value of threshold then there are chances that we may
miss many interesting patterns. Conversely, if we set a very
low value for the threshold then also it may not be easy to find
the interested associated patterns because of the following two
reasons. The first reason is that the computational and
memory requirements of existing analysis algorithm increases
considerably and secondly, the number of extracted patterns
also increases substantially. The I-confidence helps us in
eliminating patterns which consists of data items which are
not of interest. Also, I-confidence does not involve extra
computational cost as it simply depends on the support values
of the individual data items or their various combinations.
This could be easily understood with the help of following
consideration. Suppose Tc is the given value of threshold and
P is a pattern such that P= {p1, p2,….,,pn}. We could say P as
a cross-support pattern with respect to Tc,if for any two data
items suppose p1 and p2 belonging to P, the value of
supp({p1})/supp({p2}) < Tc, where 0<Tc<1.
Max_Intensive: Maximal Intensive Pattern Set
Intensive_Pattern_Evaluation() : Function for evaluating
Intensive Pattern Set
Max_Intensive_Pattern_Evaluation()
evaluating Max_Intensive Pattern Set
ISSN: 2231-2803
for
I: Extracting Maximal Intensified Pattern
Intensive=Intensive_Pattern_Evaluation(I, Min_threshold)
{
1.
Access the support value for each element in
(I) .
Create candidate patterns with items
belonging to different level of support
Prune candidate patterns on basis of Antimonotone property
Prune candidate patterns on basis of crosssupport property
Intensive patterns (i.e. Intensive ) with Iconfidence > Min_threshold
2.
3.
4.
Omiecinski proposed the concept of all confidence [1] as an
alternative to the support. All confidence represents the
minimum confidence of all the association rules extracted
from the itemset. Omiecinski’s all-confidence posses the
desirable property of anti-monotone.
Supp({p1,p2,……,pm})/max
1<=k<=m{supp(pk})}
Function
Method
4. All Confidence
The all-confidence [2] measure for an itemset P =
{p1, p2,…….., pm} is given by min({conf(AB | for all A, B
is subset of P, AUB = P, A∩B = Ø }) and is equal to :-
:
5.
}
Max_Intensive=
Max_Intensive_Pattern_Evaluation(Intensive)
{
1.
http://www.ijcttjournal.org
Find an Intensive patterns (X’) such that X’
is a subset of Y and both X’,Y Є Intensive.
Page194
International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014
2.
3.
4.
Intensive = Intensive – X’
Reapeat until steps 1 -2 until there exist no
X’, X’ subset of Y and both belonging to the
Intensive Patterns.
Set Max_Intensive = Intensive
10
Ketchup, Juice,
Coffee, Egg
11
Bread, Juice,
Pickle
}
V.
RESULT AND DISCUSSION
Performance comparison of maximal intensive -pattern
and hyperclique-pattern
12
Milk
13
Milk, Coffee,
Sugar
While comparing the performance of maximal intensive
patterns and hyperclique patterns, we take a dataset of
transactions. Both the algorithms are implemented on the
same set of data of transactions. Minimum threshold value
taken in both the cases is same. On execution of algorithm the
result that comes out is that the number of interested patterns
generated by maximal intensive patterns are less than that are
generated by hyperclique pattern algorithm.[4].Reason for this
is maximal intensive pattern used anti- monotone and all
confidence property with strong co-occurrence and cross
support property.
14
Cookie, Chocolate
15
Chocolate, Milk
16
Biscuit, Milk
17
Bread, Biscuit,
Milk
18
Let us consider the Table 1 , having a list of various
Milk, Coffee,
Sugar
transactions with the items involved in each transaction.
Transaction_Id
Items
1
Bread, Butter
2
Butter
3
Coffee, Butter,
19
Cookie, Chocolate
20
Milk
Table 1 list of various transactions with the items
involved in each transaction.
Bread
The Table 2 shows a list of interested patterns generated as
4
Coffee, Milk
5
Bread, Butter,
value of minimum threshold confidence taken for each of
Milk
these measures is 0.02. The number of interested patterns
Hyperclique-patterns and Maximal Intensive-patterns. The
generated as Hyperclique patterns are 16 and the number of
6
Coffee
7
Bread, Cookie
interested patterns generated as Maximal Intensive Patterns
are 11. Look at the interested patterns number 6 & 13
generated as Hyperclique patterns. It is found that both the
8
Coffee, Pickle
9
Bread, Sugar
values are same. The reason behind this is that while
generating hyperclique patterns, it is only that the value of h-
ISSN: 2231-2803
conf for that pattern should be greater than the minimum
http://www.ijcttjournal.org
Page195
International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014
threshold confidence. Whatsoever, if there are same multiple
candidate patterns. On the other hand, look at the interested
patterns generated as Maximal Intensive Pattern, none of the
interested patterns has any subset present in the interested
pattern list. This is due to the all-confidence property.
Interested
Hyperclique
Maximal
Patterns
Patterns
Patterns
1
{ketchup,juice,coffee,
{ketchup,juice,coffee,egg
egg}
}
{bread,juice,pickle}
{bread,juice,pickle}
2
Optimized
VI.
CONCLUSION
I conclude that this algorithm is able to reduce the
spurious patterns and generate the maximal intensive
patterns having high co-occurrence relation among patterns
with the threshold value given by user on the basis of various
properties that are prior mentioned. On this intensive
patterns, clustering process become highly efficient than the
already existing mining algorithm.
REFERENCES.
[1] Fisher, R.A.: The Use of Multiple Measurements in
Taxonomic Problems, Annals of Eugenics, vol. 7,
1936, pp. 179188.
3
{chocolate, milk}
{chocolate, milk}
4
{bread,biscuit,milk}
{bread,biscuit,milk}
5
{milk,coffee,sugar}
{milk,coffee,sugar}
6
{cookie, chocolate}
{cookie, chocolate}
7
{coffee,butter,bread}
{coffee,butter,bread}
8
{bread,butter,milk}
{bread,butter,milk}
9
{bread, cookie}
{bread, cookie}
10
{coffee, pickle}
{coffee, pickle}
11
{milk,bread,sugar}
{milk,bread,sugar}
12
{bread, butter}
13
{cookie, chocolate}
14
{biscuit, milk}
15
{coffee, milk}
16
{milk,coffee,sugar}
[2] Omiecinski, Edward R.; Alternative interest measures for
mining associations in databases, IEEE Transactions on
Knowledge and Data Engineering, 15(1):57-69, Jan/Feb 2003.
[3] Syed Zubair Ahmad Shah, Preceding Clustering by Pattern
Preservation. In VSRD-IJCSIT,Vol. 2 (8), 2012.
[4] H. Xiong, P. Tan, and V. Kumar. Mining hyperclique
patterns with confidence pruning. In Technical Report 03-006
Computer Science, Univ. of Minnesota., Jan 2003.
Table 2 List of interested patterns generated as Hypercliquepatterns and Maximal Intensive –patterns.
ISSN: 2231-2803
http://www.ijcttjournal.org
Page196