Download New Approach for Classification Based Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ISSN 2348–2370
Vol.07,Issue.12,
August-2015,
Pages:2199-2204
www.ijatir.org
New Approach for Classification Based Association Rule Mining
K. CH. PRAVALLIKA1, CH. N. S. PRIYANKA2, B. V. BALAJI3
1
PG Scholar, Dept of MCA, Sri Vasavi Engineering College, Tadepalligudem, West Godavari, AP, India,
E-mail: [email protected].
2
PG Scholar, Dept of MCA, Sri Vasavi Engineering College, Tadepalligudem, West Godavari, AP, India,
E-mail: [email protected].
3
Assistant Professor, Dept of MCA, Sri Vasavi Engineering College, Tadepalligudem, West Godavari, AP, India,
E-mail: [email protected].
Abstract: Recent studies in data mining have planned
associative classification, which, according to several
reports achieves higher classification accuracy than C4.5 In
this paper, we propose to integrate two mining techniques.
The integration is done by focusing on mining a special
detachment of association rules, called class association
rules (CARs). This paper presents a new approach for build
a classifier, based on an extended association rule mining
technique in the perspective of classification. The
distinctive of this approach is threefold: first, applying the
information gain measure to the generation of indicate
item sets; second, integrating the process of frequent item
sets generation with the process of rule generation; third,
integrating strategies for avoiding rule redundancy and
conflicts into the mining process. The corresponding
mining algorithm proposed, namely. CPAR(Classification
Based On Predictive Association Rule) and CARM
(Classification Association Rule Mining) using decision
tree information gain , produces a classifier with
satisfactory classification accuracy, compared with other
classifiers CARM could filter out many candidate item sets
in the generation process.
Keywords: Data Mining, Association Rule Mining,
Frequent Item Set, Electronic Commerce, Information
Gain, Decision Tree, FOIL, PRM , CPAR,CARM.
I. INTRODUCTION
Classification and association rule mining are two major
areas of research and applications nowadays in knowledge
discovery. An association rule (AR) is of the form, X  Y
where X and Y are sets of data items. The goal of
association rule mining is to generate certain associative
relationships between data items with the degrees of
confidence and support greater than
user specified
thresholds. The Apriori algorithm is a well known
algorithm in this field. A typical association rule
application is market baskets analysis describing, for
example, the customers’ buying behavior such as “Fruit =>
Meat” meaning that customers who bought fruit also
tended to buy meat, which reflects association between
occurrences of data items. Classification is used to find a
logical description, namely a classifier, which results from
training datasets with predetermined targets, and could group
unlabeled datasets. Existing research efforts have proposed a
number of approaches and systems. A worth-noting type of
approaches is classification based on association rules, aimed
at building a classifier by discovering a small set of rules to
form a so-called associative classifier. Classification Rule
Mining (CRM) is a well known Data Mining technique for the
extraction of hidden Classification Rules (CRs) from a given
database that is coupled with a set of pre-defined class labels,
the objective being to build a classifier to classify “unseen”
data records. One recent approach to CRM is to employ
Association Rule Mining (ARM) methods to identify the
desired CRs, i.e. Classification Association Rule Mining
(CARM).
CARM aims to mine a set of Classification Association
Rules (CARs) from a class-transaction database, where a CAR
describes an implicative co-occurring relationship between a
set of binary-valued data attributes (items) and a pre-defined
class, expressed in the form of an “antecedent  consequentclass” rule. CARM seems to offer greater accuracy, in many
cases, than other classification methods such as decision trees,
rule induction and probabilistic approaches . In the past
decade, a number of CARM approaches have been developed
that include: TFPC (Total From Partial Classification) , CBA
(Classification Based Associations), CMAR (Classification
based on Multiple Association Rules), CPAR (Classification
based on Predictive Association Rules) etc. Although these
CARM approaches employ different ARM techniques to
extract CARs from a given class-transaction database, a
similar set of CARs is always generated, based on a pair of
specific values for both support and confidence thresholds.
Regardless of which particular CARM method is utilized, a
classifier is usually presented as an ordered list of CARs,
based on a selected rule ordering strategy. Hence, it can be
indicated that the essential to produce a more accurate CARM
classifier is to develop a better (more rational) rule ordering
approach.
II. RELATED WORK
The data analysis algorithms (or data mining algorithms, as
they are more popularly known nowadays) can be divided into
Copyright @ 2015 IJATIR. All rights reserved.
K. CH. PRAVALLIKA, CH. N. S. PRIYANKA, B. V. BALAJI
three major categories based on the nature of their
called CBA-RG, is based on Apriori algorithm for finding the
information
extraction:
Clustering
(also
called
association rules and a classifier builder, which is called
segmentation or unsupervised learning), Predictive
CBA-CB. In Apriori Algorithm, item set ( a set of items) were
modeling (also called classification or supervised learning),
used while in CBARG, rule item, which consists of a condset
and Frequent pattern extraction. Clustering is the major
(a set of items) and a class. Class Association Rules that are
class of data mining algorithms. The goal of the search
used to create a classifier in is more accurate than C4.5
process used by these algorithms is to identify all sets of
algorithm. But the Classification Based on Associations
similar examples in the data, in some optimal fashion. One
(CBA) algorithm needs the ranking rule before it can create a
of the oldest algorithms for clustering is k-means. The two
classifier. Ranking depends on the support and confidence of
disadvantages of this algorithm are initialization problem
each rule. The goal of classification is to build a model from
and that the cluster must be linearly separable. To deal with
classified objects in order to classify previously unseen
the initialization problem, the global k-means has been
objects as accurately as possible. There are many
proposed, which is an incremental-deterministic algorithm
classification approaches for extracting knowledge from data
that employs k-means as a local search procedure. Kernel
such as divide-and- conquer , separate-and-conquer , covering
k-means algorithm avoids the limitation of linearly
and statistical approaches.
separable clusters and it mapped the data points from input
The divide-and conquer approach starts by selecting an
space to a higher dimensional feature through a nonlinear
attribute
as a root node, and then it makes a branch for each
transformation Ø and the k-means is applied in the feature
possible level of that attribute. This will split the training
space.
instances into subsets, one for each possible value of the
Global kernel k-means is an algorithm which mapped
attribute. The same process will be repeated until all instances
data points from input space to a higher dimensional
that fall in one branch have the same classification or the
feature space through the use of a kernel function and
remaining instances cannot be split any further. The separateoptimizes the clustering error in the feature space by
and-conquer approach, on the other hand, starts by building up
locating near-optimal solution. Because of its deterministic
the rules in greedy fashion (one by one). After a rule is found,
nature, this makes it independent of the initialization
all instances covered by the rule will be deleted. The same
problem, and the ability to identify nonlinearly separable
process is repeated until the best rule found has a large error
cluster in input space. So global kernel k-means algorithm
rate. Statistical approaches such as Naive Bayes use
combines the advantages of both global k-means and
probabilistic measures, i.e. likelihood, to classify test objects.
kernel k-means. Another approach for clustering data is
Finally, covering approach selects each of the available
hierarchical clustering that is based on the Hungarian
classes in turn, and looks for a way of covering most of
method and the computational complexity of the proposed
training objects to that class in order to come up with
algorithm is O (n2). The important classification
maximum accuracy rules. Numerous algorithms have been
algorithms are decision tree, Naive-Bayes classifier and
derived from these approaches, such as decision trees, PART,
statistics. They use heuristic search and greedy search
RIPPER ]and Prism. While single label classification, which
techniques to find the subsets of rules to find the
assigns each rule in the classifier to the most obvious label,
classifiers. C4.5 and CART are the most well-known
has been widely studied, little work has been done on multidecision tree algorithms. The final class of data mining
label classification. Most of the previous research work to
algorithms is frequent pattern extraction. For a large
date on multi-label classification is related to text
databases, describes an Apriori algorithm that generate all
categorization. In this paper, only traditional classification
significant association rules between items in the database.
algorithms that generate rules with a single class will be
The algorithm makes the multiple passes over the database.
considered.
The frontier set for a pass consists of those item sets that
III. SYSTEM DESIGN AND IMPLEMENTATION
are extended during the pass. In each pass, the support for
The overall system design of Classification Based
candidate item sets, which are derived from the tuples in
Association Rule Mining is described in below Fig.1. The
the databases and the item sets contain in frontier set, are
System is divided into 4 Modules.
measured.
 Data Source/ Data Base Module
Initially the frontier set consists of only one element,
 Classification Module
which is an empty set. At the end of a pass, the support for
 Association Rule Generation Module
a candidate item set is compared with the min support. At
 Performance Analysis Module
the same time it is determined if an item set should be
added to the frontier set for the next pass. The algorithm
Data Source/ Data Base Module: This Module maintains
terminates when the frontier set is empty. After finding all
data in the form of data sets. Here we have a Data set of
the item sets that satisfy min support threshold, association
several attribute values in the form of transaction records and
rules are generated from that item sets. Bing Liu and et al
we have data set that contains schema of data set. This schema
had proposed an Classification Based on Associations
is useful for classifying the data.
(CBA) algorithm that discovers Class Association Rules
(CARs). It consists of two parts, a rule generator, which is
International Journal of Advanced Technology and Innovative Research
Volume.07, IssueNo.12, August-2015, Pages: 2199-2204
New Approach for Classification Based Association Rule Mining
Classification Module: This Module reads the data from
new rule is to be built, the queue is first checked. If it is not
data set and performs classification operation and
empty, a rule is extracted from it and is taken as the current
generated classes.
rule. This forms the depth first-search in rule generation.
Association Rule Generation Module: This Module uses
classes and performs association rule mining and generates
frequent item sets, generates association rules.
Performance Analysis Module: This Module computes
time complexity, space complexity, accuracy and no of
association rules for each execution based on no of classes
for different algorithms such as CARM using Information
Gain, CARM Using Random Gain, FOIL and PRM. Then
it compares their values and analyzes the efficient
algorithms.
Example: An example of how CPAR generates rules.
After the first literal (A1 = 2) is selected, two literals (A2 = 1)
and (A3 = 1) are found to have similar gain, which is higher
than other literals. Literal (A2 = 1) is first selected and a rule
is generated along this direction. After that, the rule (A1 = 2;
A3 = 1) is taken as the current rule as shown in Fig.2. Again
two literals with similar gain (A4 = 2) and (A2 = 1) are
selected and a rule is generated along each of the two
directions. In this way, three rules are generated:
(A1 = 2; A2 = 1; A4 = 1).
(A1 = 2; A3 = 1; A4 = 2; A2 = 3).
(A1 = 2; A3 = 1; A2 = 1).
Fig.2. some rules generated by CPAR.
Fig.1. System Architecture.
IV. PROPOSED WORK
A. (CPAR and CARM Using Decision Tree Info Gain)
CPAR
CPAR (Classification based on Predictive Association
Rules), which combines the advantages of both associative
classification and traditional rule-based classification.
Instead of generating a large number of candidate rules as
in associative classification, CPAR adopts a greedy
algorithm to generate rules directly from training data.
Moreover, CPAR generates and tests more rules than
traditional rule-based classifiers to avoid missing important
rules. To avoid over fitting, CPAR uses expected accuracy
to evaluate each rule and uses the best k rules in prediction.
CPAR stands in the middle between exhaustive and greedy
algorithms and combines the advantages of both. CPAR
builds rules by adding literals one by one, which is similar
to PRM. However, instead of ignoring all literals except
the best one, CPAR keeps all close-to-the-best literals
during the rule building process. By doing so, CPAR can
select more than one literal at the same time and build
several rules simultaneously. The following is a detailed
description of the rule generation algorithm of CPAR.
Suppose at a certain step in the process of building a rule,
after finding the best literal p, another literal q that has
similar gain as p (e.g., differ by at most 1%) is found.
Besides continuing building the rule by appending p to r,
q is also appended to the current rule r to create a new
rule r0, which is pushed into the queue. Each time when a
B. CARM
Classification Association Rule Mining (CARM) is a
recent Classification Rule Mining approach that builds an
Association Rule Mining based classifier using Classification
Association Rules (CARs). Regardless of which particular
CARM algorithm is used, a similar set of CARs is always
generated from data, and a Classifier is usually presented as
an ordered CAR list, based on a selected rule ordering
strategy.
Fig.3. Decision Tree.
Decision Tree: A decision tree is a structure that includes a
root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node. The following
decision tree is for the concept buy computer that indicates
whether a customer at a company is likely to buy a computer
or not. Each internal node represents a test on an attribute.
Each leaf node represents a class. as shown in Fig.3
International Journal of Advanced Technology and Innovative Research
Volume.07, IssueNo.12, August-2015, Pages: 2199-2204
K. CH. PRAVALLIKA, CH. N. S. PRIYANKA, B. V. BALAJI
General definition of Info Gain: In general terms, the
The Following Fig.5 shows the comparison of space
expected information gain is the change in information
complexity between different algorithms FOIL, PRM,
entropy from a prior state to a state that takes some
CPAR and CARM using Line Chart.
information as given:
(1)
Information Gain: We now return to the problem of
trying to determine the best attribute to choose for a
particular node in a tree. The following measure calculates
a numerical value for a given attribute, A, with respect to a
set of examples, S. Note that the values of attribute A will
range over a set of possibilities which we call Values (A),
and that, for a particular value from that set, v, we write S v
for the set of examples which have value v for attribute A.
The information gain of attribute A, relative to a collection
of examples, S, is calculated as:
(2)
The information gain of an attribute can be seen as the
expected reduction in entropy caused by knowing the value
of attribute A.
Fig.5. Space Complexity comparison of algorithms.
The Following Fig.6 shows the comparison of accuracy
between different algorithms FOIL, PRM, CPAR and CARM
using Line Chart.
V. EXPERIMENTAL RESULTS
We have conducted an extensive performance study
to evaluate accuracy and efficiency of CPAR, CARM
using decision tree info gain and compare it with that of
FOIL, PRM. We validted our approach by means of a
large set of experiments addressing the following issues:
 Performance of the Classification and association
rules, in terms of execution time, memory usage.
 Performance of the Classification and association
rules, in terms of classes and accuracy.
 Performance of the Classification and association
rules, in terms of classes And No of rules generated.
 Scalability of the approach.
All the experiments are performed on a Pentium IV with
2GB main memory, running Microsoft Windows/XP. The
Following Fig.4 shows the comparison of time
complexity between different algorithms FOIL, PRM,
CPAR and CARM using Line Chart.
Fig.4. Time Complexity comparison of algorithms.
Fig.6. Accuracy comparison of algoritms.
The Following Fig.7 shows the comparison of no of Rules
generated between different algorithms FOIL, PRM, CPAR
and CARM using Line Chart.
Fig.7. No.of Rules comparison of algorithms.
International Journal of Advanced Technology and Innovative Research
Volume.07, IssueNo.12, August-2015, Pages: 2199-2204
New Approach for Classification Based Association Rule Mining
The Following Fig.8 shows the comparison of time
The Following Fig.11 shows the comparison of No of
complexity between different algorithms FOIL, PRM,
Rules between different algorithms FOIL, PRM, CPAR and
CPAR and CARM, using Bar Chart.
CARM using Bar Chart.
Fig.8. Time Complexity comparison of algorithms.
The Following Fig.9 shows the comparison of space
complexity between different algorithms FOIL, PRM,
CPAR and CARM using Bar Chart.
Fig.9. Space Complexity comparison of algorithms.
The Following Fig.10 shows the comparison of Accuracy
complexity between different algorithms FOIL, PRM,
CPAR and CARM using Bar Chart.
Fig.11. No Of Rules comparison of algorithms.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we examined two major challenges in
associative classification: (1) efficiency at handling huge
number of mined association rules, and (2) effectiveness at
predicting new class labels with high classification accuracy.
We proposed two novel associative classification methods,
CARM using info gain and CPAR (Classification Based on
Predictive Association Rule). Our Experiments shows both
CARM using info gain and CPAR shows better efficiency
than FOIL and PRM.
VII. REFERENCES
[1]Devasri Rai, A.S.Thoke and Keshri Verma, “Enhancement
of Associative Rule based FOIL and PRM Algorithms”,
Proc.. 2012.
[2]Quinlan, J.R.(1993). C4.5: Programs for Machine Learning
San Mateo, Morgan Kaufmann , San Francisco F.Clark and
T.Niblett. The cn2 induction algorithm. Machine Learning,
2:261-283,1989.
[3]Memory based reasoning(Data Mining Techniques ), by
M.J.A. Berry and G.S Linoff,2004.
[4]R.Andrews, J.Diederich, and A. Ticke,”A survey and
critique of Techniques for extracting rules from trained
artificial neural networks, “Knowledge Based Systems ,
volume ,pp. 373-389,1995.
[5]P.Langely , W.Iba , and K.Thomson . “An analysis of
Bayesian Classifiers,” in National Conference on Artificial
Intelligence (1992).pp.223-228.
[6]Liu B., Hsu W. And Ma Y. Integrating Classification and
Association rule Mining . In Proceedings of the 4th
international Conference on Knowledge Discovery and Data
Mining (KDD-2001, pages 80-86, new York ,USA , August
1998. The AAAI press.
[7]J.R. Quinlan and R.M. Cameron-Jones. FOIL: A midterm
report. In Proc. 1993 European Conf. Machine Learning . pp.
3{20,Vienna Austria, 1993.
[8]X.Yin and J.Han, “CPAR: Classification Based on
Predictive Association Rules”,
Proceedings of Siam
International conference on Data Mining ,2003, pp.331-335.
Fig.10. Accuracy comparison of algorithms.
International Journal of Advanced Technology and Innovative Research
Volume.07, IssueNo.12, August-2015, Pages: 2199-2204
K. CH. PRAVALLIKA, CH. N. S. PRIYANKA, B. V. BALAJI
[9]Prafulla Gupta & Durga Toshniwal, “Performance
Author’s Profile:
Comparision of Rule Based Classification Algorithms”,
K.CH.Pravallika is pursuing Master of
International Journal of Computer Science & Informatics,
Computer Applications in Sri Vasavi
Volume-1,Issue-II,2011,pp37-42.
Engineering
College,
Tadepalligudem,
[10]W. Li ,J.Han and J.Pie CMAR: Accurate and efficient
Affiliated to JNTUK.
Classification based on multiple class-association rules. In
ICDM01, 2011,pp.369{376.san jose, CA ,Nov.2001.
[11]Thabtah , F., Cowling, P. And Peng ,
CH.N.S.Priyanka is studying Master of
Y.H.(2004).MMAC: A New Multi-Class , Multi-Label
Computer Applications in Sri Vasavi
Associative Classification
Approach. Fourth IEEE
Engineering College, Tadepalligudem,
International Conference on Data Mining (ICDM04).
Affiliated to JNTUK.
[12]“Classification Based on Predictive Association rule, “
Available
online
:http://www.csc.liv.ac.uk/~
frans/KDD/Software/FOIL_PRM_CPAR/cpar.html.
[13]Uci:Blake , c.i.,&Merz,C,J(1998) UCI repository of
Mr.B.V.Balaji has completed his M.Tech
machine learning data bases from www.ics.uci.edu/~
(Computer Science Engineering) from
mlearn/MLrepository.html.
JNTUK University, Kakinada. Currently ,he
[14]Coenen, F.(2003), The LUCS-KDD Discretized
is working as Assistant Professor of MCA
/normalised
ARM
and
CARM
data
library
department in Sri Vasavi Engineering
http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KD
College , Tadepalligudem.
D_DN/, Department of Computer Science , The University
of Liverpool, UK.
[15]Zuoliang Chen, Guoqing Chen, “Building An
Associative Classifier Based On Fuzzy Association Rules”,
International Journal of Computational Intelligence
Systems, Vol.1, No. 3 (August, 2008), 262 – 273.
[16] Xin Lu, Barbara Di Eugenio and Stellan Ohlsson, “
Learning Tutorial Rules Using Classification Based On
Associations”, Xin Lu, Computer Science (M/C 152),
University of Illinois at Chicago, 851 S Morgan St.,
Chicago IL, 60607, USA. Email: [email protected].
[17]“Alaa Al Deen” Mustafa Nofal and Sulieman BaniAhmad, “ Classification based on association-rule mining
techniques: a general surveyand empirical comparative
evaluation”, Ubiquitous Computing and Communication
Journal, Volume 5 Number 3, www.ubicc.org.
[18] A. Zemirline, L. Lecornu, B. Solaiman, and A. Echcherif, “An Efficient Association Rule Mining Algorithm
for Classification”, L. Rutkowski et al. (Eds.): ICAISC
2008, LNAI 5097, pp. 717–728, 2008. c Springer-Verlag
Berlin Heidelberg 2008.
[19]Bing Liu Wynne Hsu Yiming Ma, “Integrating
Classification and Association Rule Mining “,Appeared in
KDD-98, New York, Aug 27-31, 1998.
International Journal of Advanced Technology and Innovative Research
Volume.07, IssueNo.12, August-2015, Pages: 2199-2204