Download Discovering Association Rules and Classification for Biological Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Discovering Association Rules and Classification for
Biological Data using Data Mining Methods
1,2
J. Tsiligaridis1, M. Pagela2
Math & Computer Science Department, Heritage University , Toppenish, USA
Abstract - This project presents a set of algorithms and their
efficiency for discovering association rules and classification
using Genetics Algorithm(GA), Decision Trees (DT) and
Neural Networks (NN).
A GA generates a large set of possible solutions to a given
problem. Apriori is the basic algorithm for association rules.
A GA is developed for finding the frequent conditions. The
proposed GA based on encoding and generation construction
method (GA_EN) can mine association rules with improved
performance using appropriate generation of the rules. For
GA classification (GA_CL) algorithm, rules are classified
using predefined constraints. A Decision Tree algorithm
(DTA) is created from data using probabilities, and the goal
is to create on-demand an accurate decision tree (DT).
Based on the rules produced from GA_CL, a Neural Network
classifier (NNC_GA) is created. For learning a
backpropagation neural network algorithm is used to adjust
the weights. Simulation results are provided.
Keywords: Genetic Algorithm, Decision Trees, Neural
Network, Data Mining
1
Introduction
GA is based on biological principles of natural selection.
The key idea of Apriori is to find the frequent conditions
constructed by possible values of attributes in any data set.
This idea can also be used to find the frequent conditions
constructed by possible values of attributes in any
data set. The GA finds all the possible associations between
conditions constructed by attribute values under a given
constraint (e.g. support and confidence). The association rules
mining, integrated with classification, creates the association
classification [1],[2],[3]. The objective of Classification
Association rules (CAR) is to generate a set of class
association rules that satisfy the min support (msup), and
minimum confidence (mconf) constraints and to build a
classifier from the class association rule set. The GA_EN,
based on an encoding method and construction of generations
has advantages over the Apriori because it includes GA
mining techniques that improve performance. The GA_CL
discovers rules with minimum class support (mcsupp) and
less than the maximum classification error (maxclerror).
There are two types of a DT [3]; the complete and the
incomplete. In the incomplete one there are subtrees where
the repetition and replication are included. A DT represents a
procedure for classifying objects based on their attributes. The
rule set can be created by running the tree. Decision tree is
used to find predictive rules combining numeric and
categorical attributes. The splitting process is recursively
repeated until the end of data. The DTA creates a DT using
the criterion of maximum probability. There are two phases.
In order to avoid repetition or replication of subtrees a new
criterion, the elimination of a branch (CEB), is applied.
The criterion of elimination a branch (CEB), eliminates
redundant branches. The pruning of decision rules case is also
examined with consequences on the accuracy.
A NN is a collection of units that are connected in some
pattern to allow communication between the units. The back
propagation algorithm is in wide use because it learns by
adapting its weight using the generalized delta rule which
attempts to minimize the squared error between what is the
desired network output and the actual network output.
The NNC_GA is a classifier using NN methodology and
having as input the rules created from GA_CL.
The article is organized as follows. Section 2 ,3 introduces
definitions for association rules and CAR. Section 4
deals with GA_EN and GA_CL. Section 5 includes the
description of DTA. Section 6 contains the NN and
NNC_GA. Simulation results appear in Section 7.
2
Association rules
Let D = {T1, T2, . . . ,Tn} be a set of n transactions and
let I be a set of items, I = {i1, i2 . . . im}. Each transaction [5]
is a set of items, i.e. Ti ⊆ I. An association rule is
an implication of the form X ⇒ Y, where X, Y ⊂ I, and
X ∩ Y = ∅ ; X is called the antecedent and Y is called the
consequent of the rule. In general, a set of items, such as X
or Y , is called an itemset. For an itemset X ⊆ I, support(X)
is defined as the fraction of transactions Ti ∈ D such that X
⊆ Ti. That is, P(X) = support(X). The support of a rule X ⇒
Y is defined as support(X ⇒ Y) = P(X ∪ Y ). An association
rule X ⇒ Y has a measure of reliability called confidence(X
⇒ Y ) defined as P(Y |X) = P(X∪Y )/P (X) = support(X∪Y
)/support(X). For the CAR it is supposed that data samples are
given with n attributes (A1,A2,..,An) and for each sample
there is a class label C (C={c1,c2,..ccm}). A pattern P
(P={a1,a2,..,ak})is a set of attribute values for different
attributes (1≤k≤n). For rule R: P-> c the number of data
samples matching pattern P with class label c is called the
support of rule R. The ratio of the number of samples
matching pattern P and having class label c versus the total
number of samples matching pattern P is called confidence of
R.
3
CAR
The CAR[3] contains methods for associative
classification. CBA is one of the methods. It uses an iterative
approach to frequent itemset mining which is similar to
Apriori. CBA construct the classifier where the rules are
ordered according to the precedence based on the rule
confidence and support. More details appear in [3].
(Greedy strategy). The multi-way split is used, where as
many partitions as distinct values. Nodes with homogeneous
class distribution are preferred.
In the incomplete DT there are subtrees where the repetition
and replication are included. Repetition is where an attribute
is repeatedly tested along a given branch of three , e.i. age,
and replication where duplicate subtrees exist within a tree ,
such as the subtree headed by the node “credit_rating”. The
DTA uses the criterion of maximum probability and can be
created in the following phases:
Phase 1: Discover the root (i) (from all the attributes)
∑ ∑ p( Ai) * p(Ci / Ai)
P(EAi) =
Ci
4
GA_EN ,GA_CL
A GA is an iterative procedure, which works with a
population of individuals represented by a finite string of
characters or binary values. The traditional method usually
searches a very large space for the solution using a large
number of iterations, where a GA minimizes searches by
adopting a fitness function. Each iteration consists of a
created evolved
population of genomes and a new
generation. There are three operations: selection, crossover,
mutation. For GA_EN, the number of conditions determines
the construction of the chromosome and the population size.
The next generations can be created from other either one or
two previous generations using the “or” operation for
crossover of chromosomes. This provides less memory
operations and fewer offspring chromosomes. The msupp and
mconf is the constraint set by user. The GA_CL working with
the next generations under conditions and the crossover of the
chromosomes and considering the predefined classification
error can discover the classification rules. The conditions:
msup and maxclerror reduces the number of undesired rules
in the mining process. For the GA_CL only the rules from
chromosomes are extracted which the classification error does
not surpass the maxclerror. The next generation includes
initially chromosomes with support greater than the minsup
(gr_mins_chrom). The ‘or’ operation among gr_mins_chrom
can create the chromosomes of the next generation.
Chromosomes with class error less than the maxclerror can
provide the classification rule. There are no classification
rules for any attribute if there are no chromosomes with
acceptance classification error. The GA mines rules of our
interest by defining the msupp and mconf . With low
constraints the number of discovered association rules
becomes extremely large.
5
DTA
In decision trees [4],[5] the input data
set has
one
attribute called class C that takes a value from K discrete
values 1, . . . , K, and a set of numeric and categorical
attributesA1, .. . , Ap. The goal is to predict C given A1, . . . ,
Ap. Decision tree algorithms automatically split numeric
attributes Ai into two ranges and they split categorical
attributes Aj into two subsets at each node. Split the records
based on an attribute test that optimizes certain criterion
,
Ai
where Ai : the attributes of the tuples and Ci
the classes (attribute test). MP = max (P(EAi)) //max
attribute test criterion
Phase 2: split the data into smaller subsets, so that the
partition to be as pure as possible using the same formula.
The measure of nodes impurity is the MP. Continue until the
end of the attributes.
There is a stopping criterion for expanding a node when all
records belong to the same class.
DTA : Input : training data
Output: decision tree
1. define root node (phase 1)
2. discover the branches from root
while (! end of the attributes)
{3. splitting the attribute (phase 2) }
Example:
Weather
Sunny
Sunny
…..
Windy
…..
parents
yes
no
no
money
rich
rich
rich
decision (Example)
cinema
Tennis
cinema
yes
no
Parents: class, P(E) = (5/10)*(5/10) + (5/10)*(1/10) = 0.3
(phase 1)
Weather: class, P(E) = (3/10)*(1/10) + (4/10)*(3/10) +
(3/10)*(2/10) = 0.21
The CEB, is used to eliminate redundant branches. It is a
prepruning approach [3].
For an attribute (attr1) with value v1 , if there are tuples from
attr2 that have all the values in relation with v1 (of attr1) then
the attr2 is named as: do n’t care attribute Example: R1.
PCEB = P(A1 =a1,…, A|A| = a|A| | C=ci) =
| A|
∏ p( A
i
= ai | C = c j )
i =1
A branch is eliminated when the PCEB ≠ 0
The criterion of Elimination of Branch (CEB):: if the P CEB=
0, between two attributes (A1, A2) then A2 is don’t care
attribute. The CEB criterion is valid when P CEB ≠ 0. CEB
is to develop the DT so that to avoid the repetition or
replication.
Theorem: The CEB criterion can determine the existence of
a small DT with the best accuracy (100%, or complete)
avoiding repetitions and replications. Proof: Because if CEB
criterion is valid discourage the repetition
Example:
Age Has_job Own-house Credit-rating class (Example)
Young false
False
fair
No
Young false
False
Good
No
Young True
False
Good
Yes
Young True
True
fair
Yes
Middle True
True
Good
Yes
Old
false
true
Excellent Yes
……..
Of the DT (with own_house as a root) it is not necessary to
have more extension with the attribute of age for all the
probable partitions (“young”,” middle”, “old”). PCEB = P(A1
=a1,…, A|A| = a|A| | C=ci) = P(age =young , own_house=”y”
| C=”yes”) =P(age =young | C=”yes”) * P(own_house=”y” |
C=”yes”) = 2/5 * 6/9 ≠ 0
DTs provide less rules than the CAR. DT can find rules with
very low support (like medical rules). CAR requires discrete
attributes. DT learning uses discrete and continues attributes.
The ID3 is a method for discovering DT and uses the
information gain as its attribute selection measure [3].
Neural Network
For supervising learning it is necessary to have: data
that have a known classification, sufficient data to represent
all aspects of the problem beign solved, sufficient data to
allow for testing. The back propagation algorithm learns by
adapting its weight using the generalized delta rule which
attempts to minimize the squared error between what is the
desired network output and the actual network output. During
learning it continually cycles through the data until the error is
at a low enough value for the task to be considered solved.
There are two activation functions one from input layer to
hidden layer and the other from hidden layer to output layer.
Both of them are the logistic functions.
Example: for tennis participation with tuples: “outlook ,
temperature, humidity, windy, class”. The distributed coding
has been used. The logistic function is applied to hidden
layer and output layer. A set of five input processing units and
five hidden layer units has been used. The weight matrix is
5x5 dimension with bias input z0 =1. Parameters are: m: #of
input vectors with length of 5 (including bias input). Array
x[i] is input values , d[i] the desired output for each input x[i].
Two activation (logistic) functions fh (input layer to hidden
layer) and fo (hidden layer to output layer) are assumed. Test
of convergence is achieved by checking the output error
function to see if its magnitude is below some given
threshold. Some disadvantage of NN: (a) Does not explain the
solution is derived, need more examples, need appropriate
examples that matches the real world situation. (b) it can be a
7
Simulation
Two are the scenarios for the simulation
1. Apriori vs GA_EN: The GA_EN using the particular way
to perform the next generations has better performance than
the Apriori for the mining of the rules of the “heart” data set.
running time
6
lengthy and computationally expensive process to train a NN
using a large number of high dimensional training examples.
The advantage of DT over NN is that it will require much less
training time than NN.
If we compare decision trees and neural networks we can see
that their advantages and drawbacks are almost
complementary. For instance humans easily understand
knowledge representation of decision trees, which is not the
case for neural networks. Decision trees have trouble dealing
with noise in training data, which is again not the case for
neural networks, decision trees learn very fast and neural
networks learn relatively slow, etc.
The classification rules created by GA_CL will be used as
input to the construction of the NNC_GA. The NNC_GA
architecture has three layers. First, the input layer that has the
input nodes where each one is represented by one
characteristic of the rules. Second , the hidden layer in which
each node will be connected with the characteristic (the
attribute value in the rule is called characteristic) of each rule.
The number of rules is equal to the number of hidden nodes.
Third, the output layer , that has the classes nodes (i.e.
c1,..cn). In the NNC_GA for learning the input vector can be
created in binary format by considering ‘1’ for activated
input node and ‘0’ for the non activated one. The gradient
descent would proceed in infinitesimal steps along the
direction established by the gradient. For the learning rate, n.,
is selected large enough to cause the network to converge
quickly without oscillations.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
GA_EN
Apriori
Asociation rules
Fig.1 GA_EN vs Apriori
2. CBA vs NNC_GA: Using data set “hepatitis” there is better
accuracy for NNC_GA because NNC_GA follows the NN
method for learning the process and adjusting the weight.
9
1
[1] B. Bringmann, S. Nijssen and A. Zimmermann , “Patternbased classification: a unifying perspective”, In Proceedings
of ECML-PKDD workshop on from local patterns to global
models, pp. 36-50, 2009
accuracy
0.95
0.9
0.85
[2] W. Li, J. Han and J. Pei, “CMAR: Accurate and efficient
classification based on multiple class-association rules”, In
Data Mining ’01, Proceedings IEEE International
Conference on, pp. 369-376, Nov. 2001.
0.8
0.75
0.7
CBA
NNC_GA
Fig. 2 CBA vs NNC_GA
4. ID3 vs DTA: using the data set “iris”. DTA results are
slightly better than the ones of ID3.
0.87
running time
[3] J. Han, ,M. Kamber, J. Pei, “ Data Mining Concepts and
Techniques”, Morgan Kaufman, 3 ed, 2012
[4] U. Fayyad and G. Piateski-Shapiro, “ From Data
Mining to Knowledge Discovery. MIT Press, 1995.
[5] C. Ordonez, “Comparing Association Rules and Decision
Trees for Disease Prediction”, HIKM 2006, Nov 11, Virginia.
[6] M. Karntardzic, ”Data Mining: Concepts, Models,
Methods, and Algorithms”, IEEE Press, 2003
0.875
0.865
0.86
0.855
0.85
0.845
0.84
DTA
ID3
Fig. 3 ID3 vs DTA
8
References
Conclusions
In this project a new framework of algorithms is
developed based on the ability of Genetic Algorithms for
discovering Association rules and Classification on biological
data.
Certain advantages apply depending on the algorithms’
particular way of operation. Future work will focus on NN.