Download Association Rule mining based Decision Tree Induction for efficient

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Computer Applications (0975 – 8887)
Volume 31– No.6, October 2011
Association Rule mining based Decision Tree
Induction for efficient detection of cancerous
masses in mammogram
S.Pitchumani Angayarkanni
Asst. Professor,
Lady Doak College, Madurai
Dr.Nadira Banu Kamal
Director,
Department of M.C.A,
TBAK College,Kilakarai
SOM and ANN
accuracy. Of earlier breast cancer detection. It consists
of four stages .
1) Preprocessing and enhancement
2) Feature
extraction,
selection
and
classification
3) Segmentation of suspicious region
4) Deriving the decision rule using apriori
algorithm and formulating decision tree.
. The objective is to classify a tumor as either benign
or malignant based on cell descriptions gathered by
microscopic
examination.
The
classification
performance of C5.0 DT is evaluated and compared to
the one that achieved by radial basis function kernel
support vector machine (RBF-SVM). The dataset has
been partitioned by the ratio 70:30% into training and
test subsets respectively. Experimental results show
that the generalization of the C5.0 DT has been
increased radically using boosting, winnowing and
tree pruning methods. The C5.0 DT model together
with Association rule based clustering has achieved a
remarkable performance with 99.95% classification
accuracy on training subset and 100% of test on both
training and test subsets.
1.INTRODUCTION
2 PRE-PROCESSING
ABSTRACT
Breast cancer is one of the most common form of
cancer in women. In order to reduce the death rate ,
early detection of cancerous regions in mammogram
images is needed. The existing system is not so
accurate and it is time consuming one. The system we
propose includes the data mining concept for early,
fast and accurate detection of cancerous masses in
mammogram images. The system we propose consists
of :preprocessing phase, a phase for segmenting
normal, benign and malignant regions and a phase for
mining the resulted traditional Database and a final
phase to organize the resulted association rule based
decision tree induction in a classification model . The
experimental results show that the method performs
well, reaching over 99% accuracy. This is mainly to
increase the levels of diagnostic confidence and to
provide immediate second opinion for physician.
Keywords
Preprocessing, Gabor Filter, Decision Tree Induction,
Breast cancer is one of the major causes for
the increase in mortality among women. A tumor is a
mass of tissue that grows out of the normal faces that
regulate growth. Early diagnosis requires accurate and
reliable diagnosis procedure that allows the physicians
to distinguish benign from malignant without going
for surgical biopsy. The prediction must classify the
patients to either benign or malignant.
Many different algorithms have been proposed for
automatic detection of breast cancer in mammograms.
Feature extracted from mammogram images can be
used for detection of cancer. Many proposed algorithm
did not analyze the texture component of the image in
detail. The proposed algorithm performs classification
based on the analysis of the nine texture parameters;
further Knowledge Discovery Database KDD which
includes Data Mining technique is incorporated to
generate test cases for the two layered ANN. The logic
of Decision Tree is very easy to discern. The proposed
algorithm produces an accuracy of 99.99% with a
minimum error rate of 0.001%.The main objective of
this paper is to develop a CAD system for automatic
detection of breast cancer through MRI[1]. The
proposed system can provide the valuable outlook and
The mammogram image for this study is
taken from MIAS , which is an UK research group
organization related to the breast cancer investigation.
As mammograms are difficult to interpret ,
preprocessing is necessary to improve the quality of
the image and make the feature extraction phase as an
earlier and reliable one[2]. The classification of tumor
is surrounded be breast tissue that makes the
calcifications preventing accurate detection. A
preprocessing usually noise reducing step used to
improve image and calcification contrast.
In this work an efficient filter referred to as
the gabor with low pass filter was applied to image
that maintained calcifications while suppressing
unimportant image features. Fig 2 shows
representative output image of the filter for a image
cluster in fig 1. By comparing the two images , we
observe the background mammography structures are
removed while calcifications are preserved. This
simplifies the further tumor detection step. In order to
diminish the effect of over brightness and over
darkness in images and at the same time determine the
image feature, we applied histogram equalization
method, which is a widely used technique. Histogram
1
International Journal of Computer Applications (0975 – 8887)
Volume 31– No.6, October 2011
equalization increases the contrast of the image.
Removal of pectoral muscles helps in predicting the
tumor more accurately[15].
In this work an efficient filter referred to as the gobar
with low pass filter was applied to image that
maintained
calcifications
while
suppressing
unimportant image features. Gabor filter is a linear
filter used for edge detection. Frequency and
orientation representations of Gabor filter are similar
to those of human visual system, and it has been found
to be particularly appropriate for texture representation
and discrimination [3]. In the spatial domain, a 2D
Gabor filter is a Gaussian kernel function modulated
by a sinusoidal plane wave. Its impulse response is
defined by a harmonic function multiplied by a
Gaussian function. Because of the multiplicationconvolution property (Convolution theorem), the
Fourier transform of a Gabor filter's impulse response
is the convolution of the Fourier transform of the
harmonic function and the Fourier transform of the
Gaussian function. The filter has a real and an
imaginary component representing orthogonal
directions.
Gabor wavelet filters smooth the image by
blocking detail information. Mass detection aims to
extract the edge of the tumor from surrounding normal
tissues and background. PSNR, RMS, MSE, NSD,
ENL value calculated for each of 121 pairs of
mammogram images clearly shows that gabor wavelet
filter when applied to mammogram image leads to
best Image Quality[4]. The orientation and scale can
be changed in this program to extract texture
information. Here 3 scales and 4 orientations was
used.
The GLCM image is divided into 3x3 matrix and the
texture features are calculated[3]. Texture Features
are: Cluster Prominence, Energy,
Entropy,
Homogeneity, Difference variance, Difference
Entropy,
Information
Measure,
Normalized
,correlation
3.1 Texture Feature Analysis–(Self
Organization Map)
A self-organized map (SOM)-type of artificial
neural network that is trained using unsupervised
learning to produce a two-dimensional, representation
of the training samples, called a map (Kohonen
Map).SOM operate in two modes: Training and
Mapping. Training builds the map using input samples
(vector
quantization).
Mapping
automatically
classifies a new input vector[4]. The map constitutes
of neurons/node located on a regular map grid. The
lattice of the grid can be either hexagonal or
rectangular.
Figure 2: Training
The blue blob is the distribution of the
training data, and the small white disc is the current
training sample drawn from that distribution. The map
constitutes of neurons/node located on a regular map
grid. The lattice of the grid can be either hexagonal or
rectangular. The SOM Toolbox is designed to perform
the Training and Mapping functions .We find which
Texture Feature is used for detecting tumor in the
mammograms[4]. By analyzing different cases of
Mammograms we find that only Variable 8
(Information Measure) differs.
Figure 1: Preprocessing and Image
Enhancement
Table 1: Signal to Noise ratio calculation
PSNR
RMS
NSD
ENL
MES
3 TEXTURE ANALYSIS
87.65
2.97
4.55
89.89
8.83
Texture based segmentation is
implemented because when a person is affected by
cancer the texture of the skin becomes smooth[3].
This Segmentation method segments the calcification
pattern and the other suspicious regions in the
mammograms. Using GLCM(Gray level cooccurrence matrix) technique we how often different
combination of brightness values occur in an image.
2
International Journal of Computer Applications (0975 – 8887)
Volume 31– No.6, October 2011
Figure 6:SOM based visualization for Benign case
Figure3: Segmentation and GLCM texture
approach
4 WATERSHED SEGMENTATION:
 The Detected Region can be segmented
using Watershed Segmentation[5].
 Watershed transition form usually works
well only with images of bubbles or
metallographic pictures, but when combined
with filter techniques, it works well.
 The detected image is superimposed with
the original image to obtain the output image.
Figure 4:Texture Feature Extraction
Figure 7:Segmentation of LEFT Image – Benign
case
5 ASSOCIATION RULE MINING
USING APRIORI ALGORITHM:
Figure 5:Excel sheet representation of texture
features
It is mainly used to search for interesting
relationships among items in a given data set.
Algorithm:
Apriori: Find frequent itemsets using an
iterative level-wise approach based on candidate
generation.
Input: Excel file of Texture parameters D;
minimum support threshold,min_sup
Output: L, frequent itemsets in D.
Method:
3
International Journal of Computer Applications (0975 – 8887)
Volume 31– No.6, October 2011
Pass 1
1.
2.
Generate the candidate itemsets in C1
Save the frequent itemsets in L1
Pass k
1.
2.
3.
Generate the candidate itemsets in Ck from
the frequent
itemsets in Lk-1
1. Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, p.item2, . . . ,
p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . .
p.itemk-2 = q.itemk-2, p.itemk-1 <
q.itemk-1
2. Generate all (k-1)-subsets from the
candidate itemsets in Ck
3. Prune all candidate itemsets from
Ck where some (k-1)-subset of the
candidate itemset is not in the
frequent itemset Lk-1
Scan the transaction database to determine
the support for each candidate itemset in Ck
Save the frequent itemsets in Lk
3. Node4=Value2 Node10=Value2 27 ==>
Node7=Value1 26 <conf:(0.96)> lift:(1.5) lev:(0.09)
[8] conv:(4.86)
4. class=Value1 Node3=Value2 26 ==>
Node2=Value1 25 <conf:(0.96)> lift:(1.66) lev:(0.1)
[9] conv:(5.46)
5. Node3=Value2 Node8=Value1 33 ==>
Node2=Value1 31 <conf:(0.94)> lift:(1.62)
lev:(0.12) [11] conv:(4.62)
6. Node2=Value1 Node10=Value1 32 ==>
Node3=Value2 30 <conf:(0.94)> lift:(1.74)
lev:(0.13) [12] conv:(4.91)
7. Node3=Value2 Node5=Value2 31 ==>
Node2=Value1 29 <conf:(0.94)> lift:(1.61)
lev:(0.11) [11] conv:(4.34)
8. Node2=Value2 Node8=Value2 31 ==>
Node3=Value1 29 <conf:(0.94)> lift:(2.03)
lev:(0.15) [14] conv:(5.58)
9. Node3=Value2 Node4=Value2 29 ==>
Node2=Value1 27 <conf:(0.93)> lift:(1.61) lev:(0.1)
[10] conv:(4.06)
10. Node3=Value1 Node6=Value2 29 ==>
Node8=Value2 27 <conf:(0.93)> lift:(1.69)
lev:(0.11) [11] conv:(4.35)
Cross validation 10 fold
TP Rate FP Rate Precision
Recall F-Measure ROC Area
Benign
0.706
0.498
0.586
0.706 0.641
0.59
Malignant
0.502
0.294
0.63
0.502 0.559
0.59
Weighted Avg . 0.604
0.396
0.608
0.604 0.6
0.59
=== Confusion Matrix ===
a b <-- classified as
204 85 | a = Benign
144 145 | b = Malignant
Figure 9 Apriori rule generation
Figure 8: Attribute Selection and Classification
Minimum support: 0.25 (25 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 15
Generated sets of large itemsets:
Size of set of large itemsets L(1): 20
Size of set of large itemsets L(2): 92
Size of set of large itemsets L(3): 19
Best rules found:
1. Node3=Value2 Node10=Value1 31 ==>
Node2=Value1 30 <conf:(0.97)> lift:(1.67)
lev:(0.12) [12] conv:(6.51)
2. Node3=Value2 Node9=Value1 30 ==>
Node2=Value1 29 <conf:(0.97)> lift:(1.67)
lev:(0.12) [11] conv:(6.3)
6. DECISION TREE INDUCTION
METHOD, J48 ALGORITHM
The J48 Decision tree classifier follows the
following simple algorithm. In order to classify a new
item, it first needs to create a decision tree based on
the attribute values of the available training data. So,
whenever it encounters a set of items (training set) it
identifies the attribute that discriminates the various
instances most clearly[6,7,8]. This feature that is able
to tell us most about the data instances so that we can
classify them the best is said to have the highest
information gain. Now, among the possible values of
this feature, if there is any value for which there is no
ambiguity, that is, for which the data instances falling
within its category have the same value for the target
variable, then we terminate that branch and assign to it
the target value that we have obtained.
4
International Journal of Computer Applications (0975 – 8887)
Volume 31– No.6, October 2011
For the other cases, we then look for another attribute
that gives us the highest information gain. Hence we
continue in this manner until we either get a clear
decision of what combination of attributes gives us a
particular target value, or we run out of attributes. In
the event that we run out of attributes, or if we cannot
get an unambiguous result from the available
information, we assign this branch a target value that
the majority of the items under this branch possess.
Now that we have the decision tree, we follow the
order of attribute selection as we have obtained for the
tree. By checking all the respective attributes and their
values with those seen in the decision tree model, we
can assign or predict the target value of this new
instance. The above description will be more clear and
easier to understand with the help of an example.
Hence, let us see an example of J48 decision tree
classification.
=t-neg/neg, sensitivity =t-pos/pos and Precision=tpos/t-pos+f-pos and accuracy is =sensitivity
(pos/(pos+neg))+specificity(neg/(pos+neg)).
From the above algorithm the accuracy was found to
be 99.9%. The proposed method yields very good
accuracy in minimum period of time shows the
efficiency of the algorithm.
8. REFERENCES:
[1] Serhat Ozekes.,A.Yilmez Camurc :Computer aided
detection of Mammographic masses on CAD digital
Mammograms.: stanbul Ticaret Üniversitesi Fen
Bilimleri (2005) pp.87-97
[2] Ruchaneewan Susomboon, Daniela Stan Raicu,
Jacob Furst.:Pixel – Based Texture Classification of
Tissues in computed Tomography.: Literature review
(2007)
=== Classifier model (full training set) ===
[3] A. K. Jain, F. Farrokhnia, “Unsupervised Texture
Segmentation using Gabor Filters,” Pattern
Recognition,pp. 1167-1186, 1991.
J48 pruned tree
------------------
[4] M. Unser, “Texture Classification and
Segmentation Using Wavelet Frames,” IEEE Trans.
Image Proc., Pp. 1549-1560,1995.
Node7 = Value1
| Node5 = Value1
| | Node8 = Value1: Value1 (15.0/3.0)
| | Node8 = Value2
| | | Node2 = Value1: Value1 (7.0/2.0)
| | | Node2 = Value2: Value2 (12.0/2.0)
| Node5 = Value2
| | Node8 = Value1: Value2 (15.0)
| | Node8 = Value2
| | | Node4 = Value1
| | | | class = Value1: Value2 (3.0/1.0)
| | | | class = Value2: Value1 (6.0/1.0)
| | | Node4 = Value2: Value2 (6.0/1.0)
Node7 = Value2
| Node4 = Value1
| | Node5 = Value1
| | | Node9 = Value1: Value2 (4.0)
| | | Node9 = Value2: Value1 (4.0/1.0)
| | Node5 = Value2
| | | class = Value1
| | | | Node2 = Value1: Value2 (2.0)
| | | | Node2 = Value2: Value1 (3.0/1.0)
| | | class = Value2: Value1 (11.0/1.0)
| Node4 = Value2: Value1 (12.0/1.0)
Number of Leaves :
Size of the tree :
13
25
Time taken to build model: 0.03 seconds
[5]www.wiau.man.ac.uk/services/MIAS/MIASmini
Mammographic Image Analysis Society: Mini
Mammography Database, 2003.
[6] Beucher, S., and Lantuejoul, C. “Use of
watersheds in contour detection”. In Proc.
International Workshop on Image Processing,
Real-Time Edge and Motion
Detection/Estimation, Rennes, September (1979).
[7] Laila Elfangary and Walid Adly Atteya: Mining
Medical Databases using Proposed Incremental
Association Rules Algorithm (PIA).: Second
International Conference on the Digital Society ,IEEE
Computer Society(2008)
[8] Pudi. V., Harilsa., j. : On the optimality of
association rule mining algorithms. Technical Report
TR-2001-01, DSL, Indian Institute of Science (2001)
[9] Hoppner. F., Klawonn. F.; Kruse, R, Rurkler, T.:
Fuzzy cluster Analysis, methods for classification
Data Analysis and Image recognition.: Wiley, New
York (1999)
[10] Delgado. M., Marin. N., Sanchez. D., Vila. M.A.
: Fuzzy Association Rules : General Model and
Applications. IEEE Transaction of Fuzzy systems 11,
214-225(2003)
7. CONCLUSION:
The rule generated using decision tree
induction method clearly shows that the time taken t
classify benign and malignant cases in just 0.03
seconds and the accuracy was 99.9%. The specificity
5