Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Applications (0975 – 8887) Volume 31– No.6, October 2011 Association Rule mining based Decision Tree Induction for efficient detection of cancerous masses in mammogram S.Pitchumani Angayarkanni Asst. Professor, Lady Doak College, Madurai Dr.Nadira Banu Kamal Director, Department of M.C.A, TBAK College,Kilakarai SOM and ANN accuracy. Of earlier breast cancer detection. It consists of four stages . 1) Preprocessing and enhancement 2) Feature extraction, selection and classification 3) Segmentation of suspicious region 4) Deriving the decision rule using apriori algorithm and formulating decision tree. . The objective is to classify a tumor as either benign or malignant based on cell descriptions gathered by microscopic examination. The classification performance of C5.0 DT is evaluated and compared to the one that achieved by radial basis function kernel support vector machine (RBF-SVM). The dataset has been partitioned by the ratio 70:30% into training and test subsets respectively. Experimental results show that the generalization of the C5.0 DT has been increased radically using boosting, winnowing and tree pruning methods. The C5.0 DT model together with Association rule based clustering has achieved a remarkable performance with 99.95% classification accuracy on training subset and 100% of test on both training and test subsets. 1.INTRODUCTION 2 PRE-PROCESSING ABSTRACT Breast cancer is one of the most common form of cancer in women. In order to reduce the death rate , early detection of cancerous regions in mammogram images is needed. The existing system is not so accurate and it is time consuming one. The system we propose includes the data mining concept for early, fast and accurate detection of cancerous masses in mammogram images. The system we propose consists of :preprocessing phase, a phase for segmenting normal, benign and malignant regions and a phase for mining the resulted traditional Database and a final phase to organize the resulted association rule based decision tree induction in a classification model . The experimental results show that the method performs well, reaching over 99% accuracy. This is mainly to increase the levels of diagnostic confidence and to provide immediate second opinion for physician. Keywords Preprocessing, Gabor Filter, Decision Tree Induction, Breast cancer is one of the major causes for the increase in mortality among women. A tumor is a mass of tissue that grows out of the normal faces that regulate growth. Early diagnosis requires accurate and reliable diagnosis procedure that allows the physicians to distinguish benign from malignant without going for surgical biopsy. The prediction must classify the patients to either benign or malignant. Many different algorithms have been proposed for automatic detection of breast cancer in mammograms. Feature extracted from mammogram images can be used for detection of cancer. Many proposed algorithm did not analyze the texture component of the image in detail. The proposed algorithm performs classification based on the analysis of the nine texture parameters; further Knowledge Discovery Database KDD which includes Data Mining technique is incorporated to generate test cases for the two layered ANN. The logic of Decision Tree is very easy to discern. The proposed algorithm produces an accuracy of 99.99% with a minimum error rate of 0.001%.The main objective of this paper is to develop a CAD system for automatic detection of breast cancer through MRI[1]. The proposed system can provide the valuable outlook and The mammogram image for this study is taken from MIAS , which is an UK research group organization related to the breast cancer investigation. As mammograms are difficult to interpret , preprocessing is necessary to improve the quality of the image and make the feature extraction phase as an earlier and reliable one[2]. The classification of tumor is surrounded be breast tissue that makes the calcifications preventing accurate detection. A preprocessing usually noise reducing step used to improve image and calcification contrast. In this work an efficient filter referred to as the gabor with low pass filter was applied to image that maintained calcifications while suppressing unimportant image features. Fig 2 shows representative output image of the filter for a image cluster in fig 1. By comparing the two images , we observe the background mammography structures are removed while calcifications are preserved. This simplifies the further tumor detection step. In order to diminish the effect of over brightness and over darkness in images and at the same time determine the image feature, we applied histogram equalization method, which is a widely used technique. Histogram 1 International Journal of Computer Applications (0975 – 8887) Volume 31– No.6, October 2011 equalization increases the contrast of the image. Removal of pectoral muscles helps in predicting the tumor more accurately[15]. In this work an efficient filter referred to as the gobar with low pass filter was applied to image that maintained calcifications while suppressing unimportant image features. Gabor filter is a linear filter used for edge detection. Frequency and orientation representations of Gabor filter are similar to those of human visual system, and it has been found to be particularly appropriate for texture representation and discrimination [3]. In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave. Its impulse response is defined by a harmonic function multiplied by a Gaussian function. Because of the multiplicationconvolution property (Convolution theorem), the Fourier transform of a Gabor filter's impulse response is the convolution of the Fourier transform of the harmonic function and the Fourier transform of the Gaussian function. The filter has a real and an imaginary component representing orthogonal directions. Gabor wavelet filters smooth the image by blocking detail information. Mass detection aims to extract the edge of the tumor from surrounding normal tissues and background. PSNR, RMS, MSE, NSD, ENL value calculated for each of 121 pairs of mammogram images clearly shows that gabor wavelet filter when applied to mammogram image leads to best Image Quality[4]. The orientation and scale can be changed in this program to extract texture information. Here 3 scales and 4 orientations was used. The GLCM image is divided into 3x3 matrix and the texture features are calculated[3]. Texture Features are: Cluster Prominence, Energy, Entropy, Homogeneity, Difference variance, Difference Entropy, Information Measure, Normalized ,correlation 3.1 Texture Feature Analysis–(Self Organization Map) A self-organized map (SOM)-type of artificial neural network that is trained using unsupervised learning to produce a two-dimensional, representation of the training samples, called a map (Kohonen Map).SOM operate in two modes: Training and Mapping. Training builds the map using input samples (vector quantization). Mapping automatically classifies a new input vector[4]. The map constitutes of neurons/node located on a regular map grid. The lattice of the grid can be either hexagonal or rectangular. Figure 2: Training The blue blob is the distribution of the training data, and the small white disc is the current training sample drawn from that distribution. The map constitutes of neurons/node located on a regular map grid. The lattice of the grid can be either hexagonal or rectangular. The SOM Toolbox is designed to perform the Training and Mapping functions .We find which Texture Feature is used for detecting tumor in the mammograms[4]. By analyzing different cases of Mammograms we find that only Variable 8 (Information Measure) differs. Figure 1: Preprocessing and Image Enhancement Table 1: Signal to Noise ratio calculation PSNR RMS NSD ENL MES 3 TEXTURE ANALYSIS 87.65 2.97 4.55 89.89 8.83 Texture based segmentation is implemented because when a person is affected by cancer the texture of the skin becomes smooth[3]. This Segmentation method segments the calcification pattern and the other suspicious regions in the mammograms. Using GLCM(Gray level cooccurrence matrix) technique we how often different combination of brightness values occur in an image. 2 International Journal of Computer Applications (0975 – 8887) Volume 31– No.6, October 2011 Figure 6:SOM based visualization for Benign case Figure3: Segmentation and GLCM texture approach 4 WATERSHED SEGMENTATION: The Detected Region can be segmented using Watershed Segmentation[5]. Watershed transition form usually works well only with images of bubbles or metallographic pictures, but when combined with filter techniques, it works well. The detected image is superimposed with the original image to obtain the output image. Figure 4:Texture Feature Extraction Figure 7:Segmentation of LEFT Image – Benign case 5 ASSOCIATION RULE MINING USING APRIORI ALGORITHM: Figure 5:Excel sheet representation of texture features It is mainly used to search for interesting relationships among items in a given data set. Algorithm: Apriori: Find frequent itemsets using an iterative level-wise approach based on candidate generation. Input: Excel file of Texture parameters D; minimum support threshold,min_sup Output: L, frequent itemsets in D. Method: 3 International Journal of Computer Applications (0975 – 8887) Volume 31– No.6, October 2011 Pass 1 1. 2. Generate the candidate itemsets in C1 Save the frequent itemsets in L1 Pass k 1. 2. 3. Generate the candidate itemsets in Ck from the frequent itemsets in Lk-1 1. Join Lk-1 p with Lk-1q, as follows: insert into Ck select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1q where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1 2. Generate all (k-1)-subsets from the candidate itemsets in Ck 3. Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1 Scan the transaction database to determine the support for each candidate itemset in Ck Save the frequent itemsets in Lk 3. Node4=Value2 Node10=Value2 27 ==> Node7=Value1 26 <conf:(0.96)> lift:(1.5) lev:(0.09) [8] conv:(4.86) 4. class=Value1 Node3=Value2 26 ==> Node2=Value1 25 <conf:(0.96)> lift:(1.66) lev:(0.1) [9] conv:(5.46) 5. Node3=Value2 Node8=Value1 33 ==> Node2=Value1 31 <conf:(0.94)> lift:(1.62) lev:(0.12) [11] conv:(4.62) 6. Node2=Value1 Node10=Value1 32 ==> Node3=Value2 30 <conf:(0.94)> lift:(1.74) lev:(0.13) [12] conv:(4.91) 7. Node3=Value2 Node5=Value2 31 ==> Node2=Value1 29 <conf:(0.94)> lift:(1.61) lev:(0.11) [11] conv:(4.34) 8. Node2=Value2 Node8=Value2 31 ==> Node3=Value1 29 <conf:(0.94)> lift:(2.03) lev:(0.15) [14] conv:(5.58) 9. Node3=Value2 Node4=Value2 29 ==> Node2=Value1 27 <conf:(0.93)> lift:(1.61) lev:(0.1) [10] conv:(4.06) 10. Node3=Value1 Node6=Value2 29 ==> Node8=Value2 27 <conf:(0.93)> lift:(1.69) lev:(0.11) [11] conv:(4.35) Cross validation 10 fold TP Rate FP Rate Precision Recall F-Measure ROC Area Benign 0.706 0.498 0.586 0.706 0.641 0.59 Malignant 0.502 0.294 0.63 0.502 0.559 0.59 Weighted Avg . 0.604 0.396 0.608 0.604 0.6 0.59 === Confusion Matrix === a b <-- classified as 204 85 | a = Benign 144 145 | b = Malignant Figure 9 Apriori rule generation Figure 8: Attribute Selection and Classification Minimum support: 0.25 (25 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 15 Generated sets of large itemsets: Size of set of large itemsets L(1): 20 Size of set of large itemsets L(2): 92 Size of set of large itemsets L(3): 19 Best rules found: 1. Node3=Value2 Node10=Value1 31 ==> Node2=Value1 30 <conf:(0.97)> lift:(1.67) lev:(0.12) [12] conv:(6.51) 2. Node3=Value2 Node9=Value1 30 ==> Node2=Value1 29 <conf:(0.97)> lift:(1.67) lev:(0.12) [11] conv:(6.3) 6. DECISION TREE INDUCTION METHOD, J48 ALGORITHM The J48 Decision tree classifier follows the following simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly[6,7,8]. This feature that is able to tell us most about the data instances so that we can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then we terminate that branch and assign to it the target value that we have obtained. 4 International Journal of Computer Applications (0975 – 8887) Volume 31– No.6, October 2011 For the other cases, we then look for another attribute that gives us the highest information gain. Hence we continue in this manner until we either get a clear decision of what combination of attributes gives us a particular target value, or we run out of attributes. In the event that we run out of attributes, or if we cannot get an unambiguous result from the available information, we assign this branch a target value that the majority of the items under this branch possess. Now that we have the decision tree, we follow the order of attribute selection as we have obtained for the tree. By checking all the respective attributes and their values with those seen in the decision tree model, we can assign or predict the target value of this new instance. The above description will be more clear and easier to understand with the help of an example. Hence, let us see an example of J48 decision tree classification. =t-neg/neg, sensitivity =t-pos/pos and Precision=tpos/t-pos+f-pos and accuracy is =sensitivity (pos/(pos+neg))+specificity(neg/(pos+neg)). From the above algorithm the accuracy was found to be 99.9%. The proposed method yields very good accuracy in minimum period of time shows the efficiency of the algorithm. 8. REFERENCES: [1] Serhat Ozekes.,A.Yilmez Camurc :Computer aided detection of Mammographic masses on CAD digital Mammograms.: stanbul Ticaret Üniversitesi Fen Bilimleri (2005) pp.87-97 [2] Ruchaneewan Susomboon, Daniela Stan Raicu, Jacob Furst.:Pixel – Based Texture Classification of Tissues in computed Tomography.: Literature review (2007) === Classifier model (full training set) === [3] A. K. Jain, F. Farrokhnia, “Unsupervised Texture Segmentation using Gabor Filters,” Pattern Recognition,pp. 1167-1186, 1991. J48 pruned tree ------------------ [4] M. Unser, “Texture Classification and Segmentation Using Wavelet Frames,” IEEE Trans. Image Proc., Pp. 1549-1560,1995. Node7 = Value1 | Node5 = Value1 | | Node8 = Value1: Value1 (15.0/3.0) | | Node8 = Value2 | | | Node2 = Value1: Value1 (7.0/2.0) | | | Node2 = Value2: Value2 (12.0/2.0) | Node5 = Value2 | | Node8 = Value1: Value2 (15.0) | | Node8 = Value2 | | | Node4 = Value1 | | | | class = Value1: Value2 (3.0/1.0) | | | | class = Value2: Value1 (6.0/1.0) | | | Node4 = Value2: Value2 (6.0/1.0) Node7 = Value2 | Node4 = Value1 | | Node5 = Value1 | | | Node9 = Value1: Value2 (4.0) | | | Node9 = Value2: Value1 (4.0/1.0) | | Node5 = Value2 | | | class = Value1 | | | | Node2 = Value1: Value2 (2.0) | | | | Node2 = Value2: Value1 (3.0/1.0) | | | class = Value2: Value1 (11.0/1.0) | Node4 = Value2: Value1 (12.0/1.0) Number of Leaves : Size of the tree : 13 25 Time taken to build model: 0.03 seconds [5]www.wiau.man.ac.uk/services/MIAS/MIASmini Mammographic Image Analysis Society: Mini Mammography Database, 2003. [6] Beucher, S., and Lantuejoul, C. “Use of watersheds in contour detection”. In Proc. International Workshop on Image Processing, Real-Time Edge and Motion Detection/Estimation, Rennes, September (1979). [7] Laila Elfangary and Walid Adly Atteya: Mining Medical Databases using Proposed Incremental Association Rules Algorithm (PIA).: Second International Conference on the Digital Society ,IEEE Computer Society(2008) [8] Pudi. V., Harilsa., j. : On the optimality of association rule mining algorithms. Technical Report TR-2001-01, DSL, Indian Institute of Science (2001) [9] Hoppner. F., Klawonn. F.; Kruse, R, Rurkler, T.: Fuzzy cluster Analysis, methods for classification Data Analysis and Image recognition.: Wiley, New York (1999) [10] Delgado. M., Marin. N., Sanchez. D., Vila. M.A. : Fuzzy Association Rules : General Model and Applications. IEEE Transaction of Fuzzy systems 11, 214-225(2003) 7. CONCLUSION: The rule generated using decision tree induction method clearly shows that the time taken t classify benign and malignant cases in just 0.03 seconds and the accuracy was 99.9%. The specificity 5