Download Advances in Natural and Applied Sciences

Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70 AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/ANAS Breast Cancer Classification Using Various Machine Learning Techniques 1 V.Palaniammal and 2Dr. RM.Chandrasekaran Research Scholar, Manonmaniam Sundaranar University, Tirunelveli – 627 012 Tamil Nadu, India. Professor, Department of Computer Science & Engineering, Annamalai University Tamil Nadu, India 1 2 ARTICLE INFO Article history: Received 4 September 2014 Received in revised form 24 November 2014 Accepted 8 December 2014 Available online 10 January 2015 Annamalainagar – 608 002, ABSTRACT In this paper we present various machine learning techniques for breast cancer classification. WBC (Wisconsin Breast Cancer) data are used for classification. We have investigated three data mining techniques: they are k-Nearest Neighbor, Support Vector Machine and Auto associative Neural Network. Several experiments were conducted using these algorithms. The achieved prediction performances are comparable to existing techniques. However, we found out that Auto associative Neural Network has a much better performance than the other two techniques. Keywords: Breast cancer, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Auto Assosiative Neural Network (AANN). © 2015 AENSI Publisher All rights reserved. To Cite This Article: V.Palaniammal and Dr. RM.Chandrasekaran., Breast Cancer Classification Using Various Machine Learning Techniques. Adv. in Nat. Appl. Sci., 9(3): 65-70, 2015 INTRODUCTION Breast cancer is the most common cancer among Women. Breast cancer occurs when a malignant (cancerous) tumor originates in the breast. As breast cancer tumors mature, they may metastasize (spread) to other parts of the body. The primary route of metastasis is the lymphatic system which, ironically enough, is also the body's primary system for producing and transporting white blood cells and other cancer-fighting immune system cells throughout the body. Metastasized cancer cells that aren't destroyed by the lymphatic system's white blood cells move through the lymphatic vessels and settle in remote body locations, forming new tumors and perpetuating the disease process. Breast cancer is fairly common. Because of its well publicized nature, and potential for lethality, breast cancer is arguably the most frightening type of cancer diagnosis someone can receive. However, it is important to keep in mind that, if identified and properly treated while still in its early stages, breast cancer can be cured. The causes of breast cancer are not yet definitively known. However, extensive research efforts have uncovered various risk factors that are associated with increased incidence of breast cancer in women. Though some of these risk factors are unavoidable and uncontrollable, some of them are very avoidable, making it possible for people to take action so as to minimize their cancer risk. Even a minimized risk is still a risk, however. There is no way to pre-determine whether a person will get breast cancer until they have either been diagnosed with it or they have lived a breast-cancer free lifetime. Detection and diagnosis of the disease at an earlier stage can ensure a long survival of lives (Jyotirmay Gadewadikar, 2010; Belciug, S., 2010). The symptoms of breast cancer include mass, changes in shape and dimension of breast (Mojarad, S.A., 2010). Various diagnostic tests and procedures are available for detecting the presence of breast cancer. 2. Related work: A literature survey showed that there have been several studies on the prediction problem using various techniques (Dursun Delen, 2009; Vapnik, V.N., 1990). However, we could only find a few studies related to medical diagnosis using data mining approaches (Shelly Gupta, 2011). Fatima., et al (2012) used an approach for classifying the breast cancer from Wisconsin breast cancer diagnosis (WBCD) database using adaptive neuro-fuzzy inference system (ANFIS). By using the ANFIS classifier they were achieved an accuracy of 98.25 % in tissue level. Karabatak., et al (2009) proposed an automatic diagnosis based pattern recognition system for detecting breast cancer based on the association rules Corresponding Author: V.Palaniammal, 1Research Scholar, Manonmaniam Sundaranar University, Tirunelveli – 627 012 Tamil Nadu, India. E-mail: Email : [email protected] 66 V.Palaniammal and Dr. RM.Chandrasekaran, 2015 Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70 (AR) and Artificial neural network (ANN). In that study, they used AR method for reducing the dimension of breast cancer database and ANN for intelligent classification. The proposed system i.e. the combination of AR and ANN, performance was compared with only ANN model. The dimension of input feature space was reduced from nine to four by using AR. In the testing stage, they use 3- fold cross validation method to the WBCD to evaluate the proposed pattern recognition system performances. The correct classification rate obtained from that AR + ANN system was 95.6%. Akay proposed another pattern recognition method by considering SVM classifier to classify benign and malignant masses. They used WBCD breast cancer database and achieved an accuracy of 98% (Akay, M.F., 2009). Proposed System: Early diagnosis needs an accurate and reliable diagnosis procedure that can be used by physicians to distinguish benign breast tumors from malignant ones without going for surgical biopsy (Fuzzy Inference System for Breast Cancer Diagnoses, 2010a.; Vazirani, H., 2010). The objective of these predictions is to assign patients to one of the two group either a “benign” that is noncancerous or a “malignant” that is cancerous. In this paper, we propose a data mining technique such as k-Nearest Neighbor, Support Vector Machine and Auto associative Neural Network. Fig. 1 shows the block diagram of proposed system. WBC Database Extracted Features Classification using KNN, SVM and AANN Output (Normal/ Abnormal Fig. 1: Block Diagram of Proposed System. Wbc database: WBC (Wisconsin Breast Cancer) database is created by William H. Wolberge at University of Wisconsin. This database is used to distinguish malignant (cancerous) from benign (non-cancerous) samples. This database contains 200 observations among which 117 are benign cases and 83 are malignant cases. These instances are described by 10 Attributes. Extracted Features: The process of defining a set of features will most efficiently or more meaningfully represent the information that is important for analysis and classification. Classification: Auto Associative Neural Network (Aann): A special kind of back propagation neural network called auto associative neural network (AANN) is used to capture the distribution of feature vectors in the feature space. We use AANNs with 5 layers as shown in Fig. 2. This architecture consists of three non-linear hidden layers between the linear input and output layers. The second hidden layer contains fewer nodes than the input layer, and is known as the compression layer. In this network, the second and fourth layers have more units than the input layer. The third layer has fewer units than the first or fifth. The processing units in the first and third hidden layer are nonlinear, and the units in the second compression/hidden layer can be linear or nonlinear. As the error between the actual and the desired output vectors is minimized, the cluster of points in the input space determines the shape of the hyper surface obtained by the projection onto the lower dimensional space. The network‟s weights are initially set to small random values. Training is the process of adapting or modifying the weights in response to the training input patterns being presented at the input layer. The training algorithm controls the response of the weights to adapt the learning algorithm. During the training process, weights will gradually converge towards the values which will match the input feature vector to the target. From the input vector, an output vector is produced by the network which can then be compared to the target output vector. If there is no difference between the produced output and target output vectors, learning stops. Otherwise the weights are changed to reduce the difference. The weights are adapted using a recursive algorithm which starts at the output nodes and works back to the hidden layer. Testing is the application mode where the network processes test pattern presented at its input layer and creates a response at the output layer. The vector values are arranged in matrix format to make it feasible to work with AANN. The values are trained using AANN which uses the back propagation algorithm. The weight values are adjusted to get the input features as output. The obtained features are tested to find the normalized square error between the input and the output features. The error is transformed into confidence score which shows the similarity between the input and the output. 67 V.Palaniammal and Dr. RM.Chandrasekaran, 2015 Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70 Non-linear hidden layers Linear output layer Linear input layer Input layer Output layer Compression Layer Fig. 2: Auto Associative Neural Network. Aann Training Algorithm: 1. Select and apply the training pair from the training set to the network. 2. Calculate the output of the network. 3. Calculate the error between the output of the network and the input. 4. Adjust the weights (V, W matrix) in such a way that the error is minimized. 5. Repeat steps 1 to 4 for all the training pairs. 6. Repeat steps 1 to 5 until the network recognizes the training set or for certain epochs. Testing is the application mode where the network processes a tested input pattern presented at its input layer and creates a response at the output layer. The main application area includes Pattern Recognition, Voice Recognition, Bioinformatics, Signal Validation, and Net Clustering. K- Nearest Neighbor (Knn): K-nearest neighbor algorithm is a technique for classifying data based on the closest training examples in the feature space (Bremner, D., 2005). First the dataset is divided into a testing and training set. For each row in the testing set, the „K‟ nearest training set objects are found, and the classification of test data is determined by majority vote with ties are broken at random. If there are ties for the Kth nearest vector then all the instances are included in the vote. The way the KNN classifier works is, first by calculating the distances between the testing data vector and all of the training vectors using a particular distance calculation methodology which is given as follows: The Euclidean distance is found to be: Euclidean Distance = (x 2  x1 )  )y2  y1 ) The KNN classifier takes the test instance „x‟ and finds the K-nearest neighbors in the training data and assigns „x‟ into the class occurring most among the K neighbors. Support Vector Machine (Svm): After feature extraction, the extracted features are given as inputs to SVM. SVM is used to classify the group of data as either affected or normal depending on the feature values. Support vector machines (SVM‟s) are a set of related supervised learning methods used for classification and regression (Smola, A.J., 1998; Cortes, C., V.N. Vapnik, 1995). As shown in Fig. 3, a support vector machine, constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Support vectors Maximum margin decision hyperplane Margin is maximized Fig. 3: Optimal Hyperplane, maximizing margin and support vectors. The support vectors are the (transformed) training patterns. The support vectors are (equally) close to hyperplane. The support vectors are training samples that define the optimal separating hyperplane and are the 68 V.Palaniammal and Dr. RM.Chandrasekaran, 2015 Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70 most difficult patterns to classify. SVM constructs a linear model to estimate the decision function using nonlinear class boundaries based on support vectors (Rejani, Y.I.A., S.T. Selvi, 2009). If the data are linearly separated, SVM trains linear machines for an optimal hyperplane that separates the data without error and into the maximum distance between the hyperplane and the closest training points. Experimental Results: The experiment is performed on WBC database using various machine learning techniques. The results shows that the accuracy of KNN is 97%, SVM is 99% and AANN is 99.5% Performance Measures: In order to evaluate the performance of the detector, and to allow comparisons, several ratios have been taken into account. True Positive (TP): The detector found a normal data when a normal data is present. True Negative (TN): The detector found a abnormal data when a abnormal data is present. False Positive (FP): The detector found a abnormal data when a normal sample is present. False Negative (FN): The detector found a normal data when a abnormal data is present . Sensitivity is the percentage of abnormal data classified as abnormal by the procedure. Sensitivity (SE) = 100 Specificity is the percentage of normal data classified as normal by the procedure. Specificity (SP) = 100. From these quantities, Sensitivity, Specificity are chosen as measurement of accuracy and are calculated using the following equation. Accuracy (A) = 100 Table 1 shows the performance of WBC database using KNN. Table 2 and 3 shows the performance of SVM and AANN Table 1: Performance of WBC database using KNN. No. TP TN FP 1 30 19 1 2 29 19 2 3 24 25 1 4 28 20 0 FN 0 0 0 2 Sensitivity (%) 100 100 100 93 Specificity (%) 95 93 95 100 Accuracy (%) 98 96 98 96 Table 2: Performance of WBC database using SVM. No. TP TN FP F1 31 18 0 F2 31 18 0 F3 25 25 0 F4 30 20 0 FN 1 1 0 0 Sensitivity (%) 96.88 96.88 100 100 Specificity (%) 100 100 100 100 Accuracy (%) 98 98 100 100 Table 3: Performance of WBC database using AANN. No. TP TN FP FN F1 31 19 0 0 F2 31 18 0 1 F3 25 25 0 0 F4 30 20 0 0 Sensitivity (%) 100 96.88 100 100 Specificity (%) 100 100 100 100 Accuracy (%) 100 98 100 100 Table 4: Average performance of KNN, SVM and AANN. Classifiers Sensitivity (%) KNN 98.25 SVM 98.32 AANN 99.15 Specificity (%) 95 100 100 Accuracy (%) 97 99 99.5 The average performance of WBC database using KNN, SVM and AANN are shown in table 4. Performance Graph: Three different techniques KNN, SVM, AANN have been used for classification. When these three techniques are compared, it has been clear that the performance of AANN is high. The accuracy of AANN is 99.5%. Fig. 4, 5 and 6 shows the performance of KNN, SVM and AANN. The average performance of KNN, SVM and AANN are shown in Fig. 7. Conclusion: The last decade has major advancements in the methods of the diagnosis of breast cancer. Only recently the soft computing techniques are being used, In this paper we have discuss some of effective techniques that can be 69 V.Palaniammal and Dr. RM.Chandrasekaran, 2015 Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70 used for breast cancer classification. Among the various data mining classifiers, Auto Associative Neural Network is found to be best predictor with 99.5% Accuracy on WBC (Wisconsin Breast Cancer) from UCI machine learning. Fig. 4: Average Performance of KNN. Fig. 5: Average Performance of SVM. Fig. 6: Average Performance of AANN. Fig. 7: Average performance of KNN, SVM and AANN. REFERENCES Fatima, B., C.M. Amine, 2012. “A Neuro-Fuzzy inference model for breast cancer recognition”, International Journal of Computer Science & Information Technology. Shelly Gupta, Dharminder Kumar, Anand Sharma, 2011. “Data Mining Classification Techniques Applied for Breast Cancer Diagnosis And Prognosis”, IEEE, 2. Jyotirmay Gadewadikar, Yufeng Zheng, Ping Zhang, 2010. “Exploring Bayesian networks for medical decision support in breast cancer detection”, African Journal of Mathematics and Computer Science Research, pp: 225-2312010. 70 V.Palaniammal and Dr. RM.Chandrasekaran, 2015 Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70 Mojarad, S.A., S.S. Dlay, W.L. Woo, G.V. Sherbet, 2010. “Breast Cancer prediction and cross validation using multilayer perceptron neural networks”, IEEE, pp: 1-8. Belciug, S., F. Gorunescu, M. Gorunescu, A.B.M. Salem, 2010. “Assessing Performances of Unsupervised and Supervised Neural Networks in Breast Cancer Detection”, Proceeding 7th International Conference on Informatics and Systems, pp: 1-8. Fuzzy Inference System for Breast Cancer Diagnoses, 2010a. IEEE, pp: 911-915. Vazirani, H., R. Kala, A. Shukla, R Tiwari, 2010.“Diagnosis of Breast Cancer by Modular Neural Network”, IEEE, pp: 115-119. Akay, M.F., 2009. “Support vector machines combined with feature selection for breast cancer diagnosis.”, Expert Systems with Applications, pp: 3240-3247. Karabatak, M., M.C. Ince, 2009. “An expert system for detection of breast cancer based on rules and neural network”, Expert Systems with Applications, pp: 3465 3469. Dursun Delen, 2009. “Analysis of cancer data a data mining approach”, The Journal of Knowledge Engineering, Expert Systems, 26. Rejani, Y.I.A., S.T. Selvi, 2009. “Early detection of breast cancer using SVM classifier technique”,International Journal on Computer Science and Engineering, pp: 127-130. Sudhir, D., A. Ghatol Ashok, P. Pande Amol, 2006. “Neural Network aided Breast Cancer Detection and Diagnosis.”, WSEAS International Conference on Neural Networks. Bremner, D., E.J. Demaine, J. Erickson, S. Iacono, P.M. Langerman and Godfried, 2005. “Outputsensitive algorithms for computing nearest-neighbour decision boundaries”, Discrete and Computational Geometry. Vapnik, V.N., 1999. “An overview of statistical learning theory”, IEEE Trans. Neural Networks New York, 10: 998- 999. Smola, A.J., B. Scholkopf and K.R. Muller, 1998. “The connection between regularization operators and support vector kernels”, Neural Networks New York, 11: 637- 649. Cortes, C., V.N. Vapnik, 1995. “Support vector networks”, Machine learning Boston, 3: 273-297.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Advances in Natural and Applied Sciences