Download Advances in Natural and Applied Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70
AENSI Journals
Advances in Natural and Applied Sciences
ISSN:1995-0772 EISSN: 1998-1090
Journal home page: www.aensiweb.com/ANAS
Breast Cancer Classification Using Various Machine Learning Techniques
1
V.Palaniammal and 2Dr. RM.Chandrasekaran
Research Scholar, Manonmaniam Sundaranar University, Tirunelveli – 627 012 Tamil Nadu, India.
Professor, Department of Computer Science & Engineering, Annamalai University
Tamil Nadu, India
1
2
ARTICLE INFO
Article history:
Received 4 September 2014
Received in revised form 24 November
2014
Accepted 8 December 2014
Available online 10 January 2015
Annamalainagar
–
608
002,
ABSTRACT
In this paper we present various machine learning techniques for breast cancer
classification. WBC (Wisconsin Breast Cancer) data are used for classification. We
have investigated three data mining techniques: they are k-Nearest Neighbor, Support
Vector Machine and Auto associative Neural Network. Several experiments were
conducted using these algorithms. The achieved prediction performances are
comparable to existing techniques. However, we found out that Auto associative Neural
Network has a much better performance than the other two techniques.
Keywords:
Breast cancer, K-Nearest Neighbor
(KNN), Support Vector Machine
(SVM), Auto Assosiative Neural
Network (AANN).
© 2015 AENSI Publisher All rights reserved.
To Cite This Article: V.Palaniammal and Dr. RM.Chandrasekaran., Breast Cancer Classification Using Various Machine Learning
Techniques. Adv. in Nat. Appl. Sci., 9(3): 65-70, 2015
INTRODUCTION
Breast cancer is the most common cancer among Women. Breast cancer occurs when a malignant
(cancerous) tumor originates in the breast. As breast cancer tumors mature, they may metastasize (spread) to
other parts of the body. The primary route of metastasis is the lymphatic system which, ironically enough, is
also the body's primary system for producing and transporting white blood cells and other cancer-fighting
immune system cells throughout the body. Metastasized cancer cells that aren't destroyed by the lymphatic
system's white blood cells move through the lymphatic vessels and settle in remote body locations, forming new
tumors and perpetuating the disease process.
Breast cancer is fairly common. Because of its well publicized nature, and potential for lethality, breast
cancer is arguably the most frightening type of cancer diagnosis someone can receive. However, it is important
to keep in mind that, if identified and properly treated while still in its early stages, breast cancer can be cured.
The causes of breast cancer are not yet definitively known. However, extensive research efforts have
uncovered various risk factors that are associated with increased incidence of breast cancer in women. Though
some of these risk factors are unavoidable and uncontrollable, some of them are very avoidable, making it
possible for people to take action so as to minimize their cancer risk. Even a minimized risk is still a risk,
however. There is no way to pre-determine whether a person will get breast cancer until they have either been
diagnosed with it or they have lived a breast-cancer free lifetime. Detection and diagnosis of the disease at an
earlier stage can ensure a long survival of lives (Jyotirmay Gadewadikar, 2010; Belciug, S., 2010). The
symptoms of breast cancer include mass, changes in shape and dimension of breast (Mojarad, S.A., 2010).
Various diagnostic tests and procedures are available for detecting the presence of breast cancer.
2. Related work:
A literature survey showed that there have been several studies on the prediction problem using various
techniques (Dursun Delen, 2009; Vapnik, V.N., 1990). However, we could only find a few studies related to
medical diagnosis using data mining approaches (Shelly Gupta, 2011).
Fatima., et al (2012) used an approach for classifying the breast cancer from Wisconsin breast cancer
diagnosis (WBCD) database using adaptive neuro-fuzzy inference system (ANFIS). By using the ANFIS
classifier they were achieved an accuracy of 98.25 % in tissue level. Karabatak., et al (2009) proposed an
automatic diagnosis based pattern recognition system for detecting breast cancer based on the association rules
Corresponding Author: V.Palaniammal, 1Research Scholar, Manonmaniam Sundaranar University, Tirunelveli – 627 012
Tamil Nadu, India.
E-mail: Email : [email protected]
66
V.Palaniammal and Dr. RM.Chandrasekaran, 2015
Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70
(AR) and Artificial neural network (ANN). In that study, they used AR method for reducing the dimension of
breast cancer database and ANN for intelligent classification. The proposed system i.e. the combination of AR
and ANN, performance was compared with only ANN model. The dimension of input feature space was
reduced from nine to four by using AR. In the testing stage, they use 3- fold cross validation method to the
WBCD to evaluate the proposed pattern recognition system performances. The correct classification rate
obtained from that AR + ANN system was 95.6%. Akay proposed another pattern recognition method by
considering SVM classifier to classify benign and malignant masses. They used WBCD breast cancer database
and achieved an accuracy of 98% (Akay, M.F., 2009).
Proposed System:
Early diagnosis needs an accurate and reliable diagnosis procedure that can be used by physicians to
distinguish benign breast tumors from malignant ones without going for surgical biopsy (Fuzzy Inference
System for Breast Cancer Diagnoses, 2010a.; Vazirani, H., 2010). The objective of these predictions is to assign
patients to one of the two group either a “benign” that is noncancerous or a “malignant” that is cancerous. In
this paper, we propose a data mining technique such as k-Nearest Neighbor, Support Vector Machine and Auto
associative Neural Network. Fig. 1 shows the block diagram of proposed system.
WBC
Database
Extracted
Features
Classification
using KNN,
SVM and
AANN
Output
(Normal/
Abnormal
Fig. 1: Block Diagram of Proposed System.
Wbc database:
WBC (Wisconsin Breast Cancer) database is created by William H. Wolberge at University of Wisconsin.
This database is used to distinguish malignant (cancerous) from benign (non-cancerous) samples. This database
contains 200 observations among which 117 are benign cases and 83 are malignant cases. These instances are
described by 10 Attributes.
Extracted Features:
The process of defining a set of features will most efficiently or more meaningfully represent the
information that is important for analysis and classification.
Classification:
Auto Associative Neural Network (Aann):
A special kind of back propagation neural network called auto associative neural network (AANN) is used
to capture the distribution of feature vectors in the feature space. We use AANNs with 5 layers as shown in Fig.
2. This architecture consists of three non-linear hidden layers between the linear input and output layers.
The second hidden layer contains fewer nodes than the input layer, and is known as the compression layer.
In this network, the second and fourth layers have more units than the input layer. The third layer has fewer
units than the first or fifth. The processing units in the first and third hidden layer are nonlinear, and the units in
the second compression/hidden layer can be linear or nonlinear. As the error between the actual and the desired
output vectors is minimized, the cluster of points in the input space determines the shape of the hyper surface
obtained by the projection onto the lower dimensional space.
The network‟s weights are initially set to small random values. Training is the process of adapting or
modifying the weights in response to the training input patterns being presented at the input layer. The training
algorithm controls the response of the weights to adapt the learning algorithm. During the training process,
weights will gradually converge towards the values which will match the input feature vector to the target. From
the input vector, an output vector is produced by the network which can then be compared to the target output
vector. If there is no difference between the produced output and target output vectors, learning stops. Otherwise
the weights are changed to reduce the difference. The weights are adapted using a recursive algorithm which
starts at the output nodes and works back to the hidden layer.
Testing is the application mode where the network processes test pattern presented at its input layer and
creates a response at the output layer. The vector values are arranged in matrix format to make it feasible to
work with AANN. The values are trained using AANN which uses the back propagation algorithm. The weight
values are adjusted to get the input features as output. The obtained features are tested to find the normalized
square error between the input and the output features. The error is transformed into confidence score which
shows the similarity between the input and the output.
67
V.Palaniammal and Dr. RM.Chandrasekaran, 2015
Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70
Non-linear hidden
layers
Linear output
layer
Linear input
layer
Input layer
Output layer
Compression
Layer
Fig. 2: Auto Associative Neural Network.
Aann Training Algorithm:
1. Select and apply the training pair from the training set to the network.
2. Calculate the output of the network.
3. Calculate the error between the output of the network and the input.
4. Adjust the weights (V, W matrix) in such a way that the error is minimized.
5. Repeat steps 1 to 4 for all the training pairs.
6. Repeat steps 1 to 5 until the network recognizes the training set or for certain epochs.
Testing is the application mode where the network processes a tested input pattern presented at its input
layer and creates a response at the output layer. The main application area includes Pattern Recognition, Voice
Recognition, Bioinformatics, Signal Validation, and Net Clustering.
K- Nearest Neighbor (Knn):
K-nearest neighbor algorithm is a technique for classifying data based on the closest training examples in
the feature space (Bremner, D., 2005). First the dataset is divided into a testing and training set. For each row in
the testing set, the „K‟ nearest training set objects are found, and the classification of test data is determined by
majority vote with ties are broken at random. If there are ties for the Kth nearest vector then all the instances are
included in the vote.
The way the KNN classifier works is, first by calculating the distances between the testing data vector and
all of the training vectors using a particular distance calculation methodology which is given as follows:
The Euclidean distance is found to be:
Euclidean Distance = (x 2  x1 )  )y2  y1 )
The KNN classifier takes the test instance „x‟ and finds the K-nearest neighbors in the training data and
assigns „x‟ into the class occurring most among the K neighbors.
Support Vector Machine (Svm):
After feature extraction, the extracted features are given as inputs to SVM. SVM is used to classify the
group of data as either affected or normal depending on the feature values. Support vector machines (SVM‟s)
are a set of related supervised learning methods used for classification and regression (Smola, A.J., 1998;
Cortes, C., V.N. Vapnik, 1995). As shown in Fig. 3, a support vector machine, constructs a hyperplane or set of
hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other
tasks.
Support vectors
Maximum margin
decision hyperplane
Margin is
maximized
Fig. 3: Optimal Hyperplane, maximizing margin and support vectors.
The support vectors are the (transformed) training patterns. The support vectors are (equally) close to
hyperplane. The support vectors are training samples that define the optimal separating hyperplane and are the
68
V.Palaniammal and Dr. RM.Chandrasekaran, 2015
Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70
most difficult patterns to classify. SVM constructs a linear model to estimate the decision function using nonlinear class boundaries based on support vectors (Rejani, Y.I.A., S.T. Selvi, 2009). If the data are linearly
separated, SVM trains linear machines for an optimal hyperplane that separates the data without error and into
the maximum distance between the hyperplane and the closest training points.
Experimental Results:
The experiment is performed on WBC database using various machine learning techniques. The results
shows that the accuracy of KNN is 97%, SVM is 99% and AANN is 99.5%
Performance Measures:
In order to evaluate the performance of the detector, and to allow comparisons, several ratios have been
taken into account.
True Positive (TP): The detector found a normal data when a normal data is present.
True Negative (TN): The detector found a abnormal data when a abnormal data is present.
False Positive (FP): The detector found a abnormal data when a normal sample is present.
False Negative (FN): The detector found a normal data when a abnormal data is present .
Sensitivity is the percentage of abnormal data classified as abnormal by the procedure.
Sensitivity (SE) = 100
Specificity is the percentage of normal data classified as normal by the procedure.
Specificity (SP) = 100.
From these quantities, Sensitivity, Specificity are chosen as measurement of accuracy and are calculated
using the following equation.
Accuracy (A) = 100
Table 1 shows the performance of WBC database using KNN. Table 2 and 3 shows the performance of
SVM and AANN
Table 1: Performance of WBC database using KNN.
No.
TP
TN
FP
1
30
19
1
2
29
19
2
3
24
25
1
4
28
20
0
FN
0
0
0
2
Sensitivity (%)
100
100
100
93
Specificity (%)
95
93
95
100
Accuracy (%)
98
96
98
96
Table 2: Performance of WBC database using SVM.
No.
TP
TN
FP
F1
31
18
0
F2
31
18
0
F3
25
25
0
F4
30
20
0
FN
1
1
0
0
Sensitivity (%)
96.88
96.88
100
100
Specificity (%)
100
100
100
100
Accuracy (%)
98
98
100
100
Table 3: Performance of WBC database using AANN.
No.
TP
TN
FP
FN
F1
31
19
0
0
F2
31
18
0
1
F3
25
25
0
0
F4
30
20
0
0
Sensitivity (%)
100
96.88
100
100
Specificity (%)
100
100
100
100
Accuracy (%)
100
98
100
100
Table 4: Average performance of KNN, SVM and AANN.
Classifiers
Sensitivity (%)
KNN
98.25
SVM
98.32
AANN
99.15
Specificity (%)
95
100
100
Accuracy (%)
97
99
99.5
The average performance of WBC database using KNN, SVM and AANN are shown in table 4.
Performance Graph:
Three different techniques KNN, SVM, AANN have been used for classification. When these three
techniques are compared, it has been clear that the performance of AANN is high. The accuracy of AANN is
99.5%. Fig. 4, 5 and 6 shows the performance of KNN, SVM and AANN. The average performance of KNN,
SVM and AANN are shown in Fig. 7.
Conclusion:
The last decade has major advancements in the methods of the diagnosis of breast cancer. Only recently the
soft computing techniques are being used, In this paper we have discuss some of effective techniques that can be
69
V.Palaniammal and Dr. RM.Chandrasekaran, 2015
Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70
used for breast cancer classification. Among the various data mining classifiers, Auto Associative Neural
Network is found to be best predictor with 99.5% Accuracy on WBC (Wisconsin Breast Cancer) from UCI
machine learning.
Fig. 4: Average Performance of KNN.
Fig. 5: Average Performance of SVM.
Fig. 6: Average Performance of AANN.
Fig. 7: Average performance of KNN, SVM and AANN.
REFERENCES
Fatima, B., C.M. Amine, 2012. “A Neuro-Fuzzy inference model for breast cancer recognition”,
International Journal of Computer Science & Information Technology.
Shelly Gupta, Dharminder Kumar, Anand Sharma, 2011. “Data Mining Classification Techniques Applied
for Breast Cancer Diagnosis And Prognosis”, IEEE, 2.
Jyotirmay Gadewadikar, Yufeng Zheng, Ping Zhang, 2010. “Exploring Bayesian networks for medical
decision support in breast cancer detection”, African Journal of Mathematics and Computer Science Research,
pp: 225-2312010.
70
V.Palaniammal and Dr. RM.Chandrasekaran, 2015
Advances in Natural and Applied Sciences, 9(3) March 2015, Pages: 65-70
Mojarad, S.A., S.S. Dlay, W.L. Woo, G.V. Sherbet, 2010. “Breast Cancer prediction and cross
validation using multilayer perceptron neural networks”, IEEE, pp: 1-8.
Belciug, S., F. Gorunescu, M. Gorunescu, A.B.M. Salem, 2010. “Assessing Performances of Unsupervised
and Supervised Neural Networks in Breast Cancer Detection”, Proceeding 7th International Conference on
Informatics and Systems, pp: 1-8.
Fuzzy Inference System for Breast Cancer Diagnoses, 2010a. IEEE, pp: 911-915.
Vazirani, H., R. Kala, A. Shukla, R Tiwari, 2010.“Diagnosis of Breast Cancer by Modular Neural
Network”, IEEE, pp: 115-119.
Akay, M.F., 2009. “Support vector machines combined with feature selection for breast cancer diagnosis.”,
Expert Systems with Applications, pp: 3240-3247.
Karabatak, M., M.C. Ince, 2009. “An expert system for detection of breast cancer based on rules and
neural network”, Expert Systems with Applications, pp: 3465 3469.
Dursun Delen, 2009. “Analysis of cancer data a data mining approach”, The Journal of Knowledge
Engineering, Expert Systems, 26.
Rejani, Y.I.A., S.T. Selvi, 2009. “Early detection of breast cancer using SVM classifier
technique”,International Journal on Computer Science and Engineering, pp: 127-130.
Sudhir, D., A. Ghatol Ashok, P. Pande Amol, 2006. “Neural Network aided Breast Cancer Detection and
Diagnosis.”, WSEAS International Conference on Neural Networks.
Bremner, D., E.J. Demaine, J. Erickson, S. Iacono, P.M. Langerman and Godfried, 2005. “Outputsensitive algorithms for computing nearest-neighbour decision boundaries”, Discrete and Computational
Geometry.
Vapnik, V.N., 1999. “An overview of statistical learning theory”, IEEE Trans. Neural Networks New
York, 10: 998- 999.
Smola, A.J., B. Scholkopf and K.R. Muller, 1998. “The connection between regularization operators and
support vector kernels”, Neural Networks New York, 11: 637- 649.
Cortes, C., V.N. Vapnik, 1995. “Support vector networks”, Machine learning Boston, 3: 273-297.