Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online) IJCST Vol. 2, Issue 3, September 2011 Breast Cancer Assessment and Diagnosis using Particle Swarm Optimization 1 Aswini Kumar Mohanty, 2 Swasati Sahoo, 3Arati Pradhan, 4Dr. Saroj Kumar Lenka SOA University, Bhubaneswar, Orissa, India Biju Patnaik University of Technology, Bhubaneswar,Orissa,India 3 Dept. of Computer Science, Gandhi Engineering College, Bhubaneswar.Orissa,India 4 Dept. of Computer Science, Modi Univesity, Lakshmangarh, Rajasthan, India 1 2 Abstract A binary Discrete Particle Swarm Optimization;BPSO/DPSO was proposed and successfully applied to the classification risk of Wisconsin-breast-cancer data set. Breast cancer is one of the leading causes of death among the women in many parts of the world. In 2007, approximately 178,480 women in the United States will be found to have invasive breast cancer. However, the medical technology has been improved and causing declination of the mortality in breast cancer in the past decade. This has been possible owing to earlier diagnosis and improved treatment. Hence, the purpose of this study was to separate from a population of patients who had and had not breast cancer. This study proposed the methodology for data mining that the fundamental of concept was in terms of the standard PSO called Discrete PSO. The novel PSO in which each particle was coded in positive integer numbers and has a feasible system structure. Based on the obtained results, our research used the two rules to improve the accuracy to 96.995%, sensitivity to 100% and specificity to 95.83%. The results compared with the previous research that showed the improvement of accuracy at the same number of rules. In this research we have got high quality results which can be used as reference for hospital decision making and research workers. Keywords Discrete Particle Swarm Optimization, Breast Cancer, Data Mining. I. Introduction Until 21 century the cancer is still the main cause of death in all worlds. According to statistics of World Health Organization; WHO, we can find more than 10 million human to be diagnosed the cancer and 6 million will die. For example, in the United States, cancer is the second cause of death; it leads to approximately 25% of all mortalities. Furthermore, the medical resource allocation and utilization of particular interest in the case of cancer, it’s cost 263.3 billion per year [1, 2]. Breast cancer is the most common cancer in women, with over 1 million new cases diagnosed annually. It is estimated that approximately 500,000 women will die of breast cancer each year, making this the second leading cause of death from cancer in women, with a lifetime risk of the order of 1/10. The molecular events relating to breast cancer biology and pathogenesis had greatly increased over the last decade [3, 4]. However, the development of medicine technology makes the mortality in breast cancer has declined in the past decade, due to earlier diagnosis and improved treatment [2, 5, 6]. Hence, it is important to improve the accuracy of earlier diagnosis. The data mining is a sophisticated tool for classification tasks. Nevertheless, health care related data mining is one of the most rewarding and challenging area of application in data mining and knowledge discovery. The challenges are due to the data sets which are large, complex, heterogeneous, hierarchical, time series and of varying quality. The available healthcare data sets are fragmented and distributed in nature, thereby making the process w w w. i j c s t. c o m of data integration a highly challenging task [7]. Moreover, data classification method has been applied in problems of medicine, social science, management, and engineering [8]. Therefore, this study proposed the data classification method using Binary/ Discrete Particle Swarm Optimization called BPSO/DPSO, and adopted the classification rules to diagnosis whether the tumor of patient is malignant or not. II. Data mining for classification tasks With the rapid growth of databases data mining has become an increasingly important approach for data analysis. In past years, research in molecular biology and molecular medicine has accumulated enormous amounts of data. Such large amount of information must be thoroughly analysed to gain a better understanding of the underlying biological processes. Methods of knowledge discovery and data mining are the best candidates for this challenging task. The operations research community has contributed significantly to this field, especially through the formulation and solution of numerous data mining problems as optimization problems, and several operations research applications can also be addressed using data mining methods. One of the important tasks in data mining is classification. In classification, there is a target variable which is partitioned into predefined groups or classes. The classification system takes labeled data instances and generates a model that determines the target variable of new data instances. The discovered knowledge is usually represented in the form of if–then prediction rules, which have the advantage of being a high level, symbolic knowledge representation, contributing to the comprehensibility of the discovered knowledge. III. Introduction of the PSO The Particle Swarm Optimization (PSO) technique is a population based stochastic optimization technique first introduced by Eberhart and Kennedy [9]. It belongs to the category of Swarm Intelligence methods; it is also an evolutionary computation method inspired by the metaphor of social interaction and communication such as bird flocking and fish schooling. The values c1 ρ1 and c2 ρ 2 determine the weights of the two parts, and c1 + c2 is usually limited to 4 [9]. To apply PSO, several parameters including the number of population (m), cognition learning factor (c1), social learning factor (c2), inertia weight (w), and the number of iterations or CPU time should be properly determined. We conducted the preliminary experiments, and the complete computational procedure of the PSO algorithm can be summarized as in the following. 1) Initialize: Initialize parameters and population with random position and velocities. 2) Evaluation: Evaluate the fitness value (the desired objective function) for each particle. 3) Find the pbest: If the fitness value of particle i is better than its best fitness value (pbest) in history, then set current fitness value as the new pbest to particle i. 4) Find the gbest: If any pbest is updated and it is better than International Journal of Computer Science and Technology 37 IJCST Vol. 2, Issue 3, September 2011 ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online) the current gbest, then set gbest to the current value. 5) Update velocity and position: Update velocity and move to the next position according to Eqs. (1) and (2). 6) Stopping criterion: If the number of iterations or CPU time are met, then stop; otherwise go back to step 2. Since PSO was first introduced to optimize various continuous nonlinear functions by Kennedy and Eberhart [9], it has been successfully applied to a wide range of applications [9, 10]. However, the major obstacle of successfully applying a PSO is its continuous nature. To remedy this drawback, Kennedy and Eberhart [9, 10] developed a discrete version of PSO (BPSO). In BPSO, the particle is characterized by a binary. solution representation and the velocity must be transformed into the change of probability for each binary dimension to take a value of one. Basically, BPSO updates the velocity according to Eq. (1) but particles are represented by binary variables and without using w. Furthermore, the velocity is constrained to the interval [0, 1] by using the following sigmoid transformation: (1) (2) (3) Where denotes the probability of bit taking 1. To avoid approaching 0 or 1, a constant Vmax is used to limit the range of velocity. Typically, Vmax is often set at 4 such that v t ∈ − 4,4 and ij [ ] (4) Each bit of particles, at each time step, changes its current position according to Eq.(5) instead of Eq.(2) based on Eq.(3) as follows [10]: (5) This paper aims at creating a novel PSO in which each particle is coded in positive integer numbers and has a feasible system structure. The proposed DPSO for data mining can overcome the drawbacks of GAs that [11] proposed. The details are presented below. IV. Study Method This study developed a process for data mining which we adopted the methodology that the fundamental of concept was in terms of the standard PSO called Discrete PSO. In the case of the Wisconsin breast cancer data set tested in this study. The data set included 9 features and 1 class variable, and we filled the values which appear the most frequently in that feature to substitute for the miss data. Besides the class variable, the value of 9 features is between 1 and 10, with a higher value corresponding to a more unusual situation of the tumour such as Table 1. The data set contains 699 points, 458 were diagnosed to be benign (class=2) and 241 malignant (class=4). Further, we divided the training data set which contain 466 patients’ records and validation data set which contain 233 patients’ records from original data set randomly. The training 38 International Journal of Computer Science and Technology data set are used for learning the breast cancer pattern and then generating the decision rule(s), and the validation data set which have not been used to develop the system are used to validate the results. Fig. 1 shows the flowchart of this study. Original data set Training data set Update the training data set Validation data set Discrete-PSO Classification rule DPSO operations (Particle update) Meet the accuracy predefined No The data of unable classify correctly Yes Discovery of decision multirules Fig.1: The flowchart of this study Table 1: The feature variable of data set Feature variable Domain Clump Thickness 1~10 Uniformity of Cell Size 1~10 Uniformity of Cell Shape 1~10 Marginal Adhesion 1~10 Single Epithelial Cell Size 1~10 Bare Nuclei 1~10 Bland Chromatin 1~10 Normal Nucleoli 1~10 Mitoses 1~10 Class 2,4 2:benign, 4:malignant A. Discrete PSO This study proposed the methodology for data mining that the fundamental of concept was in terms of the standard PSO called Discrete PSO;DPSO. This paper aims at creating a novel PSO in which each particle is coded in positive integer numbers and has a feasible system structure. We would introduce the algorithm process of DPSO: 1. Encoding The concept of encoding was according to [12]; however we revised the original process to be more systematized and effective in the light of solving the problem. Further, Fig. 2 show the form of encoding in this study, we assume that the number of selected feature variables to be m. w w w. i j c s t. c o m IJCST Vol. 2, Issue 3, September 2011 ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online) 2. Fitness Function According to the scenario of the individual, we can calculate the accuracy that represents the fitness value of this individual such as Table 2. In terms of relative reference, we define the TP, TN, FP and FN rate parameters to show in the Table 3 [6, 12]. The calculation of accuracy is the amount of the malignant being select correctly plus the amount of the benign being not select that divide the amount of data. Besides accuracy, we consider the other two performance measures including sensitivity and specificity simultaneously. The definition of accuracy, sensitivity and specificity are showed below. where C w = cw C p = Cw + c p Cg = C p + cg No of features Available Variable 1 > or = or < Threshold w w g g Table 2.The fitness value of each individual NO. of variable Variable Sign Threshold Fitness 1st individual A1 iA11 iA12 iA13 ζA1 2nd individual A2 iA21 iA22 iA23 ζA2 … … … … … … Kth individual AK iAK1 iAK2 iAK3 ζAK Table 3.The definition of TP, TN, FP and FN rate parameters Predicted patient state Actual state p p Fig.2: The form of encoding The process of update would describe below: • Generate a random variable called î that is between 0 and 1. • If 0 ≤ î < C is true then the original individual will be kept, else if C ≤ î < C is true then the original individual will be replaced by Pbest, else if C ≤ î < C is true then the original individual will be replaced by Gbest, else if C ≤ î ≤ 1 is true then the original individual will be replaced by new individual which generate randomly. • Repeat the above process until the terminative criteria has been meeting. Classified as “true”(Positive) Classified as “false”(Negative) Class is “true” (Positive) TP FN Class is “false” (Negative) FP TN Updating the velocity and positions are the most important parts of PSO. They play an important role in exchanging information among particles. It leads to an effective combination of partial solutions in other particles and speeds up the search procedure early in the generation. In the traditional PSO, each particle needs to use more two equations, generate three random numbers, five multiplications, and three summations to move to its next position. Thus, the time complexity is O (8nm) for the traditional PSO. However, there is no need to use the velocity, and one random, two multiplications, and one comparison are needed in the proposed DPSO after C w , C p and C are given. Therefore, the proposed DPSO is more efficient than other PSOs. Fig. 3 shows the process of individual update. g 3. Individual update The underlying principle of the traditional PSO is that the next position of each particle is a compromise of its current position, the best position in its history so far, and the best position among all existing particles. Eq. (2) is very easy and efficient to decide next positions for the problems with continuous variables, but not trivial and well-defined for the problems with discrete variables and sequencing problems. To overcome the drawback of PSO for discrete variables, a novel method to implement PSO procedure is proposed based on the following equation after C w , C , and C g are given: p w w w. i j c s t. c o m International Journal of Computer Science and Technology 39 IJCST Vol. 2, Issue 3, September 2011 ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online) Table 4 :Comparison of DPSO and Gas Generate initial solution randomly Recollect the Gbest and Pbest 0 ≤ ξ < Cw Yes No Cw ≤ ξ < C p Yes 0.9356 Rule1 0.9528 2 if (Uniformity of Cell Shape <10 and Bland Chromatin =2) then Malignant 0.9505 Rule 1+Rule 2 0.96995 1 if (Uniformity of Cell Size>2.4467 and Uniformity of Cell Shape>2.5096) then Malignant 0.9335 Rule1 0.9614 2 if (Bland Chromatin>3.0526 and Clump Thickness>3.1710) then Malignant 0.9539 Rule 1+Rule 2 0.9654 3 if (Bare Nuclei>3.0899 and Uniformity of Cell Size=2) then Malignant 0.956 Rule 1+Rule 2 +Rule3 0.96995 Replaced by Pbest GAs [11] No C p ≤ ξ < Cg 1 Original individual will be kept No Yes Replaced by Gbest No Training Validation if (Uniformity of Cell Shape >2 and Bland Chromatin >2) then Malignant DPSO Generate random variables Accuracy No. Decision rules Generate new individual randomly Table 5 : The relative rate of experiment Stop Yes Algorithm end Fig.3.The flowchart of individual update V. Experimental results In rule 1, we could derive the best accuracy to be 0.9528 that the classification rule was “Uniformity of Cell Shape >2 and Bland Chromatin >2”. In our study, the data could not be classified correctly by rule 1, we adopted the method of [11] that new decision rule is to be explored. In this process, both the selected feature of training data not being classified correctly and all the unselected feature of data are preserved for mining in an additional rule. After the repeated process, we found that Rule 2 is “Uniformity of Cell Shape <10 and Bland Chromatin =2”. So far, this study utilized two rules to improve the accuracy to 96.995%. Table 4 showed the comparison results with GAs. We found that the proposed DPSO can enhance the accuracy by 1.28%. Table 5 and 6 showed that the performance of Type I error in GAs and DPSO to be equivalent. According to the above results, the proposed DPSO had shown to be better than the GAs in enhancing the performance of Type II error by 4.67%. However, we merely used two rules that had the same accuracy for the GAs that used three rules. Accuracy Sensitivity Specificity 96.995% 100% 95.83% TP=65, FN=0, FP=3, TN=165 Table 6 : The results of diagnose Classified class I(without breast cancer) II(with breast cancer) DPSO 165(95.83%) 7(4.67%) GAs (T.C.Chen et al., 2006) 146(95.42%) 7(4.58%) DPSO 0(0%) 65(100%) GAs (T.C.Chen et al., 2006) 0(0%) 80(100%) Actual class I(without breast cancer) II(with breast cancer) VI. Conclusion The best way to improve a breast cancer victim’s chance of longterm survival is to detect it as early as possible. Data mining is the search for valuable information in large volumes of data [13]. Hence, the DPSO was proposed and successfully applied to 40 International Journal of Computer Science and Technology w w w. i j c s t. c o m IJCST Vol. 2, Issue 3, September 2011 ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online) the classification risk of Wisconsin-breast-cancer data set. Based on the obtained results, our research used two rules to improve the accuracy to 96.995%, sensitivity to 100% and specificity to 95.83%. The results compared with the previous research to show that we merely used two rules to compare with the GAs that used three rules, and the result of accuracy was equivalent. In this research we have got high quality results which can be used as reference for hospital decision making and research workers. In future research, we not only continued to ameliorate the process of data mining but applied to the various domains so that improved the medical quality. References [1] Gloria Phillips-Wren, Phoebe Sharkey, Sydney Morss Dy, “Mining lung cancer patient data to assess healthcare resource utilization”, Expert Systems with Applications, Available online 14 September 2007. [2] Shital Shah, Andrew Kusiak, “Cancer gene search with datamining and genetic algorithms”, Computers in Biology and Medicine, Vol. 37, Issue 2, February 2007, pp. 251-261. [3] Sigurdur Ingvarsson, “Breast cancer: introduction”, Seminars in Cancer Biology, Vol. 11, Issue 5, October 2001, pp. 323326. [4] Hamid Mohamadi, Jafar Habibi, Mohammad Saniee Abadeh, Hamid Saadi, “Data mining with a simulated annealing based fuzzy classification system”, Pattern Recognition, Volume 41, Issue 5, May 2008, pp. 1824-1833. [5] S. Kaye, “New paradigms in the treatment of breast and colorectal cancer—an introduction”, European Journal of Cancer, Vol. 38, Supplement 2, February 2002, pp. 1-2. [6] Dursun Delen, Glenn Walker, Amit Kadam, “Predicting breast cancer survivability: a comparison of three data mining methods”, Artificial Intelligence in Medicine, Vol. 34, Issue 2, June 2005, pp. 113-127. [7] Delen, D., Patil, N., “Knowledge Extraction from Prostate Cancer Data”, Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Vol. 5, Jan. 2006 pp. 92b - 92b. [8] Young U. Ryu, R. Chandrasekaran, Varghese S. Jacob, “Breast cancer prediction using the isotonic separation technique”, European Journal of Operational Research, Vol. 181, Issue 2, 1 September 2007, pp. 842-854. [9] J. Kennedy, R.C. Eberhard, “Particle swarm optimization”, Proceedings of IEEE International Conference on Neural Networks, Piscataway, NJ, USA, 1995, pp. 1942-1948. [10]J. Kennedy, R.C. Eberhart, “A discrete binary version of the particle swarm algorithm”, Systems, Man, and Cybernetics, 1997 Computational Cybernetics and Simulation, IEEE International Conference, Vol. 5, No. 12-15, 1997/10, pp. 4104-4108. [11] Ta-Cheng Chen, Tung-Chou Hsu, “A GAs based approach for mining breast cancer pattern”, Expert Systems with Applications, Vol. 30, Issue 4, May 2006, Pages 674-681. [12]T.F. Sousa, A.P. Silva, A.F. Silva, “Particle swarm based data mining algorithms for classification tasks”, Parallel Computing, Vol. 30, No. 5-6, 2004, pp. 767-783. [13]Xiangchun Xiong, Yangon Kim, Yuncheol Baek, Dae Wong Rhee, Soo-Hong Kim, “Analysis of breast cancer using data mining & statistical techniques”, Proceedings Sixth International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing and First ACIS International Workshop on SelfAssembling Wireless Networks, May 2005 pp. 82 – 87. w w w. i j c s t. c o m Aswini Kumar mohanty has received BEin Comp. Sc, from Marathawada Uni- versity,in 1991 and M.tech in Comp.Sc. from Kalinga University , Chhatis -hgarh in 2005. Presently he is .working as associate Professor in Gandhi engineering college,CSE deptt. bhubaneswar He has more than twenty years of experience in both teaching and Industry. His research area is image mining ,data mining, soft computing and image processing.. Mrs. Swasati Sahoo has passed B.E. in Comp. Science from Utal Univ-ersity and pursuing M.Tech in Computer Science under BPUT .She is currently working as Asst. Professor in deptt. Of CSE of Gandhi engineering college BBSR. Research interest is data mining, image processing and soft computing. Mrs. Arati Pradhan obtained her MCA degree from Sambal-pur University. Then she received her M.Tech in Comp Sc. from Fakir Mohan University. She has served as faculty member in various institutes. Presently she is with Gandhi Engineering College, Bhubaneswar, as Asst. Professor. Currently she is continuing her research work on Application of Soft Computing in Data Mining. Dr. Saroj kumar lenka Passed his B.E. CSE in 1994 from Utkal University and M.tech in 2005. He obtained his PHD from Berhampur University in 2008 from deptt of Computer Scien- ce.Currently he is working as a professor in deptt. of CSE at MODI University, Rajstan. His area of research is image processing, data mining and coputer architecture. . International Journal of Computer Science and Technology 41