Download Breast Cancer Assessment and Diagnosis using Particle Swarm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Data mining wikipedia , lookup

Transcript
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
IJCST Vol. 2, Issue 3, September 2011
Breast Cancer Assessment and Diagnosis using Particle
Swarm Optimization
1
Aswini Kumar Mohanty, 2 Swasati Sahoo, 3Arati Pradhan, 4Dr. Saroj Kumar Lenka
SOA University, Bhubaneswar, Orissa, India
Biju Patnaik University of Technology, Bhubaneswar,Orissa,India
3
Dept. of Computer Science, Gandhi Engineering College, Bhubaneswar.Orissa,India
4
Dept. of Computer Science, Modi Univesity, Lakshmangarh, Rajasthan, India
1
2
Abstract
A binary Discrete Particle Swarm Optimization;BPSO/DPSO
was proposed and successfully applied to the classification risk
of Wisconsin-breast-cancer data set. Breast cancer is one of the
leading causes of death among the women in many parts of the
world. In 2007, approximately 178,480 women in the United
States will be found to have invasive breast cancer. However, the
medical technology has been improved and causing declination
of the mortality in breast cancer in the past decade. This has
been possible owing to earlier diagnosis and improved treatment.
Hence, the purpose of this study was to separate from a population
of patients who had and had not breast cancer. This study proposed
the methodology for data mining that the fundamental of concept
was in terms of the standard PSO called Discrete PSO. The
novel PSO in which each particle was coded in positive integer
numbers and has a feasible system structure. Based on the obtained
results, our research used the two rules to improve the accuracy
to 96.995%, sensitivity to 100% and specificity to 95.83%. The
results compared with the previous research that showed the
improvement of accuracy at the same number of rules. In this
research we have got high quality results which can be used as
reference for hospital decision making and research workers.
Keywords
Discrete Particle Swarm Optimization, Breast Cancer, Data
Mining.
I. Introduction
Until 21 century the cancer is still the main cause of death in all
worlds. According to statistics of World Health Organization;
WHO, we can find more than 10 million human to be diagnosed
the cancer and 6 million will die. For example, in the United States,
cancer is the second cause of death; it leads to approximately 25%
of all mortalities. Furthermore, the medical resource allocation and
utilization of particular interest in the case of cancer, it’s cost 263.3
billion per year [1, 2]. Breast cancer is the most common cancer
in women, with over 1 million new cases diagnosed annually.
It is estimated that approximately 500,000 women will die of
breast cancer each year, making this the second leading cause
of death from cancer in women, with a lifetime risk of the order
of 1/10. The molecular events relating to breast cancer biology
and pathogenesis had greatly increased over the last decade [3,
4]. However, the development of medicine technology makes
the mortality in breast cancer has declined in the past decade,
due to earlier diagnosis and improved treatment [2, 5, 6]. Hence,
it is important to improve the accuracy of earlier diagnosis.
The data mining is a sophisticated tool for classification tasks.
Nevertheless, health care related data mining is one of the most
rewarding and challenging area of application in data mining
and knowledge discovery. The challenges are due to the data
sets which are large, complex, heterogeneous, hierarchical, time
series and of varying quality. The available healthcare data sets are
fragmented and distributed in nature, thereby making the process
w w w. i j c s t. c o m
of data integration a highly challenging task [7]. Moreover, data
classification method has been applied in problems of medicine,
social science, management, and engineering [8]. Therefore,
this study proposed the data classification method using Binary/
Discrete Particle Swarm Optimization called BPSO/DPSO, and
adopted the classification rules to diagnosis whether the tumor of
patient is malignant or not.
II. Data mining for classification tasks
With the rapid growth of databases data mining has become
an increasingly important approach for data analysis. In past
years, research in molecular biology and molecular medicine
has accumulated enormous amounts of data. Such large amount
of information must be thoroughly analysed to gain a better
understanding of the underlying biological processes. Methods
of knowledge discovery and data mining are the best candidates
for this challenging task. The operations research community
has contributed significantly to this field, especially through the
formulation and solution of numerous data mining problems
as optimization problems, and several operations research
applications can also be addressed using data mining methods.
One of the important tasks in data mining is classification. In
classification, there is a target variable which is partitioned into
predefined groups or classes. The classification system takes
labeled data instances and generates a model that determines the
target variable of new data instances. The discovered knowledge is
usually represented in the form of if–then prediction rules, which
have the advantage of being a high level, symbolic knowledge
representation, contributing to the comprehensibility of the
discovered knowledge.
III. Introduction of the PSO
The Particle Swarm Optimization (PSO) technique is a population
based stochastic optimization technique first introduced by Eberhart
and Kennedy [9]. It belongs to the category of Swarm Intelligence
methods; it is also an evolutionary computation method inspired by
the metaphor of social interaction and communication such as bird
flocking and fish schooling. The values c1 ρ1 and c2 ρ 2 determine
the weights of the two parts, and c1 + c2 is usually limited to 4
[9]. To apply PSO, several parameters including the number of
population (m), cognition learning factor (c1), social learning
factor (c2), inertia weight (w), and the number of iterations or CPU
time should be properly determined. We conducted the preliminary
experiments, and the complete computational procedure of the
PSO algorithm can be summarized as in the following.
1) Initialize: Initialize parameters and population with random
position and velocities.
2) Evaluation: Evaluate the fitness value (the desired objective
function) for each particle.
3) Find the pbest: If the fitness value of particle i is better than
its best fitness value (pbest) in history, then set current fitness
value as the new pbest to particle i.
4) Find the gbest: If any pbest is updated and it is better than
International Journal of Computer Science and Technology 37
IJCST Vol. 2, Issue 3, September 2011
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
the current gbest, then set gbest to the current value.
5) Update velocity and position: Update velocity and move to
the next position according to Eqs. (1) and (2).
6) Stopping criterion: If the number of iterations or CPU time
are met, then stop; otherwise go back to step 2.
Since PSO was first introduced to optimize various continuous
nonlinear functions by Kennedy and Eberhart [9], it has been
successfully applied to a wide range of applications [9, 10].
However, the major obstacle of successfully applying a PSO is
its continuous nature. To remedy this drawback, Kennedy and
Eberhart [9, 10] developed a discrete version of PSO (BPSO).
In BPSO, the particle is characterized by a binary. solution
representation and the velocity must be transformed into the
change of probability for each binary dimension to take a value
of one. Basically, BPSO updates the velocity according to Eq.
(1) but particles are represented by binary variables and without
using w. Furthermore, the velocity is constrained to the interval
[0, 1] by using the following sigmoid transformation:
(1)
(2)
(3)
Where
denotes the probability of bit taking 1. To avoid
approaching 0 or 1, a constant Vmax is used to limit the
range of velocity. Typically, Vmax is often set at 4 such that
v t ∈ − 4,4 and
ij
[
]
(4)
Each bit of particles, at each time step, changes its current position
according to Eq.(5) instead of Eq.(2) based on Eq.(3) as follows
[10]:
(5)
This paper aims at creating a novel PSO in which each particle
is coded in positive integer numbers and has a feasible system
structure. The proposed DPSO for data mining can overcome the
drawbacks of GAs that [11] proposed. The details are presented
below.
IV. Study Method
This study developed a process for data mining which we adopted
the methodology that the fundamental of concept was in terms of
the standard PSO called Discrete PSO. In the case of the Wisconsin
breast cancer data set tested in this study. The data set included 9
features and 1 class variable, and we filled the values which appear
the most frequently in that feature to substitute for the miss data.
Besides the class variable, the value of 9 features is between 1 and
10, with a higher value corresponding to a more unusual situation
of the tumour such as Table 1. The data set contains 699 points,
458 were diagnosed to be benign (class=2) and 241 malignant
(class=4). Further, we divided the training data set which contain
466 patients’ records and validation data set which contain 233
patients’ records from original data set randomly. The training
38 International Journal of Computer Science and Technology
data set are used for learning the breast cancer pattern and then
generating the decision rule(s), and the validation data set which
have not been used to develop the system are used to validate the
results. Fig. 1 shows the flowchart of this study.
Original data set
Training data set
Update the training data set
Validation data set
Discrete-PSO
Classification rule
DPSO operations
(Particle update)
Meet the accuracy
predefined
No
The data of unable
classify correctly
Yes
Discovery of
decision multirules
Fig.1: The flowchart of this study
Table 1: The feature variable of data set
Feature variable
Domain
Clump Thickness
1~10
Uniformity of Cell Size
1~10
Uniformity of Cell Shape
1~10
Marginal Adhesion
1~10
Single Epithelial Cell Size
1~10
Bare Nuclei
1~10
Bland Chromatin
1~10
Normal Nucleoli
1~10
Mitoses
1~10
Class
2,4
2:benign, 4:malignant
A. Discrete PSO
This study proposed the methodology for data mining that the
fundamental of concept was in terms of the standard PSO called
Discrete PSO;DPSO. This paper aims at creating a novel PSO in
which each particle is coded in positive integer numbers and has
a feasible system structure. We would introduce the algorithm
process of DPSO:
1. Encoding
The concept of encoding was according to [12]; however we
revised the original process to be more systematized and effective
in the light of solving the problem. Further, Fig. 2 show the form
of encoding in this study, we assume that the number of selected
feature variables to be m.
w w w. i j c s t. c o m
IJCST Vol. 2, Issue 3, September 2011
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
2. Fitness Function
According to the scenario of the individual, we can calculate
the accuracy that represents the fitness value of this individual
such as Table 2. In terms of relative reference, we define the TP,
TN, FP and FN rate parameters to show in the Table 3 [6, 12].
The calculation of accuracy is the amount of the malignant being
select correctly plus the amount of the benign being not select
that divide the amount of data. Besides accuracy, we consider
the other two performance measures including sensitivity and
specificity simultaneously. The definition of accuracy, sensitivity
and specificity are showed below.
where
C w = cw
C p = Cw + c p
Cg = C p + cg
No of features
Available
Variable 1
> or =
or <
Threshold
w
w
g
g
Table 2.The fitness value of each individual
NO. of variable Variable Sign Threshold Fitness
1st individual
A1
iA11
iA12
iA13
ζA1
2nd individual
A2
iA21
iA22
iA23
ζA2
…
…
…
…
…
…
Kth individual
AK
iAK1
iAK2
iAK3
ζAK
Table 3.The definition of TP, TN, FP and FN rate parameters
Predicted patient state
Actual state
p
p
Fig.2: The form of encoding
The process of update would describe below:
• Generate a random variable called î that is between 0 and
1.
• If 0 ≤ î < C is true then the original individual will be kept, else
if C ≤ î < C is true then the original individual will be replaced
by Pbest, else if C ≤ î < C is true then the original individual will
be replaced by Gbest, else if C ≤ î ≤ 1 is true then the original
individual will be replaced by new individual which generate
randomly.
• Repeat the above process until the terminative criteria has
been meeting.
Classified as
“true”(Positive)
Classified as
“false”(Negative)
Class is “true”
(Positive)
TP
FN
Class is “false”
(Negative)
FP
TN
Updating the velocity and positions are the most important parts
of PSO. They play an important role in exchanging information
among particles. It leads to an effective combination of partial
solutions in other particles and speeds up the search procedure
early in the generation. In the traditional PSO, each particle needs
to use more two equations, generate three random numbers, five
multiplications, and three summations to move to its next position.
Thus, the time complexity is O (8nm) for the traditional PSO.
However, there is no need to use the velocity, and one random, two
multiplications, and one comparison are needed in the proposed
DPSO after C w , C p and C are given. Therefore, the proposed DPSO
is more efficient than other PSOs. Fig. 3 shows the process of
individual update.
g
3. Individual update
The underlying principle of the traditional PSO is that the next
position of each particle is a compromise of its current position,
the best position in its history so far, and the best position among
all existing particles. Eq. (2) is very easy and efficient to decide
next positions for the problems with continuous variables, but not
trivial and well-defined for the problems with discrete variables
and sequencing problems. To overcome the drawback of PSO for
discrete variables, a novel method to implement PSO procedure
is proposed based on the following equation after C w , C , and C g
are given:
p
w w w. i j c s t. c o m
International Journal of Computer Science and Technology 39
IJCST Vol. 2, Issue 3, September 2011
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
Table 4 :Comparison of DPSO and Gas
Generate initial
solution randomly
Recollect the Gbest
and Pbest
0 ≤ ξ < Cw
Yes
No
Cw ≤ ξ < C p
Yes
0.9356
Rule1
0.9528
2
if (Uniformity of Cell
Shape <10 and Bland
Chromatin =2) then
Malignant
0.9505
Rule
1+Rule 2
0.96995
1
if (Uniformity of
Cell Size>2.4467 and
Uniformity of Cell
Shape>2.5096) then
Malignant
0.9335
Rule1
0.9614
2
if (Bland
Chromatin>3.0526
and Clump
Thickness>3.1710)
then Malignant
0.9539
Rule
1+Rule 2
0.9654
3
if (Bare
Nuclei>3.0899
and Uniformity of
Cell Size=2) then
Malignant
0.956
Rule
1+Rule
2 +Rule3
0.96995
Replaced by Pbest
GAs
[11]
No
C p ≤ ξ < Cg
1
Original individual
will be kept
No
Yes
Replaced by Gbest
No
Training Validation
if (Uniformity of Cell
Shape >2 and Bland
Chromatin >2) then
Malignant
DPSO
Generate random
variables
Accuracy
No. Decision rules
Generate new
individual randomly
Table 5 : The relative rate of experiment
Stop
Yes Algorithm end
Fig.3.The flowchart of individual update
V. Experimental results
In rule 1, we could derive the best accuracy to be 0.9528 that the
classification rule was “Uniformity of Cell Shape >2 and Bland
Chromatin >2”. In our study, the data could not be classified
correctly by rule 1, we adopted the method of [11] that new decision
rule is to be explored. In this process, both the selected feature of
training data not being classified correctly and all the unselected
feature of data are preserved for mining in an additional rule. After
the repeated process, we found that Rule 2 is “Uniformity of Cell
Shape <10 and Bland Chromatin =2”. So far, this study utilized
two rules to improve the accuracy to 96.995%. Table 4 showed the
comparison results with GAs. We found that the proposed DPSO
can enhance the accuracy by 1.28%. Table 5 and 6 showed that the
performance of Type I error in GAs and DPSO to be equivalent.
According to the above results, the proposed DPSO had shown
to be better than the GAs in enhancing the performance of Type
II error by 4.67%. However, we merely used two rules that had
the same accuracy for the GAs that used three rules.
Accuracy
Sensitivity
Specificity
96.995%
100%
95.83%
TP=65, FN=0, FP=3, TN=165
Table 6 : The results of diagnose
Classified class
I(without
breast
cancer)
II(with
breast
cancer)
DPSO
165(95.83%)
7(4.67%)
GAs
(T.C.Chen et
al., 2006)
146(95.42%)
7(4.58%)
DPSO
0(0%)
65(100%)
GAs
(T.C.Chen et
al., 2006)
0(0%)
80(100%)
Actual class
I(without
breast cancer)
II(with breast
cancer)
VI. Conclusion
The best way to improve a breast cancer victim’s chance of longterm survival is to detect it as early as possible. Data mining
is the search for valuable information in large volumes of data
[13]. Hence, the DPSO was proposed and successfully applied to
40 International Journal of Computer Science and Technology
w w w. i j c s t. c o m
IJCST Vol. 2, Issue 3, September 2011
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
the classification risk of Wisconsin-breast-cancer data set. Based
on the obtained results, our research used two rules to improve
the accuracy to 96.995%, sensitivity to 100% and specificity to
95.83%. The results compared with the previous research to show
that we merely used two rules to compare with the GAs that used
three rules, and the result of accuracy was equivalent. In this
research we have got high quality results which can be used as
reference for hospital decision making and research workers. In
future research, we not only continued to ameliorate the process of
data mining but applied to the various domains so that improved
the medical quality.
References
[1] Gloria Phillips-Wren, Phoebe Sharkey, Sydney Morss Dy,
“Mining lung cancer patient data to assess healthcare resource
utilization”, Expert Systems with Applications, Available
online 14 September 2007.
[2] Shital Shah, Andrew Kusiak, “Cancer gene search with datamining and genetic algorithms”, Computers in Biology and
Medicine, Vol. 37, Issue 2, February 2007, pp. 251-261.
[3] Sigurdur Ingvarsson, “Breast cancer: introduction”, Seminars
in Cancer Biology, Vol. 11, Issue 5, October 2001, pp. 323326.
[4] Hamid Mohamadi, Jafar Habibi, Mohammad Saniee Abadeh,
Hamid Saadi, “Data mining with a simulated annealing based
fuzzy classification system”, Pattern Recognition, Volume
41, Issue 5, May 2008, pp. 1824-1833.
[5] S. Kaye, “New paradigms in the treatment of breast and
colorectal cancer—an introduction”, European Journal of
Cancer, Vol. 38, Supplement 2, February 2002, pp. 1-2.
[6] Dursun Delen, Glenn Walker, Amit Kadam, “Predicting
breast cancer survivability: a comparison of three data mining
methods”, Artificial Intelligence in Medicine, Vol. 34, Issue
2, June 2005, pp. 113-127.
[7] Delen, D., Patil, N., “Knowledge Extraction from Prostate
Cancer Data”, Proceedings of the 39th Annual Hawaii
International Conference on System Sciences, Vol. 5, Jan.
2006 pp. 92b - 92b.
[8] Young U. Ryu, R. Chandrasekaran, Varghese S. Jacob, “Breast
cancer prediction using the isotonic separation technique”,
European Journal of Operational Research, Vol. 181, Issue
2, 1 September 2007, pp. 842-854.
[9] J. Kennedy, R.C. Eberhard, “Particle swarm optimization”,
Proceedings of IEEE International Conference on Neural
Networks, Piscataway, NJ, USA, 1995, pp. 1942-1948.
[10]J. Kennedy, R.C. Eberhart, “A discrete binary version of the
particle swarm algorithm”, Systems, Man, and Cybernetics,
1997 Computational Cybernetics and Simulation, IEEE
International Conference, Vol. 5, No. 12-15, 1997/10, pp.
4104-4108.
[11] Ta-Cheng Chen, Tung-Chou Hsu, “A GAs based approach
for mining breast cancer pattern”, Expert Systems with
Applications, Vol. 30, Issue 4, May 2006, Pages 674-681.
[12]T.F. Sousa, A.P. Silva, A.F. Silva, “Particle swarm based
data mining algorithms for classification tasks”, Parallel
Computing, Vol. 30, No. 5-6, 2004, pp. 767-783.
[13]Xiangchun Xiong, Yangon Kim, Yuncheol Baek, Dae
Wong Rhee, Soo-Hong Kim, “Analysis of breast cancer
using data mining & statistical techniques”, Proceedings
Sixth International Conference on Software Engineering,
Artificial Intelligence, Networking and Parallel/Distributed
Computing and First ACIS International Workshop on SelfAssembling Wireless Networks, May 2005 pp. 82 – 87.
w w w. i j c s t. c o m
Aswini Kumar mohanty has received BEin Comp.
Sc, from Marathawada Uni- versity,in 1991 and
M.tech in Comp.Sc. from Kalinga University ,
Chhatis -hgarh in 2005. Presently he is .working
as associate Professor in Gandhi engineering
college,CSE deptt. bhubaneswar He has more than
twenty years of experience in both teaching and Industry. His
research area is image mining ,data mining, soft computing and
image processing..
Mrs. Swasati Sahoo has passed B.E. in Comp.
Science from Utal Univ-ersity and pursuing
M.Tech in Computer Science under BPUT .She
is currently working as Asst. Professor in deptt.
Of CSE of Gandhi engineering college BBSR.
Research interest is data mining, image processing
and soft computing.
Mrs. Arati Pradhan obtained her MCA degree from
Sambal-pur University. Then she received her
M.Tech in Comp Sc. from Fakir Mohan University.
She has served as faculty member in various
institutes. Presently she is with Gandhi Engineering
College, Bhubaneswar, as Asst. Professor. Currently
she is continuing her research work on Application of Soft
Computing in Data Mining.
Dr. Saroj kumar lenka Passed his B.E. CSE in
1994 from Utkal University and M.tech in 2005.
He obtained his PHD from Berhampur University
in 2008 from deptt of Computer Scien- ce.Currently
he is working as a professor in deptt. of CSE at
MODI University, Rajstan. His area of research is
image processing, data mining and coputer architecture.
.
International Journal of Computer Science and Technology 41