Download CLASSIFICATION AND CLUSTERING MEDICAL DATASETS BY

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data (Star Trek) wikipedia , lookup

Machine learning wikipedia , lookup

Pattern recognition wikipedia , lookup

Catastrophic interference wikipedia , lookup

Time series wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
318
Vol 04, Special Issue01; 2013
http://ijpaper.com/
PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER
CSEA2012
ISSN: 2230-8547; e-ISSN: 2230-8555
CLASSIFICATION AND CLUSTERING MEDICAL
DATASETS BY USING ARTIFICIAL NEURAL NETWORK
MODELS
B.V.S DHEERAJ REDDY1, MOUNIKA BOOREDDY2
1
Department of Computer Science and Engineering, Sastra University
Department of Information and Communication Technology, Sastra University
[email protected], [email protected]
2
1. INTRODUCTION
Artificial Neural Networks (ANN) is an information-processing paradigm inspired by the way the human brain
processes information. Artificial neural networks are collections of mathematical models that represent some of
the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning.
The key element of ANN is topology. The ANN consists of a large number of highly interconnected processing
elements (nodes) that are tied together with weighted connections (links). Learning in biological systems
involves adjustments to the synaptic connections that exist between the neurons. This is true for ANN as well.
Learning typically occurs by example through training, or exposure to a set of input/output data (pattern) where
the training algorithm adjusts the link weights. The link weights store the knowledge necessary to solve specific
problems.
Neural networks provide a new suite of non linear algorithms for feature extraction (using hidden layers) and
classification (e.g. multi layer perceptrons). The main characteristics of neural networks are those having the
ability to learn complex non-linear input output relationships, use training procedures, and adapt themselves to
the data. The power of Artificial Neural Networks resides in its capacity to generate an area of decision of any
form. The main goal of research in the field of artificial neural networks is to understand and emulate the
working principles of biological neural systems.
Some of the benefits of neural network are as follows
• Ability to process a massive input data in parallel with parallel architecture
• Simulation of diffuse medical reasoning
• Higher performances when compared with statistical approaches
• Self-Organizing ability-learning capability
• Easy knowledge base updating
2. NEURAL NETWORK IN MEDICAL FIELD
Neural networks are known to produce highly accurate results in practical applications. Neural networks have
been successfully applied to a variety of real world classification tasks in industry, business and science. Also
they have been applied to various areas of medicine, such as diagnostic aides, medicine, biochemical analysis,
image analysis, drug development. They are used in the analysis of medical images from a variety of imaging
modalities. Earlier works in Clinical Diagnosis, Image Analysis and Signal Analysis are presented in the
following sections.
2.1
Clinical Diagnosis
A research group at University Hospital, Lund, Sweden tested whether neural networks training to detect acute
myocardial infarction could lower this error rate. They trained the network using ECG measurements from 1120
patients who had suffered a heart attack, and 10,452 healthy persons with no history of heart attack. The
performance of the neural networks was then compared with that of a widely used ECG interpretation program
and that of an experienced cardiologist.
An Entropy Maximization Network (EMN) has been applied to prediction of metastases in breast cancer
patients [1]. They used EMN to construct discrete models that predict the occurrence of auxiliary lymph node
metastases in breast cancer patients, based on characteristics of the primary tumor alone.
An artificial neural network has been used to predict the occurrence of coronary artery disease. Serum lipid
profile and clinical events of 162 patients over a period of 10 years served as input data to the network [2].
In [3], the authors carried out a study to investigate the effectiveness of radial basis function networks as an
alternative data driven diagnostic technique of myocardial infraction. The study included clinical data from 500
cases.
A Bayesian posterior probability distribution is used in a neural network input selection. The network is
designed to assist inexperienced gynecologist in the pre-operative discrimination
between benign and
malignant ovarian tumors[4].
Serum electrophoresis is used as standard laboratory medical test for diagnosis of several pathological
conditions such as liver cirrhosis or nephritic syndrome. A multilayer perceptron trained using the Back-
2010-2013 - IJPAPER
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
319
Vol 04, Special Issue01; 2013
http://ijpaper.com/
PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER
CSEA2012
ISSN: 2230-8547; e-ISSN: 2230-8555
Propagation learning algorithm, and a Radial-Based Function network were used to implement an erective
diagnostic aid system[5].
2.2 Image Analysis
In [6], the authors presented examples of filtering, segmentation and edge detection techniques using cellular
neural networks to improve resolution in brain tomographies, and improve global frequency correction for the
detection of micro calcifications in mammograms.
In [7], the authors trained different neural networks to recognize regions of interest (ROIs) corresponding to
specific organs within electrical impedance tomography images (EIT) of the thorax. The network allows
automatic selection of optimal pixels based on the number of images, over a sample period, in which each pixel
is classified as belonging to a particular organ.
In [8], the authors compared neural networks (cascade correlation) and fuzzy clustering techniques for
segmentation of MRI of the brain. Both approaches were applied to intelligent diagnosis.
In [9], the authors implemented a self-organizing network multilayer adaptive resonance architecture (MARA)
for the segmentation of CT images of the heart. Similarly, [10] implemented a two layer neural network for
segmentation of CT images of the abdomen.
2.3 Signal Analysis
A knowledge-based neural network (KBANN) is implemented for classification of phosphorus (31P) magnetic
resonance spectra (MRS) from normal and cancerous breast tissues [11].
In [12], the authors reported the results from the application of tools for synthesizing, optimizing and analyzing
neural networks to an Electrocardiogram (ECG) Patient Monitoring task. A neural network was synthesized
from a rule-based classifier and optimized over a set of normal and abnormal heart beats.
In [13], the purpose of study was to identify and characterize clusters in a heterogeneous breast cancer
computer-aided diagnosis database. Identification of subgroups within the database could help elucidate clinical
trends and facilitate future model building. A self-organizing map (SOM) was used to identify clusters in a large
92258 cases), heterogeneous computer-aided diagnosis database based on mammographic findings (BIRADSTM) and patient age.
Analysis of NN as ECG analyzer also proves that NN is capable to deal with ambiguous nature of ECG
signal[14]. Silipo and Marchesi use static and recurrent neural network (RNN) architectures for the
classification tasks in ECG Analysis for arrhythmia, myocardial ischemia and chronic alterations.
3. PROPOSED MODEL
The aim of this paper is to present the application of ANN’s in medical diagnosis with three different datasets
such as breast cancer, heart disease and diabetes dataset. These data sets are obtained from UCI ML repository
http://www.ics.uci.edu. The two important techniques classification and clustering are applied on these three
medical datasets by constructing ANN models
3.1 Neural Networks Training
The various phases in classification and clustering problems solved by Neural Network techniques are
designing, training and testing. The three aspects involved in the construction of Neural Networks are:
1. Structure
:
The architecture and topology of the neural network.
2. Encoding
:
The method of changing weights (Training).
3. Recall
: The method and capacity to retrieve information.
The structure relates to how many layers should a network contain, and what their function are, in relation to
input, output, or feature extraction. Encoding refers to the paradigm used for the determination of and changing
weights on the connections between neurons. The performance of the network can be analyzed by using recall.
3.2 Classification on Medical Datasets
Medical datasets are rich with hidden information that can be used for making intelligent medical diagnosis.
Classification and prediction are two forms of data analysis that can be used to extract models describing
important data classes or to predict future data trends. Whereas classification predicts categorical labels,
prediction models continuous-valued functions. Data classification is a two step process.
In the first step, a model is built describing a predetermined set of data classes or concepts. The model is
constructed by analyzing datasets described by attributes. Each data set is assumed to belong to a predefined
class, as determined by one of the attributes, called the class label attributes. In the context of classification, data
sets are also referred to as training samples and are randomly selected from the sample population.
Classification is often referred to as supervised learning because the classes are determined before examining
the data. Classification algorithms require that the classes be defined based on data attribute values. They often
describe these classes by looking at the characteristics of data already known to belong to the classes. Pattern
recognition is a type of classification where an input pattern is classified into one of several classes based on its
similarity to these predefined classes.
In the second step the model is used for classification. First the predictive accuracy of the model (or classifier) is
estimated. The holdout method is a simple technique that used a test set of class-labeled samples. These samples
2010-2013 - IJPAPER
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
320
Vol 04, Special Issue01; 2013
http://ijpaper.com/
PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER
CSEA2012
ISSN: 2230-8547; e-ISSN: 2230-8555
of a model on a given test set are the percentage of test set samples that are correctly classified by the model.
For each test sample, the known class label is compared with the learned model’s class prediction for that
sample.
3.3 Feed Forward Neural Networks with Back-Propagation
The invention of the Back-Propagation algorithm has played a large part in the resurgence of interest in ANNs.
Back-Propagation is a systematic method for training multilayer ANNs. Feed Forward network is a very popular
model in networks. Back-Propagation learning algorithm consists of two passes namely forward pass and
backward pass.
In the forward pass an input vector is applied to the network and the output at each neuron in the output layer is
calculated. During this pass the synaptic weights of the network are all fixed. In backward pass the synaptic
weights are all adjusted in accordance with error correction rule. This error is then Back-Propagated through the
network against the direction of synaptic connections. Momentum and variable learning rates are considered to
improve the Back-Propagation algorithm. Momentum allows the network to respond not only the gradient but
also recent trends in the error surface. With the momentum concept it is possible for the network to ignore small
features in the error surface.
3.3.1 Classification of Breast Cancer Data Set
The Wisconsin Breast Cancer dataset was initially created to conduct experiments that were to prove the
usefulness of automation of fine needle aspiration cytological diagnosis. It contains 699 instances of cytological
analysis of fine needle aspiration from breast tumors. Each case comprises 11 attributes: a case ID, cytology
data (normalized, with values in the range 1-10) and a benign/malignant attribute. The attribute information is
given in Table 1.
The values are normalized in the form of zero’s and one’s.
Table 1. Breast Cancer Data Set Attributes
Attribute
Domain
1.
Sample code number
2.
Clump thickness
1 – 10
3.
Uniformity of cell size
1 – 10
4.
Uniformity of cell shape
1 – 10
5.
Marginal adhesion
1 – 10
6.
Single epithelial cell size
1 – 10
7.
Bare nuclei
1 – 10
8.
Bland chromatic
1 – 10
9.
Normal nucleoli
1 – 10
10. Mitosis
11. Class: 2 for benign, 4 for malignant
Id number
1 – 10
2, 4
3.3.2 Heart Disease Data Set
Heart disease data set concerns to diagnosis a person is having heart disease or not. It contains 414 instances, 13
attributes and a class attribute. A class value of 0 indicates ‘normal person, a value of 1 indicates first stroke, a
value of 2 indicates second stroke, and a value of 3 indicates end of life. The attribute description of this data set
is given in Table 2.
Table 2. Heart Disease Data Set Attributes
S.No.
Attribute
Description
Range
1.
Age
Age in years
Continuous
2.
Sex
(1=male; 0=female)
0,1
3.
Cp
Value 1:typical angina
1,2,3,4
--value 2:atypical angina
--value 3:non-anginal pain
--value 4: asymptomatic
4.
Trestbps
Resting blood pressure (in mm Hg)
Continuous
5.
Chol
Serum cholesterol in mg/dl
Continuous
6.
Fbs
(Fasting blood sugar >120 mg/dl)
0,1
(1=true; 0=false)
7.
Restecg
Resting electrocardiographic results
0,1,2
--value 0: normal
2010-2013 - IJPAPER
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
321
Vol 04, Special Issue01; 2013
http://ijpaper.com/
PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER
CSEA2012
8.
9.
10.
Thalach
Exang
Oldpeak
11.
Slope
12.
Ca
13.
14.
Thal
Class: 0 for
normal person, 1
for first stroke, 2
for second
stroke, 3 for end
of life.
ISSN: 2230-8547; e-ISSN: 2230-8555
--value 1: having ST-T wave abnormality (T
wave inversions and/or ST Elevation or
depression of>0.05 mV)
--value 2: showing probable or definite left
ventricular Hypertrophy by Estes’ criteria
Maximum heart rate achieved
Exercise induced angina (1=yes; 0=no)
ST depression induced by exercise relative to
rest
The slope of the peak exercise ST segment
--value 1: up sloping
--value 2: flat
--value 3: down sloping
Number of major vessels (0-3)colored by
fluoroscopy
Normal,fixed defect,reversible defect
Continuous
0,1
Continuous
1,2,3
Continuous
3,6,7
0,1,2,3
3.3.3 Diabetes Data Set
The Data Set taken here is of all female patients, at least 21 years old, and of Pima Indian heritage. Diabetes
data set concerns to diagnosis a person is Diabetic or not. It contains 768 instances, 8 attributes and a class
attribute. A class value of 0 indicates not diabetic person, a value of 1 indicates diabetic person. The attribute
description of this data set is given in Table 3.
S. No.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Table 3. Diabetes Data Set Attributes
Attribute
Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2
Diabetes pedigree function
Age (years)
Class variable (0 or 1)
3.3.4 Experimental Results from Classification
The classification accuracy is improved in case of multi layer network compared to single layer network
because of the reason the hidden layer neuron acts as a feature extractor. The experimental results of the three
data sets are given in Table 4.
Table 4. Results of Classification Experiments on Three Data Sets
Accuracy of Classification
Dataset
Attributes Instances Classes
Single layer Multi layer
Breast Cancer
9
699
2
72 %
80.3%
Heart Disease
13
414
4
70%
81.4%
Diabetes
8
768
2
69.4%
78.2%
4. CLUSTERING OF MEDICAL DATA SETS
Clustering is a multivariate analysis technique widely adopted in medical diagnosis studies and pattern
recognition areas. By examining the underlying structure of a dataset, cluster analysis aims to class data into
2010-2013 - IJPAPER
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
322
Vol 04, Special Issue01; 2013
http://ijpaper.com/
PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER
CSEA2012
ISSN: 2230-8547; e-ISSN: 2230-8555
separate groups according to their characteristics. The clustering is performed such that spectra held within a
cluster are as similar as possible, and those found in opposing clusters as dissimilar as possible.
In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and
unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason,
clustering is a form of learning by observation, rather than learning by examples.
One of the important applications of neural network is clustering of medical data for clinical diagnosis. In this
paper, the neural network model used for clustering is Kohonen Self-Organizing Map.
4.1 Self-Organizing Maps
Dimensionality reduction concomitant with preservation of topological information is common in normal
human subconscious information processing. We routinely compress information by extracting relevant facts
and thereby develop reduced representations of impinging information while retaining essential knowledge. A
good example is that of biological vision where three dimensional visual images are routinely mapped onto a
two dimensional retina and information is preserved in a way that permits perfect visualization of a three
dimensional world. The self-organization feature map is a neural network model that is based on Kohonen’s
discovery that topological information prevalent in high dimensional input data can be transformed onto a one
or two dimensional layer of neurons.
4.3 Experiment Results from Cluster Analysis:
Experiments are conducted on the above mentioned three medical datasets using self organization neural
network model for cluster analysis. The total instances in these datasets are divided into training vector and
testing vectors and the results are shown in the following tables
Table 5 : Results of the Cluster Analysis for Breast Cancer data set: Total instances taken are 699.
Training
Test Vectors
Time
Recognized
Efficiency
Vectors
(sec)
Vectors
139
560
0.156
450
80.35 %
279
419
0.129
352
84.0 %
419
279
0.099
240
86.02 %
560
139
0.046
126
90.6 %
Table 6: Results of the Cluster Analysis for Heart Disease Data Set: Total instances taken are 414.
Training
Time
Recognized
Test Vectors
Efficiency
Vectors
(sec)
Vectors
90
324
0.133
261
80.5%
180
294
0.104
245
83.3%
262
152
0.082
135
88.8%
341
73
0.034
67
91.7%
Table 7: Results of the Cluster Analysis for Diabetic data set:
Total instances taken are 768.
Training
Time
Test Vectors
Vectors
(sec)
155
613
0.152
307
460
0.117
460
307
0.092
614
155
0.059
Recognized
Vectors
500
400
275
140
Efficiency
81.5%
86.95%
89.57%
90.32%
CONCLUSION
This study has been carried out to develop a system for performing classification and clustering tasks on three
types of medical data sets such as, Breast Cancer, Heart Disease and Diabetes, using Neural Network technique
for diagnosis purpose.
Two types of Neural Network models like Feed Forward Neural Network (FFNN) with Back-Propagation and
Self-organization networks are considered in this paper. Experiments are conducted with FFNN with single and
multilayer and experiments results proved that classification accuracy is improved in case of multilayer network
compared to single layer network.
Another unsupervised Self-Organization Network is designed to perform cluster analysis on the three medical
data sets. Cluster experiments are performed with different percentage of samples in the training process. The
experimental results proved that the performance accuracy is improved if more sample data is used in the
training process.
2010-2013 - IJPAPER
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,
323
Vol 04, Special Issue01; 2013
http://ijpaper.com/
PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER
CSEA2012
ISSN: 2230-8547; e-ISSN: 2230-8555
References
[1]Choong P. L., DeSeilva C.J.S., “Breast Cancer Prognosis using EMN Architecture”. Proceedings of IEEE
International Conference on Neural Networks. June, 1994.
[2]Lapuerta P., Azen S.P., and LaBree L., “Use of Neural Networks in Predicting the Risk of Coronary Artery
Disease”, Computers and Biomedical Research, 28, 1995, pp. 38—52.
[3]Fraser H., Pugh R., Kennedy R., Ross P., and Harrison R., “A comparison of Backpropagation and Radial
Basis Functions, in the Diagnosis of Myocardial Infraction”, In Ifeachor E., and Rosen K. (Eds.), International
Conference on Neural Networks and Expert Systems in Medicine and Healthcare, 1994, pp. 76—84.
[4]Verrelst H., Vandewalle J., and De Moor B., “Bayesian Input Selection for Neural Network Classifiers”, In
Ifeachor E., Sperduti A., and Starita A. (Eds.), Third International Conference on Neural Networks and Expert
Systems in Medicine and Healthcare, 1998, pp. 125—132. World Scientific.
[5]Costa A., Cabestany J., Moreno J., and Calvet M., “Neuroserum: AnArtificial Neural Net Based Diagnostic
Aid Tool for Serum Electrophoresis”. In Ifeachor E., Sperduti A., and Starita A. (Eds.), Third International
Conference on Neural Networks and Expert Systems in Medicine and Healthcare, 1998, pp. 34—43. World
Scientific.
[6]Aizenberg I., Aizenberga N., Hiltnerb J., “Cellular neural networks and computational intelligence in medical
image processing. Image and VisionComputing”, 19(4), 2001,177-183.
[7]Miller A., Blott B., and Hames T., “Review of Neural Network Applications in Medical Imaging and Signal
Processing”, Medical and Biological Engineering and Computing, 30(5), 1992, 449-464.
[8]Hall L., Bensaid A., Clarke L., Velthuizen R., Silbiger M., and Bezdek J., “A Comparison of Neural Network
and Fuzzy Clustering Techniques in Segmenting Magnetic Resonance Images of the Brain”, IEEE Transactions
on Neural Networks, 3(5), 1992, 672-682.
[9]Rajapakse J., and Acharya R., “Medical Image Segmentation with MARA, In International Joint Conference
on Neural Networks”, Vol. 2, 1990, pp. 965-972.
[10]Daschlein R., Waschulzik T., Brauer W., “Computer Aided Analysis of LungParenchyma Lesions in
Standard Chest Radiography”, In Ifeachor E., and Rosen K. (Eds.), International Conference on Neural
Networks and Expert Systems in Medicine and Healthcare, 1994, pp.174-180.
[11]Sarle W. S., “Neural Networks and Statistical Models, Proceedings of the Nineteenth Annual SAS Users
Group International Conference”, April, 1994.
[12]Waltrus R.L., Towell G., and Glassman M.S., “Synthesize, Optimize, Analyze, Repeat (SOAR):
Application of Neural Network Tools to ECG Patient Monitoring”.
[13]Mia K. Markey, Joseph Y. Lo, Georgia D. Tourassi, Carey E. Floyd Jr., “Self-organizing map for cluster
analysis of a breast cancer databases”, Artificial Intelligence in Medicine, Vol. 27, 2003, pp. 113-127.
[14]Silipo R., and Marchesi C., “Artificial Neural Networks for automatic ECG analysis”, IEE E Transactions
on Signal Processing, Vol. 46, n. 5, 1998, pp. 1417-1425.
2010-2013 - IJPAPER
Indexing in Process - EMBASE, EmCARE, Electronics & Communication Abstracts, SCIRUS, SPARC, GOOGLE Database, EBSCO, NewJour, Worldcat,
DOAJ, and other major databases etc.,