Download Mining Health Data for Breast Cancer Diagnosis Using Machine

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Mining Health Data for Breast Cancer
Diagnosis Using Machine Learning
Mohammad Ashraf Bani Ahmad
A thesis submitted for the requirements of the
Degree of Doctor of Philosophy
Faculty of Education, Science, Technology & Mathematics
December 2013
In the name of Allah, the Most Merciful, the Most Compassionate. Recite and
your Lord is the most Generous - Who taught by the pen - Taught man that
which he knew not. Surat Al-`Alaq (The Clot), the Holy Quran.
ii
Abstract
The recent advancements in computer technologies and storage capabilities have
produced an incredible amount of data and information from many sources such as
social networks, online databases, and health information systems. Nowadays, many
countries around the world are changing the way of implementing health care to the
patients and the people by utilising the benefits of advancements in computer
technologies and communications through electronic health.
Electronic health (eHealth) is the process of using emerging information and
communication technologies in health care for the benefit of humans. eHealth
includes a range of components such as electronic health records, electronic
prescriptions, electronic and mobile treatments for patients. In Australia, the majority
of medical and health coverage is provided by the government and due to shortage of
medical personnel and appropriate supportive technologies, many people have to
suffer long waiting times and limited medical resources. Therefore, the Australian
government, territory, and state governments raised the inclusion of eHealth
technologies in the health care system, to cope with the increased demand on health
services and help solve some problems that face the traditional health systems.
This initiative produced the National eHealth Transition Authority Limited (known
as NEHTA).The main purpose of NEHTA is to develop better ways of electronically
collecting and securely exchanging health information across Australia. Since July
2012, anyone seeking healthcare in Australia can register for a personally controlled
electronic health record. This can lead to a huge repository about Australian health
care records.
iii
This huge amount of data can be tuned into knowledge and more useful form of data
by using computing and machine learning tools. It is believed that engineering this
amount of data can aid in developing expert systems for decision support that can
assist physicians in diagnosing and predicting some debilitating life threatening
diseases such as breast cancer. Expert systems for decision support can reduce the
cost, the waiting time, and free human experts (physicians) for more research, as well
as reduce the errors and mistakes that can be made by humans due to fatigue and
tiredness. However, the process of utilising health data effectively, involves many
challenges such as the problem of missing features values, the curse of
dimensionality due to a large number of features (attributes), and the course of
actions to determine the features, that can lead to more accurate results (more
accurate diagnosis). Effective machine learning tools can assist in early detection of
diseases such as breast cancer, and the current work in this thesis focuses on
investigating novel approaches to diagnose breast cancer based on machine learning
tools, and involves development of new techniques to construct and process missing
features values, investigate different feature selection methods, and how to employ
them into diagnosis process.
It is believed that the adoption of electronic health systems into the health care
system requires comprehensives design and development, which may need several
stages to make it more useful for humans and governments. For example, storing
health records and electronic exchange of health records across the country are not
the only aims of eHealth. Treating health records as an important information
resource and probing the data to extract useful diagnostic and disease related
intelligence, by using automated approaches, including most significant
features,
for example, may lead to new tools/approaches to examine new cases (patients)
iv
based on previous and similar cases, using machine learning and computer
intelligence. It is the process of mapping the existing data into new unseen scenarios
and settings, that can lead to increase in understanding the disease related
information, such as early onset of disease, and better monitoring of different stages
of disease, leading to value addition of health care technologies, for enhanced
quality of service to patients, providing better assistance to doctors (bring an
electronic consultant for doctors for example), and easy to cross validate standard
disease diagnostic procedures. The thesis proposes several approaches to make this
vision a reality. The main findings of this research can be categorised as follows:

The thesis proposed a new approach for diagnosing breast cancer by reducing
the number of features to the optimal number using the information gain
algorithm, and then applies the new reduced features dataset to the Adaptive
Neuro Fuzzy Inference system (ANFIS). It is found that the accuracy for the
proposed approach is 98.24%, significantly better. The promising results may
lead to further attempts to utilise and exploit information technology for
diagnosing patients, and provide decision support to physicians.

The thesis proposed a new approach for constructing missing features values
based on iterative k nearest neighbours and the distance functions. The
approach is an iterative approach until finding the most suitable features
values that satisfy classification accuracy. The proposed approach showed
improvement of 0.005 of classification accuracy on the constructed dataset
than the original dataset on both Euclidean and Minkowski distance
functions. The study found that Manhattan, Chebychev, and Canberra
distance metrics produced lower classification accuracy on the new dataset
than the original dataset. The study also noticed that classification accuracy
v
depends greatly on the number of neighbours (k). The experimental
evaluation showed that less neighbours may lead to more accuracy. The
reason for that, in my opinion, is the amount of noise produced from conflict
neighbours. Finally, the maximum classification accuracy was on k=1 which
was 0.9698.

Different sets of experiments were performed to evaluate benchmark attributes
selections methods on well-known publicly available dataset from UCI machine
learning repository, Wisconsin Breast Cancer dataset (WBC). Naïve Bayes has
performed the supreme in regard to classification accuracy. k-NN and
Decision Tree have performed just better on dataset after applying features
selection methods. In general, features selection methods can improve the
performance of learning algorithms. However, no single feature selection
method that best satisfy all datasets and learning algorithms.

In regards to Classification Fusion on three well-known machine learning
classifiers on breast cancer dataset. the study confirms the argument that the
best combination of a set of classifiers depends on the application and on the
classifier characteristics. In addition, there is no best combination of
classifiers that suites all datasets. However in the current experiments, Naïve
Bayes and k-NN produced better results when they combined as one classifier
with maximum classification accuracy obtained on WBC dataset (0.9642).
vi
Acknowledgements
I would like to thank everyone who has helped me to complete this thesis. Special,
deep, and honest thanks to the supervision panel, chiefly, Dr Girija Chetty for the
guidance, smile, and her advice through this research. Big thanks to all faculty staff
and employees, I’m pleased for being a small part of such great place for more than
three years, principally Professor Dharmendra Sharma, A/Prof. Dat Tran for his
support provided through my journey, and Professor George Cho for his comments
and suggestions.
My distinctive thanks to my wife; you were a complete package of family and love
that gave me strength, support, and love during tough times. My son Hashim and
daughter Yarra, you are my world, my words, my strength, and the reason of my life.
Thank you.
Words can’t express my thanks to my family, the source of power and strength; my
parents for raising me up, especially my Mum, thanks aren’t enough and wouldn’t be
enough, and sorry for being away from your warm and kind bosom but I’m back,
hopefully soon. My Dad, thank you for everything you done to me, for the advice,
encouragement, and support through my life. My brothers and sisters, thank you.
ix
This thesis is dedicated to Mum, Dad, wife Fayha, son Hashim, and daughter Yarra.
x
Table of Contents
Abstract ...................................................................................................................... iii
Acknowledgements .................................................................................................... ix
Table of Contents ...................................................................................................... xi
List of Figures .......................................................................................................... xiv
List of Tables ........................................................................................................... xvi
Acronyms ................................................................................................................ xvii
Chapter One: Introduction ....................................................................................... 1
1.1
Overview .................................................................................................................. 1
1.2
Research Motivation ................................................................................................ 3
1.3
Research Objectives ................................................................................................. 7
1.4
Research Contribution............................................................................................ 11
1.5
Research Methodology .......................................................................................... 14
1.6
Thesis Road Map ................................................................................................... 15
1.7
Chapter Summary .................................................................................................. 17
Chapter Two: Background Study and Literature Review ................................... 18
2.1
Overview ................................................................................................................ 18
2.2
Background Study .................................................................................................. 18
2.3
Classification.......................................................................................................... 19
2.3.1
k-Nearest Neighbors algorithm ...................................................................... 21
2.3.2
Artificial Neural Network .............................................................................. 27
2.3.3
Decision Tree ................................................................................................. 31
2.3.4
Naïve Bayes Classifier ................................................................................... 34
2.4
Data Mining in Healthcare ..................................................................................... 36
2.4.1
Treatment Effectiveness ................................................................................. 36
2.4.2
Healthcare Management ................................................................................ 37
2.4.3
Customer Relationship Management ............................................................. 37
2.4.4
Fraud and Abuse ............................................................................................ 37
2.4.5
Computer Aided Diagnosis ............................................................................ 38
2.4.6
Ethical, Legal, and Social Issues .................................................................... 40
2.4.7
Challenges of Data Mining in Healthcare ...................................................... 43
2.4.8
Electronic Health Record ............................................................................... 45
2.5
Related Work on Breast Cancer Diagnosis ............................................................ 46
2.6
Feature Selection Techniques ................................................................................ 47
xi
2.6.1
Wrapper Feature Selection Technique ........................................................... 49
2.6.2
Filters Feature Selection Techniques ............................................................. 51
2.6.3
Embedded Feature Selection Techniques ...................................................... 52
2.6.4
Feature Selection Techniques Used in Current Work .................................... 53
2.6.5
Related Work on Feature Selection Techniques ............................................ 56
2.7
Missing Features Values ........................................................................................ 58
2.7.1
Types of Missing Values................................................................................ 59
2.7.2
Handling missing data .................................................................................... 60
2.8
Chapter Summary .................................................................................................. 65
Chapter Three: Research Methodology ................................................................. 66
3.1
Introduction ............................................................................................................ 66
3.2
Data Mining Methodology ..................................................................................... 68
3.2.1
Data Collection .............................................................................................. 70
3.2.2
Data Selection ................................................................................................ 72
3.2.3
Data Pre-Processing ....................................................................................... 72
3.2.4
Applying Data Mining Methods .................................................................... 73
3.2.5
Evaluation ...................................................................................................... 75
3.2.6
Machine Learning Software Development Tools .......................................... 76
3.2.7
Results Visualization...................................................................................... 76
3.3
Chapter Summary .................................................................................................. 77
Chapter Four: Breast Cancer Diagnosis Based on Information Gain and
Adaptive Neuro Fuzzy Inference System ............................................................... 78
4.1
Overview ................................................................................................................ 78
4.2
Adaptive Neural Fuzzy Inference System (ANFIS) .............................................. 78
4.2.1
ANFIS Structure ............................................................................................ 79
4.2.2
ANFIS Learning............................................................................................. 81
4.3
Information Gain ................................................................................................... 82
4.4
The Proposed IG –ANFIS Approach ..................................................................... 83
4.5
The Experimental Results ...................................................................................... 84
4.6
Summary and Discussion ....................................................................................... 91
Chapter Five: Iterative Weighted k-NN for Constructing Missing Feature
Values in Wisconsin Breast Cancer Dataset .......................................................... 92
5.1
Overview ................................................................................................................ 92
5.2
Missing Feature Values.......................................................................................... 92
5.3
The Proposed Method ............................................................................................ 95
5.4
The Experimental Results ...................................................................................... 98
xii
5.5
Summary and Discussion ..................................................................................... 100
Chapter Six: Diagnosing Breast Cancer Based on Feature Selection and Naïve
Bayes ........................................................................................................................ 102
6.1
Overview .............................................................................................................. 102
6.2
Feature Selection Techniques .............................................................................. 103
6.3
Feature Selection Techniques used in this Chapter.............................................. 104
6.4
The Experiment Methodology ............................................................................. 105
6.5
The Experimental Results .................................................................................... 107
6.6
Summary and Discussion ..................................................................................... 113
Chapter Seven: Fusion of Heterogeneous Classifiers for Breast Cancer
Diagnosis ................................................................................................................. 114
7.1
Overview .............................................................................................................. 114
7.2
Multi-Classification Approach ............................................................................. 115
7.2.1
Classifier Selection ...................................................................................... 115
7.2.2
Fusion Classifier .......................................................................................... 115
7.3
Classifiers Combination Strategies ...................................................................... 116
7.4
Experimental Methodology.................................................................................. 117
7.5
Experimental Results ........................................................................................... 118
7.6
Summary and Discussion ..................................................................................... 121
Chapter Eight: Discussion and Future Work ...................................................... 122
References ............................................................................................................... 130
xiii
List of Figures
Figure 1: Medical Doctors per 1000 population in selected countries in Organization for
Economic Cooperation and Development (OECD) Countries, 2009. ..................................... 6
Figure 2: Number of MRI units per one million populations in selected countries in
Organization for Economic Cooperation and Development (OECD) Countries, 2003. .......... 7
Figure 3: Updated eHealth Architecture Including the proposed integrated intelligent system
[12] ......................................................................................................................................... 10
Figure 4: General approach for building a classification model ........................................... 20
Figure 5: Example of k-NN [16] ........................................................................................... 22
Figure 6: k-NN characteristics in regards to some learning features. ................................... 26
Figure 7: Human neuron [33] ............................................................................................... 28
Figure 8: Artificial Neuron ................................................................................................... 29
Figure 9: Simplified neuron operation .................................................................................. 29
Figure 10: ANN architecture ................................................................................................ 30
Figure 11: ANN characteristics in regards to some learning features. ................................. 30
Figure 12: Simple Decision Tree .......................................................................................... 31
Figure 13: Decision Tree characteristics in regards to some learning features. .................... 33
Figure 14: Bayesian classifier characteristics in regards to some learning features. ............ 35
Figure 15: The Wrapper approach for features subset selection [65] ................................... 50
Figure 16: The filter approach [56] ....................................................................................... 51
Figure 17: Research Method Overview ................................................................................ 69
Figure 18: (a) Fuzzy Reasoning (b) Equivalent ANFIS Structure [89]. ............................... 82
Figure 19: The general structure for the proposed approach ................................................ 84
Figure 20: Information Gain Ranking on WBC ................................................................... 86
Figure 21: Sugeno Fuzzy Inference System with four features input and single output ...... 87
Figure 22: Input Membership Function for the feature “Uniformity of Cell Size” .............. 88
Figure 23: The structure for the proposed approach (IG-ANFIS) ........................................ 89
Figure 24: ANFIS Structure on MATLAB ........................................................................... 89
Figure 25: Comparison of classification accuracy between IG-ANFIS and some previous
work ....................................................................................................................................... 90
Figure 26: The Flowchart for the proposed method (Constructing Missing Features Values)
............................................................................................................................................... 97
Figure 27: A comparison of classification accuracy for the proposed method through
Euclidean/k-NN...................................................................................................................... 99
xiv
Figure 28: A comparison of classification accuracy for the proposed method through
Minkoski/k-NN .................................................................................................................... 100
Figure 29: Hybrid method of feature selection technique and a learning algorithm........... 106
Figure 30: Attributes selection methods with Naïve Bayes ................................................ 108
Figure 31: Results for attributes selection methods with k-NN .......................................... 110
Figure 32: Results for attributes selection methods with Decision Tree ............................ 112
Figure 33: Hybrid method of feature selection technique and a learning algorithm........... 113
Figure 34: Single Classifier on three datasets WBC, WDBC, and WPBC. ........................ 119
Figure 35: Two Classifiers on three datasets WBC, WDBC, and WPBC. ......................... 120
Figure 36: The Fusion of three classifiers on three datasets WBC, WDBC, and WPBC. .. 121
Figure 37: Results for attributes selection methods with Naïve Bayes on three databases
(Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 127
Figure 38: Results for Attributes Selection Methods with k-NN on three databases (Thyroid,
Hepatitis, and Breast Cancer)............................................................................................... 128
Figure 39: Results for Attributes Selection Methods with Decision Tree on three databases
(Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 129
xv
List of Tables
Table 1: The confusion matrix for classifier c(x) on matrix X that contains 160 records. ... 21
Table 2: Examples, advantages, and disadvantages of wrapper feature selection [63] ......... 50
Table 3: Examples, advantages, and disadvantages of filter feature selection [63] .............. 52
Table 4: Examples, advantages, and disadvantages of embedded feature selection [63] ..... 53
Table 5: Extract of data to demonstrate Expectation Maximization [83] ............................. 62
Table 6: The calculations of mean, variance, and covariance for the features depression, age,
height, and weight. ................................................................................................................. 63
Table 7: The final data set after performing Expectation Maximization method. ................ 64
Table 8: Selection of research paradigms and research methods [85] .................................. 67
Table 9: Sample of Wisconsin Breast Cancer Diagnosis dataset .......................................... 71
Table 10: Information Gain Ranking Using WEKA on WBC.............................................. 85
Table 11: Comparison of classification accuracy between IG-ANFIS and some previous
work ....................................................................................................................................... 90
Table 12: Results for Attributes Selection Methods with Naïve Bayes. ............................. 107
Table 13: Results for Attributes Selection Methods with k-NN ......................................... 109
Table 14: Results for Attributes Selection Methods with Decision Tree ............................ 111
Table 15: Statistics of Breast Cancer Datasets .................................................................... 118
Table 16: Results for attributes selection methods with Naïve Bayes on three databases
(Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 126
Table 17: Results for Attributes Selection Methods with k-NN on three databases (Thyroid,
Hepatitis, and Breast Cancer)............................................................................................... 128
Table 18: Results for Attributes Selection Methods with Decision Tree on three databases
(Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 129
xvi
Acronyms
ADALINE: Adaptive linear Element
ANFIS: Fuzzy Inference System
ANN: Artificial Neural Network
CAD: Computer Aided Diagnosis
CART: Classification and Regression Tree
CES: Consistency Based Subset Evaluation
CFS: correlation based feature selection
DM: Data Mining
eHealth: Electronic Health
EHR: Electronic Health Record
ERR: Error Rate
FIS: Fuzzy Inference System
GA: Genetic Algorithm
HIS: Hybrid Intelligent System
IG: Information Gain
IGANFIS: Information Gain and Adaptive Neuro-Fuzzy Inference System
k-NN: k Nearest Neighbors
LSE: Least Square Estimate
xvii
MAR: Missing At Random
MCAR: Missing Completely At Random
ML: Machine learning
MNAR: Missing Not At Random
NEHTA: National Electronic Health Transition Authority
OECD: Organization for Economic Cooperation and Development
PCA: Principle Components Analysis
R: Relief
RT: Random Tree
SBFS: Sequential Backward Floating Search
SFFS: Sequential Forward Floating Search
SFS: Sequential Forward Search
SU: Symmetrical Uncertainty
UCI Machine Learning Repository: University of California Irvine Machine Learning Repository
WBC: Wisconsin Breast Cancer Dataset
WEKA: Waikato Environment for Knowledge Analysis
xviii
Chapter 1: Introduction
CHAPTER ONE
Introduction
1.1
Overview
The advancement of information technology, software development, and system
integration techniques have produced a new generation of complex computer
systems.
These systems have presented challenges to information technology
researchers. Challenges include the compatibility between heterogeneous systems,
security and privacy issues, systems management, sharing of data, and reusing and
benefiting from the existing resources and data.
An example of complex systems is the healthcare system. Recently, there has been
an increased interest to utilise the advancement of communication and data mining
technologies in healthcare systems. Also, many countries are changing the way of
conducting healthcare systems towards a global healthcare system across the country
by setting healthcare standardization in communication and building the electronic
healthcare records.
The Electronic Health Record (EHR) is a systematic collection of electronic health
data about individual patients or populations. It is capable of being shared across
healthcare providers in a certain state or the country [1]. Health records may include
a range of data including general medical records, patient examinations, patient
treatments, medical history, allergies, immunization status, laboratory results,
radiology images, and some useful information for examination. This rich
information may help researchers in examining and diagnosing diseases using
computer techniques. Using EHRs may help in reducing the cost of legacy systems,
1
Chapter 1: Introduction
improving the quality of care, and mobility of records. However, issues of privacy
and security in such models have been a concern for patients and governments.
The existence of EHRs encouraged researchers to the idea of electronic healthcare
system where the components of the legacy healthcare systems (facilities, workforce,
the providers of therapeutics, and education and research institution) come together
and electronically share and transfer patient information across the public
infrastructure across the country.
Australia is moving fast toward electronic health care information systems across the
country. This movement will produce a huge EHR for Australian populations and
healthcare providers, and this health related information and data can be a valuable
asset. Therefore, the aim of the current work is to investigate the aspects of utilising
health data for the benefit of humans by using novel machine learning and data
mining techniques.
The idea is to propose an automated method for diagnosing diseases based on
previous data and information. However, there are several problems associated with
effectively utilising this previously acquired patient data, which can make any
electronic healthcare system less efficient. These problems are: the problem of
missing values and how to process them, the issue of huge features and attributes and
how to select the most beneficial features and the problem of extracting accurate
diagnostic markers that can predict the early onset of the disease, and the monitoring
of different stages of the disease. This thesis will try to investigate these issues and
propose methods for breast cancer disease as an example, based on the power of
automated technologies and the previous evidence or data. The scope of the thesis is
limited to the problems outlined above, and does not include other equally important
2
Chapter 1: Introduction
issues like privacy and security. In this research, EHRs will be used as data sources
for developing automatic data mining and machine learning techniques, so as to
produce useful patterns and decision support logic for automatic computer aided
diagnosis. For pursuing investigations for this project, the study used well-known
datasets available publicly for research purposes. It is envisaged, that the novel
algorithms and techniques developed and validated on this dataset can be extended to
real clinical environments by integrating them into clinical computer aided diagnosis
and decision support systems. This database is the trial dataset before integrating the
proposed methods into the real clinical environments.
1.2
Research Motivation
In Australia and all over the world, people are suffering from limited medical
resources and long waiting times to receive medical services. According to the World
Health Organization (WHO), Australia has ranked 32 out of 190 countries in the
field of health care systems [2]. A study shows that Australia had fewer practicing
physicians and limited care beds per one thousand people than the median of some
countries in the Organization for Economic Cooperation and Development (OECD)
[3].
The increasing population of Australia, the ageing population, the modern lifestyle,
the climate change, and the new diseases that come into view have presented
challenges for the Australian health organisations and state governments to set
procedures and plans to manage and cope with the available medical resources,
infrastructure, and to deliver a decent healthcare services for residents despite the
shortages in medical personnel and equipment. In addition, medical services are
essential for all individuals and it is the nation’s responsibility to develop and sustain
3
Chapter 1: Introduction
the medical infrastructures and services for all residents and citizens. In addition to
the shortages in medical personnel and technology, incidents of prescription errors
have been increasingly causing minor to major problems for patients. For example,
serious health problems may occur because of Adverse Drug Effects (ADE). ADE
caused by mistaken prescription, errors in dosage, miss-communication between
physicians and pharmacy, dispensing and administering of drugs, and inappropriate
number of drug intake [4]. For example, a study [5] shows that ADE may rank as the
sixth leading cause of death in the United States after heart diseases, cancer, stroke,
pulmonary diseases, and roads accidents. In Australia, the Australian Department of
Health and Aging estimated that around 140,000 hospital admissions every year are
due to ADE incidents [6]. Those problems may be avoided by a systematic
information transfer between different health care providers (hospitals, medical
centres, pharmacies, pathologies, etc.).
Another issue that stand for countries including Australia is the shortages in medical
doctors. Figure 1 shows a comparison between Australia and selected countries in
Organization for Economic Cooperation and Development (OECD) in terms of
number of medical doctors. The Figure shows 1.43 general practitioners, 1.35
specialists, and total of 2.81 physicians per 1000 population. The availability of
innovative eHealth technologies such as the one proposed in this research, can help
alleviate this shortage.
Breast cancer has become a common disease around the world. Yearly, millions of
women suffer from this debilitating life threatening disease, making it the second
common non-skin cancer after lung cancer, and the fifth cause of death among
cancer diseases in the world [7]. Discovering the disease in its early stages may
reduce the breast cancer tragedy. Computing technologies and machine learning tools
4
Chapter 1: Introduction
can be used to assist physicians in diagnosing and predicting the disease so they can
provide the necessary treatment and prevent the impact, including the possibility of
death. More specifically, breast cancer cause about 22.9% of all cancers in women
excluding skin cancers [8]. For example, breast cancer caused 458,503 deaths
worldwide in 2008 [8]. Breast cancer is targeting women 100 times more than men,
although men tend to have poorer outcomes due to delays in diagnosis [9]. Survival
rates for breast cancer vary greatly depending on the cancer type, stage, treatment,
and geographical location of the patient. For instance, survival rates in the Western
world are high. However, in developing countries survival rates are much poorer.
Therefore, this work provides a hope, that this research and the related future work
makes some contributions that can help in a better diagnosis of breast cancer for men
and women worldwide, especially for countries with poor health services.
In Australia, breast cancer is the most common cancer in women (excluding 2 types
of non-reportable skin cancer), representing over a quarter (28%) of all reported
cancer cases in women in 2006. A total of 12,614 breast cancer cases were diagnosed
in women in 2006, the largest number recorded to date of study (until 2009). More
than two-thirds (69%) of these cases were in women aged 40 to 69 years. In the same
year, 102 cases of breast cancer cases were diagnosed in men, accounting for 0.8% of
breast cancer cases [10].
While breast cancer is the most commonly reported cancer in non-indigenous women
in the four jurisdictions for which data were available, indigenous women were
significantly less likely to be diagnosed with breast cancer than non-Indigenous
women in 2002 to 2006 (69 and 103 new cases per 100,000 women, respectively)
[10]. Worldwide, breast cancer was the sixth leading cause of burden of disease for
5
Chapter 1: Introduction
women in 2003 and it accounted for 7% of all years of life lost due to premature
mortality [10].
Figure 1: Medical Doctors per 1000 population in selected countries in Organization for Economic
Cooperation and Development (OECD) Countries, 2009.
6
Chapter 1: Introduction
Technology availability is also a challenge that stands for countries.
Figure 2
demonstrates the availability of Magnetic Resonance Imaging (MRI) in selected
countries in Organisation for Economic Cooperation and Development (OECD)
Countries. The Figure shows 3.7 MRI units per one million populations in Australia.
Therefore, these shortages in medical resources drive researchers to look for more
effective solutions for the benefit of society. Computer scientists can utilize the latest
technologies in machine learning science to produce models and methods that can
assist physicians in the process of examination and treatment
Figure 2: Number of MRI units per one million populations in selected countries in Organization for
Economic Cooperation and Development (OECD) Countries, 2003.
1.3
Research Objectives
Computing and machine learning tools can significantly help in solving the health
care problems by developing expert systems that can assist physicians in diagnosing
7
Chapter 1: Introduction
and predicting diseases in early stages. These systems can reduce the cost, the
waiting time, and free human experts for more research as well as reduce the errors
and mistakes made by medical personnel [11]. Computer Aided Diagnosis (CAD)
and medical expert systems and tools have become one of the foremost research
areas in the field of medical diagnoses. The aim of CAD is to design an expert
system that combines the human expertise and the technology intelligence to achieve
more accurate diagnosis effectively [11]. CAD can be used to assist physicians in
diagnosing and predicting diseases. Accordingly, physicians can provide a necessary
treatment promptly to prevent loss, including the possibility of death.
In Australia, the National Electronic Health Transition Authority (NEHTA) was
established by the Australian states and territory governments to develop an
electronic and secure exchange of health information between healthcare providers.
The project was expected to complete by the end of 2012 [12]. From July 2012,
anyone seeking healthcare in Australia can register for a personally controlled
electronic health record. The result will be a huge database about Australian health
care records.
This database can be utilised for research purposes after applying the Australian
privacy and information use standards and polices. The outcomes of the present
thesis work may fit as a component in the Australian National E-health System.
The aims of this research work are:

To utilize patient’s histories, health information, and databases for
discovering and diagnosing diseases, and provide decision support to
medical professionals. The research is expected to establish some models
that can assist physicians in diagnosing diseases and grouping patients in
useful patterns based on different risk factors, and how machine learning
8
Chapter 1: Introduction
techniques can identify such patterns. This can help in detecting early
onset of the disease, identification of disease stages and treatment plans.

To address an important issue related to missing values that can play an
important role in determining the performance improvements achieved by
data mining and machine learning algorithms.

To work with large number of features and attributes in the dataset, and
identify the significance of some features over others. Large number of
features can lead to curse of dimensionality, and can render a machine
learning algorithm or technique limited in terms of accuracy, precision
and specificity
Therefore, this thesis proposes new methods for constructing missing feature values,
investigate feature selection techniques and develop new machine learning
algorithms for providing automatic computer aided diagnosis and decision support
system for breast cancer disease diagnosis. The aim is to develop an integrated
system with a principled workflow (constructing missing features values, feature
selections, and classification algorithms).
This work envisages that the outcomes of this research in terms of an integrated
computer aided decision support and diagnosis system with a principled workflow of
different algorithmic techniques for dealing with missing feature values, extracting
significant feature selection and using machine learning based classification can
enhance the accuracy with which benign and malignant forms of the disease can be
identified, and can contribute as a component in the Australian National E-health
System. Figure 3 shows the proposed architecture for the eHealth system along with
the proposed integrated intelligent system. Figure 3 is an updated architecture from
9
Chapter 1: Introduction
NEHTA [12], and includes outcomes of this research as a possible component of the
system for future adoption.
Figure 3: Updated eHealth Architecture Including the proposed integrated intelligent system [12]
To conclude, research objectives of this research is to utilize patient’s histories,
health information, and databases from the national EHRs
(Electronic Health Records) for discovering markers for early diagnosis and
management of breast cancer with an integrated intelligent approach consisting of
processing missing feature values, significant feature selection, and learning based
classification. The research is expected to establish some models and tools that can
assist physicians in diagnosing diseases. The aim is to design an expert system that
combines the human expertise and the technology intelligence to achieve more
accurate diagnosis. This system may assist physicians in decision making and double
check physician’s assessment (Evidence based diagnosis)Monitoring the disease, by
grouping patients into related health patterns for better and effective treatment plans.
10
Chapter 1: Introduction
To address the above mentioned research objectives, the research problem is
formulated in terms of following questions:
Question 1:
Does hybridization of the existing machine learning algorithms produce better
approaches for medical diagnosis of breast cancer in terms of classification accuracy,
tolerance to noise and missing values?
Question 2:
How discriminating dataset features can improve prediction in context of medical
diagnosing with breast cancer as an example.
Question 3:
How to identify the diagnosing features that best describe data for the purpose of
differentiating malignant and benign form of breast cancer using learning based
classification and data mining techniques?
1.4
Research Contribution
The present research aims to contribute to the national interest for electronic
healthcare system (E-health). The aim of the current thesis is to analyse large data
obtained from E- health systems using data mining and machine learning algorithms.
The process of analysing large amount of data includes some novel algorithmic
techniques such as constructing missing feature values, investigating and developing
better features selection techniques, and proposing new machine learning based
11
Chapter 1: Introduction
approaches for diagnosing the disease information based on previous history
obtained from patients.

In regards to constructing missing features values, the study proposed a new
approach for constructing missing features values based on iterative nearest
neighbourhood and distance metric. The proposed approach employs k nearest
neighbours’ algorithm and propagates the classification accuracy to a certain
threshold. The proposed method showed improvement in classification accuracy
of around 0.5 % in the constructed dataset than the original dataset which
contained some missing features values. The maximum classification accuracy
was 0.9698. This work has been published in peer reviewed journal, the
International Journal of Intelligent Information Processing (Ashraf, M., et al., "A
New Approach for Constructing Missing Features Values," International Journal
of Intelligent Information Processing; Mar2012, Vol. 3, Issue 1, pp. 110-118,
2012).

In terms of features selection techniques, the research focused on features
selection technique as a method to gain high quality attributes to enhance the
mining process. Features selection techniques touch all disciplines that require
knowledge discovery from large data. In this part of research, I have examined
different benchmark features selections methods on publicly available Wisconsin
Breast Cancer Dataset (WBC) and three well-recognized machine learning
algorithms. The study found that features selection methods are capable of
improving the performance of learning algorithms. However, no single features
selection method can best satisfy all datasets and learning algorithms. Therefore,
machine learning researchers should understand the nature of datasets and
learning algorithm characteristics in order to obtain better outcomes. Overall,
12
Chapter 1: Introduction
consistency based subset evaluation (CES) performed better than information
gain, symmetrical uncertainty, Relief (R),correlation based feature selection
(CFS), and principle components analysis (PCA).The results have been accepted
by the International Journal on Data Mining and Intelligent Information
Technology Applications and will appear in March 2013edition(Ashraf, M., et
al., "Features Selections Techniques on Thyroid, Hepatitis, and Breast Cancer”,
International Journal of Intelligent Information Processing, Accepted, (will
appear in Mar 2013))

In regards to diagnosis approaches, this work proposed two approaches for
diagnosing breast cancer based on machine intelligence and previous history. The
first approach presented a new method for breast cancer diagnosis using a
combination of an Adaptive Network based Fuzzy Inference System (ANFIS)
and the Information Gain method. In this approach, the purpose of ANFIS is to
build an input-output mapping using both human knowledge and machine
learning ability, and the purpose of information gain method is to reduce the
number of input features to ANFIS. The experimental validation shows 98.23%
accuracy which underlines the capability of the proposed algorithm. The second
approach utilised features selection technique and Bayes learning algorithm, as
the feature selection techniques have become an essential
component of
automatic data mining systems dealing with large data, and an obvious tool to aid
researchers in computer science and many other fields of science. Whether the
target research is in medicine, agriculture, business, or industry; the necessity for
analysing large amount of data is needed. In addition to that, finding the most
excellent feature selection technique that best satisfies a certain learning
algorithm could bring the benefit for researchers. Therefore, the current work
13
Chapter 1: Introduction
proposed a new method for diagnosis based on a combination of learning
algorithm and feature selection technique. The idea is to obtain a hybrid
integrated approach that combines the best performing learning algorithms and
the best performing feature selection technique with an experimental evaluation
on the Wisconsin Breast Cancer Dataset (WBC). Experimental results showed
that co-ordination between consistency based subset evaluation method along
with Naïve Bayes learning algorithm can produce promising results. The results
appeared in the 19thInternational Conference on Neural Information Processing in
2012.(Ashraf, M., et al. (2012), Hybrid Approach for Diagnosing Thyroid,
Hepatitis, and Breast Cancer Based on Correlation Based Feature Selection and
Naïve Bayes, Neural Information Processing, T. Huang, Z. Zeng, C. Li and C.
Leung, Springer Berlin Heidelberg. 7666: 272-280).

A new set of experiments performed to evaluate the fusion of multiclassification. Based on the experiments, the best combination of a set of
classifiers depends on the application and on the classifiers characteristics. In
addition, there is no best combination of classifiers that suites all datasets.
However, the experiments showed that Naïve Bayes and k-NN produced better
results when they combined as one classifier with maximum classification
accuracy obtained on WBC dataset (0.9642).
1.5
Research Methodology
Knowledge discovery from the databases or data mining refers to extracting useful
relationships and patterns from large databases. Due to the amount of data and to
obtain useful outcomes, a systemic method must be applied. It has become a fact
that quality data will imply more accurate outcomes than dirty data. Dirty data is a
14
Chapter 1: Introduction
common term in data mining that describe some unwanted data characteristics such
as incompleteness, noisy, and inconsistency. In this research, the method proposed
involves different data mining processes starting by appropriate data collection (in
our case, online datasets), data selection, data pre-processing, applying learning
based classifier methods, evaluation, and finally visualisation and evaluation of
results in terms of tables and diagrams. Details of each stage are described in
Chapter 3.
1.6
Thesis Road Map
The thesis is organised into six chapters:

Chapter 1: Introduction

Chapter 2: Background Study

Chapter 3: Research Methodology

Chapter 4: Breast Cancer Diagnosis based into ANFIS and Information Gain

Chapter 5: Constructing Missing Features Values

Chapter 6: Breast Cancer Diagnosis based on Naïve Bayes and Feature
Selection techniques.

Chapter 7: Fusion of Heterogeneous Classifiers for Breast Cancer Diagnosis.

Chapter 8: Discussion and Future Work.
This introductory chapter presents the problem description, motivation and
objectives of this work, the contribution to the scientific knowledge, the
methodology, and the road map of thesis.
Chapter two provides a review of canonical machine learning and data mining
techniques, features selection methods, and data mining algorithms in the field of
15
Chapter 1: Introduction
healthcare. This work has combined them into one chapter because they are strongly
related as a background study. Chapter three describes data mining methodology
used in this work. Chapter four presents a new approach for breast cancer diagnosis
using data mining techniques and the power of new neuro fuzzy inference technique.
The approach relies on permutation between the Adaptive Network based Fuzzy
Inference System (ANFIS) and the Information Gain method. Chapter five presents a
new approach for constructing missing features values based on iterative nearest
neighbours and distance metrics. The proposed approach employs weighted k nearest
neighbours’ algorithm. Chapter 6 shows a new method for diagnosing breast cancer
on a combination of learning algorithms and features selection techniques. The idea
is to obtain a hybrid approach that combines the best performing learning algorithms
and the best performing feature selection techniques. Chapter 7 shows a set of
experiments to evaluate the idea of classification fusion. Finally, Chapter 8 presents
some of the conclusions drawn from this work and scope for future work.
16
Chapter 1: Introduction
1.7
Chapter Summary
This Chapter presented an overview of the thesis, the motivation and the objectives
of the proposed research, and described the need of current research to assist in
solving the shortages in automated computer aided decision support technologies, the
contribution of current research, and the thesis road map. The next chapter will
present a background study about machine learning, data mining in healthcare,
feature
selection
techniques,
and
missing
features
values.
17
Chapter 2: Background Study
CHAPTER TWO
Background Study and Literature Review
2.1
Overview
Machine learning (ML) can be interpreted as a group of topics that emphasizes on
making and testing algorithms that can assist the process of classification, prediction,
and pattern recognition, using computer models obtained from existing data
(previous data). Machine learning can produce classifiers to be used on the available
resources. In addition, machine learning does not involve much human interaction.
The objective behind limited human involvement is that the use of automatic preprogrammed methods can reduce human biases. The process of proposing the
algorithm and its functionality to classify objects or predict new cases are to be built
on solid and reliable data [13].
2.2
Background Study
In general, machine learning can be defined as a scientific domain that aims to design
and develop algorithms that allow computers to learn and behave to solve a real time
problem based on previous data or under a certain instructions and rules. There are
many presentations of machine learning; data mining is the most used application of
machine learning [14]. Data mining is a science to discover knowledge from
databases. The database contains a collection of instances (records or case). Each
instance used by machine learning and data mining algorithms is formatted using
same set of fields (features, attributes, inputs, or variables). When the instances
contain the correct output (class label) then the learning process is called the
18
Chapter 2: Background Study
supervised learning. On the other hand, the process of machine learning without
knowing the class label of instances is called unsupervised learning. Clustering is a
common unsupervised learning method (some clustering models are for both). The
goal of clustering is to describe data. However, classification and regression are
predictive methods. In the current research, my focus is on supervised machine
learning [14].
2.3
Classification
Classification and regression are common models in supervised learning. The current
research will concentrate on classification. However, it is useful to distinguish
between them. Regression algorithms attempt to map input to domain values (can be
real values). For Example, a regressor can forecast a certain goods sales by
considering goods features. At the same time, classifiers can map the input space into
pre-defined classes. For example, a classifier can predict a new case of patient
whether benign (healthy) or malignant (suffer from a certain disease) [15].
Classification is the process of learning the target function that maps between a set of
features (inputs) and a predefined class labels (output). The input data for the
classification is a set of instances. Each instance is a record of data in the form of
(𝒙, 𝒚)
where 𝒙 is the features set and 𝒚 is the target variable (class label).
Classification model is a tool that used to describe data (Descriptive Model) or a tool
to predict the target variable for new instance (Predictive Model). Examples of
classification models are decision tree, artificial neural network, Naïve Bayes, and
nearest neighbour’s classifier [16].
19
Chapter 2: Background Study
The general approach for solving classification problem is shown in Figure 4. The
training data consists of instances whose class labels are known. The classification
model can be built based on the training data. The model then can be evaluated and
tested by using the testing data which contains records with missing class labels. The
evaluation of model performance is based on the number of testing instances that are
correctly forecasted [16]. The result of performing the model on the testing data
produces the confusion matrix.
Figure 4: General approach for building a classification model
Suppose the goal is to classify some objects 𝒊 = 𝟏, … . , 𝒏 into 𝒌 predefined classes,
where 𝒌 represent the number of classes. For example, if the aim of classification is
to diagnose a patient whether or not suffering from breast cancer then the value of 𝒌
will be 𝟐 corresponding to either benign or malignant. Database (available data) can
be organised as 𝒏 𝒙 𝒑 matrix 𝑿, where 𝒙𝒊𝒋 represent the feature value 𝒋 in the record
𝒊. Every row in the matrix 𝑿 is represented by a vector 𝒙𝒊 with 𝒑 features and a class
label 𝒚𝒊 . The classifier can be denoted as 𝒄(𝒙). One method to evaluate the classifier
20
Chapter 2: Background Study
is by calculating the error estimation based on the confusion matrix. To explain the
error estimation, let us consider an example. Suppose the aim of a certain classifier
𝒄(𝒙) is to train and test input vectors 𝒙 into two possible classes benign and
malignant. Suppose the result of classification of the classifier 𝒄(𝒙) on vectors 𝒙 is as
shown in the confusion matrix in Table 1.
Table 1: The confusion matrix for classifier c(x) on matrix X that contains 160 records.
Predicted
true
benign
malignant
benign
60
15
malignant
5
80
The error rate (Er) of algorithm is the total number of incorrect classified samples
divided by the total number of records in the matrix X. In the example above, Er =
(15 + 5) / 160 = 0.125. On the other hand, the classification accuracy of the model
can be calculated as Acc = 1 − Er = 0.875.
2.3.1
k-Nearest Neighbors algorithm
The k Nearest Neighbour algorithm (k-NN) is an instance based machine learning
algorithm. k-NN is very simple to understand but works amazingly well [17]. The
idea behind k-NN method for classifying objects is based on the closest training
cases in the feature space. The k-NN finds the k closest instances to a predefined
instance and decides its class label by identifying the most frequent class label
among the training data that have the minimum distance between the query instance
and training instances. The distance is determined by the distance metric. Preferably,
the distance metric minimise the distance between similar instances and maximise
the distance between different instances. The following pseudo-code shows an
21
Chapter 2: Background Study
illustration for k-NN implementation [18]. Examples of approaches to define the
distance are the Euclidean and Manhattan methods. Figure 5 shows an example of kNN.
procedure K-NN-Learner(TestingDataSet)
for each testing instance
{ find the k most nearest instances of
the training set according to a distance metric (Euclidean
distance or Manhattan distance )
Resulting Class= most frequent class
label of the k nearest instances}
Figure 5: Example of k-NN [16]
Distance Functions
1. Euclidean distance
The most regularly used metric to compute the distance between data points is the
Euclidean distance. Euclidean distance is the square root of the sum between two
points. For n-dimensional data, the distance is giving by the formula number 1,
22
Chapter 2: Background Study
where 𝑑 denote to distance, 𝑥 and 𝑦 are two different cases in the dataset, 𝑛 is the
total number of cases in the dataset [19].
n
d(x, y) =
i=1
xi − yi
2
(1)
2. Manhattan distance
Manhattan distance is one of well-known measuring distance. Manhattan distance is
calculated by summing the absolute value of the difference of data points. Manhattan
distance is less costly to calculate in compare to Euclidean distance. The formula for
Manhattan distance is giving by the formula number 2, where 𝑑 denote to distance, 𝑥
and 𝑦 are two different cases in the dataset, 𝑛 is the total number of cases in the
dataset [20].
n
d(x, y) =
i=1
xi − yi
(2)
3. Minkowski distance
Minkowski function is a geometric distance between two points and uses a scaling
factor, r. The main use is to find the similarity between objects. When r = 2 then it
become the Euclidean distance. When r = 1 then it become the Manhattan distance.
The distance is giving by the formula number 3, where 𝑑 denote to distance, 𝑥 and 𝑦
are two different cases in the dataset, 𝑛 is the total number of cases in the dataset
[20].
1
𝑟 𝑟
𝑛
𝑑(𝑥, 𝑦) =
𝑖=1
𝑥𝑖 − 𝑦𝑖
(3)
4. Chebyshev distance
23
Chapter 2: Background Study
Chebyshev distance function calculates the absolute differences between the coordinates
of two points. Example of common application for using Chebyshev distance is Fuzzy
C-means Clustering [112], where d denote to distance, x and y are two different cases in
the dataset.
d(x, y) = max xi − yi
𝑖
(4)
5. Canberra distance
Canberra distance is the sum of absolute values of the differences between ranks divided
by their sum, thus it is a weighted version of the Manhattan distance function, where d
denote to distance, x and y are two different cases in the dataset, n is the total number of
cases in the dataset [113].
n
d(x, y) =
i=1
xi − yi
xi + yi
(5)
Before using the k-NN, it is a good approach to list the advantages and disadvantages
of k-NN to ensure that the k-NN is appropriate for the dataset and the learning
process.
 k-NN advantages

Is a very efficient pattern recognition method and can be easily carried out
[21].

k-NN simplicity of use [22].

Strong against noisy data[22].

Can be used for large and small datasets [22].

Suitable for linear and nonlinear functions[22].

The ability to add additional instances with no need to train the data set [23].
24
Chapter 2: Background Study

Weight is used to measure features significance [23].

missing values can be easily imputed [24].

flexibility (nonparametric model except the value of k) [25].
 k-NN disadvantages

The need calculate the distance between the query instance and all other
instances[24].

The need for huge memory[24].

Not useful for multidimensional dataset because of high error rate [24].

The option of using many distance function which may lead to different
accuracy level [24].
Figure 6 shows the features of learning for k-Nearest Neighbour algorithm [27]:
25
Chapter 2: Background Study
Figure 6: k-NN characteristics in regards to some learning features.
26
Chapter 2: Background Study
2.3.2
Artificial Neural Network
Artificial Neural Networks date back to nineteenth century when William James and
Alexander Bain comprehended the ability of constructing a manmade system based
on neural models [28]. In the middle of twentieth century, McCulloch and Pitt found
the capability of learning a group of neurons, and Donald Hebb had developed tuning
method that shows how neurons use enforcement to strengthen the connections from
important input. In 1950s, based on Hebb methodology, Farley and Clark established
the first artificial neural networks where neurons were randomly connected followed
by development of perceptron for pattern classification by Frank Rosenblatt.
Regrettably, the system was not able to perform complex classification and the
research was stopped in 1960s [28]. During that era, the Adaptive linear Element
(ADALINE) was developed by Widrow and Hoff that ultimately used to eliminate
the echoes in telephone systems based on adaptive signal processing [29].
Despite the limited research on neural networks during 1970s, some researchers had
developed self-organising neural model based on physiological studies on nerves
systems called adaptive resonance theory (ART) [30]. In 1974, Paul Werbos had
developed a learning rule based on error minimization approach in which the error is
propagated in reverse by adjusting the weights using the Gradient descent model.
Paul’s technique is the back propagation error algorithm which is the most used
artificial neural networks model that spread widely in mid 1980s by a group of
researchers[28].
During 1980s and 1990s, computers have extended in speed about hundred times
quicker since the beginning of the research, academic programs appeared, new
courses were introduced, and funding becomes available. All the mentioned factors
encouraged researchers to concentrate on neural networks application, development,
27
Chapter 2: Background Study
and new approaches for prediction, forecasting, and diagnoses. For example, many
studies[31, 32] demonstrates the potential applications of ANN for clinical decision
making. Now a day, Major evolution in neural networks that attracts funding for
further research in many fields such as the hybrid neural networks and how to
combine it with other technologies.
The artificial neuron is a computer simulated model stimulated from the natural
neurons. Natural neurons receive signals from synapses located on the surface of the
neuron. The neuron is starting to work and send a signal through the axon once the
signal extent to a certain threshold. This signal then transfers through to other
neurones and may get to the control unit (the brain) for a proper action. Figure 7
shows how the human neuron looks like [33].
Figure 7: Human neuron [33]
The Artificial Neuron (AN) simulates the functionality of real neuron. AN have a set
of inputs associated with weights. Inputs and weights are calculated by a
mathematical equation to control when the AN activated. ANN is a combination of
artificial neurons that process information [33]. Figure 8 shows a simple artificial
neuron.
28
Chapter 2: Background Study
Figure 8: Artificial Neuron
In general, the artificial neuron operation is modelled by the data flow diagram as in
Figure 9.
Figure 9: Simplified neuron operation
After briefly describing the artificial neuron, the ANN is described next. ANN is a
set of connected artificial neurons. The most used ANN model is the Feed Forward
Networks. Figure 10 shows a three layer topology of Feed Forward Networks. The
outcome of ANN is subject to input and the value of weight [35].
29
Chapter 2: Background Study
Figure 10: ANN architecture
Figure 11 shows the features of learning for Artificial Neural Network[27]:
Figure 11: ANN characteristics in regards to some learning features.
30
Chapter 2: Background Study
2.3.3
Decision Tree
Decision tree is a classification method which contains nodes, branches, and leafs.
The first node on the tree or the top node is called the root node. Each node in the
tree is connected with one or more nodes using branches, the last node in the tree that
contains no outgoing branches is called leaf node. The leaf node indicate to
termination or the outcome value [16] [36]. Figure 12 shows an example of a simple
decision tree.
Figure 12: Simple Decision Tree
Figure 12 shows how to solve a real time problem based on making questions and
answers about attributes in the testing records. The terminology of such classification
method is to keep asking question until conclusion is reached. The set of questions
and answers could form a decision tree with set of nodes. The tree could contain
three types of nodes [16]:
31
Chapter 2: Background Study

Root node that has zero or more outgoing nodes and no incoming nodes, as
well as, it contains the testing condition that separate records.

Normal nodes, those nodes are internal nodes and each has one and only one
incoming node and two or more outgoing edges. It also contains the testing
condition that separate records.

Leaf nodes, those nodes hold the class labels, have no outgoing edges, and
only one incoming edge.
2.3.3.1
Building Decision Tree
Building an optimal decision tree is a difficult task because there are many decision
trees that can be built from a set of attributes. In addition, constructing an optimal
decision tree is computationally costly [37]. Generally speaking, the methods of
constructing decision trees can be grouped into two types: top-down and bottom-up
method with favourite to the first group according to the literature [37]. There are
many types of top-down decision tree for example CART, C4.5, and ID3.
2.3.3.2
ID3
The ID3 is a top-down decision tree. The algorithm proposed by Quinlan in 1986.
ID3 features the simplicity among other classifiers; it uses information gain to split
instances and building the tree. ID3 is simple to perform. However, it doesn’t handle
missing values and no pruning procedures [15].
2.3.3.3
C4.5 decision tree
C4.5 is a better version of ID3 found by the founder of ID3 in 1993. The purpose of
C4.5 is to overtake the disadvantages of the early version (ID3). The process of
32
Chapter 2: Background Study
splitting instances is done by gain ratio or the information gain. The algorithm runs
as long as the number of instances to be split is more than a predefined threshold.
Unlike ID3, the new version is capable to treat missing values and can handle
numeric attributes [15].
2.3.3.4
CART
CART proposed by Breiman in 1984, it is shortcut for Classification and Regression
Tree. CART has become a common method for constructing decision tree model due
to the capability to deal with different data types, handling missing values, and the
ability to produce rules which are understandable by human. CART may name
binary tree because the tree is constructed by splitting a node into two child nodes
with exactly two outgoing edges from the internal nodes. The splits are selected
using the towing criteria (represent the quality of the connection between a parent
and child decision nodes) [15] .
Figure 13 shows the features of learning for Artificial Neural Network.
Figure 13: Decision Tree characteristics in regards to some learning features.
33
Chapter 2: Background Study
2.3.4
Naïve Bayes Classifier
Naïve Bayes classifier in data mining is a mathematical classifier based on
independency and probability (Bayes theorem). The Naïve Bayes classifier adopts
the idea that the existence of a certain feature of an object is unrelated to the
existence of any other feature, given the class variable. For example, an animal may
be considered to be a cat if it is hunt, play with kids, has four legs, has a head, and
weight about 3 kilograms. Naïve Bayes algorithm treat all features independently and
how they make a prediction of this animal is a cat, with no feature depends on others
features values [38].
Naïve Bayes algorithm is significant classifiers; it is easy to construct, does not
requires parameter estimation, easy to interpret. Therefore, Naïve Bayes can be
performed by expert and inexpert data mining developers. Finally, Naïve Bayes
generally performs well in comparison with other data mining methods [39].
The literature shows two types of Naïve Bayes, Multinomial model and Multivariate
Bernoulli model. In these models, the classification is performed by the following
Naïve rule [40, 41]:
𝑃(𝑐𝑗 |𝑥𝑖 ) =
𝑃 𝑐𝑗 . 𝑃(𝑥𝑖 |𝑐𝑗 )
𝑃(𝑥𝑖 )
(4)
Where 𝑐𝑗 is the instance class label, 𝑥𝑖 is the test attribute, 𝑃 𝑐𝑗 𝑥𝑖 is the posterior
probability of the class label 𝑐𝑗 given the attribute 𝑥𝑖 , 𝑃 𝑐𝑗 is the prior probability of
class label 𝑐𝑗 , 𝑃 𝑥𝑖 |𝑐𝑗 is the likelihood which is the probability of attribute 𝑥𝑖 given
the class label 𝑐𝑖 . Assume that each attribute 𝑥𝑖 is conditional independent of every
other attribute 𝑥𝑗 then the conditional distribution over the class variable c is:
34
Chapter 2: Background Study
𝑛
𝑃(𝑐|𝑥𝑖 ) = 𝑃 𝑐
𝑖=1
𝑃(𝑥𝑖 |𝑐)
(5)
The advantage of Bayesian classifier over other classification methods is the
opportunity of considering the prior information about a given problem. The main
disadvantages of Bayesian classifier are (1) the numerical attributes require
discretization in most cases; (2) it is not suitable for large data sets which contain
many attributes (time and space issues) [27]. Figure 14 shows the features of learning
for Bayesian classifier.
Figure 14: Bayesian classifier characteristics in regards to some learning features.
35
Chapter 2: Background Study
2.4
Data Mining in Healthcare
Data mining methods have been commonly used in healthcare and diagnostic
applications since of their ability to predict new cases. The main feature of data
mining algorithms is the capability to learn from previous cases and it is ability to
produce a prediction model. The resulting model is used to predict the new arrival
cases, produce conclusion (knowledge) form a large amount of data, or to classify the
data into useful patterns. There are enormous benefits for data mining applications in
healthcare sector. In the main, data mining benefits in healthcare sector can be
categorized in the following: treatment effectiveness, healthcare management,
customer relationship management, fraud and abuse detection, and medical diagnosis
(computer aided diagnosis) [42]. The focus of current research will be on computer
aided diagnosis. However, other data mining application will be described briefly.
2.4.1 Treatment Effectiveness
Data mining applications can be used to assess the level of medical treatments. The
aim is to develop a model that compares between the symptoms, causes, and the
treatment procedures. At the end, data mining model may produce an analysis for all
treatment procedures to produce knowledge about the best practice procedure. For
example, data mining model can find the best performing drug for a certain disease,
grouping the side effects for a drug and how to reduce its risk on patients. Finally,
data mining may play a role to determine the successful treatment and make it a
standard approach among healthcare providers [42].
36
Chapter 2: Background Study
2.4.2 Healthcare Management
In regards to healthcare management, a good design of data mining can benefit to
better classify and track some dangerous diseases and some patients who may infect
others, design appropriate intervention, and reduce the number of hospital
admissions and claims. For example, investigating readmission cases in a certain
hospital and comparing its data with current scientific literature can make an efficient
utilization of medical resources. As an another data mining example in healthcare
management, data mining can be used to decrease patient length of stay (compare a
certain case with previous cases, the length of stay should not exceed the average
stay of previous cases, of course the final decision is to be taken by the doctor.
However, data mining models can provide an approximate length of stay), provide
information to physicians (data mining model may use as a second opinion for
physicians or as a consultant), and to develop best practices [42, 43].
2.4.3 Customer Relationship Management
As in other industries, data mining applications can be used to improve customer
satisfaction [44]. A study performed by Milley [45] stated that mining patient survey
data can help to determine the waiting times expected for a patient before seen by a
physician, how to improve customer service for patients, and assist healthcare
providers with knowledge about patient expectations. Also, Hallick [46]suggested
that Customer Relation Management in healthcare can help promote disease
education, prevention, and wellness services.
2.4.4 Fraud and Abuse
Data mining applications can work to help governments (Medicare for instance) and
healthcare insurance companies to control and reduce the fraud made by some
37
Chapter 2: Background Study
healthcare providers. The idea is to establish a model that recognises strange claims
made by customers (patients), physicians, pharmacies, labs, or other healthcare
providers [42]. Once the model predicted fraud case, the fraud control department
may investigate the case further before action is taken against violating healthcare
provider.
2.4.5 Computer Aided Diagnosis
Computer Aided Diagnosis (CAD) is an assistance method for diagnosing diseases
and estimating the level for illness. CAD can be categorised as an expert system
which utilizes human knowledge and experience to solve problems automatically or
with a little support from human experts. The use of CAD systems is not replacing
the role of medical personnel. However, CAD systems work as a second opinion, or
for assisting in decision support in the diagnosis process. The final decision is to be
made by the physicians [11].
 Types of CAD Systems
There are two types of CAD systems: CADe for Computer Aided detection and
CADx for computer aided diagnosis. CADe involve the use of computer analyses to
indicate whether or not a certain case suffer from a certain disease (the target disease
such breast cancer). CADx on the other hand, is to evaluate the process of detection.
In both types, the final diagnosis and patient management is performed by the
physician or the medical personnel [47].
 CAD Systems Characteristics
Computer Aided Diagnosis (CAD) and medical expert systems and tools have
become one of main areas of research in the field of medical diagnoses. The aim of
CAD is to design an expert system that combines the human expertise and the
38
Chapter 2: Background Study
technology intelligence to achieve more accurate diagnosis effectively. In addition, it
can speed up the diagnoses; reduce the errors and mistakes made by human being,
and free human expertise for further research. CAD can be used to assist physicians
in diagnosing and predicting diseases so physicians can provide a necessary
treatment promptly and prevent loss, including the possibility of death. In general,
Computer Aid Diagnosis and expert systems have a number of attractive advantages
[11, 48]:

Fast response and reduce the cost. The cost of providing expert system per user is
lower than providing human expert. In addition, some cases require immediate
response, especially in emergencies. In such cases, diagnosis can be obtained by
expert systems to produce an approximate and reliable result for the situation.

Increase availability and reliability. Experts are available 24/7 on a suitable
computer with internet connectivity for e-experts systems. Medical expert
systems also increase the confidence about decisions made by physicians and
may help to break arguments between human experts in case of different
opinions.


Steady, unemotional, and complete response at all the time.
Human experts are not permanent. However, the expert systems last
indefinitely.

The knowledge in expert systems can be examined and corrected since it is
explicitly known instead of being implicit in the mind of human experts mind.

Justification and warranting. Medical expert systems can explain in more details
the steps and reasons for taking the decision than the human being. Moreover,
expert systems confirm that the knowledge has been correctly used.
39
Chapter 2: Background Study
2.4.6 Ethical, Legal, and Social Issues
About three-quarter billions of people from North America, Europe, Asia, and
recently Australia have their medical information collected in electronic form.
Therefore, there must be a form of protection for human data. This work will discuss
some ethical, legal, and social limitations on data collection and distribution that
limit researchers and industries when utilizing human data to prevent the abuse of
patients and the use of their data for commercial purposes. The main points of the
ethical, legal, and social issues in mining medical data may be organized into the
following categories[49]:
 Data Ownership
Cios and Moore [49] discussed the term of data ownership and raised some questions
that require legal professionals assistance to answer. In legal theory, ownership is
determined by who is entitled to sell a particular item of property[50].
The problem of identifying the ownership of data is typically unresolved because
human data and tissues are not supposed to be sold for any reason. Therefore, cannot
apply the legal theory in regards to human data. At the same time, human medical
data are available for data mining without prior consent in some cases. The question
of patients’ information ownership is still unsettled and further investigation by legal
specialists and researchers is needed to solve this issue. Example of questions that
need to be answered: (1) who own the data; the patients, the government, or
healthcare providers? (2) Does the medical doctor own the data? (3) Do insurance
providers own the data? (4) If insurance providers do not own their customers data,
can they refuse to pay for the collection and storage of the data? (5) Who organize
and mine the data? Hence, data ownership brought a debate about who own the data
40
Chapter 2: Background Study
and who is legally allowed to give authorities for data usage in scientific field such as
data mining.
 Fear of Lawsuits
An important aspect of medical data mining is the fear of lawsuits against medical
practitioners and health care providers. Some medical practitioners and health care
providers are reluctant to hand over data to data miners. However, providing
individuals data to data mining may lead to lawsuits. Therefore, health care providers
and governments should work together to give patients the right to decide whether
their data to be involved for research purposes or not. In addition, patients should
choose the field of research involved and the location of their data storage. In my
opinion, this could facilitate the process of mining data and may avoid the lawsuits.
Moreover, this could reduce the efforts and time during mining process [49] [50].
 Privacy and Security of Human Data
Medical data which obtained by healthcare providers and medical practitioners from
individuals may contain private and confidential data. Individuals’ data have to be
handled with enormous care to protect people privacy and confidentiality. To meet
these requirements, there are four forms of patient data identification [49]:

Anonymous data:
Individual identification is deleted during data collection.
Therefore, no way to recover the patient identity in the future.

Anonymized data:
Individual identification is recorded initially during data
collection and then removed. In this type of identification, there is a chance to reidentify the patient because patient information has been recorded at some stage.
41
Chapter 2: Background Study

De-identified data: Individual identification is recorded initially during data
collection, which is subsequently encoded or encrypted. There still some chances
to identify the person using computer technology (The country privacy law and
guidelines). This method could help researchers to remove duplicate records that
related to same patient. However, there is still a technique to identify the patient
in the future by decrypting the identification field.

Identified data: Individual identification is recorded initially during data
collection. This method requires receiving a written consent from patient to be
identified. This method should adhere to the country privacy law and guidelines.
 Expected Benefits
The World Wide Web and the internet are convenient ways to share and store data,
accessible almost from everywhere, and can help researchers who may have
legitimate reasons for access private information. For example, researchers who hold
reasonable claims to mine data because the data is rare and they could not mount the
financial and administrative resources to collect and mine private data. The use of
individuals’ data must be justified to the authorities. In addition, researchers who
want to apply methods on data must show some expected benefits for the science or
the society [49] [50].
 Administrative Issues
Researchers and data miners are not the only people dealing with private data.
Therefore, some countries including the United State did specify administrative
guidelines for patient privacy. The guidelines include [49] [50]:
42
Chapter 2: Background Study
1. The establishment of security measures and policy to ensure privacy and security
is in place in research centers and all institutions and organizations that hold and
have access to people information.
2. There must be legal agreement between the healthcare providers and researchers
(or institutions) that use patients’ medical information. The agreement should
force researchers and institutions to protect patients’ data.
3. There must be up to date plans to protect patients’ information against natural
disasters including disasters plans and data backup.
4. There must an authorization and identification scheme for employees to limit
access to authorized personnel only.
5. There must be an ongoing internal review of authorization and privileges
procedures to ensure that the right person have an access to patients data.
6. There must be training sessions for employees in security and privacy issues. The
training should be regular to cope with the technology advancements in regards
to security and privacy.
7.
There must be a daily update to security infrastructure including anti-virus and
internet security software.
2.4.7 Challenges of Data Mining in Healthcare
Data mining and the advancement in computer technology can help the healthcare
industry in many applications as mentioned earlier in this chapter. However, utilizing
data mining in healthcare have some limitations. The first limitation is the type of
data in healthcare databases. The types of data in healthcare database are
heterogeneous. Some patients’ examinations results are in numeric form, text form,
43
Chapter 2: Background Study
and images. The process of mining such a mixed data types bring a challenge to
developers. The source of data is different, such as laboratories, medical centers,
physicians and more. Therefore, data collection and integration is time consuming.
To overcome this problem, some authors recommended that a data warehouse to be
built before data mining process. However, this can be time consuming and may not
reliable for previous data[42].
Secondly, the nature of data is to be unorganized (not processed data), this include
missing features values, corrupted files, inconsistent with patient history or family
history. The problem of missing features values can be solved by constructing or
estimating the missing features values. However, the mining process will be more
efficient with complete data.
Thirdly, mining data that contains large number of cases and attributes may led to
patterns that are random and not real [51]. For this reason, not all significant patterns
are necessary to be useful.
Fourthly, mining healthcare data requires expertise’s that combine the knowledge in
data mining and knowledge discovery as well as knowledge in medical science.
Since it is pretty uncommon to find expert people who have knowledge in the
domain area (data mining and medical science), mining healthcare data may requires
collaboration between expertise in data mining and expertise in medical science[42].
Finally, resources for developing data mining application should be allocated by
healthcare organizations including budget, time, and efforts, and expertise. Data
mining developments can produce a negative outcome for some causes, such as lack
of management, limited support, and lack of cooperation between mining and
medical expertise[42].
44
Chapter 2: Background Study
2.4.8 Electronic Health Record
Electronic Health Record (EHR) is a systematic collection of electronic health
information about patients or populations. The format of EHR is digital and can be
shared across different locations such as healthcare providers through a network.
EHR is capable to store range of data including patients medical history, medication
and allergies, immunization status, laboratory test results, radiology images, vital
signs, personal information such as age and weight, and billing information [1].
The concept of electronic health record instead of paper is not a new technology. In
sixties of last century,
more information is needed to be gathered and stored for
patients because medical care became more complex. Physician feared the fact that
some patients information maybe lost or unavailable. The availability of complete
patient health information when needed is the main idea behind storing patient’s
health information in electronic form [52]. Between 1970s and 1980s, electronic
health records evolved to store more information about patients in order to improve
patient care. For example, drug dosages, side effects, allergies, and drug interactions
became available electronically to healthcare providers, enabling that information to
be stored in a systematic way to produce electronic healthcare systems. Recently,
some
universities,
research
centres,
and
healthcare
providers
developed
computerized health records for research purposes and to track patient treatment.
Overall, the innovation and advancements in EHR enhanced the quality of patient’s
healthcare [52].
Many countries including Australia are adopting EHR concept to bring better off
healthcare for its people including: better clinic information and accessibility, patient
safety, better patient care, and efficiency and savings[52].
45
Chapter 2: Background Study
“Our recovery plan will invest in electronic health records and new technology that
will reduce errors, bring down costs, ensure privacy, and save lives,” President
Obama stated in his speech to Congress in February 2009.
2.5
Related Work on Breast Cancer Diagnosis
In this section some of the related prior work on data mining methods for breast
cancer diagnosis is discussed.
Song et al. [53] presented a new approach for automatic breast cancer diagnosis
based on artificial intelligence technology. They focused to obtain a hybrid system
for diagnosing new breast cancer cases in collaboration between Genetic Algorithm
(GA) and Fuzzy Neural Network. They also showed that inputs reduction (features
selections) can be used for many other problems which have high complexity and
strong non-linearity with huge data to be analysed.
Arulampalam and Bouzerdoum [54] proposed a method for diagnosing breast cancer
and called Shunting Inhibitory Artificial Neural Networks (SIANNs). SIANN is a
neural network stimulated by human biological networks in which the neurons
interact among each other’s via a nonlinear mechanism called shunting inhibition.
The feed forward SIANNs have been applied to several medical diagnosis problems
and the results were more favourable than those obtained using Multilayer
Perceptions (MLPs). In addition, a reduction in the number of inputs was
investigated.
Setiono [55]proposed a method to extract classification rules from trained neural
networks and discussed its application to breast cancer diagnosis. He also explained
46
Chapter 2: Background Study
how the pre-processing of data set can improve the accuracy of the neural network
and the accuracy of the rules because some rules may be extracted from human
experience, and may be erroneous. The data pre-processing involves the selection of
significant attributes and the elimination of records with missed attribute values from
Wisconsin Breast Cancer Diagnosis dataset. The rules generated by Setiono’s
method were more brief and accurate than those generated by other methods
mentioned in the literature.
Meesad and Yen [56] proposed a hybrid Intelligent System (HIS) which integrates
the Incremental Learning Fuzzy Network (ILFN) with the linguistic knowledge
representations. The linguistic rules have been determined based on knowledge
embedded in the trained ILFN or been extracted from real experts. In addition, the
method also utilized Genetic Algorithm (GA) to reduce the number of the linguistic
rules that sustain high accuracy and consistency. After the system being completely
constructed, it can incrementally learn new information in both numerical and
linguistic forms. The proposed method has been evaluated using Wisconsin Breast
Cancer Dataset (WBC) data set. The results have shown that the proposed HIS
perform better than some well-known methods.
2.6
Feature Selection Techniques
Nowadays, the capability of collecting and generating data is more than before.
Contributing factors include the steady progress of computer hardware technology
for storing data and the computerization of business, scientific, and government
transactions. In addition, the use of the internet as a wide information system has
flooded us with incredible amount of data and information.
47
Chapter 2: Background Study
Data mining has attracted a big attention to information system researcher in the
recent years due to the wide availability of big amount of data and the need for
tuning such data into knowledge and useful patterns. The gained knowledge and
patterns can be used in many fields such as marketing, business analysis, and health
information systems [57].
The quality of data, the large amount of data, the existence of low quality data,
unreliable, redundant and noisy artefacts and outliers; all mentioned factors do affect
the process of extracting knowledge and useful patterns, and then knowledge
discovery during training phase is more difficult.
Experts in machine learning and data mining stated that the classification
performance (such as accuracy) decrease when the dataset contains many features
that are not relevant to the process of prediction. For example, performing decision
tree C4.5 on Monk1 problem produced error rate of 24.3% because of three
irrelevant features. However, the error rate dropped to 11.1% by ignoring the
irrelevant features [58]. The k nearest neighbour algorithm degrade to irrelevant
attributes and training set size for a certain accuracy level grows exponentially with
the number of irrelevant attributes [59].
Therefore, researchers have felt the necessity for producing more reliable data from
large amount of records such as using feature selection methods. Feature selection or
attribute subset combination is the process of identifying and utilizing the most
relevant attributes and removing as many redundant and irrelevant attributes as
possible [60, 61].
Variables, features, inputs, or attributes selection have become the focus of much
research in many areas where the number of cases and attributes are huge. The
48
Chapter 2: Background Study
purpose of feature selection is to obtain less number of features than the original
number of features in a certain dataset to: (1)enhance the prediction
accuracy,(2)obtaining a quicker classifier, (3) ignore the less important or irrelevant
features, (4) improve data quality, (5) avoid over fitting (6) and help solve the
problem of incredible amount of available data and how to utilise it effectively [62].
The literature shows that feature selection techniques can be divided upon the
induction algorithm and how it works with feature selection search. According to
that, feature selection techniques can be divided into three types: filter methods,
wrapper methods, and embedded methods [63].
2.6.1 Wrapper Feature Selection Technique
The wrapper approach was proposed by Kohavi and Paeger in 1994 in Stanford
university AI lab [64].In wrapper methods, the feature selection algorithm located as
a wrapper around the learning algorithm. The process starts with a search for relevant
subset of attributes by using the learning algorithm. The learning algorithm itself is
used to evaluate the feature subset which obtained by the search.
Figure 15 illustrates how the wrapper approach performs on the training set and the
evaluation process. The learning algorithm is treated as a black box with no
modification to the learning algorithm itself. The learning algorithm assesses the
subsets of features obtained by the search method. The learning algorithm obtains a
hypothesis about the quality and the relevance of a certain feature subset. Features
subset with the highest estimated value is chosen as the final set on which to run the
learning algorithm. The final step is to evaluate the model on new dataset (not used
by the search) to ensure the independency between the training process and the
49
Chapter 2: Background Study
testing process. The result is an estimated accuracy by using the highly relevant
features subset on the desired learning algorithm [65].
Figure 15: The Wrapper approach for features subset selection [65]
Table 2 shows the main advantages and disadvantages of using wrapper as a feature
selection method, as well as, examples of existence methods that utilize the wrapper
approach [63].
Table 2: Examples, advantages, and disadvantages of wrapper feature selection [63]
50
Chapter 2: Background Study
2.6.2 Filters Feature Selection Techniques
Filter techniques examine the significance of features by investigating the real
characteristics of the data. In most cases feature rank is calculated, and low ranking
features are ignored during the learning process. Afterwards, the high ranking subset
of features is used as training set to the classification algorithm [63]. The main
difference of filter in compare with wrapper is that filter ignores the learning
algorithm during features subset search. Figure 16 shows the filter approach; it shows
that features subset extraction is totally independent from the learning classifier.
Figure 16: The filter approach [56]
Some advantages of filter techniques include that they are able to be performed on
large databases that contain large number of attributes and cases, simple
computation, fast in comparison to wrapper and embedded methods, and they are
independent of the classification algorithm. The aim behind the independency
between filters and learning classifier is that feature selection needs to be performed
only once and then different classifiers can be used to evaluate the subset. On the
other hand, the independency between filter methods and learning algorithms may
cause low level of classification accuracy [63].
Table 3 summarise the main advantages and challenges of filter methods and some
examples of popular filter methods.
51
Chapter 2: Background Study
Table 3: Examples, advantages, and disadvantages of filter feature selection [63]
2.6.3 Embedded Feature Selection Techniques
Embedded Methods (EM) vary from other feature selection methods in how
classification methods and feature selection cooperate. In filter methods, there is no
corporation between learning classifiers and feature selection. In wrapper methods,
the learning classifier are used to measure the quality of subsets of features without
intervenes in the structure of the classification. In contrast to filter and wrapper
approaches, the embedded feature selection methods and learning process cannot be
taken apart [66]. The process of finding the optimal subset of features is combined
into the classifier construction. EM computation cost is less than wrapper methods
and the fact that there is interaction between the classifier and EM. [63].
Table 4 shows some advantages and disadvantages of using such a method along of
examples.
52
Chapter 2: Background Study
Table 4: Examples, advantages, and disadvantages of embedded feature selection [63]
2.6.4 Feature Selection Techniques Used in Current Work
 Information Gain
The information gain method was proposed to approximate quality of each attribute
using the entropy by estimating the difference between the prior entropy and the post
entropy [67]. This is one of the simplest attribute ranking methods and is often used
in text categorization. If 𝑥 is an attribute and 𝑐 is the class, the following equation
gives the entropy of the class before observing the attribute:
𝐻 𝑥 =−
𝑃 𝑥 𝑙𝑜𝑔2 𝑃 𝑥
(7)
𝑥
Where𝑃(𝑐) is the probability function of variable 𝑐. The conditional entropy of 𝑐
given 𝑥 (post entropy) is given by:
𝐻 𝑐|𝑥 = −
𝑃 𝑥
𝑥
𝑃(𝑐|𝑥)𝑙𝑜𝑔2 𝑃 𝑐|𝑥
(8)
𝐶
The information gain (the difference between prior entropy and postal entropy) is
given by the following equations:
𝐻 𝑐, 𝑥 = 𝐻 𝑐 − 𝐻 𝑐|𝑥
(9)
53
Chapter 2: Background Study
𝐻 𝑐, 𝑥 = −
𝑃 𝑐 𝑙𝑜𝑔2 𝑃 𝑐 −
𝑐
−𝑃 𝑥
𝑥
𝑃 𝑐 𝑥 𝑙𝑜𝑔2 𝑃 𝑐|𝑥
𝑐
(10)
 Correlation based feature selection (CFS)
CFS is among the simplest feature selections methods. CFS ranks features and
discover the merit of features or subset of features according to a correlation between
features. The main aim of CFS is to find a subset of features space that highly
correlated with the class label. CFS usually removes uncorrelated features and
redundancy. CFS’s feature subset evaluation function is shown as follows [68].
𝑀𝑒𝑟𝑖𝑡𝑠 =
𝑘𝑟𝑐𝑓
(11)
𝐾 + (𝐾 + 1)𝑟𝑓𝑓
Where 𝑀𝑒𝑟𝑖𝑡𝑠 is the worth of feature subset 𝑠 that contain 𝑘 features, 𝑟𝑐𝑓 is the
average feature correlation to the class, and 𝑟𝑓𝑓 is the average feature to feature
correlation. In order to apply this equation to calculate approximately the correlation
between features, CFS uses a modified information gain method called symmetrical
uncertainty to compensate the information gain bias for attributes with more values
as follows [69].
𝑆𝑈 =
𝐻 𝑥𝑖 + 𝐻 𝑥𝑗 − 𝐻 𝑥𝑖, 𝑥𝑗
(12)
𝐻 𝑥𝑖 + 𝐻 𝑥𝑗
 Relief
Relief is ranked among the well-known feature selection techniques and [70]. Its aim
is to rank the features quality giving their ability to predict instances of different
classes. Relief uses instance based learning (lazy learning such as k-Nearest
54
Chapter 2: Background Study
Neighbour) to assign a grade to each feature. Features are ranked by weight and
those that exceed a threshold -determined by the user- are selected to form the
promising subset. For each instance, the closest neighbour instance of the same class
and the closest instance of a different class are selected. The following equation
compute the average of distance between the nearest match and nearest miss [70].
(13)
𝑑𝑖𝑓𝑓 (𝑥, 𝑟, 𝑕)2 𝑑𝑖𝑓𝑓 (𝑥, 𝑟, 𝑕′)2
𝑊𝑥 = 𝑊𝑥 −
+
𝑚
𝑚
Where 𝑊𝑥 is the grade for the attribute 𝑥, 𝑟 is a random sample instance, 𝑕 is the
nearest hit, 𝑕′ is the nearest miss, and 𝑚 is the number of samples.
 Principle Components Analysis (PCA)
The purpose of Principle Components Analysis (PCA) is to decrease the dataset
dimension that contains a large number of correlated attributes by transforming the
original attributes space to a new space in which attributes are uncorrelated. The
algorithm then ranks the variation between the original dataset and the new one.
Transformed attributes with most variations are kept; meanwhile discard the rest of
attributes. It is also important to mention that PCA is valid for unsupervised data sets
because it does not take into account the class label [71].
 Consistency based Subset Evaluation (CSE)
Consistency based Subset Evaluation (CSE) adopts the class consistency rate as the
evaluation measure. The idea is to obtain a set of attributes that divide the original
dataset into subsets that contain one class majority [60]. One of well-known
consistency based feature selection is consistency metric [72] proposed by Liu and
Setiono:
55
Chapter 2: Background Study
Consistencys = 1 −
𝑘
𝑗 =0
𝐷𝑗 − 𝑀𝑗
𝑁
(14)
where 𝑠 is feature subset, k is the number of features in 𝑠, 𝐷𝑗 is the number of
occurrences of the 𝑗th attribute value combination, 𝑀𝑗 is the cardinality of the
majority class for the 𝑗th attribute value, and 𝑁 is the number of features in the
original data set. For continuous values, Chi2 can be used [73]. Chi2 automatically
discretises the continuous features values and removes irrelevant continuous
attributes.
2.6.5 Related Work on Feature Selection Techniques
Numerous feature selection methods have been broadly used in different domains.
Xie and others [74] have proposed a hybrid features selections algorithms to build an
efficient diagnostic models based on a new accuracy criterion, generalized F-score
(GF) and SVM. The hybrid algorithms adopt Sequential Forward Search (SFS),
Sequential Forward Floating Search (SFFS), and Sequential Backward Floating
Search (SBFS), respectively, with SVM to accomplish hybrid features selections
with the new accuracy criterion to guide the procedure. They were called as modified
GFSFS, GFSFFS and GFSBFS, respectively. The mentioned hybrid methods
combine the advantages of filters and wrappers to select the optimal feature subset
from the original dataset to build the efficient classifiers. Experimental results
showed that the proposed hybrid methods construct efficient diagnosis classifiers
with high average accuracy when compared with traditional algorithms.
Liao and others [75] proposed a hybrid features selections method along with k-NN
and SVM. They aimed to identify the most significant genes that demonstrate the
56
Chapter 2: Background Study
highest capabilities of discrimination between the classes of samples. They first used
filter method to rank the genes in terms of their expression difference, and then a
clustering method based on k-NN principles for clustering gene expression data. A
support vector machine is applied to validate the classification performance of
candidate genes. Their experimental results demonstrated the effectiveness of their
method in addressing the problem.
Vijayasankari and Ramar [76]also proposed a novel hybrid features selections method
to select relevant features and cast away irrelevant and redundant features from the
original dataset using C4.5 and Naïve Bayes classifier. The efficiency and
effectiveness of the proposed method was demonstrated through extensive
comparisons with other methods using real world data of high dimensionality.
Experimental results on datasets revealed that the proposed algorithm increases the
classifier accuracy with less error rate.
Hall and Holmes [60] presented a benchmark comparison of several attribute
selection methods for supervised classification. Attributes selections is achieved by
cross-validating the attribute rankings with respect to a classification learner C4.5
and Naïve Bayes. The results conclude that features selections methods can enhance
the performance of some learning algorithms. The findings also include that
Correlation based feature selection method has produced the best result among six
different feature selections methods.
Saeys, et al [63] reviewed the importance of feature selection approach in a set of
well-known bioinformatics applications. They focused into two main issues: the
large input dimensionality, and the small sample sizes. The authors found that
features selections methods could help researchers solve the mentioned issues. They
57
Chapter 2: Background Study
also believed that features selections application will become fundamental in dealing
with the high dimensional applications.
The literature also showed two categories of feature selection- that are wrapper and
filter. The wrapper evaluates and select attributes based on accuracy estimated by the
target learning algorithm. Using a certain learning algorithm, wrapper basically
searches the feature space by omitting some features and testing the impact of feature
omission on the prediction metrics. The feature that make significant difference in
learning process implies it does matter and should be considered as a high quality
feature. On the other hand, filter uses the general characteristics of data itself and
work separately from the learning algorithm. Precisely, filter uses the statistical
correlation between a set of features and the target feature. The amount of correlation
between features and the target variable determine the importance of target variable
[77]. A further category is by sorting attributes using algorithms that rank a features
or set of features in which attributes are ranked in regards to their improvement to a
subset of attributes [57].
2.7
Missing Features Values
Missing features values are common in many medical databases for different reasons
such as some features values are not specified because they are not available. For
example, diagnosing patients without blood test result. Another reason for missing
attributes values is that the attribute values might be forgotten or mistakenly erased
or not filled. Moreover, some interviewers decline to respond for some private
information such as the income or the age [78].
58
Chapter 2: Background Study
2.7.1 Types of Missing Values
Donald Rubin classified the missing features values from the literature into three
types: missing completely at random, missing at random, and missing not at random
[79].
 Missing Completely At Random
Missing Completely At Random (MCAR) is a term that describes how the
missingness occurred, in which the probability that a feature value is missing is
unrelated to the feature value or to the value of any other features in the dataset. For
example, the data may be missing because equipment malfunctioned, the weather
was terrible and could not record the observation for a certain day, people got sick, or
the data were not entered correctly [80].
Notice that, the main concern is the value of feature not the missingness itself. For
instance, person who refused to mention the personal income is also likely to refuse
to mention the family income, the data obtained is still to be considered MCAR as
long as the reasons have no relation to the income value itself [80].
 Missing At Random
Missing At Random(MAR) is the case when the existence of missing feature value
does not depend on the feature value itself and may depend on other features values
in the dataset. For example, depressed person is more likely not to report income
which may lead to find that not reporting income may due to depression [80].
 Missing Not At Random
Missing Not At Random (MNAR) is the case when the missing feature value is not
missing at random or completely at random. For example, if a person suffers
depression and a person who suffer depression is more likely not report his mental
59
Chapter 2: Background Study
status, then the data are not missing at random. Respectively, if a person refuses to
tell the age, then the missing data are not random [80].
In data mining and machine learning applications, missing features values that
matters but still missing creates a challenge for researchers. Handling unknown
attributes values with the most appropriate values is a common concern in data
mining and knowledge discovery. The process of constructing missing values is a
vital process in most supervised and unsupervised data mining researches because it
may affect the quality of learning and the performance of classification algorithms
[81].
In general, the performance of classification accuracy is particularly affected by the
presence of missing feature values because most of learning classifiers such as neural
networks do not consider the probability of having missing features values and
cannot not deal with it automatically [81].
2.7.2 Handling missing data
As mentioned earlier, the process of handling missing feature values is vital process
in most supervised and unsupervised data mining researches because it may affect
the quality data itself, which may affect the classifier performance. The literature
showed many attempts to treat missing features values. The most popular methods
for dealing with missingness are omitting instances, imputation, and expectation
maximisation. All of these methods can be applied in conjunction with any classifier
that operates on complete data [81].
60
Chapter 2: Background Study
 Omitting Instances
In omitting instances method, any record of data that contains missing features
values is deleted from the data set. After omitting instances that contain missing
features values, the classification process run on the remaining instances. The main
disadvantage of this method is discarding important information in some cases.
Therefore, it is not a common method. However, it could be used if there is a small
amount of missing data [81].
 Features Imputation
Features imputation is a well-known method for constructing missing features values
in the datasets for learning purposes. The imputation method can be divided into two
major types: single imputation and multiple imputation [81].
In single imputation, the missing features values are substituted by the
correspondence features values according to certain rules such as the features values
means, mode, median, or learning algorithm. For example, the mean imputation
calculate the mean of feature f in the dataset that contain values. The mean value for
feature f is then used to fill the features f that has missing values. Another example is
the regression imputation. Regression imputation is a method for dealing with
missing features values by building regression models that construct missing features
values based on observed features (features that contain values). The regression
models are used to predict the values of missing attributes [81].
The scenario for constructing missing features values in multiple imputations is
similar to the scenario for single imputation. However, the multiple imputation use
more than one value to fill missing features values in the dataset, such as mean of
observed feature values, the mode of observed feature values, and regression method.
61
Chapter 2: Background Study
The multiple imputations approach drawbacks include the computational cost is
higher than in single imputation. However, the classification performance (accuracy)
is higher than single imputation [81].
 Expectation Maximization
The two most important methods to deal with missing features values in datasets in
the recent literature are expectation maximization and multiple imputation [79].
Expectation maximization is one of the most effective methods for handling missing
data [82]. To demonstrate expectation maximization, consider the data as shown in
table 5. Missing features values are depression, age, and height [83].
Table 5: Extract of data to demonstrate Expectation Maximization [83]
To perform expectation maximization, firstly, the mean, variance, and covariance are
estimated from instances whose data is complete such as row number 4 in table 5. In
particular, expectation maximization will calculate the following values as shown in
table 6:
62
Chapter 2: Background Study

The mean of depression, age, height, and weight is 4.71, 37.50, 183.21, and
45504.43 respectively.

The variance of depression, age, height, and weight is 3.55, 9.43, 194.43, and
14403.12 respectively and appears in the diagonals.
Other cells are the values of covariance between each pair of variables.
Table 6: The calculations of mean, variance, and covariance for the features depression, age, height,
and weight.
Secondly, Expectation maximization uses maximum likelihood procedures to
estimate regression equations calculate the relationships between variables. For
example, maximum likelihood algorithm may produce the following equations:

Depression = -15.3 + .01 x age + .004 x height + .0005 x wage

Age = 7.3 + .34 x depression + .002 x height + .0003 x wage

Height = 19.2 + .53 x depression + .021 x age + .0004 x wage

Wage = 7.3 + .44 x depression + .031 x age + .0021 x height
The purpose of maximum likelihood is to ensure these equations predict the means,
variances, and covariance more accurately than any other equations [79, 82] .Thirdly,
63
Chapter 2: Background Study
these equations can be used to estimate the missing values. The procedure of
estimating the missing feature values is shown below:

Consider the equation Depression = -15.3 + .01 x age + .004 x height + .0005
x wage.

This equation can then be used to estimate the depression for individuals who
did not provide his/her information.

For the second case, 17, 173, and 31600 would be substituted into this
equation.

Depression for this person would be 1.362.
For other missing features values, the same procedure is used after considering the
right equation. The constructed missing features values are shown in bold in table 7.
Table 7: The final data set after performing Expectation Maximization method.
64
Chapter 2: Background Study
2.8
Chapter Summary
This chapter presented a background study of main machine learning and data
mining technologies used in the present research. It also presented data mining in the
field of healthcare. Some related prior work on different data mining techniques,
missing feature values and feature selection techniques, and the technique used in
this thesis described.
The next chapter will present the research methodology used in current research and
the details of the datasets used.
65
Chapter 3: Research Methodology
CHAPTER THREE
Research Methodology
3.1
Introduction
Two major research paradigms have been identified in the Western tradition of
science, Positivist (called scientific) and Interpretive (known as anti-positivist) [84].
However, Dash [85] stated three major types of research paradigms: positivism, antipositivism, and critical thinking. The positivism paradigm is based on observation
and reasoning as a tool of understanding a certain problem or behavior. This
paradigm usually involves manipulation to variables and predictions on the basis of
previous observation or previous history. Positivist researchers are concerned about
what has caused a particular relationship and what the effects of this relationship are,
they also prefer quantitative data which can be transformed into numbers and
statistics.
Anti-positivism or qualitative research approach concentrate on subjectivist approach
to studying social phenomena which have importance to a range of research
techniques. Anti-positivism researchers criticize positivists because they believe that
statistics and numbers are not useful about human behavior .Similarly, critical theory
research approach describes critique and action research as research methods to
investigate a certain problem [85].
66
Chapter 3: Research Methodology
Despite the fact that each research tradition has its own approaches and research
methods, the researcher may adopt research methods cutting across research
traditions to solve the problem or to answer research questions [85]. Table 8 shows
research approaches, research methods, and examples for each research traditions.
This research can be placed as a positivism research utilizing the principles of
prediction upon previous history and data manipulations (the term manipulating in
this regard does not involve change in data structure or values. However,
manipulating data is the process of filling missing features values, treating noisy
data, data normalization and more).
Table 8: Selection of research paradigms and research methods [85]
Research
paradigms
Positivism
Research approach

Quantitative
Research methods
Examples

Surveys,
- Attitude of

longitudinal,
distance learners

cross-sectional,
towards online
correlation,
based education.

experimental,
- Relationship

quasi-experimental,
between students’
and
motivation and
ex-post facto research
their academic

achievement.
- Effect of
intelligence on the
academic
performances of
primary school
learners.
67
Chapter 3: Research Methodology
Anti-

Qualitative
positivism

Biographical,
- A study of

Phenomenological,
autobiography of a

Ethnographical, and
great statesman.

case study
- A study of
dropout among the
female students
- A case study of
an open distance
learning
Institution in a
country.
Critical
theory

Critical and

Ideology critique, and
- A study of
action-oriented

action research
development of
education during
the British rule in
India
- Absenteeism
among standard
five students of a
primary school
3.2
Data Mining Methodology
Knowledge discovery from the databases or data mining refers to extracting useful
relationships and patterns from large databases. Due to the amount of data and to
obtain useful outcomes, a systemically method must applied. It is became a fact that
quality data will imply more accurate outcomes than dirty data. Dirty data is a
common term in data mining that describe some unwanted data characteristics such
as incompleteness, noisy, and inconsistency. In this research, our method involves
different data mining processes as shown in figure 17:
68
Chapter 3: Research Methodology
Figure 17: Research Method Overview
69
Chapter 3: Research Methodology
3.2.1 Data Collection
It is very important to acquire high quality data, which is highly reliant on the quality
of the data collection process. The research data were proposed to be collected from
Canberra hospital and some healthcare providers in the Australian Capital Territory
(ACT). However, access to patient data could not gain due to privacy police in
healthcare providers in Australia. The second option was to collect data from
overseas. This option failed due to the cost involved. Therefore, this research relied
on third option which utilizes online databases. Online databases are available
publicly and are collected from clinical environment, and have undergone proper
organisational ethics approval processes and available freely for research proposes.
The advantage of using online databases is the ability to make comparison between
our methods and the existing methods by using the same databases.
One of the most popular machine learning repositories is UCI machine learning
repository. UCI is a collection of databases, domain theories, and data generators that
are used by machine learning researchers to train and test machine learning
algorithms. The repository was created in 1987 by David Aha and fellow graduate
students at UC Irvine. Since that time, it has been widely used by students, educators,
and researchers all over the world as a primary source of machine learning databases
[86]. This research has used the dataset Wisconsin Breast Cancer Dataset (WBC)
from UCI repository. WBC contains 699 records, each record consists of 9 features
plus the class attribute. Table 9 shows a sample of WBC dataset. In addition to
WBC, I recently found two other versions of breast cancer diagnosis, Wisconsin
Diagnosis Breast Cancer (WDBC), and Wisconsin Prognosis Breast Cancer
(WPBC). WDBC contains of 569 instances, 32 attributes, and 2 class labels where
WPBC have 198 instances, 34 attributes, and 2 class labels.
70
Chapter 3: Research Methodology
Table 9: Sample of Wisconsin Breast Cancer Diagnosis dataset
Uniformity
of Cell
Size
Uniformity
of Cell
Shape
Normal
Nucleoli
Bare
Nuclei
Single
Epithelial
Cell Size
Clump
Thickness
Marginal
Adhesion
Bland
Chromatin
Mitoses
Class
5
1
1
1
2
1
3
1
1
2
5
4
4
5
7
10
3
2
1
2
3
1
1
1
2
2
3
1
1
2
6
8
8
1
3
4
3
7
1
2
4
1
1
3
2
1
3
1
1
2
8
10
10
8
7
10
9
7
1
4
1
1
1
1
2
10
3
1
1
2
2
1
2
1
2
1
3
1
1
2
2
1
1
1
2
1
1
1
5
2
4
2
1
1
2
1
2
1
1
2
1
1
1
1
1
1
3
1
1
2
2
1
1
1
2
1
2
1
1
2
5
3
3
3
2
3
4
4
1
4
1
1
1
1
2
3
3
1
1
2
8
7
5
10
7
9
5
5
4
4
7
4
6
4
6
1
4
3
1
4
4
1
1
1
2
1
2
1
1
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
4
5
1
2
?
7
3
1
4
1
1
1
1
2
1
3
1
1
2
5
2
3
4
2
7
3
6
1
4
3
2
1
1
1
1
2
1
1
2
5
1
1
1
2
1
2
1
1
2
2
1
1
1
2
1
2
1
1
2
1
1
3
1
2
1
1
1
1
2
3
1
1
1
1
1
2
1
1
2
2
1
1
1
2
1
3
1
1
2
10
7
7
3
8
5
7
4
3
4
2
1
1
2
2
1
3
1
1
2
3
1
2
1
2
1
2
1
1
2
71
Chapter 3: Research Methodology
3.2.2 Data Selection
Data selection or feature selection have been an active research area in pattern
recognition, statistics, and data mining. The aim behind feature selection is to select a
subset of records variables by ignoring features with little or less important
information. For example, physician can make a decision based on some features
whether a dangerous surgery is necessary for treatment or not. In the current
research, feature selection methods have been used to minimise the number of
features in the dataset before commencing the mining process.
3.2.3 Data Pre-Processing
Data collection phase may produce dataset that contains incomplete, inaccurate, and
inconsistence data. Inaccurate data is having incorrect attribute values; this may due
to data entry errors, faulty in data collection tools, errors in data transmission, and
users may submit incorrect values just to fill mandatory fields during surveying [57].
Incomplete data can occur for many reasons. For example, some attributes values
were not important during data entry and some attributes values were not always
available. Inconsistency occurs when there is a record that is in conflict with other
records on the dataset [57].
Completeness, accuracy, and consistent data are the elements that define data quality.
Data pre-processing is an important step in data mining process to satisfy data quality
elements. Therefore, the current research is to utilize data pre-processing tasks to
ensure the dataset is ready for mining process in order to produce accurate results as
possible. The study has proposed a new approach for constructing missing feature
values to satisfy the completeness element; also a comparison has been made
between feature selection methods to find the best method that suites datasets, and
72
Chapter 3: Research Methodology
performed some techniques to eliminate noise and outliers. At the end of the current
phase, data should be ready for mining process.
3.2.4 Applying Data Mining Methods
At this stage, the data are ready for the mining process with no or little data preprocessing. The processed data will be used to evaluate the proposed methods. This
work proposed a method called Information Gain and Adaptive Neuro-Fuzzy
Inference System for Breast Cancer Diagnoses (IGANFIS). IGANFIS is a new
approach for breast cancer diagnosis using the advantages of Adaptive Network
based Fuzzy Inference System (ANFIS) and the Information Gain method. In this
approach, the ANFIS is to build an input-output mapping using both human
knowledge and machine learning ability. The information gain method is to reduce
the number of input features to ANFIS. The experimental results showed 98.23%
classification accuracy which underlines the capability of the proposed algorithm.
The method and experimental results presented in AICIT conference in Seoul, in
2010.
During designing and training IGANFIS method, of the study shows how it is
important to have a complete dataset during mining process or during applying
automatic methods. Therefore, this work proposed a new approach for constructing
missing features values. The new approach for constructing missing feature values is
based on iterative nearest neighbours and distance metrics. The proposed approach
employed
weighted
k-nearest
neighbours
algorithm
and
propagating
the
classification accuracy to a certain threshold. The proposed method showed
improvement of classification accuracy of 0.005 in the constructed dataset than the
original dataset which contain some missing features values. The maximum
73
Chapter 3: Research Methodology
classification accuracy was 0.9698 on k=1. Though, the amount of classification
improvement is not really significant, it does indicate that there is some room for
improvement in classification accuracy by estimating the values of missing features.
This work appeared on the 3rd International Conference in Data Mining and
Intelligent Information Technology Applications conference (2011).
Data mining process is a comprehensive approach that branches into many areas. In
order to cover the whole aspects of data mining, decided to make a comparison
between some features selection approaches. The work has been accepted in the
International Journal on Data Mining and Intelligent Information Technology
Applications and shall appear 2013 edition. Last century, the challenge was to
develop new technologies that store large amount of data. Recently, the challenges
are to effectively utilize the incredible amount of data and to obtain knowledge that
benefits business, scientific, and government transactions by using subset of features
rather than the whole features in the dataset. Therefore, the study has focused on
feature selection techniques as a method to gain high quality attributes to enhance the
mining process.
Features selection techniques touch all disciplines that require
knowledge discovery from large data. In our study, a comparison between
benchmark feature selection methods based on well-known breast cancer dataset
Wisconsin Breast Cancer Diagnosis (WBC) and three well-recognized machine
learning algorithms. The study found that feature selection methods can significantly
improve the performance of learning algorithms. However, no single feature
selection methods that best satisfy all datasets and learning algorithms. Therefore,
machine learning researchers should understand the nature of datasets and learning
algorithms characteristics in order to obtain better outcome. Overall, Correlation
Based Feature Selection (CFS) and consistency Based Subset Evaluation performed
74
Chapter 3: Research Methodology
better than Information Gain, Symmetrical Uncertainty, Relief, and Principle
Components Analysis.
Based on feature selection study that was carried out, the study found that finding the
most excellent feature selection technique that best satisfies a certain learning
algorithm could bring the benefit for researchers. Therefore, this work proposed a
new method based on a combination of learning algorithm tools and features
selections techniques. The idea is to obtain a hybrid approach that combines between
the best performing learning algorithms and the best performing features selections
techniques in regards to three well-known datasets. The experiment results showed
that combination of correlation based feature selection methods along with Naïve
Bayes learning algorithm can produce promising results. The results are presented in
ICONIP12 in Qatar (19th International Conference on Neural Information
Processing) in 2012.
3.2.5 Evaluation
Evaluation phase is an important part of data mining process. In this phase, the aim
of the data mining experts is to test and assess the proposed model. If the model does
not satisfy the expectations, then data mining experts usually rebuild the model by
changing its parameters until the desired results are achieved.
In this study, the evaluation of proposed methods is performed by comparing the
model results with the real data values (class features). According to that, the
classification accuracy and error rate are calculated. The error rate (Err) of the
classifier is defined as the average number of misclassified samples divided by the
total number of records in the dataset. On the other hand, the classification accuracy
of the model can be calculated as one minus the error rate. If the classification
75
Chapter 3: Research Methodology
accuracy is less than a certain threshold let say 80% for example, then some changes
has to be perform to the method, the feature selection, or the pre-processing phase
until obtaining satisfying outcomes. Another approach for evaluating the results is by
making a comparison between the results obtained by the proposed methods and
previous methods in the literature. In most cases, the dataset used in the proposed
method should be the same dataset used by other methods on the literature to ensure
that a competitive method has been obtained.
3.2.6 Machine Learning Software Development Tools
The current work has used two well-known machine learning tools; WEKA and
MATLAB. WEKA stands for Waikato Environment for Knowledge Analysis (also,
the Weka is a flightless bird found only in islands of New Zealand). WEKA is an
open source machine learning software written in JAVA language. WEKA provides
the environment to calculate the information gain and contains some data mining and
machine learning methods for data pre-processing, classification, regression,
clustering, association rules, and visualization [87].
MATLAB is a fourth generation language and interactive environment for numerical
computation, visualization, and programming. MATLAB is used to analyse data,
develop algorithms, and create models and applications. Therefore, it is users come
from various backgrounds of engineering, science, and economics. MATLAB is
widely used in academic and research institutions as well as industrial enterprises
[88].
3.2.7 Results Visualization
At the end of evaluation phase, data mining experts decide how to present the data
mining results. The aim of data visualization is to let the end user view and utilize the
76
Chapter 3: Research Methodology
obtained results. Since data mining usually involves extracting not existing
information from a database, the end users may raise some questions about
information source and how to utilize it. However in databases, the end users expect
the information is already reside on the database. This research hasn’t investigated
data visualization in depth. The reason is because the current study is for research
purposes and not business oriented. However, tables, scatted chart, bar chart, and
figures have been used to demonstrate the obtained results.
3.3
Chapter Summary
This chapter presented the research methodology used in current research and the
source of dataset used. It also describes the main methodology of data mining
process. Further, the next chapter will describe a new approach for diagnosing breast
cancer based on the combination of a new a new information gain and adaptive neuro
fuzzy inference technique.
77
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
CHAPTER FOUR
Breast Cancer Diagnosis Based on Information
Gain and Adaptive Neuro Fuzzy Inference System
4.1
Overview
In this Chapter, the details of data mining approach based on ANFIS and Information
Gain method is discussed. First, a brief description of ANFIS is given, and the detail
of proposed approach is described in Section 4.4. The details of experimental
validation and discussion in Section 4.5, and the summary of the findings using this
approach are described in Section 4.6.
4.2
Adaptive Neural Fuzzy Inference System (ANFIS)
Adaptive Neural Fuzzy Inference System (ANFIS), proposed by Jang in 1993, is a
combination of two machine learning approaches: Neural Network (NN) and Fuzzy
Inference System (FIS) [89].
Some of the earlier work on ANFIS was done by Übeyli [90], who aimed to
integrate adaptive neural fuzzy inference system (ANFIS) for breast cancer
diagnoses. The author used a database of patients with known diagnosis (i.e.
supervised learning). The ANFIS classifier was trained with a set of records for nine
examined features for breast cancer, and then was used to diagnose new cases. The
system combined between neural network and it is ability of learning and fuzzy
modelling approach. The performance of Übeyli’s ANFIS-based model showed a
78
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
promising results and underlined it is capability to diagnose the disease with 98%
classification accuracy.
Motivated by this work, I tried to adapt the ANFIS based data mining technique with
a pre-processing stage involving Information Gain Method (IG), with expectation
that the method can enhance the classification accuracy for breast cancer datasets.
The details of ANFIS, IG structure, and experimental validation are described in next
few Sections.
4.2.1 ANFIS Structure
ANFIS exploit the advantages of NN and FIS by combining the human expert
knowledge (FIS rules) and the ability to adapt and learn (NN) [89]. For simple
illustration, suppose the fuzzy system contains two Sugeno fuzzy rules:
𝑅𝑢𝑙𝑒1: 𝐼𝐹 𝑥 𝑖𝑠 𝐴1 𝐴𝑁𝐷 𝑦 𝑖𝑠 𝐵1 , 𝑇𝐻𝐸𝑁 𝑓 = 𝑝1 𝑥 + 𝑞1 𝑦 + 𝑟1
𝑅𝑢𝑙𝑒2: 𝐼𝐹 𝑥 𝑖𝑠 𝐴2 𝐴𝑁𝐷 𝑦 𝑖𝑠 𝐵2 , 𝑇𝐻𝐸𝑁 𝑓 = 𝑝2 𝑥 + 𝑞2 𝑦 + 𝑟2
Figure 18 (a) shows the fuzzy reasoning and Figure 18 (b) shows the corresponding
structure of ANFIS. In Figure 18 (b), the node function in each layer is as the
following [89]:
Layer1:
Each node 𝑖 (represented by a square) in this layer accepts input and computes the
membership μA i x .
𝑂𝑖1 = 𝜇𝐴𝑖 𝑥
(15)
79
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
where 𝑥 is the input to node 𝑖 , and 𝐴𝑖 is the linguistic label (small, large, etc.)
associated with this node. In other words, 𝑂𝑖1 is the membership function of 𝐴𝑖 and it
specifies the degree to which the given 𝑥 satisfies the quantifier 𝐴𝑖 .Usually 𝜇𝐴𝑖 𝑥 is
chosen to be bell-shaped with values between 0 and 1, such as the generalized bell
function:
𝜇𝐴 𝑖
𝑥 − 𝑐𝑖
𝑥 = 𝑒𝑥𝑝 −
𝑎𝑖
2
(16)
Where𝑎𝑖 and 𝑐𝑖 aretwo parameters called premises.
Layer2:
Every node in this layer (represented by a circle) takes the corresponding outputs
from Layer 1 and multiplies them to generate a weight:
𝑤 = 𝜇𝐴𝑖 𝑥 × 𝜇𝐵𝑖 𝑥 , 𝑖=1,2
(17)
The output of this node represents the firing strength of the rule.
Layer3:
Every node in this layer is a circle node labeled N. This layer normalize the weight
of a certain node in compare to the sum of other nodes weights (The ration of
weight) then compute the implication of each output member function.
𝑤𝑖 =
𝑤𝑖
𝑗
𝑤𝑗
, 𝑖=1,2. 𝑗 =2
(18)
Layer 4:
Every node in this layer is illustrated with a square. Based on Sugeno inference
system, the output of a rule can be written on the following linear format:
80
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
𝑂𝑖4 = 𝑤𝑖 𝑓𝑖 = 𝑤𝑖 (𝑝𝑖 𝑥 + 𝑞𝑖 𝑦 + 𝑟𝑖 )
(19)
Where 𝑝𝑖 ,𝑞𝑖 are the consequent parameters and 𝑟𝑖 is the bias.
Layer 5:
This layer called the aggregation layer, which computes the summation of rules, the
proposed algorithm produce a single output (centroid):
𝑂𝑖5 = 𝑓𝑖𝑛𝑎𝑙𝑜𝑢𝑡𝑝𝑢𝑡 =
𝑤𝑖 𝑓𝑖 =
𝑖
𝑖 𝑤𝑖 𝑓𝑖
𝑖
𝑤𝑖
(20)
4.2.2 ANFIS Learning
The method to train ANFIS is the hybrid learning algorithm which uses the gradient
descent method and Least Square Estimate (LSE). Each cycle of the hybrid learning
consists of a forward pass and a backward pass. In the forward pass the signal travels
forward until Layer 4 and the consequent parameters are identified using the LSE
method. In the backward pass the errors are propagated backward and the premise
parameters are updated by gradient descent. The process repeated until achieving the
lowest error or a predefined threshold [89].
81
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Figure 18: (a) Fuzzy Reasoning (b) Equivalent ANFIS Structure [89].
4.3
Information Gain
The information gain method was proposed to approximate quality of each attribute
using the entropy by estimating the difference between the prior entropy and the post
entropy [67]. This is one of the simplest attribute ranking methods and is often used
in text categorization. If 𝑥 is an attribute and 𝑐 is the class, the following equation
gives the entropy of the class before observing the attribute:
82
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
𝐻 𝑥 =−
(21)
𝑃 𝑥 𝑙𝑜𝑔2 𝑃 𝑥
𝑥
Where𝑃(𝑐) is the probability function of variable 𝑐. The conditional entropy of 𝑐
given 𝑥 (post entropy) is given by:
𝐻 𝑐|𝑥 = −
𝑃 𝑥
𝑥
𝑃(𝑐|𝑥)𝑙𝑜𝑔2 𝑃 𝑐|𝑥
(22)
𝐶
The information gain (the difference between prior entropy and postal entropy) is
given by the following equations:
(23)
𝐻 𝑐, 𝑥 = 𝐻 𝑐 − 𝐻 𝑐|𝑥
𝐻 𝑐, 𝑥 = −
𝑃 𝑐 𝑙𝑜𝑔2 𝑃 𝑐 −
𝑐
4.4
−𝑃 𝑥
𝑥
𝑃 𝑐 𝑥 𝑙𝑜𝑔2 𝑃 𝑐|𝑥
𝑐
(24)
The Proposed IG –ANFIS Approach
The proposed approach is to combine the information gain method and ANFIS
method for diagnosing diseases (in this case; breast cancer). The information gain
will be used for selecting the quality of attributes. The output of applying the
information gain method is a set of features with high ranking values, the set of high
ranked features will be the input for ANFIS. The selected features will be applied to
ANFIS to train and test the proposed approach. The structure of the proposed
approach is shown in Figure 19, where X = {x1 , x2 , . . , xn }are the original features in
dataset, Y = {y1 , y2 , . . , yk } are the features after applying the information gain
(features selections), and Z denotaed to the final output after applying Y on ANFIS
(the diagnose).
83
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Figure 19: The general structure for the proposed approach
4.5
The Experimental Results
The database, Wisconsin Breast Cancer Dataset (WBC), have been created by
William Wolberget al. [86] from the University of Wisconsin-Madison, USA. The
database attributes were collected from digital fine needle aspirate (FNA) of breast
mass. WBC contains 699 records. Each record consists of 9 features plus the class
attribute.
In our experiment, the database was divided into training and testing datasets. 341
records used for training and 342 records for testing. The records which contain
missing values (16 records) have been ignored. The class attributes have been
normalized to [0=Benign, 1=Malignant]. The information gain method has been used
to select the quality of attributes. Table 10 shows the ranking of attributes after
applying the attribute evaluator InfoGainAttributeVal and the searching method
Ranker-T-1 using WEKA on WBC dataset.
84
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Table 10: Information Gain Ranking Using WEKA on WBC
Attribute Name
Rank
Uniformity of Cell Size (UCSize)
0.636
Uniformity of Cell Shape (UCSshape)
0.633
Normal Nucleoli (NN)
0.555
Bare Nuclei(BN)
0.538
Single Epithelial Cell Size (SECS)
0.421
Clump Thickness (CT)
0.411
Marginal Adhesion (MA)
Bland Chromatin (BC)
0.394
Mitoses(MI)
0.278
0.316
It is very important to determine the number of features used in the experiment.
Therefore, the proposed approach is to select a certain number of features based on
features rank and a point where the rank is dropped significantly.
Figure 20 shows the graph of Table 10. It shows the most significant change in the
graph (the slope point) which gave us an indication to choose the first four top
ranking features located above the slope point as the recommended number of
features to be used later as inputs to ANFIS. The graph shows the biggest drop just
after the feature number 4 (BN). Therefore, features Uniformity of Cell Size
(UCSize), Uniformity of Cell Shape (UCShape), Normal Nucleoli (NN), and Bare
Nuclei (BN) are selected to train and test the model.
At this stage, the attributes have been deducted and the recommended number of
features has been set to 4 features.
85
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Figure 20: Information Gain Ranking on WBC
The first stage was to select the most important features that may lead to more
accurate results as mentioned earlier. The second stage is to construct the fuzzy
inference system (FIS). The most known fuzzy inference systems are Mamdani-FIS
and Sugeno-FIS. Mamdani-FIS method is widely used to obtain expert knowledge. It
allows users to describe the expertise as a simulation to the real life and nature.
However, Mamdani-FIS is computationally expensive. On the other hand, SugenoFIS method is computationally efficient and works well with optimization and
adaptive techniques.
In our proposed approach, Sugeno Fuzzy Inference system has been used to maps
feature to feature membership functions, feature membership function to rules, rules
to a set of output, output to output membership functions, and the output membership
function to a single-valued output as shown in Figure 21. The membership function
maps input with a membership values as shown in Figure 22.
86
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Figure 21: Sugeno Fuzzy Inference System with four features input and single output
In addition to the membership function, FIS contains the rules that add human
reasoning capabilities to machine intelligences, which are usually based on Boolean
logic. In our proposed approach, the rules have been defined from the real data. The
rules express the weight of each feature by giving higher priority for features that
have the highest rank. The proposed approach contains 81 rules (Number of rules =
𝑥 𝑦 where 𝑥 is the Number of member functions and 𝑦 is the number of features
i.e.34 =81 rules). The following are two examples of rules used in the proposed
approach:
IF AND(UniformityCellSize is poor, UniformityCellShape is Avg, BareNuclei is poor,
NormalNucleoli is poor)THEN (output is OK)
IF AND(UniformityCellSize is poor, UniformityCellShape is high, BareNuclei is poor,
NormalNucleoli is avg)THEN (output is NOT_OK)
87
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Figure 23 is a visual implementation for the ruleUniformityCellSize. It contains three
member functions: poor, average, and high.
Figure 22: Input Membership Function for the feature “Uniformity of Cell Size”
In the third and final stage, the constructed Fuzzy Inference System and the new
features set were loaded to ANFIS which will train and test the proposed approach as
shown in Figure 23. The structure of ANFIS on MATLAB is shown in Figure 24.
88
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
Figure 23: The structure for the proposed approach (IG-ANFIS)
Figure 24: ANFIS Structure on MATLAB
89
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
The result of applying ANFIS on the features selected using the information gain on
WBC dataset showed 98.24% accuracy. The results of previous work (using the
same dataset) are shown in Table 11 and Figure 25:
Table 11: Comparison of classification accuracy between IG-ANFIS and some previous work
The approach
AdaBoost
ANFIS
SANFIS
FUZZY
FUZZY-GENETIC
ILFN
NNs
ILFN&FUZZY
IG-ANFIS (our method)
SIANN
100.0%
Accuracy
57.60%
59.90%
96.07%
96.71%
97.07%
97.23%
97.95%
98.13%
98.24%
100.00%
97.23%
96.07% 96.71% 97.07%
97.95% 98.13% 98.24%100.00%
95.0%
Classification Accuracy %
90.0%
85.0%
80.0%
75.0%
70.0%
65.0%
60.0%
59.90%
57.60%
55.0%
50.0%
Classifiers
Figure 25: Comparison of classification accuracy between IG-ANFIS and some previous work
90
Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System
4.6
Summary and Discussion
This chapter proposed a new approach for diagnosing breast cancer by reducing the
number of features to the optimal number using the information gain and then apply
the new dataset to the Adaptive Neuro Fuzzy Inference system (ANFIS). The study
found that the accuracy for the proposed approach is 98.24% compared with other
methods. The proposed approach showed a very promising results which may lead to
further attempts to utilise information technology for diagnosing patients.
The next chapter will present a new approach for constructing missing features
values based on k-NN and the distance between cases. Examples of distance
functions used are Euclidean and Mankoski distance functions.
91
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
CHAPTER FIVE
Iterative Weighted k-NN for Constructing Missing
Feature Values in Wisconsin Breast Cancer
Dataset
5.1
Overview
This chapter presents a new approach for constructing missing features values based
on iterative nearest neighbours and distance metrics. The proposed approach employs
weighted k- nearest neighbour’s algorithm. The main idea is to propagate the
classification accuracy to a certain threshold which is set by the researchers and
users. The proposed method showed slight improvement of 0.005 classification
accuracy on the constructed dataset (the new dataset with no missing values) than the
original dataset which contain some missing features values. The approach also
showed the classification accuracy from k=1 to k=5, it showed that the maximum
classification accuracy was 0.9698 when k=1.
5.2
Missing Feature Values
Data mining and knowledge discovery tools become one of the foremost research
areas in the field of medical diagnoses. The aim is to classify large datasets into
patterns that can be used to extract useful knowledge. For example, data mining
techniques can utilise patient’s databases for automated medical diagnoses. The
92
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
purpose is to achieve more accurate findings, speed up the diagnoses, and reduce the
errors and mistakes occurred by human being [91]. However, incomplete dataset or
missing features values may affect data mining findings.
The problem of missing features values are common in many applications,
particularly, in medical databases for many reasons such as some features values are
not specified because they are not available at the time of data collection, attributes
value might be forgotten, mistakenly erased , or not filled during data entry. In some
cases, some interviewers decline to respond for some private information such as the
income or the age [78]. In data mining and machine learning applications, missing
features values that matters but that are missing create a challenge for researchers.
Therefore, the process of treating unknown attributes values with the most
appropriate values is a common concern in data mining and knowledge discovery.
The process of constructing missing values is a vital process in most supervised and
unsupervised data mining researches because it may affect the quality of learning and
the performance of classification algorithms.
The literature shows variety methods for treating the missing attribute values. These
methods maybe labelled into sequential and parallel methods. In sequential methods,
missing attributes values are replaced by known values then the knowledge is
acquired for a dataset with all known attribute values. Examples of sequential
methods are deleting the records (cases) that contain missing values, substitute
missing features values with the most common value of an attribute, assigning all
possible attribute values to missing features values, replacing the missing features
values with the mean of feature values [78].
93
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
In parallel methods, knowledge is acquired directly from the original data sets. An
example of parallel method is rule induction. In rule induction, the rule learning
algorithm is used to learn directly from the original dataset to find kind of regulations
on how to treat or construct missing features values if exists [78].
Now let’s have a brief attempts for treating missing features values using the above
two types of methods. White [92] did propose the simplest method to deal with
missing values by simply ignoring the cases which contain unknown attributes
values. Kononenko et al. [93] proposed a method to observe the missing features
values from other attributes. They used the class label to determine the missing
attributes values. His plan was to assign the most probable attribute value 𝑎𝑖 to the
missing attribute 𝑎𝑗 that most satisfy a class C. Another method anticipated by
Quinlan [94]was to employ the decision tree to estimate the missing values. The
approach takes a subset 𝑇 ′ from the training set 𝑇 . The equivalent values for the
missing values in the subset 𝑇 ′ must be known. In the subset cases, the missing
attribute became the class label and vice verse. Using 𝑇 ′ , the decision tree can be
built to determine the value of missing attributes which converted into class label
(temporally). This method uses the class label to determine the missing attributes
values and to utilise all the information in the case (dataset instance). However, this
method is only valid when there is only one missing attribute value. Quinlan also
proposed another method for handling the missing attributes values by considering
the missing attributes values “unknown” as an actual value for the attributes.
However, this solution is not valid for all the cases because the value “unknown”
may represent many meanings such as the value is too large or too small to be
recorded, the value didn’t recorded by mistake, etc. hence, this method may bring
uncertainty.
94
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
Meng and Schenker [95] showed that likelihood techniques can be used to deal with
missing data. However, likelihood method requires specific programs which may not
be available easily. Alternative to likelihood techniques is imputing the missing data.
Multiple-imputation is a method of generating multiple simulated values for each
incomplete dataset, and then iteratively analysing datasets with each simulated value
substituted. The intention in this method is to generate estimates that better reflect
true variability and uncertainty in the data that contains some missing values [96].
There are different ways to perform multiple-imputation but most approaches assume
missing data to be missing at random (MAR). Missing at random (MAR) is a
circumstance in which missing values are randomly distributed within one or more
subsamples not among all the dataset. For example, missing more among malignant
than benign but random among each class [97].
Santhakumaran [98] successfully used ANN to treat missing features values on
WBC. The author used back propagation algorithm to train the network and used
four missing value replacement methods to replace the missing values in dataset
(Successive Iteration, mean, median, and mode).Among these four methods, Median
method produced a promising result.
5.3
The Proposed Method
The proposed method integrates the weighted k-nearest neighbours algorithm and
propagating the classification accuracy to a certain threshold. The k-NN is to find the
closest neighbours ( 𝑛1 , … , 𝑛𝑘 ) for a certain instance ( 𝑥𝑖 ), that contains missing
feature values, using the Euclidean, Manhattan, Minkowski, and other distance
functions but this work focused finally on two distance functions, Euclidean and
95
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
Minkowski. Then the proposed approach is to find the most similar instance to (𝑥𝑖 )
from (𝑛1 , … , 𝑛𝑘 ) using formula number 4 by finding the distances values (𝑐𝑛𝑖 ).
𝑘 𝑛𝑖𝑗
𝑗 =1
𝑐𝑛𝑖 =
𝑘 1
𝑗 =1
𝑑 𝑥𝑗 , 𝑛𝑗
(25)
𝑑 𝑥𝑗 , 𝑛𝑗
Where 𝑐𝑛𝑖 donate to the closest neighbours to the instance 𝑥𝑖 ,𝑑 𝑥𝑗 , 𝑛𝑗 is the distance
between the instance 𝑥𝑗 and the neighbour 𝑛𝑗 , and
𝑛𝑖𝑗 is the feature 𝑖 of the
neighbour𝑛𝑗 . After finding the closest neighbour (the smallest value of 𝑐𝑛𝑖 ) call it
𝑐𝑛′ , the missing feature values in 𝑥𝑖 will be filled by the equivalent features values in
𝑛𝑖 which have 𝑐𝑛′ distance to 𝑥𝑖 . The process of filling missing features values will
produce a new training dataset (NT) that contains no missing features values. To
verify the accurateness of the constructed missing features values, the new training
dataset is applied to k-NN and record the accuracy. If classification accuracy is less
than a threshold then the algorithm will step back to fill the missing features values
until the desired classification accuracy is reached. Figure 26 shows the flowchart for
the proposed method.
96
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
Figure 26: The Flowchart for the proposed method (Constructing Missing Features Values)
97
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
5.4
The Experimental Results
In data mining and statistical researches, the split sample approach is a commonly
used study design in studies that contain large dataset. This design divides the dataset
into a training set and a testing set to approximate classification accuracy. The
classifier is designed and developed based on the theory and then trained using the
training dataset. After tainting the classifier, it is applied to each case in the testing
sample. In practice, dividing data is important to avoid large bias in estimation the
classifier accuracy [99]. Therefore, the dataset (WBC) has been divided into two
parts, training dataset and testing dataset. The dataset separation was random to
avoid unfairness. The training dataset contains 500 cases 16 of them contain missing
features values. The rest of the original dataset (WBC) reformed the testing cases
(199 cases). After preparing the datasets, a classifier has been developed using the
proposed method. The development tool was Microsoft Visual Studio 2010. The
selected language was C# programming language.
In the implementation, this work has used many metrics to compute the distance
including the Euclidean and Minkowski functions as a component toward the
iterative k-NN classifier. Constructing the missing features values using the proposed
method through iterative k-NN classifier with the Euclidean distance function
showed a classification accuracy enhancement of 0.005 when k=3 from the first
iteration and a maximum classification accuracy of 0.9648 . Figure 27 shows a
comparison of classification accuracy when the missing features values were not
treated and when treated. The figure also shows different classification accuracy
which depend on number of neighbours (k) in k-NN. The experiment shows that
98
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
more neighbours reduce the classification accuracy. The reason may due to noise as a
result of more neighbours.
Figure 27: A comparison of classification accuracy for the proposed method through Euclidean/k-NN
The experiment of constructing the missing feature values using the proposed
method through k-NN classifier with the Minkowski distance function showed a
classification accuracy enhancement of 0.005 when k=3 and r=1.5 from the first
iteration and a maximum classification accuracy of 0.9698 . Figure 28 shows a
comparison of classification accuracy when the missing features values were not
treated and when treated.
99
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
The experiment also showed that Manhattan, Chebychev, and Canberra distance
metrics are not suitable for constructing the missing attributes values, in this
experiment, because the classification accuracy after treating the missing values
remain lower than the classification accuracy for the original dataset.
Figure 28: A comparison of classification accuracy for the proposed method through Minkoski/k-NN
5.5
Summary and Discussion
This chapter did propose a new approach for constructing missing features values
based on iterative k nearest neighbours and the distance functions. The approach is
100
Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values
an iterative approach until finding the most suitable features values that satisfy
classification accuracy. The proposed approach showed improvement of 0.005 of
classification accuracy on the constructed dataset than the original dataset on both
Euclidean and Minkowski distance functions. Manhattan, Chebychev, and Canberra
distance metrics produced lower classification accuracy on the new dataset than the
original dataset. This work also noticed that classification accuracy depend greatly
on the number of neighbours (k). The experiment showed that less neighbours may
lead to more accuracy. The reason for that, in my opinion, is the amount of nose
produced from conflict neighbours. Finally, the maximum classification accuracy
was on k=1 which was 0.9698.
The next chapter will describe a new approach for diagnosing breast cancer. The
chapter present a comparison study between some well-known feature selection
techniques.
101
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
CHAPTER SIX
Diagnosing Breast Cancer Based on Feature
Selection and Naïve Bayes
6.1
Overview
Feature selection techniques have become an obvious need for researchers in
computer science and many fields of science. Whether the target research is in
medicine, agriculture, business, or industry; the necessity for analysing large amount
of data is needed. Addition to that, finding the most excellent feature selection
technique that best satisfy a certain learning algorithm could bring the benefit for the
research and researchers. Therefore, a new method has been proposed for diagnosing
breast cancer on a combination of learning algorithm tools and features selections
techniques. The idea is to obtain a hybrid approach that combines between the best
performing learning algorithms and the best performing features selections
techniques. The experiment result shows that assemblage between correlation based
features selections method along with Naïve Bayes learning algorithm can produce a
promising results. However, no single feature selection methods that best satisfy all
datasets and learning algorithms. Therefore, machine learning researchers should
understand the nature of datasets and learning algorithms characteristics in order to
obtain better outcomes. Overall, consistency based subset evaluation performed
better than information gain, symmetrical uncertainty, relief, correlation based
feature selection, and principle components analysis.
102
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
6.2
Feature Selection Techniques
The advancement of information technology, the growing number of social networks
websites, electronic health information systems, and other factors have flooded
internet with data. The amount of data posted daily on internet is increasing daily. At
the same time, not all data are important or even needed. Therefore, data mining
researches started using the term features selections or data selections more often.
Feature selection or attribute subset combination is the process of identifying and
utilizing the most relevant attributes and removing as many redundant and irrelevant
attributes as possible [60, 61]. In addition, features selections mechanisms do not
alter the original representation of data in any way. It just selects an optimal useful
subset.
Recently, the inspiration for applying features selection techniques in
machine learning has shifted from theoretical approach to one of steps in model
building. Many attribute selection methods use the task as a search problem, where
each result in the search space groups a distinct subset of the possible attributes
[100]. Since the space is exponential in the number of attributes which produce lots
of possible subsets, this requires the use of a heuristic search procedure for all data
sets. The search procedure is combined with an attribute utility estimator in order to
evaluate the relative merit of alternative subsets of attributes [60]. This large number
of possible subsets and the computation cost involved necessitate researchers to
conduct a benchmark feature selection methods that produce the best possible subset
in regards to more accurate results as well as low computation overhead.
Feature selection techniques could perform better if the researcher chooses the right
learning algorithm. Therefore, a new approach proposed which combines a promising
feature selection technique and one of well-known learning algorithm. In the current
work, we have focused on publicly available diseases datasets (Breast cancer) to
103
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
evaluate the proposed approach. Yearly around the world, millions of ladies suffer
from breast cancer, making it the second common non-skin cancer after lung cancer,
and the fifth cause of death among cancer diseases in the world [7]. Thyroid disorder
in women is much more common than thyroid problems in men and may lead to
thyroid cancer [101]. Hepatitis can be caused by chemicals, drugs, drinking too much
alcohol, or by different kinds of viruses and may lead to liver problems [102]. This
paper begins with brief related work, then a description of benchmark feature
selection methods, a description of our methodology in the current paper, and the
results obtained by using the three datasets. Finally, a brief discussion and future work
are presented.
6.3
Feature Selection Techniques used in this Chapter
The literature showed many methods for selecting subset of features. In this chapter, I
will concentrate in Correlation based Feature Selection (CFS), Information Gain (IG),
Relief (R), Principle Components Analysis (PCA), Consistency based Subset
Evaluation (CSE), and symmetrical uncertainty (SU). The mentioned features subset
methods were described in chapter 2. CFS aims to find subsets that contain features
that are highly correlated with the class and uncorrelated with each other [103]. IG is
one of the simplest attribute ranking methods that rank the quality of attribute
according to the difference between prior and post entropy [104]. R objective is to
measure the quality of attributes according to how their values distinguish instances of
different classes [70]. PCA is probably the oldest feature selection method. It is aim
is to reduce the dimensionality of a data set in which there are a large number of
correlated features and keeping the uncorrelated features present in the data set [71].
CSE try to obtain a set of attributes that divide the original dataset into subsets that
104
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
contain one class majority [60], while SU is a modified information gain method that
compensate the information gain bias [69].
6.4
The Experiment Methodology
Different sets of experiments were performed to evaluate benchmark attributes
selections methods on well-known publicly available dataset from UCI machine
learning repository, Wisconsin Breast Cancer dataset (WBC) [25].
For obtaining a fair judgment, as possible, between feature selection methods, this
work considered three machine learning algorithms from three categories of learning
methods. The first algorithm is k-nearest neighbours (k-NN) from lazy learning
category. k-NN is an instance-based classifier where the class of a test instance is
based upon the class of those training instances alike to it. Distance functions are
common to find the similarity between instances. Examples of distance functions are
Euclidean and Manhattan distance functions [18].
The second algorithm is Naïve Bayes classifier (NB) from Bayes category. NB is a
simple probabilistic classifier based on applying Bayes' theorem. NB is one of the
most efficient and effective learning algorithms for machine learning and data mining
because the condition of independency (no attributes depend on each other) [105].
The last machine learning algorithm is Random Tree (RT) or classification tree. RT is
used to classify an instance to a predefined set of classes based on their attributes
values. RT is frequently used in many fields such as engineering, marketing, and
medicine [37].
After applying features selections techniques and the learning algorithms on the
dataset and obtaining classification accuracy results, a hybrid method will be
105
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
constructed that combines the advantages of best performed feature selection
technique and the advantages of best perform learning algorithm as shown in Figure
29.
Figure 29: Hybrid method of feature selection technique and a learning algorithm
The software package used in the present paper is Waikato Environment for
Knowledge Analysis (WEKA). Weka provides the environment to perform many
machine learning algorithm and feature selection methods. Weka is an open source
machine learning software written in JAVA language. WEKA contains some data
mining and machine learning methods for data pre-processing, classification,
regression, clustering, association rules, and visualization [106].
106
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
6.5
The Experimental Results
The notations “+”, “-”, and “=” are used to show the feature selection methods
classification performance in compared with the original dataset (before performing
feature selection methods); where “+” denotes to improvement, “-” denotes to
degradation, and “=”denotes unchanged. The experimental results of using Naïve
Bayes (NB) as a machine learning algorithm on WBC dataset is shown in Table 12.
Table 12: Results for Attributes Selection Methods with Naïve Bayes.
Method
Original Dataset
WBC
95.99%
Correlation based feature selection(CFS)
95.99%=
Information Gain (IG)
95.99%=
Relief (R)
95.99%=
Principle Components Analysis (PCA)
96.14%+
Consistency based Subset Evaluation (CSE)
96.28%+
Symmetrical Uncertainty (SU)
95.99%=
Table 12 shows the results of applying WBC dataset on Naïve Bayes learning method
and some features selections techniques. It showed that classification accuracy of
using Naïve Bayes on original WBC dataset is 95.99%, where it showed improvement
107
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
by applying features selections methods Principle Components Analysis and
Consistency bases Subset Evaluation. The best result was performed by Consistency
bases Subset Evaluation technique about 96.28% of classification accuracy, while
classification accuracy stayed the same by using correlation based feature selection,
information gain, Relief, and Symmetrical Uncertainty. Figure 30 illustrates the
results on Table 12.
Figure 30: Attributes selection methods with Naïve Bayes
108
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
The second machine learning classifier for testing features selections methods is kNN. The experimental results of using k-NN as a machine learning algorithm on
WBC is shown in Table 13.
Table 13: Results for Attributes Selection Methods with k-NN
Method
WBC
Original Dataset
95.42%
Correlation based feature selection(CFS)
95.42%=
Information Gain (IG)
95.42%=
Relief (R)
95.42%=
Principle Components Analysis (PCA)
95.42%=
Consistency based Subset Evaluation (CSE)
95.85%+
Symmetrical Uncertainty (SU)
95.42%=
Table 13 shows that the classification accuracy of using k-NN on the original WBC is
95.42%, where it shows improvement by applying the features selections method
Consistency based Subset Evaluation (CSE).
On the other hand, other features
selections methods produced the same classification accuracy as the original dataset.
Figure 31 illustrates the results on Table 13.
109
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
Figure 31: Results for attributes selection methods with k-NN
The last machine learning classifier in our experiment is Decision Tree (DT). The
experimental results of using DT as a machine learning algorithm on WBC is shown
in Table 14.
110
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
Table 14: Results for Attributes Selection Methods with Decision Tree
Method
Original Dataset
WBC
94.56%
Correlation based feature selection(CFS)
94.56%=
Information Gain (IG)
94.56%=
Relief (R)
94.56%=
Principle Components Analysis (PCA)
94.85%+
Consistency based Subset Evaluation (CSE)
93.56%-
Symmetrical Uncertainty (SU)
94.56%=
In Table 14 shows improvement in classification accuracy by applying the features
selections PCA. There is a decline in classification accuracy by using CSE, where the
classification accuracy is not changed using CFS, IG, R, and SU. Figure 32 illustrates
the results on Table 14.
111
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
Figure 32: Results for attributes selection methods with Decision Tree
112
Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes
6.6
Summary and Discussion
Figure 33: Hybrid method of feature selection technique and a learning algorithm.
According to the results obtained by the current work on WBC, Naïve Bayes has
performed the supreme in regard to classification accuracy. k-NN and DT have
performed just better on dataset after applying features selections methods. In general,
features selections methods can improve the performance of learning algorithms.
However, no single features selections method that best satisfy all datasets and
learning algorithms. Therefore, machine learning researcher should understand the
nature of datasets and learning algorithm characteristics in order to obtain better
outcomes as possible. Overall, CES features selection method performed better than
IG, SU, R, CFS, and PC. The study also found that IG and SU performed typically
because SU is a modified version of IG.
113
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
CHAPTER SEVEN
Fusion of Heterogeneous Classifiers for Breast
Cancer Diagnosis
7.1
Overview
In the twenty first century, we are “flooded in data, but starved of information”
[107]. The new technologies produced a huge variety of data which are over human
capability of analysis. Therefore, the intelligent systems found to help human arrange
and utilize huge data effectively [107]. For the sake of obtaining full benefits of the
use of intelligent systems, it is became a common in machine learning to look for
intelligent systems advantages and mix and match them to produce a new approach
that maximize the advantages and minimize the overhead. Fusion Intelligent systems
or Hybrid intelligent systems, in which two or more machine leaning algorithms are
combined in one new approach are often effective and can overcome the limitations
of individual approaches [108]. Classification is one branch of machine learning and
the process of integrating two or more classifiers is usually referred to as multiclassifications or fusion classification. There are two main paradigms in combining
different classification algorithms: Classification Selection and classification Fusion
[109]. Classification Selection paradigms uses a single model to predicate the new
case. However, fusion classification merges two or more outputs of all models to
produce a single output. The present chapter will introduce different types of fusion
classification on three well-known classifiers on breast cancer datasets. The process
114
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
of integrating two or more classifiers enhanced the classification accuracy in some
cases. However, there is no single combination that suits all datasets.
7.2
Multi-Classification Approach
The process of combing two or more classifiers is called multi-classification
approach. The purpose of multi-classification is based on the argument that no single
classifier that suites all learning problems [109]. Multi-classification can be divided
into two types, classifier selection and fusion classifier.
7.2.1 Classifier Selection
Classifier selection is one of the simplest methods for combining learning algorithms
or classifiers. The idea is to evaluate two or more classifiers on the training dataset
and then make use of the best performed classifiers on the testing dataset. This
method is simple, straight forward, no output combination, and performs well in
compare to more complex classifiers [110].
7.2.2 Fusion Classifier
Fusion classifier is a group of classifiers whose individual predictions are combined
in some way to classify new cases (highest average ranking, average probability, or
voting). It is became one of the active areas of research in supervised learning that
study new ways of constructing classifiers for more accurate outcome. Voting is the
simplest method for multi-classification in heterogeneous and homogeneous models.
The voting method is divided into two types, weighted voting and un-weighted
voting. In un-weighted voting, all classifiers are treated equally with no priority over
other classifiers. Therefore, each classifier outputs a class value and the class with
the most votes is the final outcome of the multi-classifier. Note that this type of
115
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
voting is in fact called Plurality Voting, in contrast to the frequently used term
Majority Voting, as the majority voting implies that at least 50%+1 (the majority) of
the votes should belong to the winning class. In weighted voting, classifiers may
have different weight according to the user believe in the classifier performance and
classification accuracy. For example, the user can put more weight in k-NN (60%)
over the Naïve Bayes (40%) in a certain multi-classifiers problem. The weighted
voting is usually discriminating between classifiers based on classifier reputation and
ranking among others classifiers [110].
7.3
Classifiers Combination Strategies
Machine learning and data mining are rich of classification tools and algorithms. In
the context of combining classification there is uncertainty which classifier works
best with other classifiers. Kuncheva and Whiteaker [111] stated that the best
combination of a set of classifiers depends on the application and on the classifiers
characteristics. However, there is no best combination of classifiers. However, one
approach is to generate a large number of classifiers and then to select the best
combinations to use based on the classification accuracy or other criteria set by the
researcher. On the other hand, this approach is costly and time consuming since n
classifiers may be combined in 2n combinations, and it is difficult to obtain this
amount of experiments except in simple or restricted circumstances. To solve this,
several search techniques have been used to find the best combination of classifiers,
including forward and backward search, Tabu search, and genetic algorithms
[107].In general, more powerful technique to find the best possible combination of
classifiers is needed.
116
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
7.4
Experimental Methodology
Different sets of experiments were performed to evaluate the multi-classification
approach on well-known publicly available breast cancer
datasets from UCI
machine learning repository [25]. Recently, this research found that there are three
versions of breast cancer diagnosis and will be introduced in this chapter. The
datasets are Wisconsin Breast Cancer (Original), Wisconsin Diagnosis Breast Cancer
(WDBC), and Wisconsin Prognosis Breast Cancer (WPBC). Table 15 shows the
statistical details of the used datasets. Table 15 shows that the three datasets are
different of sizes. The smallest dataset contains 11 attributes and the largest dataset
contains 34 attributers. Number of instances also ranges from 198 to 699 instances
while all the datasets contain 2 classes. The study has considered a heterogeneous
classifier from three machine learning categories. The first algorithm is k-nearest
neighbours (k-NN) from lazy learning category. k-NN is an instance-based classifier
where the class of a test instance is based upon the class of those training instances
alike to it. Distance functions are common to find the similarity between instances.
Examples of distance functions are Euclidean and Manhattan distance functions [18].
The second algorithm is Naïve Bayes classifier (NB) form Bayes category. NB is a
simple probabilistic classifier based on applying Bayes' theorem. NB is one of the
most efficient and effective learning algorithms for machine learning and data
mining because the condition of independency (no attributes depend on each other)
[105]. The last machine learning algorithm is Random Tree (RT) from the tree
classification category. RT is used to classify an instance to a predefined set of
classes based on their attributes values. RT is frequently used in many fields such as
engineering, marketing, and medicine [37]. The study used k-fold cross validation
117
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
technique to separate the training set from test set with k=10. The environment of
experiment is the well-known machine learning software, WEKA.
Table 15: Statistics of Breast Cancer Datasets
Dataset
Wisconsin Breast Cancer
(Original)
Wisconsin Diagnosis Breast
Cancer(WDBC)
Wisconsin Prognosis Breast
Cancer(WPBC)
Number of
Attributes
11
Number of
Instances
699
Number of
Classes
2
32
569
2
34
198
2
In this experiment, this work used the confusion matrix to measure classifiers
performance. The classification accuracy is the main criteria to estimate the
effectiveness of the classification model based on the number of correct and incorrect
classification cases.
7.5
Experimental Results
Three experiments were performed on three different datasets of breast cancer data.
The first experiments were performed on single classifier model to set a base line of
classification accuracy and how it can be enhanced. The send experiment was
performed using a combination of two classifiers while the last experiment was
performed on fusion of three classifiers.
118
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
Figure 34: Single Classifier on three datasets WBC, WDBC, and WPBC.
Figure 34 shows the results of performing three single classifiers on three datasets. It
shows that Naïve Bayes performed best in regards to classification accuracy on WBC
(0.9599) while k-NN and Random Tree performed just better on WDBC and WPBC
respectively.
Figure 35 shows the result of combining two classifiers (Naïve Bayes and k-NN,
Naïve Bayes and Random Tree, and k-NN and Random Tree). The results indicate
that the fusion between Naïve Bayes and k-NN produced the best classification
accuracy (0.9642 on WBC, 0.9508 on WDBC, and 0.6869 on WPBC). This may
draw an intention that Naïve Bayes and k-NN may produce better results when they
combined together.
119
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
Figure 35: Two Classifiers on three datasets WBC, WDBC, and WPBC.
Figure 36 shows the result of fusion for three classifiers (Naïve Bayes, k-NN, and
Random Tree). It shows that combining the three classifiers in three datasets still
maintained satisfactory classification accuracy on WBC (0.9585) and WDBC
(0.9473) while a significant improvement in classification accuracy on WPBC
dataset (0.7323).
120
Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis
Figure 36: The Fusion of three classifiers on three datasets WBC, WDBC, and WPBC.
7.6
Summary and Discussion
This chapter introduced the term of classification fusion on three well-known
machine learning classifiers on breast cancer dataset. This work can confirm the
argument that the best combination of a set of classifiers depends on the application
and on the classifiers characteristics. In addition, there is no best combination of
classifiers that suites all datasets. However in the current experiments, Naïve Bayes
and k-NN produced better results when they combined as one classifier with
maximum
classification
accuracy
obtained
on
WBC
dataset
(0.9642).
121
Chapter 8: Discussion and Future Work
CHAPTER EIGHT
Discussion and Future Work
The main purpose of current research is to participate in the efforts of enhancing the
quality of healthcare services, proposing technology as one of solutions for the
problem of medical shortages in regards to staff shortage and technology lack. This
thesis presented the challenges that face many countries in the field of healthcare
services; the increase in population, the new culture, the climate change, and other
factors have derived more demand on healthcare services. Australia is one of the
countries that stated lately to utilise technology to help solve the big demand on
health services. Therefore, states, territory, and federal governments in Australia
found NEHTA to derive the national interest in eHealth and healthcare services
through technology and help in providing decent healthcare services. However, the
process of utilising technology in healthcare services is a comprehensive process and
involves many stages and steps. It is very important to discuss all related issues to
conclude with new system that derive the expected services.
eHealth project in Australia will deliver a huge repository of patients’ information.
This will create new challenges that need more investigations in order to achieve the
desired goals. In addition, data and information is valuable and may be used to
discover new trends, learn methods to treat similar cases, and find useful patterns.
The current research focused on common issues that related to huge databases;
missing features values and feature selections methods and how can be used to
diagnosis and predict the examination for new cases.
122
Chapter 8: Discussion and Future Work
Chapter 4 showed how information gain method, feature selection technique, can be
used in collaboration with adaptive nerou fuzzy inference systems in diagnosing new
patient cases. The combination created a new approach for diagnosing the breast
cancer by reducing the number of features to the optimal number using the
information gain and then applied the new dataset to the adaptive neuro fuzzy
inference system (ANFIS). The study found that the accuracy for the proposed
approach is 98.24% compared with other methods. The proposed approach showed a
very promising results which may lead to further attempts to utilise information
technology for diagnosing patients for breast cancer..
Missing features values are a concern when dealing with databases, especially, large
databases. Therefore, an approach for constructing missing features values based on
iterative k-nearest neighbours and the distance functions has been proposed. The
approach is an iterative approach until finding the most suitable features values that
satisfy classification accuracy. The proposed approach showed improvement of
0.005 of classification accuracy on the constructed dataset than the original dataset
on both Euclidean and Minkowski distance functions. The study found that
Manhattan, Chebychev, and Canberra distance metrics produced lower classification
accuracy on the new dataset than the original dataset. The study also noticed that
classification accuracy depend greatly on the number of neighbours (k). The
experiment showed that less neighbours may lead to more accuracy. The reason for
that, in my opinion, is the amount of noise produced from conflict neighbours.
Finally, the maximum classification accuracy was on k=1 which was 0.9698.
Another important issue when dealing with databases and health databases is features
selections techniques and how to determine the most important features that lead to
more accurate diagnosis. The investigation showed that no single features selections
123
Chapter 8: Discussion and Future Work
method that best satisfy all datasets and learning algorithms. Therefore, machine
learning researcher should understand the nature of datasets and learning algorithm
characteristics in order to obtain better outcomes as possible. Overall, CES features
selection method performed better than IG, SU, R, CFS, and PC. This work also
found that IG and SU performed identical, the reason may due to the fact that SU is a
modified version of IG. However, According to the results obtained by the current
work on WBC, Naïve Bayes has performed the supreme in regard to classification
accuracy. k-NN and DT have performed just better on dataset after applying features
selections methods than the original dataset with no features selections techniques . In
general, features selections methods can improve the performance of learning
algorithms.
According the investigation on features selections techniques and three well-known
machine learning algorithms, this work proposed a hybrid approach for diagnosing
breast cancer based on the best performed machine learning algorithm and the best
performed feature selections methods. Therefore, we proposed a new approach based
on Naïve bays learning algorithm and consistency based subset evaluation. The
hybrid approach showed a classification accuracy of 0.9628 which underline the
approach capability on Wisconsin Breast Cancer Dataset (WBC).
The research introduced the term of classification fusion on three well-known
machine learning classifiers on breast cancer datasets. Classification Fusion becomes
a hot topic in the field of machine learning due the capability of combing algorithms
advantages in one single algorithm. This study confirms that classification fusion
approach can improve the classification accuracy. However, Classification fusion
approach depends on the involved classifiers characteristics. In addition, there is no
best combination of classifiers that suites all datasets. However in the current
124
Chapter 8: Discussion and Future Work
experiments, Naïve Bayes and k-NN produced better results when they combined as
one classifier with maximum classification accuracy obtained on WBC dataset
(0.9642).
Based on the experiments on different machine learning algorithms and Wisconsin
Breast Cancer Dataset, the study may conclude that hybridization of the existing
machine learning algorithms can produce better approaches for medical diagnosing.
The idea is combining the advantages of different algorithms in one approach.
The study also found that features selections techniques (features discrimination) can
help improve prediction in context of medical diagnosing (breast cancer diagnosis in
our study). However, no specific features selections method that suits all machine
learning tools.
Future work can be described as follows. The current research resided mainly on
classification accuracy as the main criteria for measuring the performance of
proposed approaches. However, future work will focus in other criteria such as
classification speed and computational cost. Future work can also broaden disease
options and which has been started with some attempts on thyroid and hepatic
(figures 37-39 and tables 16-19). In addition, breast cancer dataset used in this study
has binary outcomes (class label).Clinical practice; however, is often more complex
and outcomes maybe in different format. It is envisaged that the future work can
contribute to the knowledge and improve the accuracy and reliability of established
system by broaden the databases and expanding the criteria for measuring the
performance of established systems.
125
Chapter 8: Discussion and Future Work
We also aim to contact the National eHealth Transition Authority to obtain real
dataset and discuss the integration of current proposed approaches with eHealth
system.
Table 16: Results for attributes selection methods with Naïve Bayes on three databases (Thyroid,
Hepatitis, and Breast Cancer)
Method
Thyroid
Hepatitis
Breast Cancer
NB
92.60%
84.52%
95.99%
CFS
96.53%+
87.74%+
95.99%=
IG
93.88%+
85.16%+
95.99%=
R
92.60%=
84.52%=
95.99%=
PC
94.30%+
84.52%=
96.14%+
CB
94.59%+
84.52%=
96.28%+
SU
93.88%+
85.16%+
95.99%=
126
Chapter 8: Discussion and Future Work
Figure 37: Results for attributes selection methods with Naïve Bayes on three databases (Thyroid,
Hepatitis, and Breast Cancer)
127
Chapter 8: Discussion and Future Work
.
Table 17: Results for Attributes Selection Methods with k-NN on three databases (Thyroid, Hepatitis,
and Breast Cancer)
Method
Thyroid
Hepatitis
Breast Cancer
k-NN
95.92%
81.94%
95.42%
CFS
96.10%+
84.52%+
95.42%=
IG
96.50%+
81.29%-
95.42%=
R
95.92%+
81.94%=
95.42%=
PC
95.78%-
81.29%-
95.42%=
CB
96.37%+
81.94%=
95.85%+
SU
96.50%+
81.29%-
95.42%=
Figure 38: Results for Attributes Selection Methods with k-NN on three databases (Thyroid,
Hepatitis, and Breast Cancer)
128
Chapter 8: Discussion and Future Work
Table 18: Results for Attributes Selection Methods with Decision Tree on three databases (Thyroid, Hepatitis,
and Breast Cancer)
Method
Thyroid
Hepatitis
Breast Cancer
RT
96.92%
76.77%
94.56%
CFS
96.29%-
77.42%+
94.56%=
IG
96.63%-
74.19%-
94.56%=
R
97.22%+
76.77%=
94.56%=
PC
97.03%+
76.13%-
94.85%+
CB
97.16%+
80.65%+
93.56%-
SU
96.63%-
74.19%-
94.56%=
Figure 39: Results for Attributes Selection Methods with Decision Tree on three databases (Thyroid,
Hepatitis, and Breast Cancer)
129
Chapter 8: Discussion and Future Work
References
1.
Gunter, D.T. and P.N. Terry, The Emergence of National Electronic Health
Record Architectures in the United States and Australia: Models, Costs, and
Questions. J Med Internet Res, 2005. 7(1).
2.
World Health Organization Assesses the World's Health Systems. World
Health Organization [sighted 2010 01/09/2010]; Available from:
http://www.who.int/whr/2000/media_centre/press_release/en/index.html.
3.
Gerard, A., et al., Health Care Spending and Use of Information Technology
in OECD countries. Health Affairs, 2006. 25(3).
4.
Sorwar, G. and S. Murugesan, Electronic medical prescription: an overview
of current status and issues, in Biomedical knowledge management:
infrastructures and processes for e-health systems, M. Cooper and M.
Gururajan, (Editors). 2010, IGI Global Hershey, PA. p. 61-81.
5.
Lazarou, J., B. Pomeranz, and P. Corey, Incidence of adverse drug reactions
in hospitalized patients: A meta-analysis of prospective studies. Journal of
the American Medical Association, 1998. 279(15).
6.
Medication Safety in the Community: A Review of the Literature. 2009
[sighted 2010 01/09/2010]; Available from: www.safetyandquality...con/$File/25953-MS-NPS-LitReview2009.PDF.
7.
Mammography Screening Can Reduce Deaths from Breast Cancer. 2002
[sighted 2011 20/05/ 2011]; Available from: http://www.iarc.fr/en/mediacentre/pr/2002/pr139.html.
8.
Most Frequent Cancers in Men and Women. 2008 [sighted 2012
20/01/2012]; Available from:
http://globocan.iarc.fr/factsheets/populations/factsheet.asp?uno=900.
9.
General Information About Male Breast Cancer. 2012 [sighted 2012
30/12/2012]; Available from:
http://www.cancer.gov/cancertopics/pdq/treatment/malebreast/Patient/page1.
10.
Breast Cancer in Australia an Overview. Australian Institute of Health and
Welfare, 2012.
11.
Giarratano, J. and G. Riley, Expert Systems Principles and Programming. 2
ed. Vol. 1. 1994, Boston: PWS Publishing Company.
12.
NEHTA Blueprint. 2010 [sighted 2010 01/10/2010]; Available from:
http://www.nehta.gov.au.
13.
Tarca, A.L., et al., Machine Learning and Its Applications to Biology. PLoS
Comput Biol, 2007. 3(6).
130
Chapter 8: Discussion and Future Work
14.
Rokach, L., Data mining with decision trees: theory and applications. Vol. 69.
2007: World scientific.
15.
Rokach, L. and O. Maimon, eds. Data Mining and Knowledge Discovery
Handbook. Second ed. 2010, Springer Science and Business.
16.
Tan, P.-N., M. Steinbach, and V. Kumar, Introduction to Data Mining. 2006:
Addison-Wesley.
17.
Thirumuruganathan, S. A Detailed Introduction to k-Nearest Neighbor (kNN) Algorithm. 2010.
18.
Pevsner, J., Bioinformatics and Functional Genomics. 2 ed. 2009, New York:
Wiley-Blackwell.
19.
Weisstein, E.W. Euclidean Metric. [sighted 2011 19/08/2011]; Available
from: http://mathworld.wolfram.com/EuclideanMetric.html.
20.
Young, M., et al. Distance Metrics Overview. 2004 [sighted 2011
03/08/2011]; Available from:
http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering
_Parameters/Distance_Metrics_Overview.htm.
21.
Mei, Z., Q. Shen, and B. Ye, Hybridized k-NN and SVM for gene expression
data classification. Life Science Journal, 2009. 6(1).
22.
Parvin, H., H. Alizadeh, and B. B, MKNN: Modified k-Nearest Neighbor, in
Proceeding of the World Congress on Engineering and Computer Science.
2008: USA.
23.
Cedeno, W. and D. Agrafiotis, Using particle swarms for the development of
QSAR models based on k-nearest neighbor and kernel regression. Journal of
Computer-Aided Molecular Design, 2003. 17(2-4).
24.
Crookston, N. and A. Finley, Impute: An R Package for k-NN Imputation.
Journal of Statistical Software, 2007. 23(10).
25.
Wolberg, W. and L. Mangasarian, Multisurface method of pattern separation
for medical diagnosis applied to breast cytology. Proceedings of the National
Academy of Sciences, 1990. 87: p. 9193 - 9196.
26.
Li, S., E. Harner, and D. Adjeroh, Random k-NN feature selection -- a fast
and stable alternative to Random Forests. BMC Bioinformatics, 2011.
12(450).
27.
Kotsiantis, S., Supervised Machine Learning: a Review of Classification
Technigues. Informatica, 2007. 31: p. 249-268.
28.
Priddy, K. and P. Keller, Artificial neural networks: introduction. 2005,
Washington: SPIE.
131
Chapter 8: Discussion and Future Work
29.
Widrow, B. and M. Hoff, Adaptive Switching Circuits, in WESCON
Conference Record. 1989. p. 709-717.
30.
Grossberg, S., Adaptive Resonance Theory, in Encyclopedia of Cognitive
Science. 2006, John Wiley & Sons, Ltd.
31.
Javed, K., et al., Classification and diagnostic prediction of cancers using
gene expression profiling and artificial neural networks. Nature Medicine,
2001. 7: p. 673 - 679.
32.
Baxt, W.G., Use of an artificial neural network for data analysis in clinical
decision-making: the diagnosis of acute coronary occlusion. Neural Comput.,
1990. 2(4): p. 480-489.
33.
Gershenson, C., Artificial neural networks for beginners. arXiv preprint
cs/0308031, 2003.
34.
Neuro AI, Neural networks: A requirement for intelligent systems. Available
from: http://www.learnartificialneuralnetworks.com/.
35.
Deligiorgi, D. and K. Philippopoulos, Spatial Interpolation Methodologies in
Urban Air Pollution Modeling: Application for the Greater Area of
Metropolitan Athens, Greece. Advanced Air Pollution. 2011.
36.
Larose, D., Discovering Knowledge in Data: An Introduction to Data Mining.
2005, New Jersey: John Wiley & Sons, Inc.
37.
Rokach, L. and O. Maimon, eds. Data Mining With Decision Trees. 2008,
World Scientific Publishing.
38.
Mitchell, T.M., Machine Learning. 2005: McGraw Hill.
39.
Arzucan, O., Supervised and Unsupervised Machine Learning Technique for
Text Document Categorization, in Graduate Program in Computer
Engineering. 2004, Bogazici University.
40.
Caruana, R. and A. Niculescu-Mizil, An empirical comparison of supervised
learning algorithms, in Proceedings of the 23rd international conference on
Machine learning. 2006, ACM: Pittsburgh, Pennsylvania. p. 161-168.
41.
Wu, X., et al., Top 10 algorithms in data mining. Knowl. Inf. Syst., 2007.
14(1): p. 1-37.
42.
Hian, K. and T. Gerald, Data Mining Applications in Healthcare. Journal of
Healthcare Information Management, 2005. 19(2): p. 64-72.
43.
Dakins, D., Center takes data tracking to heart. Health Data Management,
2001. 9(1): p. 32-36.
44.
Biafore, S., Predictive solutions bring more power to decision makers. Health
Management Technology, 1999. 20(10): p. 12-14.
132
Chapter 8: Discussion and Future Work
45.
Milley, A., Healthcare and data mining. Health Management Technology,
2000. 21(8): p. 44-47.
46.
Hallick, J., Analytics and the data warehouse. Health Management
Technology, 2001. 22(6): p. 24-25.
47.
Feng, D.D., Biomedical information technology. 2008, Amsterdam; Boston:
Elsevier/Academic Press.
48.
Ennis, R.L., et al., A continuous real-time expert system for computer
operations. IBM J. Res. Dev., 1986. 30(1): p. 14-28.
49.
Cios, K.J. and G.W. Moore, Uniqueness of medical data mining. Artif. Intell.
Med., 2002. 26(1-2): p. 1-24.
50.
Moore, G.W., et al., A prototype Internet autopsy database. 1625 consecutive
fetal and neonatal autopsy facesheets spanning 20 years. Arch Pathol, 1996.
120(8): p. 782-785.
51.
Hand, D.J., Data mining: statistics and more. The American Statistician,
1999. 52(2): p. 112-118.
52.
Sanderson, S., Electronic Health Records for Allied Health Careers. 2009:
McGraw-Hill
53.
Song, H., et al., New methodology of computer aided diagnostic system on
breast cancer, in Proceedings of the Second international conference on
Advances in Neural Networks - Volume Part III. 2005, Springer-Verlag:
Chongqing, China. p. 780-789.
54.
Arulampalam, G. and A. Bouzerdoum. Application of shunting inhibitory
artificial neural networks to medical diagnosis. in Intelligent Information
Systems Conference, The Seventh Australian and New Zealand 2001. 2001.
55.
Setiono, R., Generating Concise and Accurate Classification Rules for Breast
Cancer Diagnosis. Artificial Intelligence in Medicine, 2000. 18(3): p. 205219.
56.
Meesad, P. and G. Yen, Combined numerical and linguistic knowledge
representation and its application to medical diagnosis. Component and
Systems Diagnostics, Prognostics, and Health Management II, 2003. 4733: p.
98-109.
57.
Han, J. and K. M, Data Mining Concepts and Techniques. Vol. 3. 2011, San
Franscisco: Morgan Kaufmann.
58.
Thrun, S.B., et al., The Monk's Problems-A Performance Comparison of
Different Learning Algorithms. 1991, Carnegie Mellon University:
Pittsburgh, PA.
133
Chapter 8: Discussion and Future Work
59.
Langley, P. and S. Sage, Induction of Selective Bayesian Classifiers. ;In
UAI(1994). Proceedings of the Tenth Annual Conference on Uncertainty in
Artificial Intelligence, 1994: p. 399-406.
60.
Hall, M.A. and G. Holmes, Benchmarking Attribute Selection Techniques for
Discrete Class Data Mining. IEEE Transactions on Knowledge And Data
Engineering, 2003. 15(3).
61.
Ashraf, M., et al., A New Approach for Constructing Missing Features
Values. International Journal of Intelligent Information Processing, 2012.
3(1): p. 110-118.
62.
Guyon, I., et al., An introduction to variable and feature selection. J. Mach.
Learn. Res., 2003. 3: p. 1157-1182.
63.
Saeys, Y., I. Inza, and P. Larrañaga, A review of feature selection techniques
in bioinformatics. Bioinformatics, 2007. 23(19): p. 2507-2517.
64.
Kohavi, R., D. Sommerfield, and J. Dougherty, Data Mining using MLC++ -A Machine Learning Library in C++. IEEE, 1996.
65.
Kohavi, R. and G.H. John, Wrappers for feature subset selection. Artif. Intell.,
1997. 97(1-2): p. 273-324.
66.
Lal, T., et al., Embedded Methods, in Feature Extraction, I. Guyon, et al.,
(Editors). 2006, Springer Berlin Heidelberg. p. 137-165.
67.
Kononenko, I., Estimating attributes: Analysis and extensions of RELIEF, in
Machine Learning ECML-94. 1994.
68.
Hall, M.A., Correlation-based Feature Selection for Machine Learning, in
Department of Computer Science. 1999, The University of Waikato:
Hamilton.
69.
Rutkowski, L., et al., eds. Artificial Intelligence and Soft Computing, Part I.
ed. L.N.i.C.S. 6113. Vol. 1. 2010, Springer: Poland. 487-498.
70.
Guyon, I. and A. Elisseeff, An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research, 2003. 3: p. 1157-1182.
71.
Jolliffe, I.T., Principal Component Analysis. 2002, Springer: NY.
72.
Liu, H. and R. Setiono. A probabilistic approach to feature selection: A fiter
solution. in Proceedings of the 13th International Conference on Machine
Learning. 1996. Morgan Kaufmann.
73.
Liu, H. and R. Setiono. Chi2:Feature selection and discretization of numeric
attributes. in Proceedings of the 7thIEEE International Conference on Tools
with Articial Intelligence. 1995.
134
Chapter 8: Discussion and Future Work
74.
Xie, J., et al., Novel Hybrid Feature Selection Algorithms for Diagnosing
Erythemato-Squamous Diseases, J. He, et al., (Editors). 2012, Springer Berlin
Heidelberg. p. 173-185.
75.
Liao, B., et al., A Novel Hybrid Method for Gene Selection of Microarray
Data. Journal of Computational and Theoretical Nanoscience, 2012. 9(1): p.
5-9.
76.
Vijayasankari, S. and K. Ramar, Enhancing Classifier Performance Via
Hybrid FeatureSelection and Numeric Class Handling- A ComparativeStudy.
International Journal of Computer Applications, 2012. 41(17): p. 30-36
77.
Leach, M., Parallelising Feature Selection Algorithms. 2012, University of
Manchester: Manchester.
78.
Grzymala-Busse, J.W. and W.J. Grzymala-Busse, Handling Missing Attribute
Values Data Mining and Knowledge Discovery Handbook, O. Maimon and
L. Rokach, (Editors). 2010, Springer US. p. 33-51.
79.
Rubin, D.B., Inference and missing data. Biometrika, 1976. 63(3): p. 581592.
80.
Howell, D. Treatment of Missing Data. 2009.
81.
Marlin, B., Missing Data Problems in Machine Learning, in Department of
Computer Science. 2008, University of Toronto: Canada
82.
Dempster, A., N. Laird, and D. Rdin, Maximum Likelihood from Incomplete
Data via the EM Algorithm. Journal Of The Royal Statistical Society, 1977.
39(1): p. 1-39.
83.
Moss, S. Expectation maximization--to manage missing data. 2009.
84.
Galliers, R., Choosing Information Systems Research Approaches, in
Information Systems Research: Issues, Methods and Practical Guidelines, R.
Galliers, (Editor). 1992, Alfred Waller: Henley-on-Thames. p. 144-162.
85.
Dash, N.K. Module: Selection of the Research Paradigm and Methodology.
2005.
86.
UCI Machine Learning Repository [sighted 2010; Available from:
http://archive.ics.uci.edu/ml/about.html.
87.
Ian, W. and F. Eibe, Data Mining Practical Machine Learning Tools and
Techniques with Java Implementations. 2000: Morgan Kaufmann.
88.
Mathworks. Matlab overview. 1994 01/09/2012]; Available from:
http://www.mathworks.com.au/products/matlab/.
89.
Jang, R. and J. Shing, ANFIS: Adaptive-Network-based Fuzzy Inference
system. IEEE transactions on systems, 1993. 23(3): p. 665 – 685.
135
Chapter 8: Discussion and Future Work
90.
Übeyli, E.D., Adaptive Neuro-Fuzzy Inference Systems for Automatic
Detection of Breast Cancer. J. Med. Syst., 2009. 33(5): p. 353-358.
91.
Grzymala-Busse, J., Data with Missing Attribute Values: Generalization of
Indiscernibility Relation and Rule Induction Transactions on Rough Sets I, J.
Peters, et al., (Editors). 2004, Springer Berlin / Heidelberg. p. 78-95.
92.
White, A., Probabilistic induction by dynamic path generation in virtual trees,
in Annual Technical Conference of the British Computer Society Specialist
Group on Expert Systems. 1986: Brighton (UK). p. 35 - 46.
93.
Kononenko, I., I. Bratko, and E. Roskar. Experiments in automatic learning
of medical diagnostic rules. in International School for the Synthesis of
Expert’s Knowledge Workshop, Bled, Slovenia. 1984.
94.
Quinlan, J.R., Induction of decision trees. Machine Learning, 1986. 1(1): p.
81-106.
95.
Meng, X. and N. Schenker, Maximum likelihood estimation for linear
regression models with right censored outcomes and missing predictors.
Computational Statistics & Data Analysis, 1999. 29(4): p. 471-483.
96.
Rubin, D.B., Multiple Imputation for Nonresponse in Surveys. 2004, New
York Wiley & Sons.
97.
Marshall, A., et al., Comparison of techniques for handling missing covariate
data within prognostic modelling studies: a simulation study. BMC Medical
Research Methodology, 2010. 10(1): p. 7.
98.
Santhakumaran, F.P., An Algorithm to Reconstruct the Missing Values for
Diagnosing the Breast Cancer. Global Journal of Computer Science and
Technology, 2010. 10(2): p. 25-28.
99.
Simon, R., et al., Pitfalls in the Use of DNA Microarray Data for Diagnostic
and Prognostic Classification. Journal of the National Cancer Institute, 2003.
95(1): p. 14-18.
100.
Blum, A.L. and P. Langley, Selection of relevant features and examples in
machine learning. Artificial Intelligence, 1997. 97(1–2): p. 245-271.
101.
Lee, S.L. Thyroid Problems. 2012 [sighted 2012 01/03/2012]; Available
from: http://www.emedicinehealth.com/thyroid_problems/article_em.htm.
102.
Introducing Hepatitis C. 2012; Available from: http://www.hep.org.au/.
103.
Hall, M.A. and L.A. Smith, Feature subset selection: a correlation based filter
approach. 1997.
104.
Forman, G., An extensive empirical study of feature selection metrics for text
classification. The Journal of Machine Learning Research, 2003. 3: p. 12891305.
136
Chapter 8: Discussion and Future Work
105.
Zhang, H. and J. Su, Naïve Bayes for optimal ranking. Journal of
Experimental & Theoretical Artificial Intelligence, 2008. 20(2): p. 79-93.
106.
Ashraf, M., K. Le, and X. Huang, Information Gain and Adaptive NeuroFuzzy Inference System for Breast Cancer Diagnoses, in International
Conference on Computer Sciences and Convergence Information Technology
(ICCIT). 2010, IEEE: Seoul. p. 911-915.
107.
Buxton, B.F., W.B. Langdon, and S.J. Barrett, Data Fusion by Intelligent
Classifier Combination. Measurement and Control, 2001. 34(8): p. 229-234.
108.
Goonatilake, S. and S. Khebbal, Intelligent Hybrid Systems. 1994: John
Wiley & Sons, Inc.
109.
Tsoumakas, G., L. Angelis, and I. Vlahavas, Selective fusion of
heterogeneous classifiers. Intelligent Data Analysis, 2005. 9(6): p. 511-525.
110.
Džeroski, S. and B. Ženko, Is combining classifiers with stacking better than
selecting the best one? Machine Learning, 2004. 54(3): p. 255-273.
111.
Kuncheva, L. and C. Whitaker, Feature Subsets for Classifier Combination:
An Enumerative Experiment, in Multiple Classifier Systems, J. Kittler and F.
Roli, (Editors). 2001, Springer Berlin Heidelberg. p. 228-237.
112.
Grabusts, P., The Choice of Metrics for Clustering Algorithms, in
Proceedings of the 8th International Scientific and Practical Conference.
Volume I1. 2011. p. 70-76.
113.
Jurman, G., Riccadonna, S., Visintainer, R., & Furlanello, C., Canberra
distance on ranked lists. In Proceedings, Advances in Ranking–NIPS 09
Workshop , 2009, p. 22-27.
137