Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Health Data for Breast Cancer Diagnosis Using Machine Learning Mohammad Ashraf Bani Ahmad A thesis submitted for the requirements of the Degree of Doctor of Philosophy Faculty of Education, Science, Technology & Mathematics December 2013 In the name of Allah, the Most Merciful, the Most Compassionate. Recite and your Lord is the most Generous - Who taught by the pen - Taught man that which he knew not. Surat Al-`Alaq (The Clot), the Holy Quran. ii Abstract The recent advancements in computer technologies and storage capabilities have produced an incredible amount of data and information from many sources such as social networks, online databases, and health information systems. Nowadays, many countries around the world are changing the way of implementing health care to the patients and the people by utilising the benefits of advancements in computer technologies and communications through electronic health. Electronic health (eHealth) is the process of using emerging information and communication technologies in health care for the benefit of humans. eHealth includes a range of components such as electronic health records, electronic prescriptions, electronic and mobile treatments for patients. In Australia, the majority of medical and health coverage is provided by the government and due to shortage of medical personnel and appropriate supportive technologies, many people have to suffer long waiting times and limited medical resources. Therefore, the Australian government, territory, and state governments raised the inclusion of eHealth technologies in the health care system, to cope with the increased demand on health services and help solve some problems that face the traditional health systems. This initiative produced the National eHealth Transition Authority Limited (known as NEHTA).The main purpose of NEHTA is to develop better ways of electronically collecting and securely exchanging health information across Australia. Since July 2012, anyone seeking healthcare in Australia can register for a personally controlled electronic health record. This can lead to a huge repository about Australian health care records. iii This huge amount of data can be tuned into knowledge and more useful form of data by using computing and machine learning tools. It is believed that engineering this amount of data can aid in developing expert systems for decision support that can assist physicians in diagnosing and predicting some debilitating life threatening diseases such as breast cancer. Expert systems for decision support can reduce the cost, the waiting time, and free human experts (physicians) for more research, as well as reduce the errors and mistakes that can be made by humans due to fatigue and tiredness. However, the process of utilising health data effectively, involves many challenges such as the problem of missing features values, the curse of dimensionality due to a large number of features (attributes), and the course of actions to determine the features, that can lead to more accurate results (more accurate diagnosis). Effective machine learning tools can assist in early detection of diseases such as breast cancer, and the current work in this thesis focuses on investigating novel approaches to diagnose breast cancer based on machine learning tools, and involves development of new techniques to construct and process missing features values, investigate different feature selection methods, and how to employ them into diagnosis process. It is believed that the adoption of electronic health systems into the health care system requires comprehensives design and development, which may need several stages to make it more useful for humans and governments. For example, storing health records and electronic exchange of health records across the country are not the only aims of eHealth. Treating health records as an important information resource and probing the data to extract useful diagnostic and disease related intelligence, by using automated approaches, including most significant features, for example, may lead to new tools/approaches to examine new cases (patients) iv based on previous and similar cases, using machine learning and computer intelligence. It is the process of mapping the existing data into new unseen scenarios and settings, that can lead to increase in understanding the disease related information, such as early onset of disease, and better monitoring of different stages of disease, leading to value addition of health care technologies, for enhanced quality of service to patients, providing better assistance to doctors (bring an electronic consultant for doctors for example), and easy to cross validate standard disease diagnostic procedures. The thesis proposes several approaches to make this vision a reality. The main findings of this research can be categorised as follows: ο· The thesis proposed a new approach for diagnosing breast cancer by reducing the number of features to the optimal number using the information gain algorithm, and then applies the new reduced features dataset to the Adaptive Neuro Fuzzy Inference system (ANFIS). It is found that the accuracy for the proposed approach is 98.24%, significantly better. The promising results may lead to further attempts to utilise and exploit information technology for diagnosing patients, and provide decision support to physicians. ο· The thesis proposed a new approach for constructing missing features values based on iterative k nearest neighbours and the distance functions. The approach is an iterative approach until finding the most suitable features values that satisfy classification accuracy. The proposed approach showed improvement of 0.005 of classification accuracy on the constructed dataset than the original dataset on both Euclidean and Minkowski distance functions. The study found that Manhattan, Chebychev, and Canberra distance metrics produced lower classification accuracy on the new dataset than the original dataset. The study also noticed that classification accuracy v depends greatly on the number of neighbours (k). The experimental evaluation showed that less neighbours may lead to more accuracy. The reason for that, in my opinion, is the amount of noise produced from conflict neighbours. Finally, the maximum classification accuracy was on k=1 which was 0.9698. ο· Different sets of experiments were performed to evaluate benchmark attributes selections methods on well-known publicly available dataset from UCI machine learning repository, Wisconsin Breast Cancer dataset (WBC). Naïve Bayes has performed the supreme in regard to classification accuracy. k-NN and Decision Tree have performed just better on dataset after applying features selection methods. In general, features selection methods can improve the performance of learning algorithms. However, no single feature selection method that best satisfy all datasets and learning algorithms. ο· In regards to Classification Fusion on three well-known machine learning classifiers on breast cancer dataset. the study confirms the argument that the best combination of a set of classifiers depends on the application and on the classifier characteristics. In addition, there is no best combination of classifiers that suites all datasets. However in the current experiments, Naïve Bayes and k-NN produced better results when they combined as one classifier with maximum classification accuracy obtained on WBC dataset (0.9642). vi Acknowledgements I would like to thank everyone who has helped me to complete this thesis. Special, deep, and honest thanks to the supervision panel, chiefly, Dr Girija Chetty for the guidance, smile, and her advice through this research. Big thanks to all faculty staff and employees, Iβm pleased for being a small part of such great place for more than three years, principally Professor Dharmendra Sharma, A/Prof. Dat Tran for his support provided through my journey, and Professor George Cho for his comments and suggestions. My distinctive thanks to my wife; you were a complete package of family and love that gave me strength, support, and love during tough times. My son Hashim and daughter Yarra, you are my world, my words, my strength, and the reason of my life. Thank you. Words canβt express my thanks to my family, the source of power and strength; my parents for raising me up, especially my Mum, thanks arenβt enough and wouldnβt be enough, and sorry for being away from your warm and kind bosom but Iβm back, hopefully soon. My Dad, thank you for everything you done to me, for the advice, encouragement, and support through my life. My brothers and sisters, thank you. ix This thesis is dedicated to Mum, Dad, wife Fayha, son Hashim, and daughter Yarra. x Table of Contents Abstract ...................................................................................................................... iii Acknowledgements .................................................................................................... ix Table of Contents ...................................................................................................... xi List of Figures .......................................................................................................... xiv List of Tables ........................................................................................................... xvi Acronyms ................................................................................................................ xvii Chapter One: Introduction ....................................................................................... 1 1.1 Overview .................................................................................................................. 1 1.2 Research Motivation ................................................................................................ 3 1.3 Research Objectives ................................................................................................. 7 1.4 Research Contribution............................................................................................ 11 1.5 Research Methodology .......................................................................................... 14 1.6 Thesis Road Map ................................................................................................... 15 1.7 Chapter Summary .................................................................................................. 17 Chapter Two: Background Study and Literature Review ................................... 18 2.1 Overview ................................................................................................................ 18 2.2 Background Study .................................................................................................. 18 2.3 Classification.......................................................................................................... 19 2.3.1 k-Nearest Neighbors algorithm ...................................................................... 21 2.3.2 Artificial Neural Network .............................................................................. 27 2.3.3 Decision Tree ................................................................................................. 31 2.3.4 Naïve Bayes Classifier ................................................................................... 34 2.4 Data Mining in Healthcare ..................................................................................... 36 2.4.1 Treatment Effectiveness ................................................................................. 36 2.4.2 Healthcare Management ................................................................................ 37 2.4.3 Customer Relationship Management ............................................................. 37 2.4.4 Fraud and Abuse ............................................................................................ 37 2.4.5 Computer Aided Diagnosis ............................................................................ 38 2.4.6 Ethical, Legal, and Social Issues .................................................................... 40 2.4.7 Challenges of Data Mining in Healthcare ...................................................... 43 2.4.8 Electronic Health Record ............................................................................... 45 2.5 Related Work on Breast Cancer Diagnosis ............................................................ 46 2.6 Feature Selection Techniques ................................................................................ 47 xi 2.6.1 Wrapper Feature Selection Technique ........................................................... 49 2.6.2 Filters Feature Selection Techniques ............................................................. 51 2.6.3 Embedded Feature Selection Techniques ...................................................... 52 2.6.4 Feature Selection Techniques Used in Current Work .................................... 53 2.6.5 Related Work on Feature Selection Techniques ............................................ 56 2.7 Missing Features Values ........................................................................................ 58 2.7.1 Types of Missing Values................................................................................ 59 2.7.2 Handling missing data .................................................................................... 60 2.8 Chapter Summary .................................................................................................. 65 Chapter Three: Research Methodology ................................................................. 66 3.1 Introduction ............................................................................................................ 66 3.2 Data Mining Methodology ..................................................................................... 68 3.2.1 Data Collection .............................................................................................. 70 3.2.2 Data Selection ................................................................................................ 72 3.2.3 Data Pre-Processing ....................................................................................... 72 3.2.4 Applying Data Mining Methods .................................................................... 73 3.2.5 Evaluation ...................................................................................................... 75 3.2.6 Machine Learning Software Development Tools .......................................... 76 3.2.7 Results Visualization...................................................................................... 76 3.3 Chapter Summary .................................................................................................. 77 Chapter Four: Breast Cancer Diagnosis Based on Information Gain and Adaptive Neuro Fuzzy Inference System ............................................................... 78 4.1 Overview ................................................................................................................ 78 4.2 Adaptive Neural Fuzzy Inference System (ANFIS) .............................................. 78 4.2.1 ANFIS Structure ............................................................................................ 79 4.2.2 ANFIS Learning............................................................................................. 81 4.3 Information Gain ................................................................................................... 82 4.4 The Proposed IG βANFIS Approach ..................................................................... 83 4.5 The Experimental Results ...................................................................................... 84 4.6 Summary and Discussion ....................................................................................... 91 Chapter Five: Iterative Weighted k-NN for Constructing Missing Feature Values in Wisconsin Breast Cancer Dataset .......................................................... 92 5.1 Overview ................................................................................................................ 92 5.2 Missing Feature Values.......................................................................................... 92 5.3 The Proposed Method ............................................................................................ 95 5.4 The Experimental Results ...................................................................................... 98 xii 5.5 Summary and Discussion ..................................................................................... 100 Chapter Six: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes ........................................................................................................................ 102 6.1 Overview .............................................................................................................. 102 6.2 Feature Selection Techniques .............................................................................. 103 6.3 Feature Selection Techniques used in this Chapter.............................................. 104 6.4 The Experiment Methodology ............................................................................. 105 6.5 The Experimental Results .................................................................................... 107 6.6 Summary and Discussion ..................................................................................... 113 Chapter Seven: Fusion of Heterogeneous Classifiers for Breast Cancer Diagnosis ................................................................................................................. 114 7.1 Overview .............................................................................................................. 114 7.2 Multi-Classification Approach ............................................................................. 115 7.2.1 Classifier Selection ...................................................................................... 115 7.2.2 Fusion Classifier .......................................................................................... 115 7.3 Classifiers Combination Strategies ...................................................................... 116 7.4 Experimental Methodology.................................................................................. 117 7.5 Experimental Results ........................................................................................... 118 7.6 Summary and Discussion ..................................................................................... 121 Chapter Eight: Discussion and Future Work ...................................................... 122 References ............................................................................................................... 130 xiii List of Figures Figure 1: Medical Doctors per 1000 population in selected countries in Organization for Economic Cooperation and Development (OECD) Countries, 2009. ..................................... 6 Figure 2: Number of MRI units per one million populations in selected countries in Organization for Economic Cooperation and Development (OECD) Countries, 2003. .......... 7 Figure 3: Updated eHealth Architecture Including the proposed integrated intelligent system [12] ......................................................................................................................................... 10 Figure 4: General approach for building a classification model ........................................... 20 Figure 5: Example of k-NN [16] ........................................................................................... 22 Figure 6: k-NN characteristics in regards to some learning features. ................................... 26 Figure 7: Human neuron [33] ............................................................................................... 28 Figure 8: Artificial Neuron ................................................................................................... 29 Figure 9: Simplified neuron operation .................................................................................. 29 Figure 10: ANN architecture ................................................................................................ 30 Figure 11: ANN characteristics in regards to some learning features. ................................. 30 Figure 12: Simple Decision Tree .......................................................................................... 31 Figure 13: Decision Tree characteristics in regards to some learning features. .................... 33 Figure 14: Bayesian classifier characteristics in regards to some learning features. ............ 35 Figure 15: The Wrapper approach for features subset selection [65] ................................... 50 Figure 16: The filter approach [56] ....................................................................................... 51 Figure 17: Research Method Overview ................................................................................ 69 Figure 18: (a) Fuzzy Reasoning (b) Equivalent ANFIS Structure [89]. ............................... 82 Figure 19: The general structure for the proposed approach ................................................ 84 Figure 20: Information Gain Ranking on WBC ................................................................... 86 Figure 21: Sugeno Fuzzy Inference System with four features input and single output ...... 87 Figure 22: Input Membership Function for the feature βUniformity of Cell Sizeβ .............. 88 Figure 23: The structure for the proposed approach (IG-ANFIS) ........................................ 89 Figure 24: ANFIS Structure on MATLAB ........................................................................... 89 Figure 25: Comparison of classification accuracy between IG-ANFIS and some previous work ....................................................................................................................................... 90 Figure 26: The Flowchart for the proposed method (Constructing Missing Features Values) ............................................................................................................................................... 97 Figure 27: A comparison of classification accuracy for the proposed method through Euclidean/k-NN...................................................................................................................... 99 xiv Figure 28: A comparison of classification accuracy for the proposed method through Minkoski/k-NN .................................................................................................................... 100 Figure 29: Hybrid method of feature selection technique and a learning algorithm........... 106 Figure 30: Attributes selection methods with Naïve Bayes ................................................ 108 Figure 31: Results for attributes selection methods with k-NN .......................................... 110 Figure 32: Results for attributes selection methods with Decision Tree ............................ 112 Figure 33: Hybrid method of feature selection technique and a learning algorithm........... 113 Figure 34: Single Classifier on three datasets WBC, WDBC, and WPBC. ........................ 119 Figure 35: Two Classifiers on three datasets WBC, WDBC, and WPBC. ......................... 120 Figure 36: The Fusion of three classifiers on three datasets WBC, WDBC, and WPBC. .. 121 Figure 37: Results for attributes selection methods with Naïve Bayes on three databases (Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 127 Figure 38: Results for Attributes Selection Methods with k-NN on three databases (Thyroid, Hepatitis, and Breast Cancer)............................................................................................... 128 Figure 39: Results for Attributes Selection Methods with Decision Tree on three databases (Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 129 xv List of Tables Table 1: The confusion matrix for classifier c(x) on matrix X that contains 160 records. ... 21 Table 2: Examples, advantages, and disadvantages of wrapper feature selection [63] ......... 50 Table 3: Examples, advantages, and disadvantages of filter feature selection [63] .............. 52 Table 4: Examples, advantages, and disadvantages of embedded feature selection [63] ..... 53 Table 5: Extract of data to demonstrate Expectation Maximization [83] ............................. 62 Table 6: The calculations of mean, variance, and covariance for the features depression, age, height, and weight. ................................................................................................................. 63 Table 7: The final data set after performing Expectation Maximization method. ................ 64 Table 8: Selection of research paradigms and research methods [85] .................................. 67 Table 9: Sample of Wisconsin Breast Cancer Diagnosis dataset .......................................... 71 Table 10: Information Gain Ranking Using WEKA on WBC.............................................. 85 Table 11: Comparison of classification accuracy between IG-ANFIS and some previous work ....................................................................................................................................... 90 Table 12: Results for Attributes Selection Methods with Naïve Bayes. ............................. 107 Table 13: Results for Attributes Selection Methods with k-NN ......................................... 109 Table 14: Results for Attributes Selection Methods with Decision Tree ............................ 111 Table 15: Statistics of Breast Cancer Datasets .................................................................... 118 Table 16: Results for attributes selection methods with Naïve Bayes on three databases (Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 126 Table 17: Results for Attributes Selection Methods with k-NN on three databases (Thyroid, Hepatitis, and Breast Cancer)............................................................................................... 128 Table 18: Results for Attributes Selection Methods with Decision Tree on three databases (Thyroid, Hepatitis, and Breast Cancer) .............................................................................. 129 xvi Acronyms ADALINE: Adaptive linear Element ANFIS: Fuzzy Inference System ANN: Artificial Neural Network CAD: Computer Aided Diagnosis CART: Classification and Regression Tree CES: Consistency Based Subset Evaluation CFS: correlation based feature selection DM: Data Mining eHealth: Electronic Health EHR: Electronic Health Record ERR: Error Rate FIS: Fuzzy Inference System GA: Genetic Algorithm HIS: Hybrid Intelligent System IG: Information Gain IGANFIS: Information Gain and Adaptive Neuro-Fuzzy Inference System k-NN: k Nearest Neighbors LSE: Least Square Estimate xvii MAR: Missing At Random MCAR: Missing Completely At Random ML: Machine learning MNAR: Missing Not At Random NEHTA: National Electronic Health Transition Authority OECD: Organization for Economic Cooperation and Development PCA: Principle Components Analysis R: Relief RT: Random Tree SBFS: Sequential Backward Floating Search SFFS: Sequential Forward Floating Search SFS: Sequential Forward Search SU: Symmetrical Uncertainty UCI Machine Learning Repository: University of California Irvine Machine Learning Repository WBC: Wisconsin Breast Cancer Dataset WEKA: Waikato Environment for Knowledge Analysis xviii Chapter 1: Introduction CHAPTER ONE Introduction 1.1 Overview The advancement of information technology, software development, and system integration techniques have produced a new generation of complex computer systems. These systems have presented challenges to information technology researchers. Challenges include the compatibility between heterogeneous systems, security and privacy issues, systems management, sharing of data, and reusing and benefiting from the existing resources and data. An example of complex systems is the healthcare system. Recently, there has been an increased interest to utilise the advancement of communication and data mining technologies in healthcare systems. Also, many countries are changing the way of conducting healthcare systems towards a global healthcare system across the country by setting healthcare standardization in communication and building the electronic healthcare records. The Electronic Health Record (EHR) is a systematic collection of electronic health data about individual patients or populations. It is capable of being shared across healthcare providers in a certain state or the country [1]. Health records may include a range of data including general medical records, patient examinations, patient treatments, medical history, allergies, immunization status, laboratory results, radiology images, and some useful information for examination. This rich information may help researchers in examining and diagnosing diseases using computer techniques. Using EHRs may help in reducing the cost of legacy systems, 1 Chapter 1: Introduction improving the quality of care, and mobility of records. However, issues of privacy and security in such models have been a concern for patients and governments. The existence of EHRs encouraged researchers to the idea of electronic healthcare system where the components of the legacy healthcare systems (facilities, workforce, the providers of therapeutics, and education and research institution) come together and electronically share and transfer patient information across the public infrastructure across the country. Australia is moving fast toward electronic health care information systems across the country. This movement will produce a huge EHR for Australian populations and healthcare providers, and this health related information and data can be a valuable asset. Therefore, the aim of the current work is to investigate the aspects of utilising health data for the benefit of humans by using novel machine learning and data mining techniques. The idea is to propose an automated method for diagnosing diseases based on previous data and information. However, there are several problems associated with effectively utilising this previously acquired patient data, which can make any electronic healthcare system less efficient. These problems are: the problem of missing values and how to process them, the issue of huge features and attributes and how to select the most beneficial features and the problem of extracting accurate diagnostic markers that can predict the early onset of the disease, and the monitoring of different stages of the disease. This thesis will try to investigate these issues and propose methods for breast cancer disease as an example, based on the power of automated technologies and the previous evidence or data. The scope of the thesis is limited to the problems outlined above, and does not include other equally important 2 Chapter 1: Introduction issues like privacy and security. In this research, EHRs will be used as data sources for developing automatic data mining and machine learning techniques, so as to produce useful patterns and decision support logic for automatic computer aided diagnosis. For pursuing investigations for this project, the study used well-known datasets available publicly for research purposes. It is envisaged, that the novel algorithms and techniques developed and validated on this dataset can be extended to real clinical environments by integrating them into clinical computer aided diagnosis and decision support systems. This database is the trial dataset before integrating the proposed methods into the real clinical environments. 1.2 Research Motivation In Australia and all over the world, people are suffering from limited medical resources and long waiting times to receive medical services. According to the World Health Organization (WHO), Australia has ranked 32 out of 190 countries in the field of health care systems [2]. A study shows that Australia had fewer practicing physicians and limited care beds per one thousand people than the median of some countries in the Organization for Economic Cooperation and Development (OECD) [3]. The increasing population of Australia, the ageing population, the modern lifestyle, the climate change, and the new diseases that come into view have presented challenges for the Australian health organisations and state governments to set procedures and plans to manage and cope with the available medical resources, infrastructure, and to deliver a decent healthcare services for residents despite the shortages in medical personnel and equipment. In addition, medical services are essential for all individuals and it is the nationβs responsibility to develop and sustain 3 Chapter 1: Introduction the medical infrastructures and services for all residents and citizens. In addition to the shortages in medical personnel and technology, incidents of prescription errors have been increasingly causing minor to major problems for patients. For example, serious health problems may occur because of Adverse Drug Effects (ADE). ADE caused by mistaken prescription, errors in dosage, miss-communication between physicians and pharmacy, dispensing and administering of drugs, and inappropriate number of drug intake [4]. For example, a study [5] shows that ADE may rank as the sixth leading cause of death in the United States after heart diseases, cancer, stroke, pulmonary diseases, and roads accidents. In Australia, the Australian Department of Health and Aging estimated that around 140,000 hospital admissions every year are due to ADE incidents [6]. Those problems may be avoided by a systematic information transfer between different health care providers (hospitals, medical centres, pharmacies, pathologies, etc.). Another issue that stand for countries including Australia is the shortages in medical doctors. Figure 1 shows a comparison between Australia and selected countries in Organization for Economic Cooperation and Development (OECD) in terms of number of medical doctors. The Figure shows 1.43 general practitioners, 1.35 specialists, and total of 2.81 physicians per 1000 population. The availability of innovative eHealth technologies such as the one proposed in this research, can help alleviate this shortage. Breast cancer has become a common disease around the world. Yearly, millions of women suffer from this debilitating life threatening disease, making it the second common non-skin cancer after lung cancer, and the fifth cause of death among cancer diseases in the world [7]. Discovering the disease in its early stages may reduce the breast cancer tragedy. Computing technologies and machine learning tools 4 Chapter 1: Introduction can be used to assist physicians in diagnosing and predicting the disease so they can provide the necessary treatment and prevent the impact, including the possibility of death. More specifically, breast cancer cause about 22.9% of all cancers in women excluding skin cancers [8]. For example, breast cancer caused 458,503 deaths worldwide in 2008 [8]. Breast cancer is targeting women 100 times more than men, although men tend to have poorer outcomes due to delays in diagnosis [9]. Survival rates for breast cancer vary greatly depending on the cancer type, stage, treatment, and geographical location of the patient. For instance, survival rates in the Western world are high. However, in developing countries survival rates are much poorer. Therefore, this work provides a hope, that this research and the related future work makes some contributions that can help in a better diagnosis of breast cancer for men and women worldwide, especially for countries with poor health services. In Australia, breast cancer is the most common cancer in women (excluding 2 types of non-reportable skin cancer), representing over a quarter (28%) of all reported cancer cases in women in 2006. A total of 12,614 breast cancer cases were diagnosed in women in 2006, the largest number recorded to date of study (until 2009). More than two-thirds (69%) of these cases were in women aged 40 to 69 years. In the same year, 102 cases of breast cancer cases were diagnosed in men, accounting for 0.8% of breast cancer cases [10]. While breast cancer is the most commonly reported cancer in non-indigenous women in the four jurisdictions for which data were available, indigenous women were significantly less likely to be diagnosed with breast cancer than non-Indigenous women in 2002 to 2006 (69 and 103 new cases per 100,000 women, respectively) [10]. Worldwide, breast cancer was the sixth leading cause of burden of disease for 5 Chapter 1: Introduction women in 2003 and it accounted for 7% of all years of life lost due to premature mortality [10]. Figure 1: Medical Doctors per 1000 population in selected countries in Organization for Economic Cooperation and Development (OECD) Countries, 2009. 6 Chapter 1: Introduction Technology availability is also a challenge that stands for countries. Figure 2 demonstrates the availability of Magnetic Resonance Imaging (MRI) in selected countries in Organisation for Economic Cooperation and Development (OECD) Countries. The Figure shows 3.7 MRI units per one million populations in Australia. Therefore, these shortages in medical resources drive researchers to look for more effective solutions for the benefit of society. Computer scientists can utilize the latest technologies in machine learning science to produce models and methods that can assist physicians in the process of examination and treatment Figure 2: Number of MRI units per one million populations in selected countries in Organization for Economic Cooperation and Development (OECD) Countries, 2003. 1.3 Research Objectives Computing and machine learning tools can significantly help in solving the health care problems by developing expert systems that can assist physicians in diagnosing 7 Chapter 1: Introduction and predicting diseases in early stages. These systems can reduce the cost, the waiting time, and free human experts for more research as well as reduce the errors and mistakes made by medical personnel [11]. Computer Aided Diagnosis (CAD) and medical expert systems and tools have become one of the foremost research areas in the field of medical diagnoses. The aim of CAD is to design an expert system that combines the human expertise and the technology intelligence to achieve more accurate diagnosis effectively [11]. CAD can be used to assist physicians in diagnosing and predicting diseases. Accordingly, physicians can provide a necessary treatment promptly to prevent loss, including the possibility of death. In Australia, the National Electronic Health Transition Authority (NEHTA) was established by the Australian states and territory governments to develop an electronic and secure exchange of health information between healthcare providers. The project was expected to complete by the end of 2012 [12]. From July 2012, anyone seeking healthcare in Australia can register for a personally controlled electronic health record. The result will be a huge database about Australian health care records. This database can be utilised for research purposes after applying the Australian privacy and information use standards and polices. The outcomes of the present thesis work may fit as a component in the Australian National E-health System. The aims of this research work are: ο· To utilize patientβs histories, health information, and databases for discovering and diagnosing diseases, and provide decision support to medical professionals. The research is expected to establish some models that can assist physicians in diagnosing diseases and grouping patients in useful patterns based on different risk factors, and how machine learning 8 Chapter 1: Introduction techniques can identify such patterns. This can help in detecting early onset of the disease, identification of disease stages and treatment plans. ο· To address an important issue related to missing values that can play an important role in determining the performance improvements achieved by data mining and machine learning algorithms. ο· To work with large number of features and attributes in the dataset, and identify the significance of some features over others. Large number of features can lead to curse of dimensionality, and can render a machine learning algorithm or technique limited in terms of accuracy, precision and specificity Therefore, this thesis proposes new methods for constructing missing feature values, investigate feature selection techniques and develop new machine learning algorithms for providing automatic computer aided diagnosis and decision support system for breast cancer disease diagnosis. The aim is to develop an integrated system with a principled workflow (constructing missing features values, feature selections, and classification algorithms). This work envisages that the outcomes of this research in terms of an integrated computer aided decision support and diagnosis system with a principled workflow of different algorithmic techniques for dealing with missing feature values, extracting significant feature selection and using machine learning based classification can enhance the accuracy with which benign and malignant forms of the disease can be identified, and can contribute as a component in the Australian National E-health System. Figure 3 shows the proposed architecture for the eHealth system along with the proposed integrated intelligent system. Figure 3 is an updated architecture from 9 Chapter 1: Introduction NEHTA [12], and includes outcomes of this research as a possible component of the system for future adoption. Figure 3: Updated eHealth Architecture Including the proposed integrated intelligent system [12] To conclude, research objectives of this research is to utilize patientβs histories, health information, and databases from the national EHRs (Electronic Health Records) for discovering markers for early diagnosis and management of breast cancer with an integrated intelligent approach consisting of processing missing feature values, significant feature selection, and learning based classification. The research is expected to establish some models and tools that can assist physicians in diagnosing diseases. The aim is to design an expert system that combines the human expertise and the technology intelligence to achieve more accurate diagnosis. This system may assist physicians in decision making and double check physicianβs assessment (Evidence based diagnosis)Monitoring the disease, by grouping patients into related health patterns for better and effective treatment plans. 10 Chapter 1: Introduction To address the above mentioned research objectives, the research problem is formulated in terms of following questions: Question 1: Does hybridization of the existing machine learning algorithms produce better approaches for medical diagnosis of breast cancer in terms of classification accuracy, tolerance to noise and missing values? Question 2: How discriminating dataset features can improve prediction in context of medical diagnosing with breast cancer as an example. Question 3: How to identify the diagnosing features that best describe data for the purpose of differentiating malignant and benign form of breast cancer using learning based classification and data mining techniques? 1.4 Research Contribution The present research aims to contribute to the national interest for electronic healthcare system (E-health). The aim of the current thesis is to analyse large data obtained from E- health systems using data mining and machine learning algorithms. The process of analysing large amount of data includes some novel algorithmic techniques such as constructing missing feature values, investigating and developing better features selection techniques, and proposing new machine learning based 11 Chapter 1: Introduction approaches for diagnosing the disease information based on previous history obtained from patients. ο· In regards to constructing missing features values, the study proposed a new approach for constructing missing features values based on iterative nearest neighbourhood and distance metric. The proposed approach employs k nearest neighboursβ algorithm and propagates the classification accuracy to a certain threshold. The proposed method showed improvement in classification accuracy of around 0.5 % in the constructed dataset than the original dataset which contained some missing features values. The maximum classification accuracy was 0.9698. This work has been published in peer reviewed journal, the International Journal of Intelligent Information Processing (Ashraf, M., et al., "A New Approach for Constructing Missing Features Values," International Journal of Intelligent Information Processing; Mar2012, Vol. 3, Issue 1, pp. 110-118, 2012). ο· In terms of features selection techniques, the research focused on features selection technique as a method to gain high quality attributes to enhance the mining process. Features selection techniques touch all disciplines that require knowledge discovery from large data. In this part of research, I have examined different benchmark features selections methods on publicly available Wisconsin Breast Cancer Dataset (WBC) and three well-recognized machine learning algorithms. The study found that features selection methods are capable of improving the performance of learning algorithms. However, no single features selection method can best satisfy all datasets and learning algorithms. Therefore, machine learning researchers should understand the nature of datasets and learning algorithm characteristics in order to obtain better outcomes. Overall, 12 Chapter 1: Introduction consistency based subset evaluation (CES) performed better than information gain, symmetrical uncertainty, Relief (R),correlation based feature selection (CFS), and principle components analysis (PCA).The results have been accepted by the International Journal on Data Mining and Intelligent Information Technology Applications and will appear in March 2013edition(Ashraf, M., et al., "Features Selections Techniques on Thyroid, Hepatitis, and Breast Cancerβ, International Journal of Intelligent Information Processing, Accepted, (will appear in Mar 2013)) ο· In regards to diagnosis approaches, this work proposed two approaches for diagnosing breast cancer based on machine intelligence and previous history. The first approach presented a new method for breast cancer diagnosis using a combination of an Adaptive Network based Fuzzy Inference System (ANFIS) and the Information Gain method. In this approach, the purpose of ANFIS is to build an input-output mapping using both human knowledge and machine learning ability, and the purpose of information gain method is to reduce the number of input features to ANFIS. The experimental validation shows 98.23% accuracy which underlines the capability of the proposed algorithm. The second approach utilised features selection technique and Bayes learning algorithm, as the feature selection techniques have become an essential component of automatic data mining systems dealing with large data, and an obvious tool to aid researchers in computer science and many other fields of science. Whether the target research is in medicine, agriculture, business, or industry; the necessity for analysing large amount of data is needed. In addition to that, finding the most excellent feature selection technique that best satisfies a certain learning algorithm could bring the benefit for researchers. Therefore, the current work 13 Chapter 1: Introduction proposed a new method for diagnosis based on a combination of learning algorithm and feature selection technique. The idea is to obtain a hybrid integrated approach that combines the best performing learning algorithms and the best performing feature selection technique with an experimental evaluation on the Wisconsin Breast Cancer Dataset (WBC). Experimental results showed that co-ordination between consistency based subset evaluation method along with Naïve Bayes learning algorithm can produce promising results. The results appeared in the 19thInternational Conference on Neural Information Processing in 2012.(Ashraf, M., et al. (2012), Hybrid Approach for Diagnosing Thyroid, Hepatitis, and Breast Cancer Based on Correlation Based Feature Selection and Naïve Bayes, Neural Information Processing, T. Huang, Z. Zeng, C. Li and C. Leung, Springer Berlin Heidelberg. 7666: 272-280). ο· A new set of experiments performed to evaluate the fusion of multiclassification. Based on the experiments, the best combination of a set of classifiers depends on the application and on the classifiers characteristics. In addition, there is no best combination of classifiers that suites all datasets. However, the experiments showed that Naïve Bayes and k-NN produced better results when they combined as one classifier with maximum classification accuracy obtained on WBC dataset (0.9642). 1.5 Research Methodology Knowledge discovery from the databases or data mining refers to extracting useful relationships and patterns from large databases. Due to the amount of data and to obtain useful outcomes, a systemic method must be applied. It has become a fact that quality data will imply more accurate outcomes than dirty data. Dirty data is a 14 Chapter 1: Introduction common term in data mining that describe some unwanted data characteristics such as incompleteness, noisy, and inconsistency. In this research, the method proposed involves different data mining processes starting by appropriate data collection (in our case, online datasets), data selection, data pre-processing, applying learning based classifier methods, evaluation, and finally visualisation and evaluation of results in terms of tables and diagrams. Details of each stage are described in Chapter 3. 1.6 Thesis Road Map The thesis is organised into six chapters: ο· Chapter 1: Introduction ο· Chapter 2: Background Study ο· Chapter 3: Research Methodology ο· Chapter 4: Breast Cancer Diagnosis based into ANFIS and Information Gain ο· Chapter 5: Constructing Missing Features Values ο· Chapter 6: Breast Cancer Diagnosis based on Naïve Bayes and Feature Selection techniques. ο· Chapter 7: Fusion of Heterogeneous Classifiers for Breast Cancer Diagnosis. ο· Chapter 8: Discussion and Future Work. This introductory chapter presents the problem description, motivation and objectives of this work, the contribution to the scientific knowledge, the methodology, and the road map of thesis. Chapter two provides a review of canonical machine learning and data mining techniques, features selection methods, and data mining algorithms in the field of 15 Chapter 1: Introduction healthcare. This work has combined them into one chapter because they are strongly related as a background study. Chapter three describes data mining methodology used in this work. Chapter four presents a new approach for breast cancer diagnosis using data mining techniques and the power of new neuro fuzzy inference technique. The approach relies on permutation between the Adaptive Network based Fuzzy Inference System (ANFIS) and the Information Gain method. Chapter five presents a new approach for constructing missing features values based on iterative nearest neighbours and distance metrics. The proposed approach employs weighted k nearest neighboursβ algorithm. Chapter 6 shows a new method for diagnosing breast cancer on a combination of learning algorithms and features selection techniques. The idea is to obtain a hybrid approach that combines the best performing learning algorithms and the best performing feature selection techniques. Chapter 7 shows a set of experiments to evaluate the idea of classification fusion. Finally, Chapter 8 presents some of the conclusions drawn from this work and scope for future work. 16 Chapter 1: Introduction 1.7 Chapter Summary This Chapter presented an overview of the thesis, the motivation and the objectives of the proposed research, and described the need of current research to assist in solving the shortages in automated computer aided decision support technologies, the contribution of current research, and the thesis road map. The next chapter will present a background study about machine learning, data mining in healthcare, feature selection techniques, and missing features values. 17 Chapter 2: Background Study CHAPTER TWO Background Study and Literature Review 2.1 Overview Machine learning (ML) can be interpreted as a group of topics that emphasizes on making and testing algorithms that can assist the process of classification, prediction, and pattern recognition, using computer models obtained from existing data (previous data). Machine learning can produce classifiers to be used on the available resources. In addition, machine learning does not involve much human interaction. The objective behind limited human involvement is that the use of automatic preprogrammed methods can reduce human biases. The process of proposing the algorithm and its functionality to classify objects or predict new cases are to be built on solid and reliable data [13]. 2.2 Background Study In general, machine learning can be defined as a scientific domain that aims to design and develop algorithms that allow computers to learn and behave to solve a real time problem based on previous data or under a certain instructions and rules. There are many presentations of machine learning; data mining is the most used application of machine learning [14]. Data mining is a science to discover knowledge from databases. The database contains a collection of instances (records or case). Each instance used by machine learning and data mining algorithms is formatted using same set of fields (features, attributes, inputs, or variables). When the instances contain the correct output (class label) then the learning process is called the 18 Chapter 2: Background Study supervised learning. On the other hand, the process of machine learning without knowing the class label of instances is called unsupervised learning. Clustering is a common unsupervised learning method (some clustering models are for both). The goal of clustering is to describe data. However, classification and regression are predictive methods. In the current research, my focus is on supervised machine learning [14]. 2.3 Classification Classification and regression are common models in supervised learning. The current research will concentrate on classification. However, it is useful to distinguish between them. Regression algorithms attempt to map input to domain values (can be real values). For Example, a regressor can forecast a certain goods sales by considering goods features. At the same time, classifiers can map the input space into pre-defined classes. For example, a classifier can predict a new case of patient whether benign (healthy) or malignant (suffer from a certain disease) [15]. Classification is the process of learning the target function that maps between a set of features (inputs) and a predefined class labels (output). The input data for the classification is a set of instances. Each instance is a record of data in the form of (π, π) where π is the features set and π is the target variable (class label). Classification model is a tool that used to describe data (Descriptive Model) or a tool to predict the target variable for new instance (Predictive Model). Examples of classification models are decision tree, artificial neural network, Naïve Bayes, and nearest neighbourβs classifier [16]. 19 Chapter 2: Background Study The general approach for solving classification problem is shown in Figure 4. The training data consists of instances whose class labels are known. The classification model can be built based on the training data. The model then can be evaluated and tested by using the testing data which contains records with missing class labels. The evaluation of model performance is based on the number of testing instances that are correctly forecasted [16]. The result of performing the model on the testing data produces the confusion matrix. Figure 4: General approach for building a classification model Suppose the goal is to classify some objects π = π, β¦ . , π into π predefined classes, where π represent the number of classes. For example, if the aim of classification is to diagnose a patient whether or not suffering from breast cancer then the value of π will be π corresponding to either benign or malignant. Database (available data) can be organised as π π π matrix πΏ, where πππ represent the feature value π in the record π. Every row in the matrix πΏ is represented by a vector ππ with π features and a class label ππ . The classifier can be denoted as π(π). One method to evaluate the classifier 20 Chapter 2: Background Study is by calculating the error estimation based on the confusion matrix. To explain the error estimation, let us consider an example. Suppose the aim of a certain classifier π(π) is to train and test input vectors π into two possible classes benign and malignant. Suppose the result of classification of the classifier π(π) on vectors π is as shown in the confusion matrix in Table 1. Table 1: The confusion matrix for classifier c(x) on matrix X that contains 160 records. Predicted true benign malignant benign 60 15 malignant 5 80 The error rate (Er) of algorithm is the total number of incorrect classified samples divided by the total number of records in the matrix X. In the example above, Er = (15 + 5) / 160 = 0.125. On the other hand, the classification accuracy of the model can be calculated as Acc = 1 β Er = 0.875. 2.3.1 k-Nearest Neighbors algorithm The k Nearest Neighbour algorithm (k-NN) is an instance based machine learning algorithm. k-NN is very simple to understand but works amazingly well [17]. The idea behind k-NN method for classifying objects is based on the closest training cases in the feature space. The k-NN finds the k closest instances to a predefined instance and decides its class label by identifying the most frequent class label among the training data that have the minimum distance between the query instance and training instances. The distance is determined by the distance metric. Preferably, the distance metric minimise the distance between similar instances and maximise the distance between different instances. The following pseudo-code shows an 21 Chapter 2: Background Study illustration for k-NN implementation [18]. Examples of approaches to define the distance are the Euclidean and Manhattan methods. Figure 5 shows an example of kNN. procedure K-NN-Learner(TestingDataSet) for each testing instance { find the k most nearest instances of the training set according to a distance metric (Euclidean distance or Manhattan distance ) Resulting Class= most frequent class label of the k nearest instances} Figure 5: Example of k-NN [16] Distance Functions 1. Euclidean distance The most regularly used metric to compute the distance between data points is the Euclidean distance. Euclidean distance is the square root of the sum between two points. For n-dimensional data, the distance is giving by the formula number 1, 22 Chapter 2: Background Study where π denote to distance, π₯ and π¦ are two different cases in the dataset, π is the total number of cases in the dataset [19]. n d(x, y) = i=1 xi β yi 2 (1) 2. Manhattan distance Manhattan distance is one of well-known measuring distance. Manhattan distance is calculated by summing the absolute value of the difference of data points. Manhattan distance is less costly to calculate in compare to Euclidean distance. The formula for Manhattan distance is giving by the formula number 2, where π denote to distance, π₯ and π¦ are two different cases in the dataset, π is the total number of cases in the dataset [20]. n d(x, y) = i=1 xi β yi (2) 3. Minkowski distance Minkowski function is a geometric distance between two points and uses a scaling factor, r. The main use is to find the similarity between objects. When r = 2 then it become the Euclidean distance. When r = 1 then it become the Manhattan distance. The distance is giving by the formula number 3, where π denote to distance, π₯ and π¦ are two different cases in the dataset, π is the total number of cases in the dataset [20]. 1 π π π π(π₯, π¦) = π=1 π₯π β π¦π (3) 4. Chebyshev distance 23 Chapter 2: Background Study Chebyshev distance function calculates the absolute differences between the coordinates of two points. Example of common application for using Chebyshev distance is Fuzzy C-means Clustering [112], where d denote to distance, x and y are two different cases in the dataset. d(x, y) = max xi β yi π (4) 5. Canberra distance Canberra distance is the sum of absolute values of the differences between ranks divided by their sum, thus it is a weighted version of the Manhattan distance function, where d denote to distance, x and y are two different cases in the dataset, n is the total number of cases in the dataset [113]. n d(x, y) = i=1 xi β yi xi + yi (5) Before using the k-NN, it is a good approach to list the advantages and disadvantages of k-NN to ensure that the k-NN is appropriate for the dataset and the learning process. οΆ k-NN advantages ο· Is a very efficient pattern recognition method and can be easily carried out [21]. ο· k-NN simplicity of use [22]. ο· Strong against noisy data[22]. ο· Can be used for large and small datasets [22]. ο· Suitable for linear and nonlinear functions[22]. ο· The ability to add additional instances with no need to train the data set [23]. 24 Chapter 2: Background Study ο· Weight is used to measure features significance [23]. ο· missing values can be easily imputed [24]. ο· flexibility (nonparametric model except the value of k) [25]. οΆ k-NN disadvantages ο· The need calculate the distance between the query instance and all other instances[24]. ο· The need for huge memory[24]. ο· Not useful for multidimensional dataset because of high error rate [24]. ο· The option of using many distance function which may lead to different accuracy level [24]. Figure 6 shows the features of learning for k-Nearest Neighbour algorithm [27]: 25 Chapter 2: Background Study Figure 6: k-NN characteristics in regards to some learning features. 26 Chapter 2: Background Study 2.3.2 Artificial Neural Network Artificial Neural Networks date back to nineteenth century when William James and Alexander Bain comprehended the ability of constructing a manmade system based on neural models [28]. In the middle of twentieth century, McCulloch and Pitt found the capability of learning a group of neurons, and Donald Hebb had developed tuning method that shows how neurons use enforcement to strengthen the connections from important input. In 1950s, based on Hebb methodology, Farley and Clark established the first artificial neural networks where neurons were randomly connected followed by development of perceptron for pattern classification by Frank Rosenblatt. Regrettably, the system was not able to perform complex classification and the research was stopped in 1960s [28]. During that era, the Adaptive linear Element (ADALINE) was developed by Widrow and Hoff that ultimately used to eliminate the echoes in telephone systems based on adaptive signal processing [29]. Despite the limited research on neural networks during 1970s, some researchers had developed self-organising neural model based on physiological studies on nerves systems called adaptive resonance theory (ART) [30]. In 1974, Paul Werbos had developed a learning rule based on error minimization approach in which the error is propagated in reverse by adjusting the weights using the Gradient descent model. Paulβs technique is the back propagation error algorithm which is the most used artificial neural networks model that spread widely in mid 1980s by a group of researchers[28]. During 1980s and 1990s, computers have extended in speed about hundred times quicker since the beginning of the research, academic programs appeared, new courses were introduced, and funding becomes available. All the mentioned factors encouraged researchers to concentrate on neural networks application, development, 27 Chapter 2: Background Study and new approaches for prediction, forecasting, and diagnoses. For example, many studies[31, 32] demonstrates the potential applications of ANN for clinical decision making. Now a day, Major evolution in neural networks that attracts funding for further research in many fields such as the hybrid neural networks and how to combine it with other technologies. The artificial neuron is a computer simulated model stimulated from the natural neurons. Natural neurons receive signals from synapses located on the surface of the neuron. The neuron is starting to work and send a signal through the axon once the signal extent to a certain threshold. This signal then transfers through to other neurones and may get to the control unit (the brain) for a proper action. Figure 7 shows how the human neuron looks like [33]. Figure 7: Human neuron [33] The Artificial Neuron (AN) simulates the functionality of real neuron. AN have a set of inputs associated with weights. Inputs and weights are calculated by a mathematical equation to control when the AN activated. ANN is a combination of artificial neurons that process information [33]. Figure 8 shows a simple artificial neuron. 28 Chapter 2: Background Study Figure 8: Artificial Neuron In general, the artificial neuron operation is modelled by the data flow diagram as in Figure 9. Figure 9: Simplified neuron operation After briefly describing the artificial neuron, the ANN is described next. ANN is a set of connected artificial neurons. The most used ANN model is the Feed Forward Networks. Figure 10 shows a three layer topology of Feed Forward Networks. The outcome of ANN is subject to input and the value of weight [35]. 29 Chapter 2: Background Study Figure 10: ANN architecture Figure 11 shows the features of learning for Artificial Neural Network[27]: Figure 11: ANN characteristics in regards to some learning features. 30 Chapter 2: Background Study 2.3.3 Decision Tree Decision tree is a classification method which contains nodes, branches, and leafs. The first node on the tree or the top node is called the root node. Each node in the tree is connected with one or more nodes using branches, the last node in the tree that contains no outgoing branches is called leaf node. The leaf node indicate to termination or the outcome value [16] [36]. Figure 12 shows an example of a simple decision tree. Figure 12: Simple Decision Tree Figure 12 shows how to solve a real time problem based on making questions and answers about attributes in the testing records. The terminology of such classification method is to keep asking question until conclusion is reached. The set of questions and answers could form a decision tree with set of nodes. The tree could contain three types of nodes [16]: 31 Chapter 2: Background Study ο· Root node that has zero or more outgoing nodes and no incoming nodes, as well as, it contains the testing condition that separate records. ο· Normal nodes, those nodes are internal nodes and each has one and only one incoming node and two or more outgoing edges. It also contains the testing condition that separate records. ο· Leaf nodes, those nodes hold the class labels, have no outgoing edges, and only one incoming edge. 2.3.3.1 Building Decision Tree Building an optimal decision tree is a difficult task because there are many decision trees that can be built from a set of attributes. In addition, constructing an optimal decision tree is computationally costly [37]. Generally speaking, the methods of constructing decision trees can be grouped into two types: top-down and bottom-up method with favourite to the first group according to the literature [37]. There are many types of top-down decision tree for example CART, C4.5, and ID3. 2.3.3.2 ID3 The ID3 is a top-down decision tree. The algorithm proposed by Quinlan in 1986. ID3 features the simplicity among other classifiers; it uses information gain to split instances and building the tree. ID3 is simple to perform. However, it doesnβt handle missing values and no pruning procedures [15]. 2.3.3.3 C4.5 decision tree C4.5 is a better version of ID3 found by the founder of ID3 in 1993. The purpose of C4.5 is to overtake the disadvantages of the early version (ID3). The process of 32 Chapter 2: Background Study splitting instances is done by gain ratio or the information gain. The algorithm runs as long as the number of instances to be split is more than a predefined threshold. Unlike ID3, the new version is capable to treat missing values and can handle numeric attributes [15]. 2.3.3.4 CART CART proposed by Breiman in 1984, it is shortcut for Classification and Regression Tree. CART has become a common method for constructing decision tree model due to the capability to deal with different data types, handling missing values, and the ability to produce rules which are understandable by human. CART may name binary tree because the tree is constructed by splitting a node into two child nodes with exactly two outgoing edges from the internal nodes. The splits are selected using the towing criteria (represent the quality of the connection between a parent and child decision nodes) [15] . Figure 13 shows the features of learning for Artificial Neural Network. Figure 13: Decision Tree characteristics in regards to some learning features. 33 Chapter 2: Background Study 2.3.4 Naïve Bayes Classifier Naïve Bayes classifier in data mining is a mathematical classifier based on independency and probability (Bayes theorem). The Naïve Bayes classifier adopts the idea that the existence of a certain feature of an object is unrelated to the existence of any other feature, given the class variable. For example, an animal may be considered to be a cat if it is hunt, play with kids, has four legs, has a head, and weight about 3 kilograms. Naïve Bayes algorithm treat all features independently and how they make a prediction of this animal is a cat, with no feature depends on others features values [38]. Naïve Bayes algorithm is significant classifiers; it is easy to construct, does not requires parameter estimation, easy to interpret. Therefore, Naïve Bayes can be performed by expert and inexpert data mining developers. Finally, Naïve Bayes generally performs well in comparison with other data mining methods [39]. The literature shows two types of Naïve Bayes, Multinomial model and Multivariate Bernoulli model. In these models, the classification is performed by the following Naïve rule [40, 41]: π(ππ |π₯π ) = π ππ . π(π₯π |ππ ) π(π₯π ) (4) Where ππ is the instance class label, π₯π is the test attribute, π ππ π₯π is the posterior probability of the class label ππ given the attribute π₯π , π ππ is the prior probability of class label ππ , π π₯π |ππ is the likelihood which is the probability of attribute π₯π given the class label ππ . Assume that each attribute π₯π is conditional independent of every other attribute π₯π then the conditional distribution over the class variable c is: 34 Chapter 2: Background Study π π(π|π₯π ) = π π π=1 π(π₯π |π) (5) The advantage of Bayesian classifier over other classification methods is the opportunity of considering the prior information about a given problem. The main disadvantages of Bayesian classifier are (1) the numerical attributes require discretization in most cases; (2) it is not suitable for large data sets which contain many attributes (time and space issues) [27]. Figure 14 shows the features of learning for Bayesian classifier. Figure 14: Bayesian classifier characteristics in regards to some learning features. 35 Chapter 2: Background Study 2.4 Data Mining in Healthcare Data mining methods have been commonly used in healthcare and diagnostic applications since of their ability to predict new cases. The main feature of data mining algorithms is the capability to learn from previous cases and it is ability to produce a prediction model. The resulting model is used to predict the new arrival cases, produce conclusion (knowledge) form a large amount of data, or to classify the data into useful patterns. There are enormous benefits for data mining applications in healthcare sector. In the main, data mining benefits in healthcare sector can be categorized in the following: treatment effectiveness, healthcare management, customer relationship management, fraud and abuse detection, and medical diagnosis (computer aided diagnosis) [42]. The focus of current research will be on computer aided diagnosis. However, other data mining application will be described briefly. 2.4.1 Treatment Effectiveness Data mining applications can be used to assess the level of medical treatments. The aim is to develop a model that compares between the symptoms, causes, and the treatment procedures. At the end, data mining model may produce an analysis for all treatment procedures to produce knowledge about the best practice procedure. For example, data mining model can find the best performing drug for a certain disease, grouping the side effects for a drug and how to reduce its risk on patients. Finally, data mining may play a role to determine the successful treatment and make it a standard approach among healthcare providers [42]. 36 Chapter 2: Background Study 2.4.2 Healthcare Management In regards to healthcare management, a good design of data mining can benefit to better classify and track some dangerous diseases and some patients who may infect others, design appropriate intervention, and reduce the number of hospital admissions and claims. For example, investigating readmission cases in a certain hospital and comparing its data with current scientific literature can make an efficient utilization of medical resources. As an another data mining example in healthcare management, data mining can be used to decrease patient length of stay (compare a certain case with previous cases, the length of stay should not exceed the average stay of previous cases, of course the final decision is to be taken by the doctor. However, data mining models can provide an approximate length of stay), provide information to physicians (data mining model may use as a second opinion for physicians or as a consultant), and to develop best practices [42, 43]. 2.4.3 Customer Relationship Management As in other industries, data mining applications can be used to improve customer satisfaction [44]. A study performed by Milley [45] stated that mining patient survey data can help to determine the waiting times expected for a patient before seen by a physician, how to improve customer service for patients, and assist healthcare providers with knowledge about patient expectations. Also, Hallick [46]suggested that Customer Relation Management in healthcare can help promote disease education, prevention, and wellness services. 2.4.4 Fraud and Abuse Data mining applications can work to help governments (Medicare for instance) and healthcare insurance companies to control and reduce the fraud made by some 37 Chapter 2: Background Study healthcare providers. The idea is to establish a model that recognises strange claims made by customers (patients), physicians, pharmacies, labs, or other healthcare providers [42]. Once the model predicted fraud case, the fraud control department may investigate the case further before action is taken against violating healthcare provider. 2.4.5 Computer Aided Diagnosis Computer Aided Diagnosis (CAD) is an assistance method for diagnosing diseases and estimating the level for illness. CAD can be categorised as an expert system which utilizes human knowledge and experience to solve problems automatically or with a little support from human experts. The use of CAD systems is not replacing the role of medical personnel. However, CAD systems work as a second opinion, or for assisting in decision support in the diagnosis process. The final decision is to be made by the physicians [11]. οΆ Types of CAD Systems There are two types of CAD systems: CADe for Computer Aided detection and CADx for computer aided diagnosis. CADe involve the use of computer analyses to indicate whether or not a certain case suffer from a certain disease (the target disease such breast cancer). CADx on the other hand, is to evaluate the process of detection. In both types, the final diagnosis and patient management is performed by the physician or the medical personnel [47]. οΆ CAD Systems Characteristics Computer Aided Diagnosis (CAD) and medical expert systems and tools have become one of main areas of research in the field of medical diagnoses. The aim of CAD is to design an expert system that combines the human expertise and the 38 Chapter 2: Background Study technology intelligence to achieve more accurate diagnosis effectively. In addition, it can speed up the diagnoses; reduce the errors and mistakes made by human being, and free human expertise for further research. CAD can be used to assist physicians in diagnosing and predicting diseases so physicians can provide a necessary treatment promptly and prevent loss, including the possibility of death. In general, Computer Aid Diagnosis and expert systems have a number of attractive advantages [11, 48]: ο· Fast response and reduce the cost. The cost of providing expert system per user is lower than providing human expert. In addition, some cases require immediate response, especially in emergencies. In such cases, diagnosis can be obtained by expert systems to produce an approximate and reliable result for the situation. ο· Increase availability and reliability. Experts are available 24/7 on a suitable computer with internet connectivity for e-experts systems. Medical expert systems also increase the confidence about decisions made by physicians and may help to break arguments between human experts in case of different opinions. ο· ο· Steady, unemotional, and complete response at all the time. Human experts are not permanent. However, the expert systems last indefinitely. ο· The knowledge in expert systems can be examined and corrected since it is explicitly known instead of being implicit in the mind of human experts mind. ο· Justification and warranting. Medical expert systems can explain in more details the steps and reasons for taking the decision than the human being. Moreover, expert systems confirm that the knowledge has been correctly used. 39 Chapter 2: Background Study 2.4.6 Ethical, Legal, and Social Issues About three-quarter billions of people from North America, Europe, Asia, and recently Australia have their medical information collected in electronic form. Therefore, there must be a form of protection for human data. This work will discuss some ethical, legal, and social limitations on data collection and distribution that limit researchers and industries when utilizing human data to prevent the abuse of patients and the use of their data for commercial purposes. The main points of the ethical, legal, and social issues in mining medical data may be organized into the following categories[49]: οΆ Data Ownership Cios and Moore [49] discussed the term of data ownership and raised some questions that require legal professionals assistance to answer. In legal theory, ownership is determined by who is entitled to sell a particular item of property[50]. The problem of identifying the ownership of data is typically unresolved because human data and tissues are not supposed to be sold for any reason. Therefore, cannot apply the legal theory in regards to human data. At the same time, human medical data are available for data mining without prior consent in some cases. The question of patientsβ information ownership is still unsettled and further investigation by legal specialists and researchers is needed to solve this issue. Example of questions that need to be answered: (1) who own the data; the patients, the government, or healthcare providers? (2) Does the medical doctor own the data? (3) Do insurance providers own the data? (4) If insurance providers do not own their customers data, can they refuse to pay for the collection and storage of the data? (5) Who organize and mine the data? Hence, data ownership brought a debate about who own the data 40 Chapter 2: Background Study and who is legally allowed to give authorities for data usage in scientific field such as data mining. οΆ Fear of Lawsuits An important aspect of medical data mining is the fear of lawsuits against medical practitioners and health care providers. Some medical practitioners and health care providers are reluctant to hand over data to data miners. However, providing individuals data to data mining may lead to lawsuits. Therefore, health care providers and governments should work together to give patients the right to decide whether their data to be involved for research purposes or not. In addition, patients should choose the field of research involved and the location of their data storage. In my opinion, this could facilitate the process of mining data and may avoid the lawsuits. Moreover, this could reduce the efforts and time during mining process [49] [50]. οΆ Privacy and Security of Human Data Medical data which obtained by healthcare providers and medical practitioners from individuals may contain private and confidential data. Individualsβ data have to be handled with enormous care to protect people privacy and confidentiality. To meet these requirements, there are four forms of patient data identification [49]: ο· Anonymous data: Individual identification is deleted during data collection. Therefore, no way to recover the patient identity in the future. ο· Anonymized data: Individual identification is recorded initially during data collection and then removed. In this type of identification, there is a chance to reidentify the patient because patient information has been recorded at some stage. 41 Chapter 2: Background Study ο· De-identified data: Individual identification is recorded initially during data collection, which is subsequently encoded or encrypted. There still some chances to identify the person using computer technology (The country privacy law and guidelines). This method could help researchers to remove duplicate records that related to same patient. However, there is still a technique to identify the patient in the future by decrypting the identification field. ο· Identified data: Individual identification is recorded initially during data collection. This method requires receiving a written consent from patient to be identified. This method should adhere to the country privacy law and guidelines. οΆ Expected Benefits The World Wide Web and the internet are convenient ways to share and store data, accessible almost from everywhere, and can help researchers who may have legitimate reasons for access private information. For example, researchers who hold reasonable claims to mine data because the data is rare and they could not mount the financial and administrative resources to collect and mine private data. The use of individualsβ data must be justified to the authorities. In addition, researchers who want to apply methods on data must show some expected benefits for the science or the society [49] [50]. οΆ Administrative Issues Researchers and data miners are not the only people dealing with private data. Therefore, some countries including the United State did specify administrative guidelines for patient privacy. The guidelines include [49] [50]: 42 Chapter 2: Background Study 1. The establishment of security measures and policy to ensure privacy and security is in place in research centers and all institutions and organizations that hold and have access to people information. 2. There must be legal agreement between the healthcare providers and researchers (or institutions) that use patientsβ medical information. The agreement should force researchers and institutions to protect patientsβ data. 3. There must be up to date plans to protect patientsβ information against natural disasters including disasters plans and data backup. 4. There must an authorization and identification scheme for employees to limit access to authorized personnel only. 5. There must be an ongoing internal review of authorization and privileges procedures to ensure that the right person have an access to patients data. 6. There must be training sessions for employees in security and privacy issues. The training should be regular to cope with the technology advancements in regards to security and privacy. 7. There must be a daily update to security infrastructure including anti-virus and internet security software. 2.4.7 Challenges of Data Mining in Healthcare Data mining and the advancement in computer technology can help the healthcare industry in many applications as mentioned earlier in this chapter. However, utilizing data mining in healthcare have some limitations. The first limitation is the type of data in healthcare databases. The types of data in healthcare database are heterogeneous. Some patientsβ examinations results are in numeric form, text form, 43 Chapter 2: Background Study and images. The process of mining such a mixed data types bring a challenge to developers. The source of data is different, such as laboratories, medical centers, physicians and more. Therefore, data collection and integration is time consuming. To overcome this problem, some authors recommended that a data warehouse to be built before data mining process. However, this can be time consuming and may not reliable for previous data[42]. Secondly, the nature of data is to be unorganized (not processed data), this include missing features values, corrupted files, inconsistent with patient history or family history. The problem of missing features values can be solved by constructing or estimating the missing features values. However, the mining process will be more efficient with complete data. Thirdly, mining data that contains large number of cases and attributes may led to patterns that are random and not real [51]. For this reason, not all significant patterns are necessary to be useful. Fourthly, mining healthcare data requires expertiseβs that combine the knowledge in data mining and knowledge discovery as well as knowledge in medical science. Since it is pretty uncommon to find expert people who have knowledge in the domain area (data mining and medical science), mining healthcare data may requires collaboration between expertise in data mining and expertise in medical science[42]. Finally, resources for developing data mining application should be allocated by healthcare organizations including budget, time, and efforts, and expertise. Data mining developments can produce a negative outcome for some causes, such as lack of management, limited support, and lack of cooperation between mining and medical expertise[42]. 44 Chapter 2: Background Study 2.4.8 Electronic Health Record Electronic Health Record (EHR) is a systematic collection of electronic health information about patients or populations. The format of EHR is digital and can be shared across different locations such as healthcare providers through a network. EHR is capable to store range of data including patients medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal information such as age and weight, and billing information [1]. The concept of electronic health record instead of paper is not a new technology. In sixties of last century, more information is needed to be gathered and stored for patients because medical care became more complex. Physician feared the fact that some patients information maybe lost or unavailable. The availability of complete patient health information when needed is the main idea behind storing patientβs health information in electronic form [52]. Between 1970s and 1980s, electronic health records evolved to store more information about patients in order to improve patient care. For example, drug dosages, side effects, allergies, and drug interactions became available electronically to healthcare providers, enabling that information to be stored in a systematic way to produce electronic healthcare systems. Recently, some universities, research centres, and healthcare providers developed computerized health records for research purposes and to track patient treatment. Overall, the innovation and advancements in EHR enhanced the quality of patientβs healthcare [52]. Many countries including Australia are adopting EHR concept to bring better off healthcare for its people including: better clinic information and accessibility, patient safety, better patient care, and efficiency and savings[52]. 45 Chapter 2: Background Study βOur recovery plan will invest in electronic health records and new technology that will reduce errors, bring down costs, ensure privacy, and save lives,β President Obama stated in his speech to Congress in February 2009. 2.5 Related Work on Breast Cancer Diagnosis In this section some of the related prior work on data mining methods for breast cancer diagnosis is discussed. Song et al. [53] presented a new approach for automatic breast cancer diagnosis based on artificial intelligence technology. They focused to obtain a hybrid system for diagnosing new breast cancer cases in collaboration between Genetic Algorithm (GA) and Fuzzy Neural Network. They also showed that inputs reduction (features selections) can be used for many other problems which have high complexity and strong non-linearity with huge data to be analysed. Arulampalam and Bouzerdoum [54] proposed a method for diagnosing breast cancer and called Shunting Inhibitory Artificial Neural Networks (SIANNs). SIANN is a neural network stimulated by human biological networks in which the neurons interact among each otherβs via a nonlinear mechanism called shunting inhibition. The feed forward SIANNs have been applied to several medical diagnosis problems and the results were more favourable than those obtained using Multilayer Perceptions (MLPs). In addition, a reduction in the number of inputs was investigated. Setiono [55]proposed a method to extract classification rules from trained neural networks and discussed its application to breast cancer diagnosis. He also explained 46 Chapter 2: Background Study how the pre-processing of data set can improve the accuracy of the neural network and the accuracy of the rules because some rules may be extracted from human experience, and may be erroneous. The data pre-processing involves the selection of significant attributes and the elimination of records with missed attribute values from Wisconsin Breast Cancer Diagnosis dataset. The rules generated by Setionoβs method were more brief and accurate than those generated by other methods mentioned in the literature. Meesad and Yen [56] proposed a hybrid Intelligent System (HIS) which integrates the Incremental Learning Fuzzy Network (ILFN) with the linguistic knowledge representations. The linguistic rules have been determined based on knowledge embedded in the trained ILFN or been extracted from real experts. In addition, the method also utilized Genetic Algorithm (GA) to reduce the number of the linguistic rules that sustain high accuracy and consistency. After the system being completely constructed, it can incrementally learn new information in both numerical and linguistic forms. The proposed method has been evaluated using Wisconsin Breast Cancer Dataset (WBC) data set. The results have shown that the proposed HIS perform better than some well-known methods. 2.6 Feature Selection Techniques Nowadays, the capability of collecting and generating data is more than before. Contributing factors include the steady progress of computer hardware technology for storing data and the computerization of business, scientific, and government transactions. In addition, the use of the internet as a wide information system has flooded us with incredible amount of data and information. 47 Chapter 2: Background Study Data mining has attracted a big attention to information system researcher in the recent years due to the wide availability of big amount of data and the need for tuning such data into knowledge and useful patterns. The gained knowledge and patterns can be used in many fields such as marketing, business analysis, and health information systems [57]. The quality of data, the large amount of data, the existence of low quality data, unreliable, redundant and noisy artefacts and outliers; all mentioned factors do affect the process of extracting knowledge and useful patterns, and then knowledge discovery during training phase is more difficult. Experts in machine learning and data mining stated that the classification performance (such as accuracy) decrease when the dataset contains many features that are not relevant to the process of prediction. For example, performing decision tree C4.5 on Monk1 problem produced error rate of 24.3% because of three irrelevant features. However, the error rate dropped to 11.1% by ignoring the irrelevant features [58]. The k nearest neighbour algorithm degrade to irrelevant attributes and training set size for a certain accuracy level grows exponentially with the number of irrelevant attributes [59]. Therefore, researchers have felt the necessity for producing more reliable data from large amount of records such as using feature selection methods. Feature selection or attribute subset combination is the process of identifying and utilizing the most relevant attributes and removing as many redundant and irrelevant attributes as possible [60, 61]. Variables, features, inputs, or attributes selection have become the focus of much research in many areas where the number of cases and attributes are huge. The 48 Chapter 2: Background Study purpose of feature selection is to obtain less number of features than the original number of features in a certain dataset to: (1)enhance the prediction accuracy,(2)obtaining a quicker classifier, (3) ignore the less important or irrelevant features, (4) improve data quality, (5) avoid over fitting (6) and help solve the problem of incredible amount of available data and how to utilise it effectively [62]. The literature shows that feature selection techniques can be divided upon the induction algorithm and how it works with feature selection search. According to that, feature selection techniques can be divided into three types: filter methods, wrapper methods, and embedded methods [63]. 2.6.1 Wrapper Feature Selection Technique The wrapper approach was proposed by Kohavi and Paeger in 1994 in Stanford university AI lab [64].In wrapper methods, the feature selection algorithm located as a wrapper around the learning algorithm. The process starts with a search for relevant subset of attributes by using the learning algorithm. The learning algorithm itself is used to evaluate the feature subset which obtained by the search. Figure 15 illustrates how the wrapper approach performs on the training set and the evaluation process. The learning algorithm is treated as a black box with no modification to the learning algorithm itself. The learning algorithm assesses the subsets of features obtained by the search method. The learning algorithm obtains a hypothesis about the quality and the relevance of a certain feature subset. Features subset with the highest estimated value is chosen as the final set on which to run the learning algorithm. The final step is to evaluate the model on new dataset (not used by the search) to ensure the independency between the training process and the 49 Chapter 2: Background Study testing process. The result is an estimated accuracy by using the highly relevant features subset on the desired learning algorithm [65]. Figure 15: The Wrapper approach for features subset selection [65] Table 2 shows the main advantages and disadvantages of using wrapper as a feature selection method, as well as, examples of existence methods that utilize the wrapper approach [63]. Table 2: Examples, advantages, and disadvantages of wrapper feature selection [63] 50 Chapter 2: Background Study 2.6.2 Filters Feature Selection Techniques Filter techniques examine the significance of features by investigating the real characteristics of the data. In most cases feature rank is calculated, and low ranking features are ignored during the learning process. Afterwards, the high ranking subset of features is used as training set to the classification algorithm [63]. The main difference of filter in compare with wrapper is that filter ignores the learning algorithm during features subset search. Figure 16 shows the filter approach; it shows that features subset extraction is totally independent from the learning classifier. Figure 16: The filter approach [56] Some advantages of filter techniques include that they are able to be performed on large databases that contain large number of attributes and cases, simple computation, fast in comparison to wrapper and embedded methods, and they are independent of the classification algorithm. The aim behind the independency between filters and learning classifier is that feature selection needs to be performed only once and then different classifiers can be used to evaluate the subset. On the other hand, the independency between filter methods and learning algorithms may cause low level of classification accuracy [63]. Table 3 summarise the main advantages and challenges of filter methods and some examples of popular filter methods. 51 Chapter 2: Background Study Table 3: Examples, advantages, and disadvantages of filter feature selection [63] 2.6.3 Embedded Feature Selection Techniques Embedded Methods (EM) vary from other feature selection methods in how classification methods and feature selection cooperate. In filter methods, there is no corporation between learning classifiers and feature selection. In wrapper methods, the learning classifier are used to measure the quality of subsets of features without intervenes in the structure of the classification. In contrast to filter and wrapper approaches, the embedded feature selection methods and learning process cannot be taken apart [66]. The process of finding the optimal subset of features is combined into the classifier construction. EM computation cost is less than wrapper methods and the fact that there is interaction between the classifier and EM. [63]. Table 4 shows some advantages and disadvantages of using such a method along of examples. 52 Chapter 2: Background Study Table 4: Examples, advantages, and disadvantages of embedded feature selection [63] 2.6.4 Feature Selection Techniques Used in Current Work ο Information Gain The information gain method was proposed to approximate quality of each attribute using the entropy by estimating the difference between the prior entropy and the post entropy [67]. This is one of the simplest attribute ranking methods and is often used in text categorization. If π₯ is an attribute and π is the class, the following equation gives the entropy of the class before observing the attribute: π» π₯ =β π π₯ πππ2 π π₯ (7) π₯ Whereπ(π) is the probability function of variable π. The conditional entropy of π given π₯ (post entropy) is given by: π» π|π₯ = β π π₯ π₯ π(π|π₯)πππ2 π π|π₯ (8) πΆ The information gain (the difference between prior entropy and postal entropy) is given by the following equations: π» π, π₯ = π» π β π» π|π₯ (9) 53 Chapter 2: Background Study π» π, π₯ = β π π πππ2 π π β π βπ π₯ π₯ π π π₯ πππ2 π π|π₯ π (10) ο Correlation based feature selection (CFS) CFS is among the simplest feature selections methods. CFS ranks features and discover the merit of features or subset of features according to a correlation between features. The main aim of CFS is to find a subset of features space that highly correlated with the class label. CFS usually removes uncorrelated features and redundancy. CFSβs feature subset evaluation function is shown as follows [68]. πππππ‘π = ππππ (11) πΎ + (πΎ + 1)πππ Where πππππ‘π is the worth of feature subset π that contain π features, πππ is the average feature correlation to the class, and πππ is the average feature to feature correlation. In order to apply this equation to calculate approximately the correlation between features, CFS uses a modified information gain method called symmetrical uncertainty to compensate the information gain bias for attributes with more values as follows [69]. ππ = π» π₯π + π» π₯π β π» π₯π, π₯π (12) π» π₯π + π» π₯π ο Relief Relief is ranked among the well-known feature selection techniques and [70]. Its aim is to rank the features quality giving their ability to predict instances of different classes. Relief uses instance based learning (lazy learning such as k-Nearest 54 Chapter 2: Background Study Neighbour) to assign a grade to each feature. Features are ranked by weight and those that exceed a threshold -determined by the user- are selected to form the promising subset. For each instance, the closest neighbour instance of the same class and the closest instance of a different class are selected. The following equation compute the average of distance between the nearest match and nearest miss [70]. (13) ππππ (π₯, π, π)2 ππππ (π₯, π, πβ²)2 ππ₯ = ππ₯ β + π π Where ππ₯ is the grade for the attribute π₯, π is a random sample instance, π is the nearest hit, πβ² is the nearest miss, and π is the number of samples. ο Principle Components Analysis (PCA) The purpose of Principle Components Analysis (PCA) is to decrease the dataset dimension that contains a large number of correlated attributes by transforming the original attributes space to a new space in which attributes are uncorrelated. The algorithm then ranks the variation between the original dataset and the new one. Transformed attributes with most variations are kept; meanwhile discard the rest of attributes. It is also important to mention that PCA is valid for unsupervised data sets because it does not take into account the class label [71]. ο Consistency based Subset Evaluation (CSE) Consistency based Subset Evaluation (CSE) adopts the class consistency rate as the evaluation measure. The idea is to obtain a set of attributes that divide the original dataset into subsets that contain one class majority [60]. One of well-known consistency based feature selection is consistency metric [72] proposed by Liu and Setiono: 55 Chapter 2: Background Study Consistencys = 1 β π π =0 π·π β ππ π (14) where π is feature subset, k is the number of features in π , π·π is the number of occurrences of the πth attribute value combination, ππ is the cardinality of the majority class for the πth attribute value, and π is the number of features in the original data set. For continuous values, Chi2 can be used [73]. Chi2 automatically discretises the continuous features values and removes irrelevant continuous attributes. 2.6.5 Related Work on Feature Selection Techniques Numerous feature selection methods have been broadly used in different domains. Xie and others [74] have proposed a hybrid features selections algorithms to build an efficient diagnostic models based on a new accuracy criterion, generalized F-score (GF) and SVM. The hybrid algorithms adopt Sequential Forward Search (SFS), Sequential Forward Floating Search (SFFS), and Sequential Backward Floating Search (SBFS), respectively, with SVM to accomplish hybrid features selections with the new accuracy criterion to guide the procedure. They were called as modified GFSFS, GFSFFS and GFSBFS, respectively. The mentioned hybrid methods combine the advantages of filters and wrappers to select the optimal feature subset from the original dataset to build the efficient classifiers. Experimental results showed that the proposed hybrid methods construct efficient diagnosis classifiers with high average accuracy when compared with traditional algorithms. Liao and others [75] proposed a hybrid features selections method along with k-NN and SVM. They aimed to identify the most significant genes that demonstrate the 56 Chapter 2: Background Study highest capabilities of discrimination between the classes of samples. They first used filter method to rank the genes in terms of their expression difference, and then a clustering method based on k-NN principles for clustering gene expression data. A support vector machine is applied to validate the classification performance of candidate genes. Their experimental results demonstrated the effectiveness of their method in addressing the problem. Vijayasankari and Ramar [76]also proposed a novel hybrid features selections method to select relevant features and cast away irrelevant and redundant features from the original dataset using C4.5 and Naïve Bayes classifier. The efficiency and effectiveness of the proposed method was demonstrated through extensive comparisons with other methods using real world data of high dimensionality. Experimental results on datasets revealed that the proposed algorithm increases the classifier accuracy with less error rate. Hall and Holmes [60] presented a benchmark comparison of several attribute selection methods for supervised classification. Attributes selections is achieved by cross-validating the attribute rankings with respect to a classification learner C4.5 and Naïve Bayes. The results conclude that features selections methods can enhance the performance of some learning algorithms. The findings also include that Correlation based feature selection method has produced the best result among six different feature selections methods. Saeys, et al [63] reviewed the importance of feature selection approach in a set of well-known bioinformatics applications. They focused into two main issues: the large input dimensionality, and the small sample sizes. The authors found that features selections methods could help researchers solve the mentioned issues. They 57 Chapter 2: Background Study also believed that features selections application will become fundamental in dealing with the high dimensional applications. The literature also showed two categories of feature selection- that are wrapper and filter. The wrapper evaluates and select attributes based on accuracy estimated by the target learning algorithm. Using a certain learning algorithm, wrapper basically searches the feature space by omitting some features and testing the impact of feature omission on the prediction metrics. The feature that make significant difference in learning process implies it does matter and should be considered as a high quality feature. On the other hand, filter uses the general characteristics of data itself and work separately from the learning algorithm. Precisely, filter uses the statistical correlation between a set of features and the target feature. The amount of correlation between features and the target variable determine the importance of target variable [77]. A further category is by sorting attributes using algorithms that rank a features or set of features in which attributes are ranked in regards to their improvement to a subset of attributes [57]. 2.7 Missing Features Values Missing features values are common in many medical databases for different reasons such as some features values are not specified because they are not available. For example, diagnosing patients without blood test result. Another reason for missing attributes values is that the attribute values might be forgotten or mistakenly erased or not filled. Moreover, some interviewers decline to respond for some private information such as the income or the age [78]. 58 Chapter 2: Background Study 2.7.1 Types of Missing Values Donald Rubin classified the missing features values from the literature into three types: missing completely at random, missing at random, and missing not at random [79]. οΆ Missing Completely At Random Missing Completely At Random (MCAR) is a term that describes how the missingness occurred, in which the probability that a feature value is missing is unrelated to the feature value or to the value of any other features in the dataset. For example, the data may be missing because equipment malfunctioned, the weather was terrible and could not record the observation for a certain day, people got sick, or the data were not entered correctly [80]. Notice that, the main concern is the value of feature not the missingness itself. For instance, person who refused to mention the personal income is also likely to refuse to mention the family income, the data obtained is still to be considered MCAR as long as the reasons have no relation to the income value itself [80]. οΆ Missing At Random Missing At Random(MAR) is the case when the existence of missing feature value does not depend on the feature value itself and may depend on other features values in the dataset. For example, depressed person is more likely not to report income which may lead to find that not reporting income may due to depression [80]. οΆ Missing Not At Random Missing Not At Random (MNAR) is the case when the missing feature value is not missing at random or completely at random. For example, if a person suffers depression and a person who suffer depression is more likely not report his mental 59 Chapter 2: Background Study status, then the data are not missing at random. Respectively, if a person refuses to tell the age, then the missing data are not random [80]. In data mining and machine learning applications, missing features values that matters but still missing creates a challenge for researchers. Handling unknown attributes values with the most appropriate values is a common concern in data mining and knowledge discovery. The process of constructing missing values is a vital process in most supervised and unsupervised data mining researches because it may affect the quality of learning and the performance of classification algorithms [81]. In general, the performance of classification accuracy is particularly affected by the presence of missing feature values because most of learning classifiers such as neural networks do not consider the probability of having missing features values and cannot not deal with it automatically [81]. 2.7.2 Handling missing data As mentioned earlier, the process of handling missing feature values is vital process in most supervised and unsupervised data mining researches because it may affect the quality data itself, which may affect the classifier performance. The literature showed many attempts to treat missing features values. The most popular methods for dealing with missingness are omitting instances, imputation, and expectation maximisation. All of these methods can be applied in conjunction with any classifier that operates on complete data [81]. 60 Chapter 2: Background Study οΆ Omitting Instances In omitting instances method, any record of data that contains missing features values is deleted from the data set. After omitting instances that contain missing features values, the classification process run on the remaining instances. The main disadvantage of this method is discarding important information in some cases. Therefore, it is not a common method. However, it could be used if there is a small amount of missing data [81]. οΆ Features Imputation Features imputation is a well-known method for constructing missing features values in the datasets for learning purposes. The imputation method can be divided into two major types: single imputation and multiple imputation [81]. In single imputation, the missing features values are substituted by the correspondence features values according to certain rules such as the features values means, mode, median, or learning algorithm. For example, the mean imputation calculate the mean of feature f in the dataset that contain values. The mean value for feature f is then used to fill the features f that has missing values. Another example is the regression imputation. Regression imputation is a method for dealing with missing features values by building regression models that construct missing features values based on observed features (features that contain values). The regression models are used to predict the values of missing attributes [81]. The scenario for constructing missing features values in multiple imputations is similar to the scenario for single imputation. However, the multiple imputation use more than one value to fill missing features values in the dataset, such as mean of observed feature values, the mode of observed feature values, and regression method. 61 Chapter 2: Background Study The multiple imputations approach drawbacks include the computational cost is higher than in single imputation. However, the classification performance (accuracy) is higher than single imputation [81]. οΆ Expectation Maximization The two most important methods to deal with missing features values in datasets in the recent literature are expectation maximization and multiple imputation [79]. Expectation maximization is one of the most effective methods for handling missing data [82]. To demonstrate expectation maximization, consider the data as shown in table 5. Missing features values are depression, age, and height [83]. Table 5: Extract of data to demonstrate Expectation Maximization [83] To perform expectation maximization, firstly, the mean, variance, and covariance are estimated from instances whose data is complete such as row number 4 in table 5. In particular, expectation maximization will calculate the following values as shown in table 6: 62 Chapter 2: Background Study ο· The mean of depression, age, height, and weight is 4.71, 37.50, 183.21, and 45504.43 respectively. ο· The variance of depression, age, height, and weight is 3.55, 9.43, 194.43, and 14403.12 respectively and appears in the diagonals. Other cells are the values of covariance between each pair of variables. Table 6: The calculations of mean, variance, and covariance for the features depression, age, height, and weight. Secondly, Expectation maximization uses maximum likelihood procedures to estimate regression equations calculate the relationships between variables. For example, maximum likelihood algorithm may produce the following equations: ο· Depression = -15.3 + .01 x age + .004 x height + .0005 x wage ο· Age = 7.3 + .34 x depression + .002 x height + .0003 x wage ο· Height = 19.2 + .53 x depression + .021 x age + .0004 x wage ο· Wage = 7.3 + .44 x depression + .031 x age + .0021 x height The purpose of maximum likelihood is to ensure these equations predict the means, variances, and covariance more accurately than any other equations [79, 82] .Thirdly, 63 Chapter 2: Background Study these equations can be used to estimate the missing values. The procedure of estimating the missing feature values is shown below: ο· Consider the equation Depression = -15.3 + .01 x age + .004 x height + .0005 x wage. ο· This equation can then be used to estimate the depression for individuals who did not provide his/her information. ο· For the second case, 17, 173, and 31600 would be substituted into this equation. ο· Depression for this person would be 1.362. For other missing features values, the same procedure is used after considering the right equation. The constructed missing features values are shown in bold in table 7. Table 7: The final data set after performing Expectation Maximization method. 64 Chapter 2: Background Study 2.8 Chapter Summary This chapter presented a background study of main machine learning and data mining technologies used in the present research. It also presented data mining in the field of healthcare. Some related prior work on different data mining techniques, missing feature values and feature selection techniques, and the technique used in this thesis described. The next chapter will present the research methodology used in current research and the details of the datasets used. 65 Chapter 3: Research Methodology CHAPTER THREE Research Methodology 3.1 Introduction Two major research paradigms have been identified in the Western tradition of science, Positivist (called scientific) and Interpretive (known as anti-positivist) [84]. However, Dash [85] stated three major types of research paradigms: positivism, antipositivism, and critical thinking. The positivism paradigm is based on observation and reasoning as a tool of understanding a certain problem or behavior. This paradigm usually involves manipulation to variables and predictions on the basis of previous observation or previous history. Positivist researchers are concerned about what has caused a particular relationship and what the effects of this relationship are, they also prefer quantitative data which can be transformed into numbers and statistics. Anti-positivism or qualitative research approach concentrate on subjectivist approach to studying social phenomena which have importance to a range of research techniques. Anti-positivism researchers criticize positivists because they believe that statistics and numbers are not useful about human behavior .Similarly, critical theory research approach describes critique and action research as research methods to investigate a certain problem [85]. 66 Chapter 3: Research Methodology Despite the fact that each research tradition has its own approaches and research methods, the researcher may adopt research methods cutting across research traditions to solve the problem or to answer research questions [85]. Table 8 shows research approaches, research methods, and examples for each research traditions. This research can be placed as a positivism research utilizing the principles of prediction upon previous history and data manipulations (the term manipulating in this regard does not involve change in data structure or values. However, manipulating data is the process of filling missing features values, treating noisy data, data normalization and more). Table 8: Selection of research paradigms and research methods [85] Research paradigms Positivism Research approach ο· Quantitative Research methods Examples ο· Surveys, - Attitude of ο· longitudinal, distance learners ο· cross-sectional, towards online correlation, based education. ο· experimental, - Relationship ο· quasi-experimental, between studentsβ and motivation and ex-post facto research their academic ο· achievement. - Effect of intelligence on the academic performances of primary school learners. 67 Chapter 3: Research Methodology Anti- ο· Qualitative positivism ο· Biographical, - A study of ο· Phenomenological, autobiography of a ο· Ethnographical, and great statesman. ο· case study - A study of dropout among the female students - A case study of an open distance learning Institution in a country. Critical theory ο· Critical and ο· Ideology critique, and - A study of action-oriented ο· action research development of education during the British rule in India - Absenteeism among standard five students of a primary school 3.2 Data Mining Methodology Knowledge discovery from the databases or data mining refers to extracting useful relationships and patterns from large databases. Due to the amount of data and to obtain useful outcomes, a systemically method must applied. It is became a fact that quality data will imply more accurate outcomes than dirty data. Dirty data is a common term in data mining that describe some unwanted data characteristics such as incompleteness, noisy, and inconsistency. In this research, our method involves different data mining processes as shown in figure 17: 68 Chapter 3: Research Methodology Figure 17: Research Method Overview 69 Chapter 3: Research Methodology 3.2.1 Data Collection It is very important to acquire high quality data, which is highly reliant on the quality of the data collection process. The research data were proposed to be collected from Canberra hospital and some healthcare providers in the Australian Capital Territory (ACT). However, access to patient data could not gain due to privacy police in healthcare providers in Australia. The second option was to collect data from overseas. This option failed due to the cost involved. Therefore, this research relied on third option which utilizes online databases. Online databases are available publicly and are collected from clinical environment, and have undergone proper organisational ethics approval processes and available freely for research proposes. The advantage of using online databases is the ability to make comparison between our methods and the existing methods by using the same databases. One of the most popular machine learning repositories is UCI machine learning repository. UCI is a collection of databases, domain theories, and data generators that are used by machine learning researchers to train and test machine learning algorithms. The repository was created in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning databases [86]. This research has used the dataset Wisconsin Breast Cancer Dataset (WBC) from UCI repository. WBC contains 699 records, each record consists of 9 features plus the class attribute. Table 9 shows a sample of WBC dataset. In addition to WBC, I recently found two other versions of breast cancer diagnosis, Wisconsin Diagnosis Breast Cancer (WDBC), and Wisconsin Prognosis Breast Cancer (WPBC). WDBC contains of 569 instances, 32 attributes, and 2 class labels where WPBC have 198 instances, 34 attributes, and 2 class labels. 70 Chapter 3: Research Methodology Table 9: Sample of Wisconsin Breast Cancer Diagnosis dataset Uniformity of Cell Size Uniformity of Cell Shape Normal Nucleoli Bare Nuclei Single Epithelial Cell Size Clump Thickness Marginal Adhesion Bland Chromatin Mitoses Class 5 1 1 1 2 1 3 1 1 2 5 4 4 5 7 10 3 2 1 2 3 1 1 1 2 2 3 1 1 2 6 8 8 1 3 4 3 7 1 2 4 1 1 3 2 1 3 1 1 2 8 10 10 8 7 10 9 7 1 4 1 1 1 1 2 10 3 1 1 2 2 1 2 1 2 1 3 1 1 2 2 1 1 1 2 1 1 1 5 2 4 2 1 1 2 1 2 1 1 2 1 1 1 1 1 1 3 1 1 2 2 1 1 1 2 1 2 1 1 2 5 3 3 3 2 3 4 4 1 4 1 1 1 1 2 3 3 1 1 2 8 7 5 10 7 9 5 5 4 4 7 4 6 4 6 1 4 3 1 4 4 1 1 1 2 1 2 1 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 5 1 2 ? 7 3 1 4 1 1 1 1 2 1 3 1 1 2 5 2 3 4 2 7 3 6 1 4 3 2 1 1 1 1 2 1 1 2 5 1 1 1 2 1 2 1 1 2 2 1 1 1 2 1 2 1 1 2 1 1 3 1 2 1 1 1 1 2 3 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 3 1 1 2 10 7 7 3 8 5 7 4 3 4 2 1 1 2 2 1 3 1 1 2 3 1 2 1 2 1 2 1 1 2 71 Chapter 3: Research Methodology 3.2.2 Data Selection Data selection or feature selection have been an active research area in pattern recognition, statistics, and data mining. The aim behind feature selection is to select a subset of records variables by ignoring features with little or less important information. For example, physician can make a decision based on some features whether a dangerous surgery is necessary for treatment or not. In the current research, feature selection methods have been used to minimise the number of features in the dataset before commencing the mining process. 3.2.3 Data Pre-Processing Data collection phase may produce dataset that contains incomplete, inaccurate, and inconsistence data. Inaccurate data is having incorrect attribute values; this may due to data entry errors, faulty in data collection tools, errors in data transmission, and users may submit incorrect values just to fill mandatory fields during surveying [57]. Incomplete data can occur for many reasons. For example, some attributes values were not important during data entry and some attributes values were not always available. Inconsistency occurs when there is a record that is in conflict with other records on the dataset [57]. Completeness, accuracy, and consistent data are the elements that define data quality. Data pre-processing is an important step in data mining process to satisfy data quality elements. Therefore, the current research is to utilize data pre-processing tasks to ensure the dataset is ready for mining process in order to produce accurate results as possible. The study has proposed a new approach for constructing missing feature values to satisfy the completeness element; also a comparison has been made between feature selection methods to find the best method that suites datasets, and 72 Chapter 3: Research Methodology performed some techniques to eliminate noise and outliers. At the end of the current phase, data should be ready for mining process. 3.2.4 Applying Data Mining Methods At this stage, the data are ready for the mining process with no or little data preprocessing. The processed data will be used to evaluate the proposed methods. This work proposed a method called Information Gain and Adaptive Neuro-Fuzzy Inference System for Breast Cancer Diagnoses (IGANFIS). IGANFIS is a new approach for breast cancer diagnosis using the advantages of Adaptive Network based Fuzzy Inference System (ANFIS) and the Information Gain method. In this approach, the ANFIS is to build an input-output mapping using both human knowledge and machine learning ability. The information gain method is to reduce the number of input features to ANFIS. The experimental results showed 98.23% classification accuracy which underlines the capability of the proposed algorithm. The method and experimental results presented in AICIT conference in Seoul, in 2010. During designing and training IGANFIS method, of the study shows how it is important to have a complete dataset during mining process or during applying automatic methods. Therefore, this work proposed a new approach for constructing missing features values. The new approach for constructing missing feature values is based on iterative nearest neighbours and distance metrics. The proposed approach employed weighted k-nearest neighbours algorithm and propagating the classification accuracy to a certain threshold. The proposed method showed improvement of classification accuracy of 0.005 in the constructed dataset than the original dataset which contain some missing features values. The maximum 73 Chapter 3: Research Methodology classification accuracy was 0.9698 on k=1. Though, the amount of classification improvement is not really significant, it does indicate that there is some room for improvement in classification accuracy by estimating the values of missing features. This work appeared on the 3rd International Conference in Data Mining and Intelligent Information Technology Applications conference (2011). Data mining process is a comprehensive approach that branches into many areas. In order to cover the whole aspects of data mining, decided to make a comparison between some features selection approaches. The work has been accepted in the International Journal on Data Mining and Intelligent Information Technology Applications and shall appear 2013 edition. Last century, the challenge was to develop new technologies that store large amount of data. Recently, the challenges are to effectively utilize the incredible amount of data and to obtain knowledge that benefits business, scientific, and government transactions by using subset of features rather than the whole features in the dataset. Therefore, the study has focused on feature selection techniques as a method to gain high quality attributes to enhance the mining process. Features selection techniques touch all disciplines that require knowledge discovery from large data. In our study, a comparison between benchmark feature selection methods based on well-known breast cancer dataset Wisconsin Breast Cancer Diagnosis (WBC) and three well-recognized machine learning algorithms. The study found that feature selection methods can significantly improve the performance of learning algorithms. However, no single feature selection methods that best satisfy all datasets and learning algorithms. Therefore, machine learning researchers should understand the nature of datasets and learning algorithms characteristics in order to obtain better outcome. Overall, Correlation Based Feature Selection (CFS) and consistency Based Subset Evaluation performed 74 Chapter 3: Research Methodology better than Information Gain, Symmetrical Uncertainty, Relief, and Principle Components Analysis. Based on feature selection study that was carried out, the study found that finding the most excellent feature selection technique that best satisfies a certain learning algorithm could bring the benefit for researchers. Therefore, this work proposed a new method based on a combination of learning algorithm tools and features selections techniques. The idea is to obtain a hybrid approach that combines between the best performing learning algorithms and the best performing features selections techniques in regards to three well-known datasets. The experiment results showed that combination of correlation based feature selection methods along with Naïve Bayes learning algorithm can produce promising results. The results are presented in ICONIP12 in Qatar (19th International Conference on Neural Information Processing) in 2012. 3.2.5 Evaluation Evaluation phase is an important part of data mining process. In this phase, the aim of the data mining experts is to test and assess the proposed model. If the model does not satisfy the expectations, then data mining experts usually rebuild the model by changing its parameters until the desired results are achieved. In this study, the evaluation of proposed methods is performed by comparing the model results with the real data values (class features). According to that, the classification accuracy and error rate are calculated. The error rate (Err) of the classifier is defined as the average number of misclassified samples divided by the total number of records in the dataset. On the other hand, the classification accuracy of the model can be calculated as one minus the error rate. If the classification 75 Chapter 3: Research Methodology accuracy is less than a certain threshold let say 80% for example, then some changes has to be perform to the method, the feature selection, or the pre-processing phase until obtaining satisfying outcomes. Another approach for evaluating the results is by making a comparison between the results obtained by the proposed methods and previous methods in the literature. In most cases, the dataset used in the proposed method should be the same dataset used by other methods on the literature to ensure that a competitive method has been obtained. 3.2.6 Machine Learning Software Development Tools The current work has used two well-known machine learning tools; WEKA and MATLAB. WEKA stands for Waikato Environment for Knowledge Analysis (also, the Weka is a flightless bird found only in islands of New Zealand). WEKA is an open source machine learning software written in JAVA language. WEKA provides the environment to calculate the information gain and contains some data mining and machine learning methods for data pre-processing, classification, regression, clustering, association rules, and visualization [87]. MATLAB is a fourth generation language and interactive environment for numerical computation, visualization, and programming. MATLAB is used to analyse data, develop algorithms, and create models and applications. Therefore, it is users come from various backgrounds of engineering, science, and economics. MATLAB is widely used in academic and research institutions as well as industrial enterprises [88]. 3.2.7 Results Visualization At the end of evaluation phase, data mining experts decide how to present the data mining results. The aim of data visualization is to let the end user view and utilize the 76 Chapter 3: Research Methodology obtained results. Since data mining usually involves extracting not existing information from a database, the end users may raise some questions about information source and how to utilize it. However in databases, the end users expect the information is already reside on the database. This research hasnβt investigated data visualization in depth. The reason is because the current study is for research purposes and not business oriented. However, tables, scatted chart, bar chart, and figures have been used to demonstrate the obtained results. 3.3 Chapter Summary This chapter presented the research methodology used in current research and the source of dataset used. It also describes the main methodology of data mining process. Further, the next chapter will describe a new approach for diagnosing breast cancer based on the combination of a new a new information gain and adaptive neuro fuzzy inference technique. 77 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System CHAPTER FOUR Breast Cancer Diagnosis Based on Information Gain and Adaptive Neuro Fuzzy Inference System 4.1 Overview In this Chapter, the details of data mining approach based on ANFIS and Information Gain method is discussed. First, a brief description of ANFIS is given, and the detail of proposed approach is described in Section 4.4. The details of experimental validation and discussion in Section 4.5, and the summary of the findings using this approach are described in Section 4.6. 4.2 Adaptive Neural Fuzzy Inference System (ANFIS) Adaptive Neural Fuzzy Inference System (ANFIS), proposed by Jang in 1993, is a combination of two machine learning approaches: Neural Network (NN) and Fuzzy Inference System (FIS) [89]. Some of the earlier work on ANFIS was done by Übeyli [90], who aimed to integrate adaptive neural fuzzy inference system (ANFIS) for breast cancer diagnoses. The author used a database of patients with known diagnosis (i.e. supervised learning). The ANFIS classifier was trained with a set of records for nine examined features for breast cancer, and then was used to diagnose new cases. The system combined between neural network and it is ability of learning and fuzzy modelling approach. The performance of Übeyliβs ANFIS-based model showed a 78 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System promising results and underlined it is capability to diagnose the disease with 98% classification accuracy. Motivated by this work, I tried to adapt the ANFIS based data mining technique with a pre-processing stage involving Information Gain Method (IG), with expectation that the method can enhance the classification accuracy for breast cancer datasets. The details of ANFIS, IG structure, and experimental validation are described in next few Sections. 4.2.1 ANFIS Structure ANFIS exploit the advantages of NN and FIS by combining the human expert knowledge (FIS rules) and the ability to adapt and learn (NN) [89]. For simple illustration, suppose the fuzzy system contains two Sugeno fuzzy rules: π π’ππ1: πΌπΉ π₯ ππ π΄1 π΄ππ· π¦ ππ π΅1 , ππ»πΈπ π = π1 π₯ + π1 π¦ + π1 π π’ππ2: πΌπΉ π₯ ππ π΄2 π΄ππ· π¦ ππ π΅2 , ππ»πΈπ π = π2 π₯ + π2 π¦ + π2 Figure 18 (a) shows the fuzzy reasoning and Figure 18 (b) shows the corresponding structure of ANFIS. In Figure 18 (b), the node function in each layer is as the following [89]: Layer1: Each node π (represented by a square) in this layer accepts input and computes the membership ΞΌA i x . ππ1 = ππ΄π π₯ (15) 79 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System where π₯ is the input to node π , and π΄π is the linguistic label (small, large, etc.) associated with this node. In other words, ππ1 is the membership function of π΄π and it specifies the degree to which the given π₯ satisfies the quantifier π΄π .Usually ππ΄π π₯ is chosen to be bell-shaped with values between 0 and 1, such as the generalized bell function: ππ΄ π π₯ β ππ π₯ = ππ₯π β ππ 2 (16) Whereππ and ππ aretwo parameters called premises. Layer2: Every node in this layer (represented by a circle) takes the corresponding outputs from Layer 1 and multiplies them to generate a weight: π€ = ππ΄π π₯ × ππ΅π π₯ , π=1,2 (17) The output of this node represents the firing strength of the rule. Layer3: Every node in this layer is a circle node labeled N. This layer normalize the weight of a certain node in compare to the sum of other nodes weights (The ration of weight) then compute the implication of each output member function. π€π = π€π π π€π , π=1,2. π =2 (18) Layer 4: Every node in this layer is illustrated with a square. Based on Sugeno inference system, the output of a rule can be written on the following linear format: 80 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System ππ4 = π€π ππ = π€π (ππ π₯ + ππ π¦ + ππ ) (19) Where ππ ,ππ are the consequent parameters and ππ is the bias. Layer 5: This layer called the aggregation layer, which computes the summation of rules, the proposed algorithm produce a single output (centroid): ππ5 = πππππππ’π‘ππ’π‘ = π€π ππ = π π π€π ππ π π€π (20) 4.2.2 ANFIS Learning The method to train ANFIS is the hybrid learning algorithm which uses the gradient descent method and Least Square Estimate (LSE). Each cycle of the hybrid learning consists of a forward pass and a backward pass. In the forward pass the signal travels forward until Layer 4 and the consequent parameters are identified using the LSE method. In the backward pass the errors are propagated backward and the premise parameters are updated by gradient descent. The process repeated until achieving the lowest error or a predefined threshold [89]. 81 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Figure 18: (a) Fuzzy Reasoning (b) Equivalent ANFIS Structure [89]. 4.3 Information Gain The information gain method was proposed to approximate quality of each attribute using the entropy by estimating the difference between the prior entropy and the post entropy [67]. This is one of the simplest attribute ranking methods and is often used in text categorization. If π₯ is an attribute and π is the class, the following equation gives the entropy of the class before observing the attribute: 82 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System π» π₯ =β (21) π π₯ πππ2 π π₯ π₯ Whereπ(π) is the probability function of variable π. The conditional entropy of π given π₯ (post entropy) is given by: π» π|π₯ = β π π₯ π₯ π(π|π₯)πππ2 π π|π₯ (22) πΆ The information gain (the difference between prior entropy and postal entropy) is given by the following equations: (23) π» π, π₯ = π» π β π» π|π₯ π» π, π₯ = β π π πππ2 π π β π 4.4 βπ π₯ π₯ π π π₯ πππ2 π π|π₯ π (24) The Proposed IG βANFIS Approach The proposed approach is to combine the information gain method and ANFIS method for diagnosing diseases (in this case; breast cancer). The information gain will be used for selecting the quality of attributes. The output of applying the information gain method is a set of features with high ranking values, the set of high ranked features will be the input for ANFIS. The selected features will be applied to ANFIS to train and test the proposed approach. The structure of the proposed approach is shown in Figure 19, where X = {x1 , x2 , . . , xn }are the original features in dataset, Y = {y1 , y2 , . . , yk } are the features after applying the information gain (features selections), and Z denotaed to the final output after applying Y on ANFIS (the diagnose). 83 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Figure 19: The general structure for the proposed approach 4.5 The Experimental Results The database, Wisconsin Breast Cancer Dataset (WBC), have been created by William Wolberget al. [86] from the University of Wisconsin-Madison, USA. The database attributes were collected from digital fine needle aspirate (FNA) of breast mass. WBC contains 699 records. Each record consists of 9 features plus the class attribute. In our experiment, the database was divided into training and testing datasets. 341 records used for training and 342 records for testing. The records which contain missing values (16 records) have been ignored. The class attributes have been normalized to [0=Benign, 1=Malignant]. The information gain method has been used to select the quality of attributes. Table 10 shows the ranking of attributes after applying the attribute evaluator InfoGainAttributeVal and the searching method Ranker-T-1 using WEKA on WBC dataset. 84 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Table 10: Information Gain Ranking Using WEKA on WBC Attribute Name Rank Uniformity of Cell Size (UCSize) 0.636 Uniformity of Cell Shape (UCSshape) 0.633 Normal Nucleoli (NN) 0.555 Bare Nuclei(BN) 0.538 Single Epithelial Cell Size (SECS) 0.421 Clump Thickness (CT) 0.411 Marginal Adhesion (MA) Bland Chromatin (BC) 0.394 Mitoses(MI) 0.278 0.316 It is very important to determine the number of features used in the experiment. Therefore, the proposed approach is to select a certain number of features based on features rank and a point where the rank is dropped significantly. Figure 20 shows the graph of Table 10. It shows the most significant change in the graph (the slope point) which gave us an indication to choose the first four top ranking features located above the slope point as the recommended number of features to be used later as inputs to ANFIS. The graph shows the biggest drop just after the feature number 4 (BN). Therefore, features Uniformity of Cell Size (UCSize), Uniformity of Cell Shape (UCShape), Normal Nucleoli (NN), and Bare Nuclei (BN) are selected to train and test the model. At this stage, the attributes have been deducted and the recommended number of features has been set to 4 features. 85 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Figure 20: Information Gain Ranking on WBC The first stage was to select the most important features that may lead to more accurate results as mentioned earlier. The second stage is to construct the fuzzy inference system (FIS). The most known fuzzy inference systems are Mamdani-FIS and Sugeno-FIS. Mamdani-FIS method is widely used to obtain expert knowledge. It allows users to describe the expertise as a simulation to the real life and nature. However, Mamdani-FIS is computationally expensive. On the other hand, SugenoFIS method is computationally efficient and works well with optimization and adaptive techniques. In our proposed approach, Sugeno Fuzzy Inference system has been used to maps feature to feature membership functions, feature membership function to rules, rules to a set of output, output to output membership functions, and the output membership function to a single-valued output as shown in Figure 21. The membership function maps input with a membership values as shown in Figure 22. 86 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Figure 21: Sugeno Fuzzy Inference System with four features input and single output In addition to the membership function, FIS contains the rules that add human reasoning capabilities to machine intelligences, which are usually based on Boolean logic. In our proposed approach, the rules have been defined from the real data. The rules express the weight of each feature by giving higher priority for features that have the highest rank. The proposed approach contains 81 rules (Number of rules = π₯ π¦ where π₯ is the Number of member functions and π¦ is the number of features i.e.34 =81 rules). The following are two examples of rules used in the proposed approach: IF AND(UniformityCellSize is poor, UniformityCellShape is Avg, BareNuclei is poor, NormalNucleoli is poor)THEN (output is OK) IF AND(UniformityCellSize is poor, UniformityCellShape is high, BareNuclei is poor, NormalNucleoli is avg)THEN (output is NOT_OK) 87 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Figure 23 is a visual implementation for the ruleUniformityCellSize. It contains three member functions: poor, average, and high. Figure 22: Input Membership Function for the feature βUniformity of Cell Sizeβ In the third and final stage, the constructed Fuzzy Inference System and the new features set were loaded to ANFIS which will train and test the proposed approach as shown in Figure 23. The structure of ANFIS on MATLAB is shown in Figure 24. 88 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System Figure 23: The structure for the proposed approach (IG-ANFIS) Figure 24: ANFIS Structure on MATLAB 89 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System The result of applying ANFIS on the features selected using the information gain on WBC dataset showed 98.24% accuracy. The results of previous work (using the same dataset) are shown in Table 11 and Figure 25: Table 11: Comparison of classification accuracy between IG-ANFIS and some previous work The approach AdaBoost ANFIS SANFIS FUZZY FUZZY-GENETIC ILFN NNs ILFN&FUZZY IG-ANFIS (our method) SIANN 100.0% Accuracy 57.60% 59.90% 96.07% 96.71% 97.07% 97.23% 97.95% 98.13% 98.24% 100.00% 97.23% 96.07% 96.71% 97.07% 97.95% 98.13% 98.24%100.00% 95.0% Classification Accuracy % 90.0% 85.0% 80.0% 75.0% 70.0% 65.0% 60.0% 59.90% 57.60% 55.0% 50.0% Classifiers Figure 25: Comparison of classification accuracy between IG-ANFIS and some previous work 90 Chapter 4: Breast Cancer Diagnosis based on Information Gain and Neuro Fuzzy Inference System 4.6 Summary and Discussion This chapter proposed a new approach for diagnosing breast cancer by reducing the number of features to the optimal number using the information gain and then apply the new dataset to the Adaptive Neuro Fuzzy Inference system (ANFIS). The study found that the accuracy for the proposed approach is 98.24% compared with other methods. The proposed approach showed a very promising results which may lead to further attempts to utilise information technology for diagnosing patients. The next chapter will present a new approach for constructing missing features values based on k-NN and the distance between cases. Examples of distance functions used are Euclidean and Mankoski distance functions. 91 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values CHAPTER FIVE Iterative Weighted k-NN for Constructing Missing Feature Values in Wisconsin Breast Cancer Dataset 5.1 Overview This chapter presents a new approach for constructing missing features values based on iterative nearest neighbours and distance metrics. The proposed approach employs weighted k- nearest neighbourβs algorithm. The main idea is to propagate the classification accuracy to a certain threshold which is set by the researchers and users. The proposed method showed slight improvement of 0.005 classification accuracy on the constructed dataset (the new dataset with no missing values) than the original dataset which contain some missing features values. The approach also showed the classification accuracy from k=1 to k=5, it showed that the maximum classification accuracy was 0.9698 when k=1. 5.2 Missing Feature Values Data mining and knowledge discovery tools become one of the foremost research areas in the field of medical diagnoses. The aim is to classify large datasets into patterns that can be used to extract useful knowledge. For example, data mining techniques can utilise patientβs databases for automated medical diagnoses. The 92 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values purpose is to achieve more accurate findings, speed up the diagnoses, and reduce the errors and mistakes occurred by human being [91]. However, incomplete dataset or missing features values may affect data mining findings. The problem of missing features values are common in many applications, particularly, in medical databases for many reasons such as some features values are not specified because they are not available at the time of data collection, attributes value might be forgotten, mistakenly erased , or not filled during data entry. In some cases, some interviewers decline to respond for some private information such as the income or the age [78]. In data mining and machine learning applications, missing features values that matters but that are missing create a challenge for researchers. Therefore, the process of treating unknown attributes values with the most appropriate values is a common concern in data mining and knowledge discovery. The process of constructing missing values is a vital process in most supervised and unsupervised data mining researches because it may affect the quality of learning and the performance of classification algorithms. The literature shows variety methods for treating the missing attribute values. These methods maybe labelled into sequential and parallel methods. In sequential methods, missing attributes values are replaced by known values then the knowledge is acquired for a dataset with all known attribute values. Examples of sequential methods are deleting the records (cases) that contain missing values, substitute missing features values with the most common value of an attribute, assigning all possible attribute values to missing features values, replacing the missing features values with the mean of feature values [78]. 93 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values In parallel methods, knowledge is acquired directly from the original data sets. An example of parallel method is rule induction. In rule induction, the rule learning algorithm is used to learn directly from the original dataset to find kind of regulations on how to treat or construct missing features values if exists [78]. Now letβs have a brief attempts for treating missing features values using the above two types of methods. White [92] did propose the simplest method to deal with missing values by simply ignoring the cases which contain unknown attributes values. Kononenko et al. [93] proposed a method to observe the missing features values from other attributes. They used the class label to determine the missing attributes values. His plan was to assign the most probable attribute value ππ to the missing attribute ππ that most satisfy a class C. Another method anticipated by Quinlan [94]was to employ the decision tree to estimate the missing values. The approach takes a subset π β² from the training set π . The equivalent values for the missing values in the subset π β² must be known. In the subset cases, the missing attribute became the class label and vice verse. Using π β² , the decision tree can be built to determine the value of missing attributes which converted into class label (temporally). This method uses the class label to determine the missing attributes values and to utilise all the information in the case (dataset instance). However, this method is only valid when there is only one missing attribute value. Quinlan also proposed another method for handling the missing attributes values by considering the missing attributes values βunknownβ as an actual value for the attributes. However, this solution is not valid for all the cases because the value βunknownβ may represent many meanings such as the value is too large or too small to be recorded, the value didnβt recorded by mistake, etc. hence, this method may bring uncertainty. 94 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values Meng and Schenker [95] showed that likelihood techniques can be used to deal with missing data. However, likelihood method requires specific programs which may not be available easily. Alternative to likelihood techniques is imputing the missing data. Multiple-imputation is a method of generating multiple simulated values for each incomplete dataset, and then iteratively analysing datasets with each simulated value substituted. The intention in this method is to generate estimates that better reflect true variability and uncertainty in the data that contains some missing values [96]. There are different ways to perform multiple-imputation but most approaches assume missing data to be missing at random (MAR). Missing at random (MAR) is a circumstance in which missing values are randomly distributed within one or more subsamples not among all the dataset. For example, missing more among malignant than benign but random among each class [97]. Santhakumaran [98] successfully used ANN to treat missing features values on WBC. The author used back propagation algorithm to train the network and used four missing value replacement methods to replace the missing values in dataset (Successive Iteration, mean, median, and mode).Among these four methods, Median method produced a promising result. 5.3 The Proposed Method The proposed method integrates the weighted k-nearest neighbours algorithm and propagating the classification accuracy to a certain threshold. The k-NN is to find the closest neighbours ( π1 , β¦ , ππ ) for a certain instance ( π₯π ), that contains missing feature values, using the Euclidean, Manhattan, Minkowski, and other distance functions but this work focused finally on two distance functions, Euclidean and 95 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values Minkowski. Then the proposed approach is to find the most similar instance to (π₯π ) from (π1 , β¦ , ππ ) using formula number 4 by finding the distances values (πππ ). π πππ π =1 πππ = π 1 π =1 π π₯π , ππ (25) π π₯π , ππ Where πππ donate to the closest neighbours to the instance π₯π ,π π₯π , ππ is the distance between the instance π₯π and the neighbour ππ , and πππ is the feature π of the neighbourππ . After finding the closest neighbour (the smallest value of πππ ) call it ππβ² , the missing feature values in π₯π will be filled by the equivalent features values in ππ which have ππβ² distance to π₯π . The process of filling missing features values will produce a new training dataset (NT) that contains no missing features values. To verify the accurateness of the constructed missing features values, the new training dataset is applied to k-NN and record the accuracy. If classification accuracy is less than a threshold then the algorithm will step back to fill the missing features values until the desired classification accuracy is reached. Figure 26 shows the flowchart for the proposed method. 96 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values Figure 26: The Flowchart for the proposed method (Constructing Missing Features Values) 97 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values 5.4 The Experimental Results In data mining and statistical researches, the split sample approach is a commonly used study design in studies that contain large dataset. This design divides the dataset into a training set and a testing set to approximate classification accuracy. The classifier is designed and developed based on the theory and then trained using the training dataset. After tainting the classifier, it is applied to each case in the testing sample. In practice, dividing data is important to avoid large bias in estimation the classifier accuracy [99]. Therefore, the dataset (WBC) has been divided into two parts, training dataset and testing dataset. The dataset separation was random to avoid unfairness. The training dataset contains 500 cases 16 of them contain missing features values. The rest of the original dataset (WBC) reformed the testing cases (199 cases). After preparing the datasets, a classifier has been developed using the proposed method. The development tool was Microsoft Visual Studio 2010. The selected language was C# programming language. In the implementation, this work has used many metrics to compute the distance including the Euclidean and Minkowski functions as a component toward the iterative k-NN classifier. Constructing the missing features values using the proposed method through iterative k-NN classifier with the Euclidean distance function showed a classification accuracy enhancement of 0.005 when k=3 from the first iteration and a maximum classification accuracy of 0.9648 . Figure 27 shows a comparison of classification accuracy when the missing features values were not treated and when treated. The figure also shows different classification accuracy which depend on number of neighbours (k) in k-NN. The experiment shows that 98 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values more neighbours reduce the classification accuracy. The reason may due to noise as a result of more neighbours. Figure 27: A comparison of classification accuracy for the proposed method through Euclidean/k-NN The experiment of constructing the missing feature values using the proposed method through k-NN classifier with the Minkowski distance function showed a classification accuracy enhancement of 0.005 when k=3 and r=1.5 from the first iteration and a maximum classification accuracy of 0.9698 . Figure 28 shows a comparison of classification accuracy when the missing features values were not treated and when treated. 99 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values The experiment also showed that Manhattan, Chebychev, and Canberra distance metrics are not suitable for constructing the missing attributes values, in this experiment, because the classification accuracy after treating the missing values remain lower than the classification accuracy for the original dataset. Figure 28: A comparison of classification accuracy for the proposed method through Minkoski/k-NN 5.5 Summary and Discussion This chapter did propose a new approach for constructing missing features values based on iterative k nearest neighbours and the distance functions. The approach is 100 Chapter 5: Iterative Weighted k-NN for Constructing Missing Features Values an iterative approach until finding the most suitable features values that satisfy classification accuracy. The proposed approach showed improvement of 0.005 of classification accuracy on the constructed dataset than the original dataset on both Euclidean and Minkowski distance functions. Manhattan, Chebychev, and Canberra distance metrics produced lower classification accuracy on the new dataset than the original dataset. This work also noticed that classification accuracy depend greatly on the number of neighbours (k). The experiment showed that less neighbours may lead to more accuracy. The reason for that, in my opinion, is the amount of nose produced from conflict neighbours. Finally, the maximum classification accuracy was on k=1 which was 0.9698. The next chapter will describe a new approach for diagnosing breast cancer. The chapter present a comparison study between some well-known feature selection techniques. 101 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes CHAPTER SIX Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes 6.1 Overview Feature selection techniques have become an obvious need for researchers in computer science and many fields of science. Whether the target research is in medicine, agriculture, business, or industry; the necessity for analysing large amount of data is needed. Addition to that, finding the most excellent feature selection technique that best satisfy a certain learning algorithm could bring the benefit for the research and researchers. Therefore, a new method has been proposed for diagnosing breast cancer on a combination of learning algorithm tools and features selections techniques. The idea is to obtain a hybrid approach that combines between the best performing learning algorithms and the best performing features selections techniques. The experiment result shows that assemblage between correlation based features selections method along with Naïve Bayes learning algorithm can produce a promising results. However, no single feature selection methods that best satisfy all datasets and learning algorithms. Therefore, machine learning researchers should understand the nature of datasets and learning algorithms characteristics in order to obtain better outcomes. Overall, consistency based subset evaluation performed better than information gain, symmetrical uncertainty, relief, correlation based feature selection, and principle components analysis. 102 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes 6.2 Feature Selection Techniques The advancement of information technology, the growing number of social networks websites, electronic health information systems, and other factors have flooded internet with data. The amount of data posted daily on internet is increasing daily. At the same time, not all data are important or even needed. Therefore, data mining researches started using the term features selections or data selections more often. Feature selection or attribute subset combination is the process of identifying and utilizing the most relevant attributes and removing as many redundant and irrelevant attributes as possible [60, 61]. In addition, features selections mechanisms do not alter the original representation of data in any way. It just selects an optimal useful subset. Recently, the inspiration for applying features selection techniques in machine learning has shifted from theoretical approach to one of steps in model building. Many attribute selection methods use the task as a search problem, where each result in the search space groups a distinct subset of the possible attributes [100]. Since the space is exponential in the number of attributes which produce lots of possible subsets, this requires the use of a heuristic search procedure for all data sets. The search procedure is combined with an attribute utility estimator in order to evaluate the relative merit of alternative subsets of attributes [60]. This large number of possible subsets and the computation cost involved necessitate researchers to conduct a benchmark feature selection methods that produce the best possible subset in regards to more accurate results as well as low computation overhead. Feature selection techniques could perform better if the researcher chooses the right learning algorithm. Therefore, a new approach proposed which combines a promising feature selection technique and one of well-known learning algorithm. In the current work, we have focused on publicly available diseases datasets (Breast cancer) to 103 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes evaluate the proposed approach. Yearly around the world, millions of ladies suffer from breast cancer, making it the second common non-skin cancer after lung cancer, and the fifth cause of death among cancer diseases in the world [7]. Thyroid disorder in women is much more common than thyroid problems in men and may lead to thyroid cancer [101]. Hepatitis can be caused by chemicals, drugs, drinking too much alcohol, or by different kinds of viruses and may lead to liver problems [102]. This paper begins with brief related work, then a description of benchmark feature selection methods, a description of our methodology in the current paper, and the results obtained by using the three datasets. Finally, a brief discussion and future work are presented. 6.3 Feature Selection Techniques used in this Chapter The literature showed many methods for selecting subset of features. In this chapter, I will concentrate in Correlation based Feature Selection (CFS), Information Gain (IG), Relief (R), Principle Components Analysis (PCA), Consistency based Subset Evaluation (CSE), and symmetrical uncertainty (SU). The mentioned features subset methods were described in chapter 2. CFS aims to find subsets that contain features that are highly correlated with the class and uncorrelated with each other [103]. IG is one of the simplest attribute ranking methods that rank the quality of attribute according to the difference between prior and post entropy [104]. R objective is to measure the quality of attributes according to how their values distinguish instances of different classes [70]. PCA is probably the oldest feature selection method. It is aim is to reduce the dimensionality of a data set in which there are a large number of correlated features and keeping the uncorrelated features present in the data set [71]. CSE try to obtain a set of attributes that divide the original dataset into subsets that 104 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes contain one class majority [60], while SU is a modified information gain method that compensate the information gain bias [69]. 6.4 The Experiment Methodology Different sets of experiments were performed to evaluate benchmark attributes selections methods on well-known publicly available dataset from UCI machine learning repository, Wisconsin Breast Cancer dataset (WBC) [25]. For obtaining a fair judgment, as possible, between feature selection methods, this work considered three machine learning algorithms from three categories of learning methods. The first algorithm is k-nearest neighbours (k-NN) from lazy learning category. k-NN is an instance-based classifier where the class of a test instance is based upon the class of those training instances alike to it. Distance functions are common to find the similarity between instances. Examples of distance functions are Euclidean and Manhattan distance functions [18]. The second algorithm is Naïve Bayes classifier (NB) from Bayes category. NB is a simple probabilistic classifier based on applying Bayes' theorem. NB is one of the most efficient and effective learning algorithms for machine learning and data mining because the condition of independency (no attributes depend on each other) [105]. The last machine learning algorithm is Random Tree (RT) or classification tree. RT is used to classify an instance to a predefined set of classes based on their attributes values. RT is frequently used in many fields such as engineering, marketing, and medicine [37]. After applying features selections techniques and the learning algorithms on the dataset and obtaining classification accuracy results, a hybrid method will be 105 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes constructed that combines the advantages of best performed feature selection technique and the advantages of best perform learning algorithm as shown in Figure 29. Figure 29: Hybrid method of feature selection technique and a learning algorithm The software package used in the present paper is Waikato Environment for Knowledge Analysis (WEKA). Weka provides the environment to perform many machine learning algorithm and feature selection methods. Weka is an open source machine learning software written in JAVA language. WEKA contains some data mining and machine learning methods for data pre-processing, classification, regression, clustering, association rules, and visualization [106]. 106 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes 6.5 The Experimental Results The notations β+β, β-β, and β=β are used to show the feature selection methods classification performance in compared with the original dataset (before performing feature selection methods); where β+β denotes to improvement, β-β denotes to degradation, and β=βdenotes unchanged. The experimental results of using Naïve Bayes (NB) as a machine learning algorithm on WBC dataset is shown in Table 12. Table 12: Results for Attributes Selection Methods with Naïve Bayes. Method Original Dataset WBC 95.99% Correlation based feature selection(CFS) 95.99%= Information Gain (IG) 95.99%= Relief (R) 95.99%= Principle Components Analysis (PCA) 96.14%+ Consistency based Subset Evaluation (CSE) 96.28%+ Symmetrical Uncertainty (SU) 95.99%= Table 12 shows the results of applying WBC dataset on Naïve Bayes learning method and some features selections techniques. It showed that classification accuracy of using Naïve Bayes on original WBC dataset is 95.99%, where it showed improvement 107 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes by applying features selections methods Principle Components Analysis and Consistency bases Subset Evaluation. The best result was performed by Consistency bases Subset Evaluation technique about 96.28% of classification accuracy, while classification accuracy stayed the same by using correlation based feature selection, information gain, Relief, and Symmetrical Uncertainty. Figure 30 illustrates the results on Table 12. Figure 30: Attributes selection methods with Naïve Bayes 108 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes The second machine learning classifier for testing features selections methods is kNN. The experimental results of using k-NN as a machine learning algorithm on WBC is shown in Table 13. Table 13: Results for Attributes Selection Methods with k-NN Method WBC Original Dataset 95.42% Correlation based feature selection(CFS) 95.42%= Information Gain (IG) 95.42%= Relief (R) 95.42%= Principle Components Analysis (PCA) 95.42%= Consistency based Subset Evaluation (CSE) 95.85%+ Symmetrical Uncertainty (SU) 95.42%= Table 13 shows that the classification accuracy of using k-NN on the original WBC is 95.42%, where it shows improvement by applying the features selections method Consistency based Subset Evaluation (CSE). On the other hand, other features selections methods produced the same classification accuracy as the original dataset. Figure 31 illustrates the results on Table 13. 109 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes Figure 31: Results for attributes selection methods with k-NN The last machine learning classifier in our experiment is Decision Tree (DT). The experimental results of using DT as a machine learning algorithm on WBC is shown in Table 14. 110 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes Table 14: Results for Attributes Selection Methods with Decision Tree Method Original Dataset WBC 94.56% Correlation based feature selection(CFS) 94.56%= Information Gain (IG) 94.56%= Relief (R) 94.56%= Principle Components Analysis (PCA) 94.85%+ Consistency based Subset Evaluation (CSE) 93.56%- Symmetrical Uncertainty (SU) 94.56%= In Table 14 shows improvement in classification accuracy by applying the features selections PCA. There is a decline in classification accuracy by using CSE, where the classification accuracy is not changed using CFS, IG, R, and SU. Figure 32 illustrates the results on Table 14. 111 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes Figure 32: Results for attributes selection methods with Decision Tree 112 Chapter 6: Diagnosing Breast Cancer Based on Feature Selection and Naïve Bayes 6.6 Summary and Discussion Figure 33: Hybrid method of feature selection technique and a learning algorithm. According to the results obtained by the current work on WBC, Naïve Bayes has performed the supreme in regard to classification accuracy. k-NN and DT have performed just better on dataset after applying features selections methods. In general, features selections methods can improve the performance of learning algorithms. However, no single features selections method that best satisfy all datasets and learning algorithms. Therefore, machine learning researcher should understand the nature of datasets and learning algorithm characteristics in order to obtain better outcomes as possible. Overall, CES features selection method performed better than IG, SU, R, CFS, and PC. The study also found that IG and SU performed typically because SU is a modified version of IG. 113 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis CHAPTER SEVEN Fusion of Heterogeneous Classifiers for Breast Cancer Diagnosis 7.1 Overview In the twenty first century, we are βflooded in data, but starved of informationβ [107]. The new technologies produced a huge variety of data which are over human capability of analysis. Therefore, the intelligent systems found to help human arrange and utilize huge data effectively [107]. For the sake of obtaining full benefits of the use of intelligent systems, it is became a common in machine learning to look for intelligent systems advantages and mix and match them to produce a new approach that maximize the advantages and minimize the overhead. Fusion Intelligent systems or Hybrid intelligent systems, in which two or more machine leaning algorithms are combined in one new approach are often effective and can overcome the limitations of individual approaches [108]. Classification is one branch of machine learning and the process of integrating two or more classifiers is usually referred to as multiclassifications or fusion classification. There are two main paradigms in combining different classification algorithms: Classification Selection and classification Fusion [109]. Classification Selection paradigms uses a single model to predicate the new case. However, fusion classification merges two or more outputs of all models to produce a single output. The present chapter will introduce different types of fusion classification on three well-known classifiers on breast cancer datasets. The process 114 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis of integrating two or more classifiers enhanced the classification accuracy in some cases. However, there is no single combination that suits all datasets. 7.2 Multi-Classification Approach The process of combing two or more classifiers is called multi-classification approach. The purpose of multi-classification is based on the argument that no single classifier that suites all learning problems [109]. Multi-classification can be divided into two types, classifier selection and fusion classifier. 7.2.1 Classifier Selection Classifier selection is one of the simplest methods for combining learning algorithms or classifiers. The idea is to evaluate two or more classifiers on the training dataset and then make use of the best performed classifiers on the testing dataset. This method is simple, straight forward, no output combination, and performs well in compare to more complex classifiers [110]. 7.2.2 Fusion Classifier Fusion classifier is a group of classifiers whose individual predictions are combined in some way to classify new cases (highest average ranking, average probability, or voting). It is became one of the active areas of research in supervised learning that study new ways of constructing classifiers for more accurate outcome. Voting is the simplest method for multi-classification in heterogeneous and homogeneous models. The voting method is divided into two types, weighted voting and un-weighted voting. In un-weighted voting, all classifiers are treated equally with no priority over other classifiers. Therefore, each classifier outputs a class value and the class with the most votes is the final outcome of the multi-classifier. Note that this type of 115 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis voting is in fact called Plurality Voting, in contrast to the frequently used term Majority Voting, as the majority voting implies that at least 50%+1 (the majority) of the votes should belong to the winning class. In weighted voting, classifiers may have different weight according to the user believe in the classifier performance and classification accuracy. For example, the user can put more weight in k-NN (60%) over the Naïve Bayes (40%) in a certain multi-classifiers problem. The weighted voting is usually discriminating between classifiers based on classifier reputation and ranking among others classifiers [110]. 7.3 Classifiers Combination Strategies Machine learning and data mining are rich of classification tools and algorithms. In the context of combining classification there is uncertainty which classifier works best with other classifiers. Kuncheva and Whiteaker [111] stated that the best combination of a set of classifiers depends on the application and on the classifiers characteristics. However, there is no best combination of classifiers. However, one approach is to generate a large number of classifiers and then to select the best combinations to use based on the classification accuracy or other criteria set by the researcher. On the other hand, this approach is costly and time consuming since n classifiers may be combined in 2n combinations, and it is difficult to obtain this amount of experiments except in simple or restricted circumstances. To solve this, several search techniques have been used to find the best combination of classifiers, including forward and backward search, Tabu search, and genetic algorithms [107].In general, more powerful technique to find the best possible combination of classifiers is needed. 116 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis 7.4 Experimental Methodology Different sets of experiments were performed to evaluate the multi-classification approach on well-known publicly available breast cancer datasets from UCI machine learning repository [25]. Recently, this research found that there are three versions of breast cancer diagnosis and will be introduced in this chapter. The datasets are Wisconsin Breast Cancer (Original), Wisconsin Diagnosis Breast Cancer (WDBC), and Wisconsin Prognosis Breast Cancer (WPBC). Table 15 shows the statistical details of the used datasets. Table 15 shows that the three datasets are different of sizes. The smallest dataset contains 11 attributes and the largest dataset contains 34 attributers. Number of instances also ranges from 198 to 699 instances while all the datasets contain 2 classes. The study has considered a heterogeneous classifier from three machine learning categories. The first algorithm is k-nearest neighbours (k-NN) from lazy learning category. k-NN is an instance-based classifier where the class of a test instance is based upon the class of those training instances alike to it. Distance functions are common to find the similarity between instances. Examples of distance functions are Euclidean and Manhattan distance functions [18]. The second algorithm is Naïve Bayes classifier (NB) form Bayes category. NB is a simple probabilistic classifier based on applying Bayes' theorem. NB is one of the most efficient and effective learning algorithms for machine learning and data mining because the condition of independency (no attributes depend on each other) [105]. The last machine learning algorithm is Random Tree (RT) from the tree classification category. RT is used to classify an instance to a predefined set of classes based on their attributes values. RT is frequently used in many fields such as engineering, marketing, and medicine [37]. The study used k-fold cross validation 117 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis technique to separate the training set from test set with k=10. The environment of experiment is the well-known machine learning software, WEKA. Table 15: Statistics of Breast Cancer Datasets Dataset Wisconsin Breast Cancer (Original) Wisconsin Diagnosis Breast Cancer(WDBC) Wisconsin Prognosis Breast Cancer(WPBC) Number of Attributes 11 Number of Instances 699 Number of Classes 2 32 569 2 34 198 2 In this experiment, this work used the confusion matrix to measure classifiers performance. The classification accuracy is the main criteria to estimate the effectiveness of the classification model based on the number of correct and incorrect classification cases. 7.5 Experimental Results Three experiments were performed on three different datasets of breast cancer data. The first experiments were performed on single classifier model to set a base line of classification accuracy and how it can be enhanced. The send experiment was performed using a combination of two classifiers while the last experiment was performed on fusion of three classifiers. 118 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis Figure 34: Single Classifier on three datasets WBC, WDBC, and WPBC. Figure 34 shows the results of performing three single classifiers on three datasets. It shows that Naïve Bayes performed best in regards to classification accuracy on WBC (0.9599) while k-NN and Random Tree performed just better on WDBC and WPBC respectively. Figure 35 shows the result of combining two classifiers (Naïve Bayes and k-NN, Naïve Bayes and Random Tree, and k-NN and Random Tree). The results indicate that the fusion between Naïve Bayes and k-NN produced the best classification accuracy (0.9642 on WBC, 0.9508 on WDBC, and 0.6869 on WPBC). This may draw an intention that Naïve Bayes and k-NN may produce better results when they combined together. 119 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis Figure 35: Two Classifiers on three datasets WBC, WDBC, and WPBC. Figure 36 shows the result of fusion for three classifiers (Naïve Bayes, k-NN, and Random Tree). It shows that combining the three classifiers in three datasets still maintained satisfactory classification accuracy on WBC (0.9585) and WDBC (0.9473) while a significant improvement in classification accuracy on WPBC dataset (0.7323). 120 Chapter 7: Fusion of Heterogeneous Classifiers for Brest Cancer Diagnosis Figure 36: The Fusion of three classifiers on three datasets WBC, WDBC, and WPBC. 7.6 Summary and Discussion This chapter introduced the term of classification fusion on three well-known machine learning classifiers on breast cancer dataset. This work can confirm the argument that the best combination of a set of classifiers depends on the application and on the classifiers characteristics. In addition, there is no best combination of classifiers that suites all datasets. However in the current experiments, Naïve Bayes and k-NN produced better results when they combined as one classifier with maximum classification accuracy obtained on WBC dataset (0.9642). 121 Chapter 8: Discussion and Future Work CHAPTER EIGHT Discussion and Future Work The main purpose of current research is to participate in the efforts of enhancing the quality of healthcare services, proposing technology as one of solutions for the problem of medical shortages in regards to staff shortage and technology lack. This thesis presented the challenges that face many countries in the field of healthcare services; the increase in population, the new culture, the climate change, and other factors have derived more demand on healthcare services. Australia is one of the countries that stated lately to utilise technology to help solve the big demand on health services. Therefore, states, territory, and federal governments in Australia found NEHTA to derive the national interest in eHealth and healthcare services through technology and help in providing decent healthcare services. However, the process of utilising technology in healthcare services is a comprehensive process and involves many stages and steps. It is very important to discuss all related issues to conclude with new system that derive the expected services. eHealth project in Australia will deliver a huge repository of patientsβ information. This will create new challenges that need more investigations in order to achieve the desired goals. In addition, data and information is valuable and may be used to discover new trends, learn methods to treat similar cases, and find useful patterns. The current research focused on common issues that related to huge databases; missing features values and feature selections methods and how can be used to diagnosis and predict the examination for new cases. 122 Chapter 8: Discussion and Future Work Chapter 4 showed how information gain method, feature selection technique, can be used in collaboration with adaptive nerou fuzzy inference systems in diagnosing new patient cases. The combination created a new approach for diagnosing the breast cancer by reducing the number of features to the optimal number using the information gain and then applied the new dataset to the adaptive neuro fuzzy inference system (ANFIS). The study found that the accuracy for the proposed approach is 98.24% compared with other methods. The proposed approach showed a very promising results which may lead to further attempts to utilise information technology for diagnosing patients for breast cancer.. Missing features values are a concern when dealing with databases, especially, large databases. Therefore, an approach for constructing missing features values based on iterative k-nearest neighbours and the distance functions has been proposed. The approach is an iterative approach until finding the most suitable features values that satisfy classification accuracy. The proposed approach showed improvement of 0.005 of classification accuracy on the constructed dataset than the original dataset on both Euclidean and Minkowski distance functions. The study found that Manhattan, Chebychev, and Canberra distance metrics produced lower classification accuracy on the new dataset than the original dataset. The study also noticed that classification accuracy depend greatly on the number of neighbours (k). The experiment showed that less neighbours may lead to more accuracy. The reason for that, in my opinion, is the amount of noise produced from conflict neighbours. Finally, the maximum classification accuracy was on k=1 which was 0.9698. Another important issue when dealing with databases and health databases is features selections techniques and how to determine the most important features that lead to more accurate diagnosis. The investigation showed that no single features selections 123 Chapter 8: Discussion and Future Work method that best satisfy all datasets and learning algorithms. Therefore, machine learning researcher should understand the nature of datasets and learning algorithm characteristics in order to obtain better outcomes as possible. Overall, CES features selection method performed better than IG, SU, R, CFS, and PC. This work also found that IG and SU performed identical, the reason may due to the fact that SU is a modified version of IG. However, According to the results obtained by the current work on WBC, Naïve Bayes has performed the supreme in regard to classification accuracy. k-NN and DT have performed just better on dataset after applying features selections methods than the original dataset with no features selections techniques . In general, features selections methods can improve the performance of learning algorithms. According the investigation on features selections techniques and three well-known machine learning algorithms, this work proposed a hybrid approach for diagnosing breast cancer based on the best performed machine learning algorithm and the best performed feature selections methods. Therefore, we proposed a new approach based on Naïve bays learning algorithm and consistency based subset evaluation. The hybrid approach showed a classification accuracy of 0.9628 which underline the approach capability on Wisconsin Breast Cancer Dataset (WBC). The research introduced the term of classification fusion on three well-known machine learning classifiers on breast cancer datasets. Classification Fusion becomes a hot topic in the field of machine learning due the capability of combing algorithms advantages in one single algorithm. This study confirms that classification fusion approach can improve the classification accuracy. However, Classification fusion approach depends on the involved classifiers characteristics. In addition, there is no best combination of classifiers that suites all datasets. However in the current 124 Chapter 8: Discussion and Future Work experiments, Naïve Bayes and k-NN produced better results when they combined as one classifier with maximum classification accuracy obtained on WBC dataset (0.9642). Based on the experiments on different machine learning algorithms and Wisconsin Breast Cancer Dataset, the study may conclude that hybridization of the existing machine learning algorithms can produce better approaches for medical diagnosing. The idea is combining the advantages of different algorithms in one approach. The study also found that features selections techniques (features discrimination) can help improve prediction in context of medical diagnosing (breast cancer diagnosis in our study). However, no specific features selections method that suits all machine learning tools. Future work can be described as follows. The current research resided mainly on classification accuracy as the main criteria for measuring the performance of proposed approaches. However, future work will focus in other criteria such as classification speed and computational cost. Future work can also broaden disease options and which has been started with some attempts on thyroid and hepatic (figures 37-39 and tables 16-19). In addition, breast cancer dataset used in this study has binary outcomes (class label).Clinical practice; however, is often more complex and outcomes maybe in different format. It is envisaged that the future work can contribute to the knowledge and improve the accuracy and reliability of established system by broaden the databases and expanding the criteria for measuring the performance of established systems. 125 Chapter 8: Discussion and Future Work We also aim to contact the National eHealth Transition Authority to obtain real dataset and discuss the integration of current proposed approaches with eHealth system. Table 16: Results for attributes selection methods with Naïve Bayes on three databases (Thyroid, Hepatitis, and Breast Cancer) Method Thyroid Hepatitis Breast Cancer NB 92.60% 84.52% 95.99% CFS 96.53%+ 87.74%+ 95.99%= IG 93.88%+ 85.16%+ 95.99%= R 92.60%= 84.52%= 95.99%= PC 94.30%+ 84.52%= 96.14%+ CB 94.59%+ 84.52%= 96.28%+ SU 93.88%+ 85.16%+ 95.99%= 126 Chapter 8: Discussion and Future Work Figure 37: Results for attributes selection methods with Naïve Bayes on three databases (Thyroid, Hepatitis, and Breast Cancer) 127 Chapter 8: Discussion and Future Work . Table 17: Results for Attributes Selection Methods with k-NN on three databases (Thyroid, Hepatitis, and Breast Cancer) Method Thyroid Hepatitis Breast Cancer k-NN 95.92% 81.94% 95.42% CFS 96.10%+ 84.52%+ 95.42%= IG 96.50%+ 81.29%- 95.42%= R 95.92%+ 81.94%= 95.42%= PC 95.78%- 81.29%- 95.42%= CB 96.37%+ 81.94%= 95.85%+ SU 96.50%+ 81.29%- 95.42%= Figure 38: Results for Attributes Selection Methods with k-NN on three databases (Thyroid, Hepatitis, and Breast Cancer) 128 Chapter 8: Discussion and Future Work Table 18: Results for Attributes Selection Methods with Decision Tree on three databases (Thyroid, Hepatitis, and Breast Cancer) Method Thyroid Hepatitis Breast Cancer RT 96.92% 76.77% 94.56% CFS 96.29%- 77.42%+ 94.56%= IG 96.63%- 74.19%- 94.56%= R 97.22%+ 76.77%= 94.56%= PC 97.03%+ 76.13%- 94.85%+ CB 97.16%+ 80.65%+ 93.56%- SU 96.63%- 74.19%- 94.56%= Figure 39: Results for Attributes Selection Methods with Decision Tree on three databases (Thyroid, Hepatitis, and Breast Cancer) 129 Chapter 8: Discussion and Future Work References 1. Gunter, D.T. and P.N. Terry, The Emergence of National Electronic Health Record Architectures in the United States and Australia: Models, Costs, and Questions. J Med Internet Res, 2005. 7(1). 2. World Health Organization Assesses the World's Health Systems. World Health Organization [sighted 2010 01/09/2010]; Available from: http://www.who.int/whr/2000/media_centre/press_release/en/index.html. 3. Gerard, A., et al., Health Care Spending and Use of Information Technology in OECD countries. Health Affairs, 2006. 25(3). 4. Sorwar, G. and S. Murugesan, Electronic medical prescription: an overview of current status and issues, in Biomedical knowledge management: infrastructures and processes for e-health systems, M. Cooper and M. Gururajan, (Editors). 2010, IGI Global Hershey, PA. p. 61-81. 5. Lazarou, J., B. Pomeranz, and P. Corey, Incidence of adverse drug reactions in hospitalized patients: A meta-analysis of prospective studies. Journal of the American Medical Association, 1998. 279(15). 6. Medication Safety in the Community: A Review of the Literature. 2009 [sighted 2010 01/09/2010]; Available from: www.safetyandquality...con/$File/25953-MS-NPS-LitReview2009.PDF. 7. Mammography Screening Can Reduce Deaths from Breast Cancer. 2002 [sighted 2011 20/05/ 2011]; Available from: http://www.iarc.fr/en/mediacentre/pr/2002/pr139.html. 8. Most Frequent Cancers in Men and Women. 2008 [sighted 2012 20/01/2012]; Available from: http://globocan.iarc.fr/factsheets/populations/factsheet.asp?uno=900. 9. General Information About Male Breast Cancer. 2012 [sighted 2012 30/12/2012]; Available from: http://www.cancer.gov/cancertopics/pdq/treatment/malebreast/Patient/page1. 10. Breast Cancer in Australia an Overview. Australian Institute of Health and Welfare, 2012. 11. Giarratano, J. and G. Riley, Expert Systems Principles and Programming. 2 ed. Vol. 1. 1994, Boston: PWS Publishing Company. 12. NEHTA Blueprint. 2010 [sighted 2010 01/10/2010]; Available from: http://www.nehta.gov.au. 13. Tarca, A.L., et al., Machine Learning and Its Applications to Biology. PLoS Comput Biol, 2007. 3(6). 130 Chapter 8: Discussion and Future Work 14. Rokach, L., Data mining with decision trees: theory and applications. Vol. 69. 2007: World scientific. 15. Rokach, L. and O. Maimon, eds. Data Mining and Knowledge Discovery Handbook. Second ed. 2010, Springer Science and Business. 16. Tan, P.-N., M. Steinbach, and V. Kumar, Introduction to Data Mining. 2006: Addison-Wesley. 17. Thirumuruganathan, S. A Detailed Introduction to k-Nearest Neighbor (kNN) Algorithm. 2010. 18. Pevsner, J., Bioinformatics and Functional Genomics. 2 ed. 2009, New York: Wiley-Blackwell. 19. Weisstein, E.W. Euclidean Metric. [sighted 2011 19/08/2011]; Available from: http://mathworld.wolfram.com/EuclideanMetric.html. 20. Young, M., et al. Distance Metrics Overview. 2004 [sighted 2011 03/08/2011]; Available from: http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering _Parameters/Distance_Metrics_Overview.htm. 21. Mei, Z., Q. Shen, and B. Ye, Hybridized k-NN and SVM for gene expression data classification. Life Science Journal, 2009. 6(1). 22. Parvin, H., H. Alizadeh, and B. B, MKNN: Modified k-Nearest Neighbor, in Proceeding of the World Congress on Engineering and Computer Science. 2008: USA. 23. Cedeno, W. and D. Agrafiotis, Using particle swarms for the development of QSAR models based on k-nearest neighbor and kernel regression. Journal of Computer-Aided Molecular Design, 2003. 17(2-4). 24. Crookston, N. and A. Finley, Impute: An R Package for k-NN Imputation. Journal of Statistical Software, 2007. 23(10). 25. Wolberg, W. and L. Mangasarian, Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, 1990. 87: p. 9193 - 9196. 26. Li, S., E. Harner, and D. Adjeroh, Random k-NN feature selection -- a fast and stable alternative to Random Forests. BMC Bioinformatics, 2011. 12(450). 27. Kotsiantis, S., Supervised Machine Learning: a Review of Classification Technigues. Informatica, 2007. 31: p. 249-268. 28. Priddy, K. and P. Keller, Artificial neural networks: introduction. 2005, Washington: SPIE. 131 Chapter 8: Discussion and Future Work 29. Widrow, B. and M. Hoff, Adaptive Switching Circuits, in WESCON Conference Record. 1989. p. 709-717. 30. Grossberg, S., Adaptive Resonance Theory, in Encyclopedia of Cognitive Science. 2006, John Wiley & Sons, Ltd. 31. Javed, K., et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001. 7: p. 673 - 679. 32. Baxt, W.G., Use of an artificial neural network for data analysis in clinical decision-making: the diagnosis of acute coronary occlusion. Neural Comput., 1990. 2(4): p. 480-489. 33. Gershenson, C., Artificial neural networks for beginners. arXiv preprint cs/0308031, 2003. 34. Neuro AI, Neural networks: A requirement for intelligent systems. Available from: http://www.learnartificialneuralnetworks.com/. 35. Deligiorgi, D. and K. Philippopoulos, Spatial Interpolation Methodologies in Urban Air Pollution Modeling: Application for the Greater Area of Metropolitan Athens, Greece. Advanced Air Pollution. 2011. 36. Larose, D., Discovering Knowledge in Data: An Introduction to Data Mining. 2005, New Jersey: John Wiley & Sons, Inc. 37. Rokach, L. and O. Maimon, eds. Data Mining With Decision Trees. 2008, World Scientific Publishing. 38. Mitchell, T.M., Machine Learning. 2005: McGraw Hill. 39. Arzucan, O., Supervised and Unsupervised Machine Learning Technique for Text Document Categorization, in Graduate Program in Computer Engineering. 2004, Bogazici University. 40. Caruana, R. and A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in Proceedings of the 23rd international conference on Machine learning. 2006, ACM: Pittsburgh, Pennsylvania. p. 161-168. 41. Wu, X., et al., Top 10 algorithms in data mining. Knowl. Inf. Syst., 2007. 14(1): p. 1-37. 42. Hian, K. and T. Gerald, Data Mining Applications in Healthcare. Journal of Healthcare Information Management, 2005. 19(2): p. 64-72. 43. Dakins, D., Center takes data tracking to heart. Health Data Management, 2001. 9(1): p. 32-36. 44. Biafore, S., Predictive solutions bring more power to decision makers. Health Management Technology, 1999. 20(10): p. 12-14. 132 Chapter 8: Discussion and Future Work 45. Milley, A., Healthcare and data mining. Health Management Technology, 2000. 21(8): p. 44-47. 46. Hallick, J., Analytics and the data warehouse. Health Management Technology, 2001. 22(6): p. 24-25. 47. Feng, D.D., Biomedical information technology. 2008, Amsterdam; Boston: Elsevier/Academic Press. 48. Ennis, R.L., et al., A continuous real-time expert system for computer operations. IBM J. Res. Dev., 1986. 30(1): p. 14-28. 49. Cios, K.J. and G.W. Moore, Uniqueness of medical data mining. Artif. Intell. Med., 2002. 26(1-2): p. 1-24. 50. Moore, G.W., et al., A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years. Arch Pathol, 1996. 120(8): p. 782-785. 51. Hand, D.J., Data mining: statistics and more. The American Statistician, 1999. 52(2): p. 112-118. 52. Sanderson, S., Electronic Health Records for Allied Health Careers. 2009: McGraw-Hill 53. Song, H., et al., New methodology of computer aided diagnostic system on breast cancer, in Proceedings of the Second international conference on Advances in Neural Networks - Volume Part III. 2005, Springer-Verlag: Chongqing, China. p. 780-789. 54. Arulampalam, G. and A. Bouzerdoum. Application of shunting inhibitory artificial neural networks to medical diagnosis. in Intelligent Information Systems Conference, The Seventh Australian and New Zealand 2001. 2001. 55. Setiono, R., Generating Concise and Accurate Classification Rules for Breast Cancer Diagnosis. Artificial Intelligence in Medicine, 2000. 18(3): p. 205219. 56. Meesad, P. and G. Yen, Combined numerical and linguistic knowledge representation and its application to medical diagnosis. Component and Systems Diagnostics, Prognostics, and Health Management II, 2003. 4733: p. 98-109. 57. Han, J. and K. M, Data Mining Concepts and Techniques. Vol. 3. 2011, San Franscisco: Morgan Kaufmann. 58. Thrun, S.B., et al., The Monk's Problems-A Performance Comparison of Different Learning Algorithms. 1991, Carnegie Mellon University: Pittsburgh, PA. 133 Chapter 8: Discussion and Future Work 59. Langley, P. and S. Sage, Induction of Selective Bayesian Classifiers. ;In UAI(1994). Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, 1994: p. 399-406. 60. Hall, M.A. and G. Holmes, Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. IEEE Transactions on Knowledge And Data Engineering, 2003. 15(3). 61. Ashraf, M., et al., A New Approach for Constructing Missing Features Values. International Journal of Intelligent Information Processing, 2012. 3(1): p. 110-118. 62. Guyon, I., et al., An introduction to variable and feature selection. J. Mach. Learn. Res., 2003. 3: p. 1157-1182. 63. Saeys, Y., I. Inza, and P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics, 2007. 23(19): p. 2507-2517. 64. Kohavi, R., D. Sommerfield, and J. Dougherty, Data Mining using MLC++ -A Machine Learning Library in C++. IEEE, 1996. 65. Kohavi, R. and G.H. John, Wrappers for feature subset selection. Artif. Intell., 1997. 97(1-2): p. 273-324. 66. Lal, T., et al., Embedded Methods, in Feature Extraction, I. Guyon, et al., (Editors). 2006, Springer Berlin Heidelberg. p. 137-165. 67. Kononenko, I., Estimating attributes: Analysis and extensions of RELIEF, in Machine Learning ECML-94. 1994. 68. Hall, M.A., Correlation-based Feature Selection for Machine Learning, in Department of Computer Science. 1999, The University of Waikato: Hamilton. 69. Rutkowski, L., et al., eds. Artificial Intelligence and Soft Computing, Part I. ed. L.N.i.C.S. 6113. Vol. 1. 2010, Springer: Poland. 487-498. 70. Guyon, I. and A. Elisseeff, An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 2003. 3: p. 1157-1182. 71. Jolliffe, I.T., Principal Component Analysis. 2002, Springer: NY. 72. Liu, H. and R. Setiono. A probabilistic approach to feature selection: A fiter solution. in Proceedings of the 13th International Conference on Machine Learning. 1996. Morgan Kaufmann. 73. Liu, H. and R. Setiono. Chi2:Feature selection and discretization of numeric attributes. in Proceedings of the 7thIEEE International Conference on Tools with Articial Intelligence. 1995. 134 Chapter 8: Discussion and Future Work 74. Xie, J., et al., Novel Hybrid Feature Selection Algorithms for Diagnosing Erythemato-Squamous Diseases, J. He, et al., (Editors). 2012, Springer Berlin Heidelberg. p. 173-185. 75. Liao, B., et al., A Novel Hybrid Method for Gene Selection of Microarray Data. Journal of Computational and Theoretical Nanoscience, 2012. 9(1): p. 5-9. 76. Vijayasankari, S. and K. Ramar, Enhancing Classifier Performance Via Hybrid FeatureSelection and Numeric Class Handling- A ComparativeStudy. International Journal of Computer Applications, 2012. 41(17): p. 30-36 77. Leach, M., Parallelising Feature Selection Algorithms. 2012, University of Manchester: Manchester. 78. Grzymala-Busse, J.W. and W.J. Grzymala-Busse, Handling Missing Attribute Values Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, (Editors). 2010, Springer US. p. 33-51. 79. Rubin, D.B., Inference and missing data. Biometrika, 1976. 63(3): p. 581592. 80. Howell, D. Treatment of Missing Data. 2009. 81. Marlin, B., Missing Data Problems in Machine Learning, in Department of Computer Science. 2008, University of Toronto: Canada 82. Dempster, A., N. Laird, and D. Rdin, Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal Of The Royal Statistical Society, 1977. 39(1): p. 1-39. 83. Moss, S. Expectation maximization--to manage missing data. 2009. 84. Galliers, R., Choosing Information Systems Research Approaches, in Information Systems Research: Issues, Methods and Practical Guidelines, R. Galliers, (Editor). 1992, Alfred Waller: Henley-on-Thames. p. 144-162. 85. Dash, N.K. Module: Selection of the Research Paradigm and Methodology. 2005. 86. UCI Machine Learning Repository [sighted 2010; Available from: http://archive.ics.uci.edu/ml/about.html. 87. Ian, W. and F. Eibe, Data Mining Practical Machine Learning Tools and Techniques with Java Implementations. 2000: Morgan Kaufmann. 88. Mathworks. Matlab overview. 1994 01/09/2012]; Available from: http://www.mathworks.com.au/products/matlab/. 89. Jang, R. and J. Shing, ANFIS: Adaptive-Network-based Fuzzy Inference system. IEEE transactions on systems, 1993. 23(3): p. 665 β 685. 135 Chapter 8: Discussion and Future Work 90. Übeyli, E.D., Adaptive Neuro-Fuzzy Inference Systems for Automatic Detection of Breast Cancer. J. Med. Syst., 2009. 33(5): p. 353-358. 91. Grzymala-Busse, J., Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction Transactions on Rough Sets I, J. Peters, et al., (Editors). 2004, Springer Berlin / Heidelberg. p. 78-95. 92. White, A., Probabilistic induction by dynamic path generation in virtual trees, in Annual Technical Conference of the British Computer Society Specialist Group on Expert Systems. 1986: Brighton (UK). p. 35 - 46. 93. Kononenko, I., I. Bratko, and E. Roskar. Experiments in automatic learning of medical diagnostic rules. in International School for the Synthesis of Expertβs Knowledge Workshop, Bled, Slovenia. 1984. 94. Quinlan, J.R., Induction of decision trees. Machine Learning, 1986. 1(1): p. 81-106. 95. Meng, X. and N. Schenker, Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors. Computational Statistics & Data Analysis, 1999. 29(4): p. 471-483. 96. Rubin, D.B., Multiple Imputation for Nonresponse in Surveys. 2004, New York Wiley & Sons. 97. Marshall, A., et al., Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Medical Research Methodology, 2010. 10(1): p. 7. 98. Santhakumaran, F.P., An Algorithm to Reconstruct the Missing Values for Diagnosing the Breast Cancer. Global Journal of Computer Science and Technology, 2010. 10(2): p. 25-28. 99. Simon, R., et al., Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of the National Cancer Institute, 2003. 95(1): p. 14-18. 100. Blum, A.L. and P. Langley, Selection of relevant features and examples in machine learning. Artificial Intelligence, 1997. 97(1β2): p. 245-271. 101. Lee, S.L. Thyroid Problems. 2012 [sighted 2012 01/03/2012]; Available from: http://www.emedicinehealth.com/thyroid_problems/article_em.htm. 102. Introducing Hepatitis C. 2012; Available from: http://www.hep.org.au/. 103. Hall, M.A. and L.A. Smith, Feature subset selection: a correlation based filter approach. 1997. 104. Forman, G., An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 2003. 3: p. 12891305. 136 Chapter 8: Discussion and Future Work 105. Zhang, H. and J. Su, Naïve Bayes for optimal ranking. Journal of Experimental & Theoretical Artificial Intelligence, 2008. 20(2): p. 79-93. 106. Ashraf, M., K. Le, and X. Huang, Information Gain and Adaptive NeuroFuzzy Inference System for Breast Cancer Diagnoses, in International Conference on Computer Sciences and Convergence Information Technology (ICCIT). 2010, IEEE: Seoul. p. 911-915. 107. Buxton, B.F., W.B. Langdon, and S.J. Barrett, Data Fusion by Intelligent Classifier Combination. Measurement and Control, 2001. 34(8): p. 229-234. 108. Goonatilake, S. and S. Khebbal, Intelligent Hybrid Systems. 1994: John Wiley & Sons, Inc. 109. Tsoumakas, G., L. Angelis, and I. Vlahavas, Selective fusion of heterogeneous classifiers. Intelligent Data Analysis, 2005. 9(6): p. 511-525. 110. DΕΎeroski, S. and B. Ε½enko, Is combining classifiers with stacking better than selecting the best one? Machine Learning, 2004. 54(3): p. 255-273. 111. Kuncheva, L. and C. Whitaker, Feature Subsets for Classifier Combination: An Enumerative Experiment, in Multiple Classifier Systems, J. Kittler and F. Roli, (Editors). 2001, Springer Berlin Heidelberg. p. 228-237. 112. Grabusts, P., The Choice of Metrics for Clustering Algorithms, in Proceedings of the 8th International Scientific and Practical Conference. Volume I1. 2011. p. 70-76. 113. Jurman, G., Riccadonna, S., Visintainer, R., & Furlanello, C., Canberra distance on ranked lists. In Proceedings, Advances in RankingβNIPS 09 Workshop , 2009, p. 22-27. 137