Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Approach to Find Missing Values in Medical Datasets B.Mathura Bai N.Mangathayaru B.Padmaja Rani Faculty of Information Technology VNR VJIET Hyderabad INDIA Professor Dept. of Information Technology VNR VJIET, Hyderabad INDIA Professor Computer Science & Engg Dept. JNT University, Hyderabad INDIA [email protected] [email protected] [email protected] ABSTRACT Mining medical datasets is a challenging problem before data mining researchers as these datasets have several hidden challenges compared to conventional datasets. Starting from the collection of samples through field experiments and clinical trials to performing classification, there are numerous challenges at every stage in the mining process. The preprocessing phase in the mining process itself is a challenging issue when, we work on medical datasets. One of the prime challenges in mining medical datasets is handling missing values which is part of preprocessing phase. In this paper, we address the issue of handling missing values in medical dataset consisting of categorical attribute values. The main contribution of this research is to use the proposed imputation measure to estimate and fix the missing values. We discuss a case study to demonstrate the working of proposed measure. Categories and Subject Descriptors: I.2.6 Learning –Knowledge acquisition. H.2.8 Database applications: Data mining I.2.6 Artificial Intelligence: Learning - parameter learning General Terms analysis, dataset, record, attribute, noise, algorithm Keywords medical record, missing values, prediction, classification Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICEMIS '15, September 24-26, 2015, Istanbul, Turkey © 2015 ACM. ISBN 978-1-4503-3418-1/15/09…$15.00 DOI: http://dx.doi.org/10.1145/2832987.2833083 1. INTRODUCTION Disease Prediction and Classification is evolving as an important research problem in the fields of medical and health informatics. Medical data has several hidden challenges when compared to conventional datasets. The first step is the sample data collection which needs defining the attributes which functionally define the target specification. This phase of sample data collection is not free from challenges. The possibility of the presence of missing values for some attributes itself initiates the first challenge before the data mining researchers. One simple method to handle the missing values is to simply eliminate the medical record consisting of missing values. But this does not solve the problem. This is because, this medical record might contain the values of attributes which might be the deciding factors in predicting the diseases in some situations. If we do not eliminate such records consisting missing values, then the next strategy is to consider those records for the purpose of training. In such cases, we must fix the missing values. This is carried by several researchers in different ways. The most common way is using the distance measure to perform imputation of these missing values and replace these missing values using the suitable values. This requires proper estimation to be made. This is because any failure to do so would give fault outcome. The problem of mining medical records may be classified into the problem of handling the missing values, performing classification and clustering process which requires that the class label of each record is available. This is a supervised learning process. However clustering medical data to group similar medical records or patients is an un-supervised process of learning. Most works in the literature make use of Euclidean distance measure to fix the missing values. Some researchers try to cluster the medical data and then fix the missing values using k-means algorithm for clustering. Our findings indicates that there is a huge scope for research towards fixing missing values by designing new distance measures which can satisfy the basic properties of distance measures. If this is done then, we can estimate the missing values of the records containing the class label. In this paper, the section-2 discusses the related works. Section-3 points out the research issues in medical data mining. Section-4 introduces the problem definition and a generalized approach for handling the medical datasets. 2. RELATED WORKS Missing Values are most common when various field experiments and clinical trials are carried out. These missing values throw hidden challenges to the data miner and analyst. In this section we, discuss various research works carried out recently in healthcare and medical data mining. In [1], the authors discuss top-10 data mining algorithms which are most promising. In [2, 3, 4, and 5] the authors work involves predicting the heart disease using data mining approaches. The work in [6] deals with the knowledge discovery issues in data mining. The work in [7] deals with evolutionary approach in data mining and [8] deals with finding interesting rules from the dataset. In [9], the authors discuss the application of frequent items finding algorithm to find the periodical frequent patterns from transaction data. The authors [10, 14, and 15] propose an approach for handling the missing values when the feature values are continuous in nature using non-parametric discretization technique. They try to impute the missing values. They use the concept of z-score to handle the missing values in the medical records. This requires the underlying data to be continuous. In [11, 13] the authors use the concept of imputation to handle the missing values considering dengue fever dataset. They design the procedure to impute the missing attribute values for continuous attributes which may be numeric and categorical. They also use the decision tree for generating efficient decision rules for classification. In [12], the authors compare the various approaches available in the literature for handling the missing values with the benchmark datasets. In [17, 18] the authors perform classification of medical records. The research in [26-40] includes contributions of various researchers towards finding missing values, classification, prediction, clustering medical records. 3. Research Challenges 3.1 Dimensionality Reduction The process of dimensionality reduction w.r.t medical datasets requires handling the data without any loss. This is mainly because unlike the conventional approach where we try to reduce the dataset by eliminating irrelevant , least important features of the dataset through preprocessing which includes stop word elimination, removing stemming words, this approach does not hold good for the case of medical datasets. Dimensionality reduction is the major concern when dealing with the large datasets and very large databases. However in the view of mining medical records, this problem becomes much more complex. This is because; we must perform dimensionality reduction but at the same time we must make sure that we retain all the information without any loss which is not as simple as we view it to be. Dimensionality reduction techniques have been discussed extensively in various research works in the literature. The preprocessing phase also requires normalizing the data to make it feasible for handling efficiently. Once the data is normalized, the next problem is with the missing values. 3.2 Missing Values A peculiar problem which arises when handling medical datasets is the presence of missing values in the dataset which is very natural. This is because the data may have been collected; sampled from various clinical trials, field experiments or any other traditional mechanism. Proper care must be taken to handle missing values suitably and accurately. A simple approach to handle missing values would be to discard the whole record which essentially contains the missing value of an attribute. But, this do not solve the problem as, we are missing this record. There is a chance that this record may be important and deciding factor when performing prediction and classification. The problem of handling missing values has been widely dealt by data mining practitioners and researchers by exploring various approaches to handle the missing values [26-33]. Handling missing values requires the designing suitable imputation measures or using the existing measures. In case, we design new measure, it should satisfy the typical properties of a distance measure. There is a scope for research in this direction. The researchers may come up with suitable measures or alternately validate the use of mathematical functions to suit the need for fixing and filling missing values. 3.3 Functional Relation among Attributes Another important challenge and problem which cannot be left unexplored is discovering the functional relation among the attributes of the medical datasets. This must handled very carefully as the features or attributes are the major elements. 3.4 Imputation Measure The choice of suitable imputation mechanism, approach and measure is critical for handling medical data. Estimation of missing values using an effective imputation measure shall be the deciding stage when performing classification and prediction. The training set, number of samples is considered critical when handling medical datasets. One of the approaches to handle missing values is the use of proper imputation measure which can efficiently fix the missing values. Alternately, we may design a suitable imputation measure which may be used to estimate the missing value, and replace that specific missing value by an equivalent value computed using this imputation measure. Some of the works in this direction includes [10, 14]. 3.5 Deciding on the Number of Attributes One of the challenging problems when performing data mining is the number of attributes. As the number of attributes, increases, the complexity of handling the dataset also increases. This is because of the increasing attribute combination which makes the problem analysis NP-Hard. The complexity shows non-linearity. Dimensionality reduction techniques such as feature selection, reduction, PCA, SVD are usually applied to handle when the attributes are very large. In some situations, domain experts and domain knowledge may help reduce the attributes. The domain knowledge may also be useful to obtain the dependency relation among the attributes. Parameter tuning is also one of the problems when handling the medical databases and datasets. 3.6 Update Anomaly Medical data is not static. It is time variant. It is multidimensional. Since the medical data keeps updating from the readings arriving from laboratory tests, the data mining approaches, methods, methodologies, algorithms designed should be able to dynamically learn, build and update the knowledge database incrementally. Incremental Data Mining approaches are better suitable for handling medical databases as against to static methods and approaches. Techniques such as, incremental frequent pattern mining algorithms, incremental clustering, Dynamic Item set finding algorithms, handle the time varying nature of medical datasets. 3.7 Problem with Data Interpretation The medical data usually comes from various homogeneous and heterogeneous sources which bring the problems such as the data inconsistency, data incompleteness, data interpretation. This is because; various sources may consider different terminologies, standards, attributes. The attributes may be continuous or not. 4. Problem Definition Given a dataset of medical records, the problem is to classify the medical records to one of the classes labeled. Applying data mining techniques to medical datasets helps in identifying the hidden knowledge which helps in prediction and classification. There are various approaches for classification and clustering the datasets. Each algorithm has its own pros and cons. Also, the dataset over which we apply these techniques may not be free from missing values as all attribute values for every record corresponding to a schema may not be recorded and as a result of this, every value may not be present. The first step in handling the dataset may itself be challenging because this step requires handling missing values. There is scope for research to design algorithms for handling missing values in the medical records. Once we handle the missing values in the medical dataset, the dataset M may be transformed to M’. This M’ must then be normalized to make the medical dataset feasible for performing classification using the existing algorithms. The third challenge is in identifying the outlier attributes or features and then reducing this M’ to M’’. This process is called as dimensionality reduction. Finally, we may classify the existing record in to one of the class labels available. To handle missing values, we must know the corresponding class labels of each record. Otherwise this process becomes much more complex. All the existing algorithms handling missing values put a constraint that the class labels need to be known Apriori using which the missing values shall be filled. In case, the class labels are missing, we must first try to identify the nearest class label and then fill the missing values in the medical records. 5. PROPOSED MEASURE 5.1 IMPUTATION ALGORITHM Step1: Place the dataset in record and attribute matrix format with records as rows and attributes as columns Step2: Decompose the medical dataset into two parts. The first part consists of records with no missing values and second part consisting records with missing values. Step3: 3.8 Imbalanced Nature of Medical Datasets Medical datasets are usually imbalanced. In such a case, the dataset must be preprocessed, before using the same. The use of preprocessing techniques available must be done suitably to handle such anomalies. Normalize the datasets obtained in step-2 suitably. This involves mapping the attribute values to suit the need for applying the proposed measure to find missing values i.e we fix the domain of attribute values. Step-4: Choose one record with missing attribute value or missing values, each time to fix the missing values of the record. Step-5: 6. CASE STUDY From the record considered in above step, discard the attribute column consisting missing values (second part of dataset).Similarly, discard this attribute column when considered remaining records with no missing values (first part of dataset) Consider the Table.1 which has nine medical records and two records with missing values as shown below Table 1. Sample Dataset, MD A1 A2 A3 A4 R1 a11 5 a31 10 R2 a13 7 a31 5 Step-7: The class label of new test record is the label of the record to which the new test record shows maximum similarity. R3 a11 7 ? 7 R4 a12 5 a31 10 R5 a13 3 a32 ? Step-8: R6 a12 9 a31 10 Fill the missing attribute value of the test record with the corresponding attribute value of the record to which it has maximum similarity. We may also consider the frequency of occurrence in case of ambiguity. R7 a11 5 a32 3 R8 a13 6 a32 7 R9 a12 6 a32 10 Step-6: Apply the proposed measure to find the similarity between the new record and existing records without missing values. Step-9: Repeat the above steps for each record having missing values accordingly. 5.2 IMPUTATION MEASURE In this section, we discuss the proposed imputation measure which is used to fix the missing attribute values present in the medical records. The Proposed Imputation measure is defined as given by Equation. 1 = ∗ 1+ ∑ ( , ) (1) where Sim (R ,R ) = 0.5 ∗ [1 + ( . ) ( ) ] The parameters Vik, Vjk denotes attribute values of k attribute w.r.t records, Ri, Rj . Similarly, . ( ) denotes the Standard deviation of kth attribute w.r.t all records of dataset discarding missing value attributes. The distance function is represented as given by = 1-IMMV Table 2. Records with no missing values A1 A2 A3 A4 R1 a11 5 a31 10 R2 a13 7 a31 5 R4 a12 5 a31 10 R6 a12 9 a31 10 R7 a11 5 a32 3 R8 a13 6 a32 7 R9 a12 6 a32 10 (2) th Distance, Table.2 shows the records R1 , R2, R4, R6, R7, R8, R9 which have all the attribute values defined. We can observe that there are no missing values. (3) Table.3 shows the records R3 , R5 . We can observe that there are missing values in these records w.r.t attributes A3 and A4 in records R3 , R5 respectively. Table 3. Records with Missing Values A1 A2 A3 R3 a11 7 ? R5 a13 3 a32 A4 7 ? Table 4. Standard.deviation of attributes A1 A2 A3 A4 0.816497 1.46385 0.534522 2.91139 Table 8. Mapping Records with Missing Values In Table.4, the standard deviation of all attributes w.r.t the records of Table.1 are computed and recorded. The attributes values for A1 and A3 are mapped to a different domain to make it suitable for processing by the algorithm and proposed measure. Similarly, Table.8 shows mapping of attribute values for records R5 and R3. Table 5. Mapping A1 and A3 attribute values A2 A3 A4 R1 1 5 1 10 R2 3 7 1 5 R4 2 5 1 10 R6 2 9 1 10 R7 1 5 2 3 R8 3 6 2 7 R9 2 6 2 10 Std.dev 0.8164 1.4638 0.5345 2.9113 Table 6. Similarity of Record, R3with Record in Table.2 Similarity with R3 0.750079 R2 0.771048 R4 0.6206 R6 0.6206 R7 0.717678 R8 0.771595 R9 0.699342 A2 A3 A4 R3 1 7 ? 7 R5 3 3 2 ? The similarity values of records R3 , R5 w.r.t records R1 , R2, R4, R6, R7, R8, R9 is shown in Table.6 and Table.7 respectively. Table.10 shows the attribute values of records after fixing missing values using the proposed measure. Table 9. Records with Mapping Values fixed A1 R1 A1 Table 7. Similarity of Record, R5with Record in Table.3 A1 A2 A3 A4 R3 1 7 2 7 R5 3 3 2 7 Table 10. Records with Missing Values fixed A1 A2 A3 A4 R3 1 7 a 32 7 R5 3 3 2 7 The records R3 , R5 show maximum similarity with the records of Table.6 and Table.7 respectively. Hence we fix the missing attribute values of these records w.r.t record R8 7. CONCLUSIONS Missing values are very common in medical data. Handling missing values is challenging and also an interesting area of research in perspective of data mining, information retrieval. In this work, we propose imputation measure which is used to fix missing values of attributes for the records of dataset. This is done by finding similarity values between records with missing values and records without missing values. A case study is discussed which shows procedure of fixing missing values. 8. ACKNOWLEDGEMENTS R1 0.531219 R2 0.671795 R4 0.567994 R6 0.542221 We are thankful to anonymous reviewers who gave their valuable suggestions in bringing this work successfully. This work is not funded by any public, private or government agencies and has been carried as part of research work. We are also thankful to the Head of the Department, IT and Principal, VNR VJIET for motivating us towards research. R7 0.692853 References R8 0.835833 R9 0.706354 Similarity with , R5 [1]. Xindong Wu, Vipin Kumar, J. Ross Quinlan et.al. 2007. Top 10 algorithms in data mining. Knowledge Inf. Syst.14, 1, Pg 1-37. DOI=10.1007/s10115-007-0114-2. http://dx.doi.org/10.1007/s10115-007-0114-2 [2] M. A. Jabbar et.al. An Evolutionary algorithm for Heart Disease Prediction. Communications in Computer and Information Science. Springer Verlag. Volume 292, pp 378389. http://dx.doi.org/10.1007/978-3-642-31686-9_44 [3] M. A. Jabbar, B. L. Deekshatulu, Priti Chandra. Graph Based Approach for Heart Disease Prediction. Proceedings of the Third International Conference on Trends in Information, Telecommunication and Computing, Volume 150 of the series Lecture Notes in Electrical Engineering pp 465-474 [4]. V.Krishnaiah, G.Narsimha, N.Subhash Chandra. Heart Disease Prediction System Using Data Mining Technique by Fuzzy K-NN Approach. Emerging ICT for Bridging the Future - Proceedings of the 49th Annual Convention of the Computer Society of India (CSI). Series Advances in Intelligent Systems and Computing, Volume 337, pp 371384 [5]. Sujata Joshi, Mydhili K. Nair. Prediction of Heart Disease Using Classification Based Data Mining Techniques. Computational Intelligence in Data Mining - Volume 2, Volume 32 of the series Smart Innovation, Systems and Technologies pp 503-511 [6]. John F. Roddick, Peter Fule, and Warwick J. Graco. 2003. Exploratory medical knowledge discovery: experiences and issues. SIGKDD Explor. Newsl. 5, 1 (July 2003), 94-99. DOI=10.1145/959242.959243.http://doi.acm.org/10.1145/95 9242.959243 [7]. Michele Sebag, Jerome Aze, Noel Lucas. ROC-Based Evolutionary Learning: Application to Medical Data Mining. Artificial Evolution Volume 2936 of the series Lecture Notes in Computer Science pp 384-396 [8] Miho Ohsakiet.al. Investigation of Rule Interestingness in Medical Data Mining. Active Mining Volume 3430 of the series Lecture Notes in Computer Science pp 174-189 [9] Mohammed Abdul Khaleel G. N. Dash, K. S. Choudhury, Mohiuddin Ali Khan. Medical Data Mining for Discovering Periodically Frequent Diseases from Transactional Databases. Computational Intelligence in Data Mining Volume 1, Volume 31 of the series Smart Innovation, Systems and Technologies pp 87-96 [10] G. Madhu et.al. A Non-Parametric Discretization Based Imputation Algorithm for Continuous Attributes with Missing Data Values. International Journal of Information Processing, 8(1), 64-72, 2014 [11] Sreehari Rao et.al A New Intelligence-Based Approach for Computer-Aided Diagnosis of Dengue Fever. IEEE Transactions on Information Technology in Biomedicine, Jan. 2012, 112 – 118. [12] M. Naresh Kumar. Performance comparison of State-of-the art Missing Value Imputation Algorithms on Some Bench mark Datasets. [13] V. Sree Hari Rao et.al .Novel Approaches for Predicting Risk Factors of Atherosclerosis. IEEE Journal of Biomedical and Health Informatics. Jan 2013, Pg 183 – 189 [14] G.Madhu et.al. A novel index measure imputation algorithm for missing data values: A machine learning approach. IEEE International Conference on Computational Intelligence & Computing Research, Dec 2012. [15] A Novel Discretization Method for Continuous Attributes: A Machine Learning Approach. International Journal of Data Mining and Emerging Technologies 4 (1), 34-43. [16] G.Madhu et.al. Improve the Classifier Accuracy for Continuous Attributes in Biomedical Datasets Using a New Discretization Method. Journal Procedia Computer Science, 2014, Vol 31, 671-79. [17] A. Casillas et.al. First approaches on Spanish medical record classification using Diagnostic Term to class transduction. Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 60– 64. [18]. Tomasz ORCZYK. Influence of Missing data imputation method on the classification accuracy of medical data. JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 22/2013, ISSN 1642-6037. [19] Krzysztof J. Cios and G. William Moore. 2002. Uniqueness of medical data mining. Artif. Intell. Med. 26, 1-2 (September 2002), 1-24. [20] George Tzanis. 2014. Biological and Medical Big Data Mining. Int. J. Knowl. Disc. Bioinfo. 4, 1 (January 2014), 4256.DOI=10.4018/ijkdb.2014010104 [21] Ashkan Sami. 2006. Obstacles and misunderstandings facing medical data mining. In Proceedings of the Second international conference on Advanced Data Mining and Applications(ADMA'06), Xue Li, Osmar R. Zaïane, and Zhan-huai Li (Eds.). Springer-Verlag, Berlin, Heidelberg, 856-863 [22] Sean N. Ghazavi and Thunshun W. Liao. 2008. Medical data mining by fuzzy modeling with selected features. Artif. Intell. Med. 43, 3 (July 2008), 195-206. [23] Tamas Sandor Gal. 2011. Privacy Preserving Data Mining for Medical Data. Ph.D. Dissertation. University of Maryland at Baltimore County, Catonsville, MD, USA. Advisor(s) Zhiyuan Chen. AAI3490994. [24] Andrea L. Houston, Hsinchun Chen, Susan M. Hubbard, Bruce R. Schatz, Tobun D. Ng, Robin R. Sewell, and Kristin M. Tolle. 1999. Medical Data Mining on the Internet: Research on a Cancer Information System. Artif. Intell. Rev. 13, 5-6 (December 1999), 437-466. [25] M. Akhil jabbar, B.L. Deekshatulu, Priti Chandra, Classification of Heart Disease Using K- Nearest Neighbor and Genetic Algorithm, Procedia Technology, Volume 10, 2013, Pages 85-94, ISSN 2212-0173 [26] Martti Juhola and Jorma Laurikkala. 2013. Missing values: how many can they be to preserve classification reliability? Artif. Intell. Rev. 40, 3 (October 2013), 231-245. DOI=10.1007/s10462-011-9282-2 http://dx.doi.org/10.1007/s10462-011-9282-2 [27] Geaur Rahman and Zahidul Islam. 2011. A decision treebased missing value imputation technique for data preprocessing. In Proceedings of the Ninth Australasian Data Mining Conference - Volume 121 (AusDM '11), Peter Vamplew, Andrew Stranieri, Kok-Leong Ong, Peter Christen, and Paul J. Kennedy (Eds.), Vol. 121. Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 41-50. [28] Pilsung Kang. 2013. Locally linear reconstruction based missing value imputation for supervised learning. Neurocomput. 118, 65-78. [29] Krzysztof Simiński. 2013. Clustering with Missing Values. Fundam. Inf. 123, 3 (July 2013), 331-350. DOI=10.3233/FI-2013-814 http://dx.doi.org/10.3233/FI2013-814 [30] Archana Purwar and Sandeep Kumar Singh. 2015. Hybrid prediction model with missing value imputation for medical data. Expert Syst. Appl. 42, 13 (August 2015), 5621-5631. DOI=10.1016/j.eswa.2015.02.050 http://dx.doi.org/10.1016/j.eswa.2015.02.050 [31] Alireza Farhangfar, Lukasz Kurgan, and Jennifer Dy. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 12 (December 2008), 3692-3705. [32] Shariq Bashir, Saad Razzaq, Umer Maqbool, Sonya Tahir, and A. Rauf Baig. 2006. Using association rules for better treatment of missing values. In Proceedings of the 10th WSEAS international conference on Computers (ICCOMP'06), Zoran S. Bojkovic (Ed.). World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 1133-1138. [33] Hai Wang and Shouhong Wang. 2007. A data mining approach to assessing the extent of damage of missing values in survey. Int. J. Bus. Intell. Data Min. 2, 3 (October 2007), 249-260.DOI=10.1504/IJBIDM.2007.015484 [34] Leila Ben Othman and Sadok Ben Yahia. 2006. Yet another approach for completing missing values. In Proceedings of the 4th international conference on Concept lattices and their applications (CLA'06), Sadok Ben Yahia, Engelbert Mephu Nguifo, and Radim Belohlavek (Eds.). SpringerVerlag, Berlin, Heidelberg, 155-169. [35] Heiko Timm, Christian Döring, and Rudolf Kruse. 2003. Differentiated treatment of missing values in fuzzy clustering. In Proceedings of the 10th international fuzzy systems association World Congress conference on Fuzzy sets and systems (IFSA'03), Taner Bilgiç, Bernard De Baets, and Okyay Kaynak (Eds.). Springer-Verlag, Berlin, Heidelberg, 354-361. [36] Estevam R. Hruschka, Jr. and Nelson F. F. Ebecken. 2002. Missing values prediction with K2.Intell. Data Anal. 6, 6 (December 2002), 557-566. [37] Selma T. Milagre, Carlos Dias Maciel, Jose Carlos Pereira, and Adriano A. Pereira. 2009. Fuzzy cluster stability analysis with missing values using resampling. Int. J. Bioinformatics Res. Appl.5, 2 (March 2009), 207-223. [38] Hai Wang and Shouhong Wang. 2007. Visualization of the critical patterns of missing values in classification data. In Proceedings of the 9th international conference on Advances in visual information systems (VISUAL'07), Guoping Qiu, Clement Leung, Xiangyang Xue, and Robert Laurini (Eds.). Springer-Verlag, Berlin, Heidelberg, 267-274. [39] Chi-Chun Huang and Hahn-Ming Lee. 2004. A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction. Applied Intelligence 20, 3 (May 2004), 239-252. [40] Xiaodong Feng, Sen Wu, and Yanchi Liu. 2011. Imputing missing values for mixed numeric and categorical attributes based on incomplete data hierarchical clustering. In Proceedings of the 5th international conference on Knowledge Science, Engineering and Management (KSEM'11), Springer-Verlag, Heidelberg, 414424. [41] Federico Cismondi, André S. Fialho, Susana M. Vieira, Shane R. Reti, JoãO M. C. Sousa, and Stan N. Finkelstein. 2013. Missing data in medical databases: Impute, delete or classify?. Artif. Intell. Med. 58, 1 (May 2013), 63-72. DOI=10.1016/j.artmed.2013.01.003 http://dx.doi.org/10.1016/j.artmed.2013.01.003 [42] Atif Khan, John A. Doucette, and Robin Cohen. 2013. Validation of an ontological medical decision support system for patient treatment using a repository of patient data: Insights into the value of machine learning. ACM Trans. Intell. Syst. Technol. 4, 4, Article 68 (October 2013), 31 pages.DOI=10.1145/2508037.2508049 http://doi.acm.org/10.1145/2508037.2508049 [43] Jau-Huei Lin and Peter J. Haug. 2008. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J. of Biomedical Informatics 41, 1 (February 2008), 1-14. DOI=10.1016/j.jbi.2007.06.001 http://dx.doi.org/10.1016/j.jbi.2007.06.001 [44] Karla L. Caballero Barajas and Ram Akella. 2015. Dynamically Modeling Patient's Health State from Electronic Medical Records: A Time Series Approach. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). 69-78. DOI=http://dx.doi.org/10.1145/2783258.2783289 [45] Zhenxing Qin, Shichao Zhang, and Chengqi Zhang. 2006. Missing or absent? A Question in Cost-sensitive Decision Tree. In Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006, Yuefeng Li, Mark Looi, and Ning Zhong (Eds.). IOS Press, 118-125. [46] Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, and Kost Elisevich. 2010. Effect of classifiers in consensus feature ranking for biomedical datasets. In Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics (DTMBIO '10). 67-68. [47] Bogdan Gabrys. 2009. Learning with Missing or Incomplete Data. In Proceedings of the 15th International Conference on Image Analysis and Processing (ICIAP '09), Pasquale Foggia, Carlo Sansone, and Mario Vento (Eds.). SpringerVerlag, Berlin, Heidelberg, 1-4. DOI=10.1007/978-3-64204146-4_1 http://dx.doi.org/10.1007/978-3-642-04146-4_1