Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立雲林科技大學 National Yunlin University of Science and Technology Exploiting data preparation to enhance mining and knowledge discovery Advisor:Dr.Hsu Graduate: Keng-Wei Chang Author: Balaji Rajagopalan Mark W. Isken IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001 Intelligent Database Systems Lab N.Y.U.S.T. I.M. Outline Motivation Objective Introduction Data Preparation Research Method Results Intelligent Database Systems Lab N.Y.U.S.T. I.M. Motivation using organizational data for mining and knowledge discovery not amenable for mining in its natural form Intelligent Database Systems Lab N.Y.U.S.T. I.M. Objective data enhancement by the introduction of new attributes along with judicious aggregation of existing attributes results in higher quality knowledge discovery differential impact on the performance of different mining algorithms Intelligent Database Systems Lab N.Y.U.S.T. I.M. Introduction Exponential growth information result a tremendous volume of data to knowledge workers. Knowledge management solution Knowledge repository Knowledge sharing Knowledge discovery Intelligent Database Systems Lab N.Y.U.S.T. I.M. Data Preparation Present a framework based on prior research in knowledge discovery Data quality Data characteristics Data preparation Intelligent Database Systems Lab N.Y.U.S.T. I.M. Research Method data set from a large tertiary care hospital in the United States was used few topics A. Problem Domain B. Data C. Clustering Algorithms for Knowledge Discovery D. Entropy-Based Metrics for Cluster Quality Assessment E. Rule Extraction Metrics Intelligent Database Systems Lab N.Y.U.S.T. I.M. Problem Domain allocation of inpatient beds more difficult is use quantitative resource allocation in a manageable set of patient types quantitative resource sequence of hospital units visited and corresponding length of stay patient types a group of patients consuming a similar level of hospital resources Intelligent Database Systems Lab N.Y.U.S.T. I.M. Problem Domain refer to this as the patient classification problem too few V.S. too many patient types The key is identify the set of patient types Intelligent Database Systems Lab N.Y.U.S.T. I.M. Data Inpatient obstetrical and gynecological (OB/GYN) patient flow There are numerous fields demographics physician information ICD9-CM diagnostic procedure codes diagnosis-related groups (DRGs) Intelligent Database Systems Lab N.Y.U.S.T. I.M. Data almost 500 defined in DRGs range[353-384] are related to OB/GYN grouping these DRGs into five DRG types Intelligent Database Systems Lab Clustering Algorithms for Knowledge Discovery K-means and Kohonen seof-organizing Similarity Euclidean distance function d x, y n x i 1 i yi 2 Intelligent Database Systems Lab N.Y.U.S.T. I.M. Entropy-Based Metrics for Cluster Quality Assessment Entropy 1 E j pij log 2 p i ij nijbe the number of cases having a DRG type of i in cluster j pij nij / l nlj Weighted Entropy N.Y.U.S.T. I.M. cluster size calculate a weighted average entropy measure for a cluster solution Purity, let Pj max i pij Intelligent Database Systems Lab N.Y.U.S.T. I.M. Rule Extraction Metrics expect a high degree of resonance for most of the rules with our domain knowledge Intelligent Database Systems Lab N.Y.U.S.T. I.M. Results detail the data enhancements relevant to this study A. Data Preparation : Basics B. Mining and Knowledge Discovery C. Differential Impact Based on Clustering Method D. Usefulness of Knowledge Discovered E. Limitations F. Implications for Research and Practice Intelligent Database Systems Lab Data Preparation : Basics Data set included fields that represent the path and associated lengths of stay along that path Intelligent Database Systems Lab N.Y.U.S.T. I.M. Data Preparation : Basics Consider three data sets characterized in order to illustrate the impact of data preparation ED1 Eight numeric variables Intelligent Database Systems Lab N.Y.U.S.T. I.M. Data Preparation : Basics ED2 Both DRG and CCS were designed to serve as aggregate measures of hospital resource consumption in addition ED1, ED2 add five nominal variables Intelligent Database Systems Lab N.Y.U.S.T. I.M. Data Preparation : Basics ED3 in addition to ED2, ED3 contains two binary variables whether or not gave birth during the visit whether or not gave birth via C-section Intelligent Database Systems Lab N.Y.U.S.T. I.M. Mining and Knowledge Discovery Intelligent Database Systems Lab N.Y.U.S.T. I.M. Mining and Knowledge Discovery Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. Intelligent Database Systems Lab Differential Impact Based on Clustering Method Intelligent Database Systems Lab N.Y.U.S.T. I.M. Usefulness of Knowledge Discovered Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. Limitations may not exactly applicable in every case examine only two data mining algorithms K-means and Kohonen self-organizing maps illustrative, not exhaustive domain knowledge played a critical role in the data preparation process Intelligent Database Systems Lab N.Y.U.S.T. I.M. Implications for Research and Practice provides empirical evidence demonstrating the impact of data preparation on mining and knowledge discovery engage in a comparative investigation of multiple altorithms Intelligent Database Systems Lab N.Y.U.S.T. I.M. Personal opinion … Intelligent Database Systems Lab