Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NOWLEDGE ISCOVERY AND ATA INING 1 ON HE XAMPLE OF LINICAL ATABASES Magdalena Topczewska1, Leon Bobrowski1, 2 1 2 Institute of Computer Science, Technical University of Bialystok, Poland Institute of Biocybernetics and Biomedical Engineering PAS, Warsaw, Poland INTRODUCTION During the last decade, progress in computer technology has lead to the possibility of collecting of a large amount of both quantitative as well as qualitative data in databases. In many specialists’ opinions the ability to collect information has surpassed our ability to analyze it. This fact motivated the creation of a new field of research with strong associations with current practice known as “Data mining and knowledge discovery in databases”. Two international congresses have been devoted so far to this line of research and a first specialised magazine entitled “Data Mining” has been issued [1], [2]. During the development of new techniques for data mining, methods from other scientific disciplines are often adopted [3]. These include: multidimensional statistical analysis, machine learning, decision trees, neural networks, Bayesian networks, genetic algorithms, rough and fuzzy sets approach. Data visualisation techniques also play an important role. Data mining methods may be applied, among others applications, in support of business decision making and economic advising [4]. By means of these methods decisions rules can be created by using a detailed information contained in databases. Data mining techniques allow also for the development of prediction factors and the discovery of atypical situations (objects). For the purpose of illustrating some methods, we will make use of examples for medical databases [5]. Our experience in developing data mining techniques is associated with design, building and practical implementation of clinical databases. The patients descriptions in such databases usually contain a large amount of diagnostic information (including patient questionaire data and laboratory test results). Descriptions can include also the diagnosis (the label) of the considered patient. As a result, we have disease profiles collected in the clinical database from descriptions of individual patients. In this case, information from the database can be used for diagnostic support purposes. Diagnostic decision support rule applied to a new patient can be created on the basis of similar historical cases. 1 This work was partially supported by the grant 8T11E00811 from the State Committee for Scientific 5HVHDUFK.%1LQ3RODQGDQGE\WKHJUDQW:,,IURP7HFKQLFDO8QLYHUVLW\RI%LDá\VWRN An example is “Hepar” [6]. The computer system “Hepar” was designed and built a few years ago in the Institute of Biocybernetics and Biomedical Engineering PAS in co-operation with doctors. The system comprises the hepatological database and the shell of automated procedures which facilitate data analysis and diagnosis support. The database of this system contains the descriptions of about 700 hepatological patients from the Gastroenterological Clinic of the Institute of Food and Feeding in Warsaw. The “Hepar” system is currently used at this Clinic, and its database is steadily being enlarged. The description of each patient in the database is represented in the form of so-called “feature vector”. The components of these vectors are numerical results of a variety of medical examinations. These components are both qualitative and quantitative, because they contain both signs and symptoms as well as numerical results of laboratory tests. About 200 different features are used for description of a single patient in the system “Hepar”. Recently, at the Computer Science Department at the Technical University of Bialystok a clinical database has been built using the SAS system. This database has been created for doctors at the Institute of Obstetrics and Gynecology of the Clinical Hospital in Bialystok. The selection of a programming language and the system in which the new database should be developed was dictated greatly by medical demands. On the basis of traditional paper histories of disease variables describing every patient was created. In the electronic histories of disease there are about 500 positions (fields). Fig.1 The main menu of the Clinical Database The Clinical Database is now being put into medical practice. Doctors have started to collect diagnostic information of patients with cancer. This database has a graphical user interface that makes the product extremely easy to use for researchers, contains essential automated procedures for manipulation of data, procedures for statistical analysis supported by graphical presentations and makes possible the use of many tools included in the SAS system [7],[8]. The range of data analysis has been defined by medical specialists. New procedures can easily be added to this program, thus widening the database with new functions. In the near future we expect our database to be a powerful analytical tool that can be used in a client/server environment, and we have also plans to provide it with a data mining module. A prototype data mining system is being developed at our Computer Science Department. This tool includes procedures which allows : i. visualizing transformations of the data using diagnostic maps, ii. obtaining decision rules and decision trees interactively, and iii. developing neural networks on the basis of data sets. i. An example of a diagnostic map from our system is shown on the figure below (Fig.1). The considered map is used to evaluation of the following diagnostic hypothesis: “patient x is related to the k-th disease”. This map resulted from the visualizing transformation of a multivariate data set to two-dimensional space in the manner that the k-th disease is located in the centre of the map. If the patient being diagnosed is also located in the centre of the map, then the actual hypothesis is being supported by the map. The diagnostic map is design in a such manner which allow for good visual separation of the distinguished (k-th) disease from the remaining diseases. Fig.2 An example of the diagnostic map from the DaVinci system ii. The learning decision trees are one of the most widely used methods for inductive inference. It is possible to generate decision rules and decision trees on the base of data sets and to measure their quality both quantitatively and graphically These methods have been successfully applied to a broad range of tasks ranging from medical diagnosis to learning to assess credit risk of loan applicants. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute, and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute. This process is then repeated for the subtree rooted at the new node. We have developed specialised evolutionary programming algorithms aimed at design of the decision trees. iii. Development of neural networks covers automated generation of network architecture (number of layers and number of neural elements in respective layer) and establishing parameters (weights). Algorithms similar to those used in linear programming algorithms are used. Linear programming methods are utilized for instance in econometric research, and they enable the manipulation of large data sets. Other methods of data mining are also being applied and tested at our department, eg.: Bayesian networks. These networks are useful in any system in which [9]: 1) casuality plays some role, and 2) there is some uncertainty present. A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data modeling. Because the model encodes dependencies among all variables, it readily handles situations where some entries are missing. A Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Because the model has both causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in casual form) and data. Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems. More recently, researchers have developed methods for learning Bayesian networks from data. The techniques that have been developed are new and still evolving, but they have been shown to be remarkably effective for some data-modeling problems [10]. To illustrate the process of building a Bayesian network, consider the problem of detecting cellular-phone fraud. We begin by determining the variables to model. One possible choice for our problem is Fraud (Fr), Gender (G), Frequency in the last hour (F), Where call is placed (W) and Time of the day (T), representing whether or not the current phone call is fraudulent, the gender of the caller, whether or not the frequency in the last hour has increased, where calls in the last hour have been placed and the time of the day, respectively. The states of these variables are shown in the figure below. Of course, in a realistic problem, we would include many more variables. Also, we could model the states of one or more of these variables at a finer level of detail. This network can then be used in combination with some observed evidence for example if the gender and the frequency of calls are observed in a particular case, then the probability that the call is fraudulent can be calculated. M F Fr 0.001 0.003 -Fr 0.999 0.997 Fraud Gender M F 0.7 0.3 Frequency in the last hour Where call is placed Fr -Fr F=Low 0. 0.1 F=Medium 0.1 0.7 F=High 0.9 0.2 Time of the day Fr -Fr W=Local 0.1 0.8 W=Trunk 0.5 0.15 W=Abroad 0.4 0.05 Fr -Fr M F M F T=Work 0.5 0.1 0.5 0.8 0.1 0.6 T=Out 0.5 0.5 0.2 0.4 Fig.3 An example of the Bayesian network. SUMMARY Data mining methods, which these examples have shown, are used to produce tools to support decision making, so they may be applied in various industries, in particular, those which operate with large data sets. The Institute of Computer Science engages with these techniques. We are developing own software in this area and are designing the application with the SAS System. We also educate students to make multivariate data analysis and to build applications with the SAS System. REFERENCES 1. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthursamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, The MIT Press, London, 1996 2. U. Fayyad, H. Mannila, G. Piatetsky-Shapiro (Eds.), Data Mining and Knowledge Discovery, Vol. 1, Number 1-4, Kluwer Academic Publishers, 1997 3. T. M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997 4. J. P. Bigus, Data Mining with Neural Networks: Solving Business Problems – from Application Development to Decision Support, McGraw-Hill, New York, 1996 5. (. FNL(GComputers in Medicine (4-th International Conference), Vol. 1-2, Polish Society of Medical Informatics, Lodz, 1997 6. L. Bobrowski (Ed.): Hepar – computer system for diagnosis support and data analysis (in Polish), IBIB, 31, 1992 7. SAS Screen Control Language, Version 6 First Edition, SAS Institute Inc., Cary, 1991. 8. SAS Language: Reference, Version 6 First Edition, SAS Institute Inc., Cary, 1991. 9. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, California. Morgan Kauffman, 1988 10. M. J. Druzdzel. Five Useful Properties of Probabilistic Knowledge Representations From the Point of View of Intelligent Systems. Fundamenta Informaticae, Special Issue on Knowledge Representation and Machine Learning, 30(3/4):241-254.