Download Knowledge Discovery and Data Mining on the Example of Clinical Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
NOWLEDGE ISCOVERY AND ATA INING
1
ON HE XAMPLE OF LINICAL ATABASES
Magdalena Topczewska1, Leon Bobrowski1, 2
1
2
Institute of Computer Science, Technical University of Bialystok, Poland
Institute of Biocybernetics and Biomedical Engineering PAS, Warsaw, Poland
INTRODUCTION
During the last decade, progress in computer technology has lead to the possibility
of collecting of a large amount of both quantitative as well as qualitative data in databases.
In many specialists’ opinions the ability to collect information has surpassed our ability
to analyze it. This fact motivated the creation of a new field of research with strong
associations with current practice known as “Data mining and knowledge discovery
in databases”. Two international congresses have been devoted so far to this line of research
and a first specialised magazine entitled “Data Mining” has been issued [1], [2].
During the development of new techniques for data mining, methods from other scientific
disciplines are often adopted [3]. These include: multidimensional statistical analysis,
machine learning, decision trees, neural networks, Bayesian networks, genetic algorithms,
rough and fuzzy sets approach. Data visualisation techniques also play an important role.
Data mining methods may be applied, among others applications, in support
of business decision making and economic advising [4]. By means of these methods decisions
rules can be created by using a detailed information contained in databases. Data mining
techniques allow also for the development of prediction factors and the discovery of atypical
situations (objects).
For the purpose of illustrating some methods, we will make use of examples
for medical databases [5]. Our experience in developing data mining techniques is associated
with design, building and practical implementation of clinical databases. The patients
descriptions in such databases usually contain a large amount of diagnostic information
(including patient questionaire data and laboratory test results). Descriptions can include also
the diagnosis (the label) of the considered patient. As a result, we have disease profiles
collected in the clinical database from descriptions of individual patients. In this case,
information from the database can be used for diagnostic support purposes. Diagnostic
decision support rule applied to a new patient can be created on the basis of similar historical
cases.
1
This work was partially supported by the grant 8T11E00811 from the State Committee for Scientific
5HVHDUFK.%1LQ3RODQGDQGE\WKHJUDQW:,,IURP7HFKQLFDO8QLYHUVLW\RI%LDá\VWRN
An example is “Hepar” [6]. The computer system “Hepar” was designed and built
a few years ago in the Institute of Biocybernetics and Biomedical Engineering PAS
in co-operation with doctors. The system comprises the hepatological database and the shell
of automated procedures which facilitate data analysis and diagnosis support. The database
of this system contains the descriptions of about 700 hepatological patients
from the Gastroenterological Clinic of the Institute of Food and Feeding in Warsaw.
The “Hepar” system is currently used at this Clinic, and its database is steadily being
enlarged. The description of each patient in the database is represented in the form of so-called “feature vector”. The components of these vectors are numerical results of a variety
of medical examinations. These components are both qualitative and quantitative,
because they contain both signs and symptoms as well as numerical results of laboratory tests.
About 200 different features are used for description of a single patient in the system
“Hepar”.
Recently, at the Computer Science Department at the Technical University
of Bialystok a clinical database has been built using the SAS system. This database has been
created for doctors at the Institute of Obstetrics and Gynecology of the Clinical Hospital
in Bialystok. The selection of a programming language and the system in which the new
database should be developed was dictated greatly by medical demands. On the basis
of traditional paper histories of disease variables describing every patient was created.
In the electronic histories of disease there are about 500 positions (fields).
Fig.1 The main menu of the Clinical Database
The Clinical Database is now being put into medical practice. Doctors have started
to collect diagnostic information of patients with cancer. This database has a graphical user
interface that makes the product extremely easy to use for researchers, contains essential
automated procedures for manipulation of data, procedures for statistical analysis supported
by graphical presentations and makes possible the use of many tools included in the SAS
system [7],[8]. The range of data analysis has been defined by medical specialists. New
procedures can easily be added to this program, thus widening the database with new
functions. In the near future we expect our database to be a powerful analytical tool that can
be used in a client/server environment, and we have also plans to provide it with a data
mining module.
A prototype data mining system is being developed at our Computer Science
Department. This tool includes procedures which allows :
i. visualizing transformations of the data using diagnostic maps,
ii. obtaining decision rules and decision trees interactively, and
iii. developing neural networks on the basis of data sets.
i.
An example of a diagnostic map from our system is shown on the figure below (Fig.1).
The considered map is used to evaluation of the following diagnostic hypothesis: “patient x
is related to the k-th disease”. This map resulted from the visualizing transformation
of a multivariate data set to two-dimensional space in the manner that the k-th disease
is located in the centre of the map. If the patient being diagnosed is also located
in the centre of the map, then the actual hypothesis is being supported by the map.
The diagnostic map is design in a such manner which allow for good visual separation
of the distinguished (k-th) disease from the remaining diseases.
Fig.2 An example of the diagnostic map from the DaVinci system
ii.
The learning decision trees are one of the most widely used methods for inductive
inference. It is possible to generate decision rules and decision trees on the base of data
sets and to measure their quality both quantitatively and graphically These methods have
been successfully applied to a broad range of tasks ranging from medical diagnosis
to learning to assess credit risk of loan applicants. Decision trees classify instances
by sorting them down the tree from the root to some leaf node, which provides
the classification of the instance. Each node in the tree specifies a test of some attribute,
and each branch descending from that node corresponds to one of the possible values
for this attribute. An instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch corresponding
to the value of the attribute. This process is then repeated for the subtree rooted at the new
node. We have developed specialised evolutionary programming algorithms aimed
at design of the decision trees.
iii.
Development of neural networks covers automated generation of network architecture
(number of layers and number of neural elements in respective layer) and establishing
parameters (weights). Algorithms similar to those used in linear programming algorithms
are used. Linear programming methods are utilized for instance in econometric research,
and they enable the manipulation of large data sets.
Other methods of data mining are also being applied and tested at our department,
eg.: Bayesian networks. These networks are useful in any system in which [9]:
1) casuality plays some role, and
2) there is some uncertainty present.
A Bayesian network is a graphical model that encodes probabilistic relationships among
variables of interest. When used in conjunction with statistical techniques, the graphical
model has several advantages for data modeling.




Because the model encodes dependencies among all variables, it readily handles
situations where some entries are missing.
A Bayesian network can be used to learn causal relationships, and hence can be used
to gain understanding about a problem domain and to predict the consequences
of intervention.
Because the model has both causal and probabilistic semantics, it is an ideal
representation for combining prior knowledge (which often comes in casual form)
and data.
Bayesian statistical methods in conjunction with Bayesian networks offer an efficient
and principled approach for avoiding the overfitting of data.
Over the last decade, the Bayesian network has become a popular representation
for encoding uncertain expert knowledge in expert systems. More recently, researchers have
developed methods for learning Bayesian networks from data. The techniques that have been
developed are new and still evolving, but they have been shown to be remarkably effective
for some data-modeling problems [10].
To illustrate the process of building a Bayesian network, consider the problem
of detecting cellular-phone fraud. We begin by determining the variables to model.
One possible choice for our problem is Fraud (Fr), Gender (G), Frequency in the last hour
(F), Where call is placed (W) and Time of the day (T), representing whether or not the current
phone call is fraudulent, the gender of the caller, whether or not the frequency in the last hour
has increased, where calls in the last hour have been placed and the time of the day,
respectively. The states of these variables are shown in the figure below. Of course,
in a realistic problem, we would include many more variables. Also, we could model the
states of one or more of these variables at a finer level of detail. This network can then be
used in combination with some observed evidence for example if the gender
and the frequency of calls are observed in a particular case, then the probability that the call is
fraudulent can be calculated.
M
F
Fr
0.001
0.003
-Fr
0.999
0.997
Fraud
Gender
M
F
0.7
0.3
Frequency
in the last hour
Where call is placed
Fr
-Fr
F=Low
0.
0.1
F=Medium
0.1
0.7
F=High
0.9
0.2
Time of the day
Fr
-Fr
W=Local
0.1
0.8
W=Trunk
0.5
0.15
W=Abroad
0.4
0.05
Fr
-Fr
M
F
M
F
T=Work
0.5
0.1
0.5
0.8
0.1
0.6
T=Out
0.5
0.5
0.2
0.4
Fig.3 An example of the Bayesian network.
SUMMARY
Data mining methods, which these examples have shown, are used to produce tools
to support decision making, so they may be applied in various industries, in particular, those
which operate with large data sets. The Institute of Computer Science engages with these
techniques. We are developing own software in this area and are designing the application
with the SAS System. We also educate students to make multivariate data analysis
and to build applications with the SAS System.
REFERENCES
1. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthursamy (Eds.), Advances in
Knowledge Discovery and Data Mining, AAAI Press, The MIT Press, London, 1996
2. U. Fayyad, H. Mannila, G. Piatetsky-Shapiro (Eds.), Data Mining and Knowledge
Discovery, Vol. 1, Number 1-4, Kluwer Academic Publishers, 1997
3. T. M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997
4. J. P. Bigus, Data Mining with Neural Networks: Solving Business Problems – from
Application Development to Decision Support, McGraw-Hill, New York, 1996
5. (. FNL(GComputers in Medicine (4-th International Conference), Vol. 1-2,
Polish Society of Medical Informatics, Lodz, 1997
6. L. Bobrowski (Ed.): Hepar – computer system for diagnosis support and data analysis
(in Polish), IBIB, 31, 1992
7. SAS Screen Control Language, Version 6 First Edition, SAS Institute Inc., Cary, 1991.
8. SAS Language: Reference, Version 6 First Edition, SAS Institute Inc., Cary, 1991.
9. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
San Mateo, California. Morgan Kauffman, 1988
10. M. J. Druzdzel. Five Useful Properties of Probabilistic Knowledge Representations
From the Point of View of Intelligent Systems. Fundamenta Informaticae, Special Issue
on Knowledge Representation and Machine Learning, 30(3/4):241-254.