Download Word File 79K - Computing and Information Systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computing and Information Systems, 7 (2000), p. 73-78
© University of Paisley 2000
UnDeT: a tool for building uncertain trees
Dominique Fournier
We also associate with the uncertain domains, the
problems for which the source of the data is not sure
(i.e. some data prove certainly noisy without it being
possible to detect them). As an example, we had the
opportunity to study medical data resulting from brain
MRI examinations [3][4]. During this study, among
the 800 cases we had at our disposal, the examination
specialists were able to detect some numbers of
examples for which obtained measurement were
conspicuously noisy but without apparent reason.
Such cases let predict the presence in the database of
other erroneous elements but undetectable to the direct
human analysis. We therefore associate this kind of
problems (presence of hidden noisy data) with the
uncertain domains.
There are many decision trees software and pruning
facilities [1][7][8][9][12]. However they do not prove
suitable to the uncertain domains study. We are
introducing here a multi-platform software for
decision trees which implements a quality index and a
pruning method fitted to this kind of investigation
field. We will explain how its new functionalities make
our program useful for the study of the uncertain
domains.
INTRODUCTION
In this paper, we present UnDeT, an experimentation
tool developed in our laboratory which becomes
integrated into a data mining software platform. Our
module is dedicated to the study of difficult problems
issued from "uncertain domains". During various
collaborations, mainly in the medical field, we noticed
the utility of such a tool which allows an expertise of
these domains.
To sum up, an uncertain problem is characterized by a
not very reliable labelling, an incomplete and/or noisy
representation. We will now detail the user's
expectations who are confronted with these problems.
Here, we will initially clarify what we call "uncertain
domains", and present our approach of these domains.
Secondly, we detail UnDeT's design and
functionalities. And we finish by presenting an
experiment to illustrate our work and the results we
have obtained..
Specific needs
Within such a framework, the traditional tools for
classification are inoperative because of the lack of
certainty concerning the attributes of the data. A
global answer to all the difficulties inherent in the
uncertain domains does not seem either possible. That
is the reason why the requirements on the matter, are
directed towards a contribution to the data analysis,
which details its validity, its relevance. It is thus
necessary to dispose for such a study, a noise-robust
tool, able to explore badly defined data. Such a tool
must also be able to explore a great number of
attributes even not very informative to extract from
them those which bring reliable information, even
partially. Such a tool must thus lead to the selection of
well characterized cases subsets or subsets of
attributes at least partially relevant. The results
obtained could then be submitted to the evaluation and
to the validation of the expert, who will thus improve
his knowledge on the field and will be able to make
corrections to the initial data.
UNCERTAIN DOMAINS AND THEIR SPECIFIC
NEEDS
Uncertain domains
We estimate that a data mining problem belongs to the
uncertain domains if it is a difficult problem. More
precisely, a problem is difficult if no expert is able to
solve it simply. In this case, the choice of relevant
attributes to its characterization is not always possible
because of the lack of control in the field. Indeed,
during the harvesting phase of the data, the users
proceed as well as possible by selecting the greatest
number of variables likely to interact with the question
considered, but cannot be sure they do not forget
someother more informative attributes. In the same
way, the expert, who do not have a systemic at his
disposal, will be able to make errors during the
labelling of the pilot data. The choice of the class is
therefore not always reliable. In such an environment,
one can be sure neither validity of pre-sorted examples
nor the utility of the variables chosen to describe the
studied data. It is thus also possible to have useless
variables which generate noise.
OUR APPROACH
In this state of mind, what we seek to do, is to evaluate
the quality, the informative value of the training
obtained from the data resulting from uncertain
domains, and to be able to establish which are the
parts of these results who are reliable, not very reliable
or even suspect; and thus to quantify user's confidence
in the results. Our objective is thus to split the results
73
of a classification to extract the reliable components
from the others. This extraction will then allow the
user to study again the data, and perhaps, with this
new highlighting, improve his expertise of the domain.
easy to make evolve what is an additional asset within
an experimental framework which tends to change in
the course of time.
The graphic interface was made using the graphic
libraries SWING which will constitute the standard of
the next distributions of the language JAVA, which
ensures our program the better lasting of "utilisability
".
In this respect, our work was directed towards the
decision trees. Indeed, this method shows several
advantages which were determining when we choose
it. Initially, it is the structure of the models built by
this type of algorithms which interested us. The nodes
contained in the trees constitute a partition of the
original database. Such a partition which can be
associated with a set of rules allows a detailed
browsing of the database. When someone compares
this method with others which do not proceed by
successive refinement to reach their results, neuron
networks for example, it appears well fitted to our
need. Moreover, decision trees algorithms scan all the
attributes on equal terms, so the presence of noisy
variables is not a failure factor. In addition the
decision trees associated with the uncertain domains
had also the advantage of being well known locally
[2]. We thus endeavoured to develop a quality index
that allows to answer our waiting as regards to the
evaluation of the results [6]. We compute these
evaluations with UnDeT.
Functionality
We used for UnDeT the traditional algorithm of
decision tree such as C4.5 [9] (it uses the format for
the data files which has the advantage of being
commonly used in the Machine Learning community).
It implements three criteria for best attribute selection:
the gain criterion, the GainRatio criterion and finally
ORT criterion. The first two criteria are already used
in Quinlan's C4.5, the third comes from Fayyad and
Irani [5]. Our tool also presents all the functionalities
which one can await from a traditional classification
algorithm: management of the training and test files,
automatic cross-validation, confusions matrix, backs
up and re-use of the classification built model.
UNDET: UNCERTAIN DECISION TREE
The innovating parts of our tool relate to the
development of a quality index and it association in a
pruning heuristic.
Design
Quality index and pruning method
When designing our tool, in addition to our interest for
the uncertain domains, we had the ambition to develop
within the laboratory an experimental platform
dedicated to the study of the uncertain fields. We had
consequently in one hand constraints of portability to
free us from heterogeneity of the operating systems
used. And in the other hand, the association of UnDeT
with other modules such as RAR [10] and MVC [11]
within our platform imposed to us a maximum of
modularity. This is why the language JAVA seemed
particularly well adapted to us in our work
environment. This language is indeed available on any
type of computer and as an object programming
language makes it possible to have a tool flexible and
The characteristics of our tool relates to the
development of a quality index and a pruning method
[6] which are well adapted to the uncertain fields by
emphasizing and preserving the data subsets which are
well recognized in the decision tree.
DI(T ) 
m
 a (1   (' )f(depth(  ' ))
i
i
i
i
Equation 1
The quality index that we called DI (for DepthImpurity quality index) associates for each node of a
tree the complexity of the associated sub-tree and its
Figure 1: Fichier and Options Menus
74
Figure 3: Preferences
on the following principle: the construction cost of the
tree must improve the total quality of this one. So,
when a node has its purity greater than its quality
index, (regarding the definition of our quality index)
one can estimate that the cost of the subtree is too
expensive regarding its effectiveness
Figure 2: Dialog Box
leaves purity. One thus finds the precise definition of
DI for a tree T (DI(T) is also noted DI(), the quality
index of the tree T which has  as a root) in Equation
1, where i' are the leaves of T, i is the weight of i'
in , and  is an impurity measurement standardized
in [0,1]. Let us finally specify that f is damping
function decreasing accordingly to the depth of the
leaves.
Guided tour
The interface of UnDeT remains rather traditional: a
menu bar gives access to the various functions
available. The menu Fichier (Figure 1) ensures the
loading and the backup of the data (using a dialog box
(Figure 2: Dialog Box)) and the program escape. The
menu Arbre allows the construction and the pruning of
the trees as well as the procedure of "Train and Test"
and automatic Cross-Validation. Finally the menu
Options makes it possible to choose the selection
criterion used in the tree induction, the pruning
method and finally the quality index which will be
associated with the tree. The preferences (Figure 3)
allow the user to parameterize the constants used
DI (T )  1   ()
Equation 2
The pruning method associated with DI, say DI
pruning, consists in cutting the tree nodes which
quality remains lower than the purity according to the
condition given by Equation 2. This method is based
Figure 4: Results display
75
DI=0.848
-1- (30/245) gram<=0: DI=0.818
| -3- (16/243) prot<=0.95: DI=0.826
| | -5- (9/223) purpura=0: DI=0.818
| | | -10- (7/24) age<=1.66: DI=0.722
| | | | -12- (0/14) vs<=78: I=0 Class=vir
| | | | -13- (7/10) vs>78: DI=0.631
| | | | | -14- (0/7) age<=0.58: I=0 Class=vir
| | | | | -15- (7/3) age>0.58: DI=0.514
| | | | | | -16- (1/3) fever<=39.7: I=0.811 Class=vir
| | | | | | -17- (6/0) fever>39.7: I=0 Class=bact
| | | -11- (2/199) age>1.66: DI=0.832
| | | | -18- (0/119) lung=0: I=0 Class=vir
| | | | -19- (0/54) lung=1: I=0 Class=vir
| | | | -20- (0/20) lung=2: I=0 Class=vir
| | | | -21- (2/6) lung=3: DI=0.797
| | | | | -22- (0/1) season=winter: I=0 Class=vir
| | | | | -23- (2/0) season=spring: I=0 Class=bact
| | | | | -24- (0/5) season=summer: I=0 Class=vir
| | -6- (1/10) purpura=1: DI=0.873
| | | -26- (0/10) cytol<=400: I=0 Class=vir
| | | -27- (1/0) cytol>400: I=0 Class=bact
| | -7- (0/10) purpura=2: I=0 Class=vir
| | -8- (4/0) purpura=3: I=0 Class=bact
| | -9- (2/0) purpura=4: I=0 Class=bact
| -4- (14/2) prot>0.95: DI=0.685
| | -28- (2/2) cytol<=600: I=1 Class=bact
| | -29- (12/0) cytol>600: I=0 Class=bact
-2- (54/0) gram>0: I=0 Class=bact
Figure 5: Initial tree on meningitis database.
during the computation.
diagnosis is important because bacterial infection
endangers patient’s life.
The results display is relatively basic (Figure 4) but
informs the user on multiple information. In addition
to a recall of the parameters used during the
construction of the tree or its pruning, the display
precises for each node, the number of examples
associated and their distribution on each class. The
impurity and the quality index of each node are also
available.
This database contains 328 examples described by 23
(quantitative or qualitative) attributes. The class is
bivalued (viral versus bacterial). Whereas the
diagnosis is considered obvious on typical cases,
nearly one third of these examples affects non-typical
clinical and biological data: the prediction becomes
difficult regarding the pour link between attributes,
considered separately, and the diagnosis. We show in
this section how the pruning method given by UnDeT
is relevant in such domains.
EXPERIMENT
We now choose to comment on an experiment we
estimate representative of our different tests released.
For this experiment we used UnDeT and its pruning
strategy.
Results
The initial tree has high quality index (0.848) due to
the relevance of the chosen attributes. Such a result is
considered quite decent in medical domain.
Child’s meningitis database
Here is now an example of our software use on a
medical database coming from the University Hospital
at Grenoble (France). This database is the result of a
medical study on young patient affected by
meningitis. Meningitis is a pathology that is generally
caused by viral infection, but about one quarter of the
cases is the fact of bacteria. If viral cases only need a
medical supervision, patient infected by bacteria had
to be treated with suitable antibiotics. An early
To comment these results, let us call Ti the sub-tree of
root labelled –i-. On the tree in Figure 5, T4 has a
lower impurity than T10 (0.54 against 0.77) even so,
the quality index of T10 is higher (DI(T10)=0.722 and
DI(T4)=0.685). Whereas the deeper sub-tree is
penalized, DI(T10) is better because it explains better
instances (it leads to a single miss-classified instance).
The complexity cost of T10 will appear with the
pruning stage below.
76
DI=0.829
-1- (30/245) gram<=0: DI=0.796
| -3- (16/243) prot<=0.95: DI=0.803
| | -5- (9/223) purpura=0: DI=0.792
| | | -10- (7/24) age<=1.66: DI=0.722
| | | | -12- (0/14) vs<=78: I=0 Class=vir
| | | | -13- (7/10) vs>78: DI=0.631
| | | | | -14- (0/7) age<=0.58: I=0 Class=vir
| | | | | -15- (7/3) age>0.58: DI=0.514
| | | | | | -16- (1/3) fever<=39.7: I=0.811
Class=vir
| | | | | | -17- (6/0) fever>39.7: I=0
Class=bact
| | | -11- (2/199) age>1.66: I=0.080 Class=vir
| | -6- (1/10) purpura=1: DI=0.873
| | | -26- (0/10) cytol<=400: I=0 Class=vir
| | | -27- (1/0) cytol>400: I=0 Class=bact
| | -7- (0/10) purpura=2: I=0 Class=vir
| | -8- (4/0) purpura=3: I=0 Class=bact
| | -9- (2/0) purpura=4: I=0 Class=bact
| -4- (14/2) prot>0.95: DI=0.685
| | -28- (2/2) cytol<=600: I=1 Class=bact
| | -29- (12/0) cytol>600: I=0 Class=bact
-2- (54/0) gram>0: I=0 Class=bact
DI=0.762
-1- (30/245) gram<=0: DI=0.716
| -3- (16/243) prot<=0.95: DI=0.718
| | -5- (9/223) purpura=0: I=0.237 Class=vir
| | -6- (1/10) purpura=1: DI=0.873
| | | -26- (0/10) cytol<=400: I=0 Class=vir
| | | -27- (1/0) cytol>400: I=0 Class=bact
| | -7- (0/10) purpura=2: I=0 Class=vir
| | -8- (4/0) purpura=3: I=0 Class=bact
| | -9- (2/0) purpura=4: I=0 Class=bact
| -4- (14/2) prot>0.95: DI=0.685
| | -28- (2/2) cytol<=600: I=1 Class=bact
| | -29- (12/0) cytol>600: I=0 Class=bact
-2- (54/0) gram>0: I=0 Class=bact
Figure 7: Second pruned tree.
References
[1] L. Breiman, J. Friedman, R. Olshen and C. Stone.
Classification and Regression Trees, The
Wadsworth Statistics/Probability Series, 1984.
Figure 6: First pruned tree.
[2] B. Crémilleux, C. Robert and M. Gaio, Uncertain
Domains and Decision Trees: ORT versus C.M.
Criteria. In 7th Conference on Information
Processing and Management of Uncertainty in
Knowledge-based Systems (IPMU), pages 540-546,
Paris (France), 7 1998. EDK N.
In the first pruned tree in Figure 6, the node T11
becomes a leaf introducing 2 errors. This is not too
high in a node of 201 instances. Yet, the initial tree
had build a complex sub-tree and produced deep
leaves which are not reliable in uncertain domains.
The second pruned tree in Figure 7, built with a higher
damping function on the quality index, turns the node
T5 into a leaf. In this process, T10 has also disappeared:
its complexity was not reliable.
[3] N. Delcroix, F. Kauffman, JM Constans, B.
Mazoyer, and B. Victorri: Fitting with Prior
Knowledge: Improvements in Quantification of in
Vivo Proton MRS of Brain Metabolites. ISMRM,
(1901), 1998.
Finally, let us notice that the sub-tree T4 is not pruned,
thus all the common pruning method would cut it first
because it does not lessen the error rate. This sub-tree
saves a reliable population with the attribute “cytol”
higher than 600 (a result checked by medical expert).
It is typically the situation we searched in our
motivations (see Sect. 0).
[4] N. Delcroix, F. Kauffman, S. Philippe, JM.
Constans, B. Mazoyer and B. Victorri: Ajustement
paramétrique pour améliorer la quantification in
vivo de la SRM: Application au cerveau. GRAMM,
(12), 1998.
[5] U.M. Fayyad and K. B. Irani. The Attribute
Selection Problem in Decision Tree Generation. In
proceeedings of Tenth National Conference on
Artificial Intelligence, pages 104-110, Cambridge,
1992. MA:AAAI Press/MIT Press
CONCLUSION
We have introduced UnDeT, a program that build
decision trees for the uncertain domains. This software
can be used either as a supervised system or as a way
to study the data used in machine learning process.
Then we have pointed out DI a quality index and the
pruning strategy stemmed from, that are useful when
studying uncertain domains.
[6] D. Fournier and B. Crémilleux. Using Impurity
and Depth for Decision Trees Pruning. To be
published in proceedings of the Second
International ICSC Symposium on Enginneering of
Intellignet Systems EIS2000, Paisley, Scotland,
U.K., 6 2000.
Yet, DI allows the evaluation of decision tree, but
does not inform about the quality of rules induced
from tree. This is the reason why we actually work to
define a new quality index for rules induced from
decision trees. Another project is to transform UnDeT
into an applet available on the web. There is already a
small and restricted model that may be reached there:
http://cush.info.unicaen.fr/~dom/ArbreDecision.html.
[7] D. Jensen and M. Schmill. Adjusting for Multiple
Comparisons in Decision Tree Pruning. In
proceedings of the third international conference
on Knowledge Discovery and Data Mining
KDD97, Newport Beach, California, 1997. AAAI
Press.
77
[8] J. Mingers. An Empirical Comparison of Pruning
Methods for Decision Tree Induction. In Machine
Learning, 4, pages 227-243, 1989. Kluwer
Academic Publishers, Boston.
[9] J.R. Quinlan. C4.5 Programs
Learning. Morgan Kaufmann,
California, 1993.
for Machine
San Mateo,
[10] A. Ragel. Preprocessing of Missing Values using
Robust Associations Rules. In J.M. Zytkow and M.
Quafafou, editors, proceedings of the 2nd European
Symposium on Principles of Data Mining and
Knowledge Discovery PKDD 98, Lecture notes in
artificial intellignece. N° 1520, pages 414-422,
Nantes (France), 1998. Springer-Verlag.
[11] A. Ragel and B. Crémilleux: MVC - a
Preprocessing Method to Deal with Missing
Values. Knowledge-Based Systems, pages 285-291,
1999.
[12] P. Utgoff, N. Berkman and J. Clouse. Decision
Tree Induction on Efficient Tree Restructuring. In
Machine Learning, 29, pages 5-44, 1997. Kluwer
Academic Press.
78