Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computing and Information Systems, 7 (2000), p. 73-78 © University of Paisley 2000 UnDeT: a tool for building uncertain trees Dominique Fournier We also associate with the uncertain domains, the problems for which the source of the data is not sure (i.e. some data prove certainly noisy without it being possible to detect them). As an example, we had the opportunity to study medical data resulting from brain MRI examinations [3][4]. During this study, among the 800 cases we had at our disposal, the examination specialists were able to detect some numbers of examples for which obtained measurement were conspicuously noisy but without apparent reason. Such cases let predict the presence in the database of other erroneous elements but undetectable to the direct human analysis. We therefore associate this kind of problems (presence of hidden noisy data) with the uncertain domains. There are many decision trees software and pruning facilities [1][7][8][9][12]. However they do not prove suitable to the uncertain domains study. We are introducing here a multi-platform software for decision trees which implements a quality index and a pruning method fitted to this kind of investigation field. We will explain how its new functionalities make our program useful for the study of the uncertain domains. INTRODUCTION In this paper, we present UnDeT, an experimentation tool developed in our laboratory which becomes integrated into a data mining software platform. Our module is dedicated to the study of difficult problems issued from "uncertain domains". During various collaborations, mainly in the medical field, we noticed the utility of such a tool which allows an expertise of these domains. To sum up, an uncertain problem is characterized by a not very reliable labelling, an incomplete and/or noisy representation. We will now detail the user's expectations who are confronted with these problems. Here, we will initially clarify what we call "uncertain domains", and present our approach of these domains. Secondly, we detail UnDeT's design and functionalities. And we finish by presenting an experiment to illustrate our work and the results we have obtained.. Specific needs Within such a framework, the traditional tools for classification are inoperative because of the lack of certainty concerning the attributes of the data. A global answer to all the difficulties inherent in the uncertain domains does not seem either possible. That is the reason why the requirements on the matter, are directed towards a contribution to the data analysis, which details its validity, its relevance. It is thus necessary to dispose for such a study, a noise-robust tool, able to explore badly defined data. Such a tool must also be able to explore a great number of attributes even not very informative to extract from them those which bring reliable information, even partially. Such a tool must thus lead to the selection of well characterized cases subsets or subsets of attributes at least partially relevant. The results obtained could then be submitted to the evaluation and to the validation of the expert, who will thus improve his knowledge on the field and will be able to make corrections to the initial data. UNCERTAIN DOMAINS AND THEIR SPECIFIC NEEDS Uncertain domains We estimate that a data mining problem belongs to the uncertain domains if it is a difficult problem. More precisely, a problem is difficult if no expert is able to solve it simply. In this case, the choice of relevant attributes to its characterization is not always possible because of the lack of control in the field. Indeed, during the harvesting phase of the data, the users proceed as well as possible by selecting the greatest number of variables likely to interact with the question considered, but cannot be sure they do not forget someother more informative attributes. In the same way, the expert, who do not have a systemic at his disposal, will be able to make errors during the labelling of the pilot data. The choice of the class is therefore not always reliable. In such an environment, one can be sure neither validity of pre-sorted examples nor the utility of the variables chosen to describe the studied data. It is thus also possible to have useless variables which generate noise. OUR APPROACH In this state of mind, what we seek to do, is to evaluate the quality, the informative value of the training obtained from the data resulting from uncertain domains, and to be able to establish which are the parts of these results who are reliable, not very reliable or even suspect; and thus to quantify user's confidence in the results. Our objective is thus to split the results 73 of a classification to extract the reliable components from the others. This extraction will then allow the user to study again the data, and perhaps, with this new highlighting, improve his expertise of the domain. easy to make evolve what is an additional asset within an experimental framework which tends to change in the course of time. The graphic interface was made using the graphic libraries SWING which will constitute the standard of the next distributions of the language JAVA, which ensures our program the better lasting of "utilisability ". In this respect, our work was directed towards the decision trees. Indeed, this method shows several advantages which were determining when we choose it. Initially, it is the structure of the models built by this type of algorithms which interested us. The nodes contained in the trees constitute a partition of the original database. Such a partition which can be associated with a set of rules allows a detailed browsing of the database. When someone compares this method with others which do not proceed by successive refinement to reach their results, neuron networks for example, it appears well fitted to our need. Moreover, decision trees algorithms scan all the attributes on equal terms, so the presence of noisy variables is not a failure factor. In addition the decision trees associated with the uncertain domains had also the advantage of being well known locally [2]. We thus endeavoured to develop a quality index that allows to answer our waiting as regards to the evaluation of the results [6]. We compute these evaluations with UnDeT. Functionality We used for UnDeT the traditional algorithm of decision tree such as C4.5 [9] (it uses the format for the data files which has the advantage of being commonly used in the Machine Learning community). It implements three criteria for best attribute selection: the gain criterion, the GainRatio criterion and finally ORT criterion. The first two criteria are already used in Quinlan's C4.5, the third comes from Fayyad and Irani [5]. Our tool also presents all the functionalities which one can await from a traditional classification algorithm: management of the training and test files, automatic cross-validation, confusions matrix, backs up and re-use of the classification built model. UNDET: UNCERTAIN DECISION TREE The innovating parts of our tool relate to the development of a quality index and it association in a pruning heuristic. Design Quality index and pruning method When designing our tool, in addition to our interest for the uncertain domains, we had the ambition to develop within the laboratory an experimental platform dedicated to the study of the uncertain fields. We had consequently in one hand constraints of portability to free us from heterogeneity of the operating systems used. And in the other hand, the association of UnDeT with other modules such as RAR [10] and MVC [11] within our platform imposed to us a maximum of modularity. This is why the language JAVA seemed particularly well adapted to us in our work environment. This language is indeed available on any type of computer and as an object programming language makes it possible to have a tool flexible and The characteristics of our tool relates to the development of a quality index and a pruning method [6] which are well adapted to the uncertain fields by emphasizing and preserving the data subsets which are well recognized in the decision tree. DI(T ) m a (1 (' )f(depth( ' )) i i i i Equation 1 The quality index that we called DI (for DepthImpurity quality index) associates for each node of a tree the complexity of the associated sub-tree and its Figure 1: Fichier and Options Menus 74 Figure 3: Preferences on the following principle: the construction cost of the tree must improve the total quality of this one. So, when a node has its purity greater than its quality index, (regarding the definition of our quality index) one can estimate that the cost of the subtree is too expensive regarding its effectiveness Figure 2: Dialog Box leaves purity. One thus finds the precise definition of DI for a tree T (DI(T) is also noted DI(), the quality index of the tree T which has as a root) in Equation 1, where i' are the leaves of T, i is the weight of i' in , and is an impurity measurement standardized in [0,1]. Let us finally specify that f is damping function decreasing accordingly to the depth of the leaves. Guided tour The interface of UnDeT remains rather traditional: a menu bar gives access to the various functions available. The menu Fichier (Figure 1) ensures the loading and the backup of the data (using a dialog box (Figure 2: Dialog Box)) and the program escape. The menu Arbre allows the construction and the pruning of the trees as well as the procedure of "Train and Test" and automatic Cross-Validation. Finally the menu Options makes it possible to choose the selection criterion used in the tree induction, the pruning method and finally the quality index which will be associated with the tree. The preferences (Figure 3) allow the user to parameterize the constants used DI (T ) 1 () Equation 2 The pruning method associated with DI, say DI pruning, consists in cutting the tree nodes which quality remains lower than the purity according to the condition given by Equation 2. This method is based Figure 4: Results display 75 DI=0.848 -1- (30/245) gram<=0: DI=0.818 | -3- (16/243) prot<=0.95: DI=0.826 | | -5- (9/223) purpura=0: DI=0.818 | | | -10- (7/24) age<=1.66: DI=0.722 | | | | -12- (0/14) vs<=78: I=0 Class=vir | | | | -13- (7/10) vs>78: DI=0.631 | | | | | -14- (0/7) age<=0.58: I=0 Class=vir | | | | | -15- (7/3) age>0.58: DI=0.514 | | | | | | -16- (1/3) fever<=39.7: I=0.811 Class=vir | | | | | | -17- (6/0) fever>39.7: I=0 Class=bact | | | -11- (2/199) age>1.66: DI=0.832 | | | | -18- (0/119) lung=0: I=0 Class=vir | | | | -19- (0/54) lung=1: I=0 Class=vir | | | | -20- (0/20) lung=2: I=0 Class=vir | | | | -21- (2/6) lung=3: DI=0.797 | | | | | -22- (0/1) season=winter: I=0 Class=vir | | | | | -23- (2/0) season=spring: I=0 Class=bact | | | | | -24- (0/5) season=summer: I=0 Class=vir | | -6- (1/10) purpura=1: DI=0.873 | | | -26- (0/10) cytol<=400: I=0 Class=vir | | | -27- (1/0) cytol>400: I=0 Class=bact | | -7- (0/10) purpura=2: I=0 Class=vir | | -8- (4/0) purpura=3: I=0 Class=bact | | -9- (2/0) purpura=4: I=0 Class=bact | -4- (14/2) prot>0.95: DI=0.685 | | -28- (2/2) cytol<=600: I=1 Class=bact | | -29- (12/0) cytol>600: I=0 Class=bact -2- (54/0) gram>0: I=0 Class=bact Figure 5: Initial tree on meningitis database. during the computation. diagnosis is important because bacterial infection endangers patient’s life. The results display is relatively basic (Figure 4) but informs the user on multiple information. In addition to a recall of the parameters used during the construction of the tree or its pruning, the display precises for each node, the number of examples associated and their distribution on each class. The impurity and the quality index of each node are also available. This database contains 328 examples described by 23 (quantitative or qualitative) attributes. The class is bivalued (viral versus bacterial). Whereas the diagnosis is considered obvious on typical cases, nearly one third of these examples affects non-typical clinical and biological data: the prediction becomes difficult regarding the pour link between attributes, considered separately, and the diagnosis. We show in this section how the pruning method given by UnDeT is relevant in such domains. EXPERIMENT We now choose to comment on an experiment we estimate representative of our different tests released. For this experiment we used UnDeT and its pruning strategy. Results The initial tree has high quality index (0.848) due to the relevance of the chosen attributes. Such a result is considered quite decent in medical domain. Child’s meningitis database Here is now an example of our software use on a medical database coming from the University Hospital at Grenoble (France). This database is the result of a medical study on young patient affected by meningitis. Meningitis is a pathology that is generally caused by viral infection, but about one quarter of the cases is the fact of bacteria. If viral cases only need a medical supervision, patient infected by bacteria had to be treated with suitable antibiotics. An early To comment these results, let us call Ti the sub-tree of root labelled –i-. On the tree in Figure 5, T4 has a lower impurity than T10 (0.54 against 0.77) even so, the quality index of T10 is higher (DI(T10)=0.722 and DI(T4)=0.685). Whereas the deeper sub-tree is penalized, DI(T10) is better because it explains better instances (it leads to a single miss-classified instance). The complexity cost of T10 will appear with the pruning stage below. 76 DI=0.829 -1- (30/245) gram<=0: DI=0.796 | -3- (16/243) prot<=0.95: DI=0.803 | | -5- (9/223) purpura=0: DI=0.792 | | | -10- (7/24) age<=1.66: DI=0.722 | | | | -12- (0/14) vs<=78: I=0 Class=vir | | | | -13- (7/10) vs>78: DI=0.631 | | | | | -14- (0/7) age<=0.58: I=0 Class=vir | | | | | -15- (7/3) age>0.58: DI=0.514 | | | | | | -16- (1/3) fever<=39.7: I=0.811 Class=vir | | | | | | -17- (6/0) fever>39.7: I=0 Class=bact | | | -11- (2/199) age>1.66: I=0.080 Class=vir | | -6- (1/10) purpura=1: DI=0.873 | | | -26- (0/10) cytol<=400: I=0 Class=vir | | | -27- (1/0) cytol>400: I=0 Class=bact | | -7- (0/10) purpura=2: I=0 Class=vir | | -8- (4/0) purpura=3: I=0 Class=bact | | -9- (2/0) purpura=4: I=0 Class=bact | -4- (14/2) prot>0.95: DI=0.685 | | -28- (2/2) cytol<=600: I=1 Class=bact | | -29- (12/0) cytol>600: I=0 Class=bact -2- (54/0) gram>0: I=0 Class=bact DI=0.762 -1- (30/245) gram<=0: DI=0.716 | -3- (16/243) prot<=0.95: DI=0.718 | | -5- (9/223) purpura=0: I=0.237 Class=vir | | -6- (1/10) purpura=1: DI=0.873 | | | -26- (0/10) cytol<=400: I=0 Class=vir | | | -27- (1/0) cytol>400: I=0 Class=bact | | -7- (0/10) purpura=2: I=0 Class=vir | | -8- (4/0) purpura=3: I=0 Class=bact | | -9- (2/0) purpura=4: I=0 Class=bact | -4- (14/2) prot>0.95: DI=0.685 | | -28- (2/2) cytol<=600: I=1 Class=bact | | -29- (12/0) cytol>600: I=0 Class=bact -2- (54/0) gram>0: I=0 Class=bact Figure 7: Second pruned tree. References [1] L. Breiman, J. Friedman, R. Olshen and C. Stone. Classification and Regression Trees, The Wadsworth Statistics/Probability Series, 1984. Figure 6: First pruned tree. [2] B. Crémilleux, C. Robert and M. Gaio, Uncertain Domains and Decision Trees: ORT versus C.M. Criteria. In 7th Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU), pages 540-546, Paris (France), 7 1998. EDK N. In the first pruned tree in Figure 6, the node T11 becomes a leaf introducing 2 errors. This is not too high in a node of 201 instances. Yet, the initial tree had build a complex sub-tree and produced deep leaves which are not reliable in uncertain domains. The second pruned tree in Figure 7, built with a higher damping function on the quality index, turns the node T5 into a leaf. In this process, T10 has also disappeared: its complexity was not reliable. [3] N. Delcroix, F. Kauffman, JM Constans, B. Mazoyer, and B. Victorri: Fitting with Prior Knowledge: Improvements in Quantification of in Vivo Proton MRS of Brain Metabolites. ISMRM, (1901), 1998. Finally, let us notice that the sub-tree T4 is not pruned, thus all the common pruning method would cut it first because it does not lessen the error rate. This sub-tree saves a reliable population with the attribute “cytol” higher than 600 (a result checked by medical expert). It is typically the situation we searched in our motivations (see Sect. 0). [4] N. Delcroix, F. Kauffman, S. Philippe, JM. Constans, B. Mazoyer and B. Victorri: Ajustement paramétrique pour améliorer la quantification in vivo de la SRM: Application au cerveau. GRAMM, (12), 1998. [5] U.M. Fayyad and K. B. Irani. The Attribute Selection Problem in Decision Tree Generation. In proceeedings of Tenth National Conference on Artificial Intelligence, pages 104-110, Cambridge, 1992. MA:AAAI Press/MIT Press CONCLUSION We have introduced UnDeT, a program that build decision trees for the uncertain domains. This software can be used either as a supervised system or as a way to study the data used in machine learning process. Then we have pointed out DI a quality index and the pruning strategy stemmed from, that are useful when studying uncertain domains. [6] D. Fournier and B. Crémilleux. Using Impurity and Depth for Decision Trees Pruning. To be published in proceedings of the Second International ICSC Symposium on Enginneering of Intellignet Systems EIS2000, Paisley, Scotland, U.K., 6 2000. Yet, DI allows the evaluation of decision tree, but does not inform about the quality of rules induced from tree. This is the reason why we actually work to define a new quality index for rules induced from decision trees. Another project is to transform UnDeT into an applet available on the web. There is already a small and restricted model that may be reached there: http://cush.info.unicaen.fr/~dom/ArbreDecision.html. [7] D. Jensen and M. Schmill. Adjusting for Multiple Comparisons in Decision Tree Pruning. In proceedings of the third international conference on Knowledge Discovery and Data Mining KDD97, Newport Beach, California, 1997. AAAI Press. 77 [8] J. Mingers. An Empirical Comparison of Pruning Methods for Decision Tree Induction. In Machine Learning, 4, pages 227-243, 1989. Kluwer Academic Publishers, Boston. [9] J.R. Quinlan. C4.5 Programs Learning. Morgan Kaufmann, California, 1993. for Machine San Mateo, [10] A. Ragel. Preprocessing of Missing Values using Robust Associations Rules. In J.M. Zytkow and M. Quafafou, editors, proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery PKDD 98, Lecture notes in artificial intellignece. N° 1520, pages 414-422, Nantes (France), 1998. Springer-Verlag. [11] A. Ragel and B. Crémilleux: MVC - a Preprocessing Method to Deal with Missing Values. Knowledge-Based Systems, pages 285-291, 1999. [12] P. Utgoff, N. Berkman and J. Clouse. Decision Tree Induction on Efficient Tree Restructuring. In Machine Learning, 29, pages 5-44, 1997. Kluwer Academic Press. 78