Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
USING CLASSIFICATION TREES TECHNIQUES LIKE SENSITIVITY ANALYSIS IN THE FIELD OF RADIOECOLOGY 1 B. Briand1*, C. Mercat-Rommens1, G. Ducharme2 Institute for Radioprotection and Nuclear Safety, France; 2Université Montpellier II, France [email protected] This study is realized in the framework of the SENSIB project (an acronym referring to radioecological sensitivity) [1] which has been developed since 2003 by the French Institute for Radioprotection and Nuclear Safety (IRSN) and benefits from the financial support of ADEME, French Environment and Energy Management Agency. The main goal is to develop a standardised tool with a single scale of indexes in order to describe and compare the sensitivity of various environments to radioactive pollutions. Each index will represent a level of response of an environment to a pollution; for example an index of 1 refers to a low sensitive territory whereas a level of 5 refers to a high sensitive territory. This communication focuses solely on the agricultural aspects of the SENSIB project. The objective is to determine whose factors (agronomical or radioecological) are of prime influence on the radioactive contamination of agricultural productions and will be the bases for the indexes construction. The identification of characteristics of the French territories whose stronger influence the fate of a radioactive contamination in the environment is based on radioecological models. These models are generally non–linear and utilize agronomical and radioecological input variables, often linked by linear and/or non-linear relations. That is why in order to obtain more knowledge and precision of how the models work, we decided to perform an original global sensitivity analysis by using classification trees techniques [2,3]. Contrary to the other methods of global sensitivity analysis [4], the classification trees techniques allow to determine which input variables or associations of input variables contribute mainly to the different categories (predetermined) of the model output. So, the pathways linking the input variables and the output of the model can be more precisely described and used to propose recommendations to mitigate the consequences of environmental radioactive contamination. The method used to perform the sensitivity analysis is the CART method (Classification And Regression Trees) developed by Breiman et al [5]. The method is non-parametric and enables the construction of regression or classification trees depending on whether the output variable is quantitative or qualitative. A classification tree is constructed by successively splitting the data set into subsets called nodes. A recursive binary partitioning process is applied whereby parent nodes are always divided into two descending nodes (intermediate or terminal), and this process is repeated by considering each intermediate node as a parent node (see Figure 1). t0 Xk < δ Xk ≥ δ t2 t1 t5 t4 t3 t6 Output classes: class 1 t7 t8 class 2 t9 t10 Figure 1: Example of classification tree By performing a division δ on one of the input variables X k, the root node t0 (containing all of the output values) is divided into two child nodes t1 and t2 (two data sets): the node t1 contains the output values for which Xk < δ and the node t2 contains the complementary set. This process is then repeated on the descending nodes; they are also splitted into two child nodes. When the nodes are not divided, they are called terminal nodes or leaves and are assigned to a class of the output variable. Thus, each branch originating from the root node t0 of the tree constitutes a path which, by a series of yes/no questions, arrives at a terminal node. The building of a classification tree rests on a splitting criterion based on an impurity function (a pure node contains only values of one class of the output variable). Different splitting criterion exists, we use for the present study the entropy criterion defined by: m P(k / t ) log(P(k / t)) i(t ) k 1 where m is the number of classes of the output variable and P(k/t) the conditional probability of class k knowing that we are in the node t. Every splitting d at the node t leads to an impurity reduction: i(d , t ) i(t ) pg i(t g ) pd i(td ) where pg et pd are the proportion of the values in the left and the right nodes, respectively. The best splitting d* maximizes the impurity reduction: i(d *,t ) Maxi(d , t ); d D . The building of a classification tree by the CART method rests on the successive application of the three following steps: - Growing maximal tree. The data set is splitting successively in order to build an extended tree. The splitting process is stopped when the node is pure or when the number of values in the node is less than a fixed size. - Tree pruning. A sequence of trees is build. It consists in removing the large-sized branches of the extended tree which involve a weak increase of the misclassification rate. In order to measure this increase, a complexity parameter α is calculated and as it increases, more and more branches are pruned away leading to smaller trees. - Selection of the optimal tree. Among this sequence of subtrees, the optimal tree has to be selected. The selection is based on the evaluation of the predictive error using a cross-validation or a pruning sample. This method was applied on the concrete example of the transfer of strontium 90 to lettuce, the strontium 90 being released in an agricultural media due to an accidental emission in the atmosphere. The classification tree obtain (with the S-PLUS software) is presented on the Figure 2.The primary node is based on the input variable Rc, interception rate (range [0, 1]). From the left yes no branch of the tree, we can deduce a first decision rule: if Rc < 0.1 then 90Sr activity = low. Because the variable Rc can be correlated with the stage of growing of vegetable, such an indication of Rc can be translated directly in operational countermeasure for the farmers. Variables or association of input variables whose characterize high values of activity are deduced from another branch of the optimal tree, for example: if Rc ≥ 0.1 and Delai < 29.5 and Dep ≥ 4476.23 then 90Sr activity = high. Thereby, a complete examination of the tree 1: low (if the values values of of concentrations concentrations are are lower lower than than 100 100 Bq.kg Bq.kg fresh) 2: high (if the values values of of concentrations concentrations are are higher higher than than 100 100 Bq. Bq.kg kg fresh) structure allows determining whose Rc: Interception capacity of the lettuce (wd), Dep: Activity deposited (Bq.m ), Delai: Time between the deposit and the harvest of the plant (day) combinations of factors are responsible to low Figure 2: Classification tree obtained and high values of radioactive contamination. -1 -1 -2 However, one of the disadvantages of classification trees is their instability [6]. A little modification in the data set (used to build the tree) can lead to a very different tree. This instability has an impact on the tree nodes (splits), on the tree size, and moreover on the prediction. To avoid this problem and to stabilize the results of predictions, methods based on model aggregation like Bagging [7] or Random Forest [8] are proposed. These methods can clearly improve the capacities of the predictors, however the tree structure is lost and so the potential decisions rules which result from it are also lost. In order to preserve the tree structure and to obtain more stable decisions rules, a node-level stabilizing procedure is proposed [9]. By using this algorithm a new extended tree is build. The new optimal tree obtained is more stable and support a more robust identification of the most sensitive variables. This method was applied on the preceding example and allows us proposing robust recommendations to mitigate the consequences of environmental radioactive contamination. References [1] Mercat C, Renaud P: From radioecological sensitivity to risk management: the SENSIB project, Radioactivity in the Environment, Nice, October 2005. [2] Mishra S, Deeds, N.E, RamaRao B.S: Application of classification trees in the sensitivity analysis of probabilistic model results. Reliability Engineering and System Safety 79 (2003) 123-129. [3] Mokhtari A, Frey, H.C, Jaykus L.A. Application of Classification and Regression Trees for Sensitivity Analysis of the Escherichia coli O157:H7 Food Safety Process Risk Model. International Association for Food Protection. Journal of Food Protection, Volume 69, Number 3, pp. 609-618(10), 2006. [4] Saltelli A, Chan K, Scott M: Sensitivity Analysis, John Wiley & Sons publishers, Probability and Statistics series, 2000. [5] Breiman L, Friedman J.H, Olshen R, and Stone C.J: Classification and Regression Trees, Wadsworth, Belmont CA, 1984. [6] Breiman L: Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6):2350-2383, 1996. [7] Breiman L: Bagging predictors. Machine Learning, 24, pp.123-140, 1996. [8] Breiman L: Random Forest. Machine Learning, 45(1), 5-32, 2001. [9] Dannegger F: Tree stability diagnostics and some remedies for instability. Statistics in Medicine 2000; 19:475-491.