Download using classification trees techniques like sensitivity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
USING CLASSIFICATION TREES TECHNIQUES LIKE SENSITIVITY ANALYSIS
IN THE FIELD OF RADIOECOLOGY
1
B. Briand1*, C. Mercat-Rommens1, G. Ducharme2
Institute for Radioprotection and Nuclear Safety, France; 2Université Montpellier II, France
[email protected]
This study is realized in the framework of the SENSIB project (an acronym referring to radioecological
sensitivity) [1] which has been developed since 2003 by the French Institute for Radioprotection and Nuclear
Safety (IRSN) and benefits from the financial support of ADEME, French Environment and Energy
Management Agency. The main goal is to develop a standardised tool with a single scale of indexes in order to
describe and compare the sensitivity of various environments to radioactive pollutions. Each index will represent
a level of response of an environment to a pollution; for example an index of 1 refers to a low sensitive territory
whereas a level of 5 refers to a high sensitive territory.
This communication focuses solely on the agricultural aspects of the SENSIB project. The objective is to
determine whose factors (agronomical or radioecological) are of prime influence on the radioactive
contamination of agricultural productions and will be the bases for the indexes construction. The identification
of characteristics of the French territories whose stronger influence the fate of a radioactive contamination in the
environment is based on radioecological models. These models are generally non–linear and utilize agronomical
and radioecological input variables, often linked by linear and/or non-linear relations. That is why in order to
obtain more knowledge and precision of how the models work, we decided to perform an original global
sensitivity analysis by using classification trees techniques [2,3]. Contrary to the other methods of global
sensitivity analysis [4], the classification trees techniques allow to determine which input variables or
associations of input variables contribute mainly to the different categories (predetermined) of the model output.
So, the pathways linking the input variables and the output of the model can be more precisely described and
used to propose recommendations to mitigate the consequences of environmental radioactive contamination.
The method used to perform the sensitivity analysis is the CART method (Classification And Regression Trees)
developed by Breiman et al [5]. The method is non-parametric and enables the construction of regression or
classification trees depending on whether the output variable is quantitative or qualitative. A classification tree is
constructed by successively splitting the data set into subsets called nodes. A recursive binary partitioning
process is applied whereby parent nodes are always divided into two descending nodes (intermediate or
terminal), and this process is repeated by considering each intermediate node as a parent node (see Figure 1).
t0
Xk < δ
Xk ≥ δ
t2
t1
t5
t4
t3
t6
Output classes:
class 1
t7
t8
class 2
t9
t10
Figure 1: Example of classification tree
By performing a division δ on one of the input variables X k, the root
node t0 (containing all of the output values) is divided into two child
nodes t1 and t2 (two data sets): the node t1 contains the output values
for which Xk < δ and the node t2 contains the complementary set.
This process is then repeated on the descending nodes; they are also
splitted into two child nodes. When the nodes are not divided, they
are called terminal nodes or leaves and are assigned to a class of the
output variable. Thus, each branch originating from the root node t0
of the tree constitutes a path which, by a series of yes/no questions,
arrives at a terminal node.
The building of a classification tree rests on a splitting criterion based on an impurity function (a pure node
contains only values of one class of the output variable). Different splitting criterion exists, we use for the
present study the entropy criterion defined by:
m
 P(k / t ) log(P(k / t))
i(t )  
k 1
where m is the number of classes of the output variable and P(k/t) the conditional probability of class k knowing
that we are in the node t. Every splitting d at the node t leads to an impurity reduction:
i(d , t )  i(t )  pg i(t g )  pd i(td )
where pg et pd are the proportion of the values in the left and the right nodes, respectively. The best splitting d*
maximizes the impurity reduction: i(d *,t )  Maxi(d , t ); d  D .
The building of a classification tree by the CART method rests on the successive application of the three
following steps:
- Growing maximal tree. The data set is splitting successively in order to build an extended tree. The splitting
process is stopped when the node is pure or when the number of values in the node is less than a fixed size.
- Tree pruning. A sequence of trees is build. It consists in removing the large-sized branches of the extended tree
which involve a weak increase of the misclassification rate. In order to measure this increase, a complexity
parameter α is calculated and as it increases, more and more branches are pruned away leading to smaller trees.
- Selection of the optimal tree. Among this sequence of subtrees, the optimal tree has to be selected. The
selection is based on the evaluation of the predictive error using a cross-validation or a pruning sample.
This method was applied on the concrete example of the transfer of strontium 90 to lettuce, the strontium 90
being released in an agricultural media due to an accidental emission in the atmosphere. The classification tree
obtain (with the S-PLUS software) is presented on the Figure 2.The primary node is based on the input variable
Rc, interception rate (range [0, 1]). From the left
yes
no
branch of the tree, we can deduce a first decision
rule:
if Rc < 0.1 then 90Sr activity = low.
Because the variable Rc can be correlated with
the stage of growing of vegetable, such an
indication of Rc can be translated directly in
operational countermeasure for the farmers.
Variables or association of input variables whose
characterize high values of activity are deduced
from another branch of the optimal tree, for
example:
if Rc ≥ 0.1 and Delai < 29.5 and Dep ≥ 4476.23
then 90Sr activity = high.
Thereby, a complete examination of the tree
1: low (if the values
values of
of concentrations
concentrations are
are lower
lower than
than 100
100 Bq.kg
Bq.kg
fresh)
2: high (if the values
values of
of concentrations
concentrations are
are higher
higher than
than 100
100 Bq.
Bq.kg
kg fresh)
structure
allows
determining
whose
Rc: Interception capacity of the lettuce (wd), Dep: Activity deposited (Bq.m ),
Delai: Time between the deposit and the harvest of the plant (day)
combinations of factors are responsible to low
Figure 2: Classification tree obtained
and high values of radioactive contamination.
-1
-1
-2
However, one of the disadvantages of classification trees is their instability [6]. A little modification in the data
set (used to build the tree) can lead to a very different tree. This instability has an impact on the tree nodes
(splits), on the tree size, and moreover on the prediction. To avoid this problem and to stabilize the results of
predictions, methods based on model aggregation like Bagging [7] or Random Forest [8] are proposed. These
methods can clearly improve the capacities of the predictors, however the tree structure is lost and so the
potential decisions rules which result from it are also lost. In order to preserve the tree structure and to obtain
more stable decisions rules, a node-level stabilizing procedure is proposed [9]. By using this algorithm a new
extended tree is build. The new optimal tree obtained is more stable and support a more robust identification of
the most sensitive variables. This method was applied on the preceding example and allows us proposing robust
recommendations to mitigate the consequences of environmental radioactive contamination.
References
[1] Mercat C, Renaud P: From radioecological sensitivity to risk management: the SENSIB project,
Radioactivity in the Environment, Nice, October 2005.
[2] Mishra S, Deeds, N.E, RamaRao B.S: Application of classification trees in the sensitivity analysis of
probabilistic model results. Reliability Engineering and System Safety 79 (2003) 123-129.
[3] Mokhtari A, Frey, H.C, Jaykus L.A. Application of Classification and Regression Trees for Sensitivity
Analysis of the Escherichia coli O157:H7 Food Safety Process Risk Model. International Association for Food
Protection. Journal of Food Protection, Volume 69, Number 3, pp. 609-618(10), 2006.
[4] Saltelli A, Chan K, Scott M: Sensitivity Analysis, John Wiley & Sons publishers, Probability and Statistics
series, 2000.
[5] Breiman L, Friedman J.H, Olshen R, and Stone C.J: Classification and Regression Trees, Wadsworth,
Belmont CA, 1984.
[6] Breiman L: Heuristics of instability and stabilization in model selection. The Annals of Statistics,
24(6):2350-2383, 1996.
[7] Breiman L: Bagging predictors. Machine Learning, 24, pp.123-140, 1996.
[8] Breiman L: Random Forest. Machine Learning, 45(1), 5-32, 2001.
[9] Dannegger F: Tree stability diagnostics and some remedies for instability. Statistics in Medicine 2000;
19:475-491.