* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download L A ECML
Survey
Document related concepts
Transcript
LEARNING MONOTONE MODELS FROM DATA AT ECML PKDD 2009 MoMo 2009 September 7, 2009 Bled, Slovenia Workshop Organization • • Rob Potharst (Erasmus Universiteit Rotterdam) Ad Feelders (Universiteit Utrecht) Program Committee Arno Siebes Michael Berthold Malik Magdon-Ismail Ivan Bratko Hennie Daniels Oleg Burdakov Arie Ben-David Bernard De Baets Michael Rademaker Roman Slowinski Linda van der Gaag Ioannis Demetriou Universiteit Utrecht Universität Konstanz Rensselaer Polytechnic Institute University of Ljubljana Tilburg University Linköping University Holon Institute of Technology Universiteit Gent Universiteit Gent University of Poznan Universiteit Utrecht University of Athens The Netherlands Germany USA Slovenia The Netherlands Sweden Israel Belgium Belgium Poland The Netherlands Greece Preface In many application areas of data analysis, we know beforehand that the relation between some predictor variable and the response should be increasing (or decreasing). Such prior knowledge can be translated to the model requirement that the predicted response should be a (partially) monotone function of the predictor variables. There are also many applications where a nonmonotone model would be considered unfair or unreasonable. Try explaining to a rejected job applicant why someone who scored worse on all application criteria got the job! The same holds for many other application areas, such as credit rating and university entrance selection. These considerations have motivated the development of learning algorithms that are guaranteed to produce (or have a bias towards) monotone models. Examples are monotone versions of classification trees, neural networks, rule learning, Bayesian networks, nearest neighbor methods and rough sets. Work on this subject has however been scattered over different research communities (machine learning, data mining, neural networks, statistics and operations research), and our aim with the workshop Learning Monotone Models from Data at ECML PKDD 2009 is to bring together researchers from these different fields to exchange ideas. Even though the number of submissions was not overwhelming, we were pleased to receive some high quality contributions that have been included in these proceedings. We are also very happy that Bernard De Baets from Ghent University in Belgium and Arie Ben-David from the Holon Institute of Technology in Israel accepted our invitation to give their view of the research area in two invited lectures at the workshop. Rob Potharst and Ad Feelders, August 2009. 5 Table of Contents Arie Ben-David, Monotone Ordinal Concept Learning: Past, Present and Future…………………………………………………………..7 Bernard De Baets, Monotone but not Boring: how to deal with Reversed Preference in Monotone Classification…………………………………………….9 Marina Velikova and Hennie Daniels, On Testing Monotonicity Of Datasets…...11 Oleg Burdakov, Anders Grimvall and Oleg Sysoev, Generalized PAV Algorithm with Block Refinement for Partially Ordered Monotonic Regression...23 Jure Žabkar, Martin Možina, Ivan Bratko and Janez Demšar, Discovering Monotone Relations with Padé……………………………………………………39 Nicola Barile and Ad Feelders, Nonparametric Ordinal Classification with Monotonicity Constraints…………………………………………………………47 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 6 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 7 Monotone Ordinal Concept Learning: Past, Present and Future Arie Ben-David Department of Technology Management Holon Institute of Technology Holon,Israel Abstract: This talk will survey the history of ordinal concept learning in general and that of monotone ordinal learning in particular. Some approaches that were taken over the years will be presented, as well as a survey of recent publications about the topic. Some key points that should be addressed in future research will be discussed. In particular: the use of a standard operating environment, the establishment of a "large enough" publicly available body of ordinal benchmarking files, and the use of an agreed upon set of meters and procedures for comparing the performance of the various models. Time will be allocated for an open discussion about how these and other goals can be promoted. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 8 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 9 Monotone but not Boring: how to deal with Reversed Preference in Monotone Classification Bernard De Baets KERMIT Research Unit Knowledge-based Systems Ghent University Ghent, Belgium Abstract: We deal with a particular type of classification problems, in which there exists a linear ordering on the label set (as in ordinal regression) as well as on the domain of each of the features. Moreover, there exists a monotone relationship between the features and the class labels. Such problems of monotone classification typically arise in a multi-criteria evaluation setting. When learning such a model from a data set, we are confronted with data impurity in the form of reversed preference. We present the Ordinal Stochastic Dominance Learner framework which allows to build various instance-based algorithms able to process such data. Moreover, we explain how reversed preference can be eliminated by relating this problem to the maximum independent set problem and solving it efficiently using flow network algorithms. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 10 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 11 On Testing Monotonicity of Datasets Marina Velikova1 and Hennie Daniels2 1 2 Department of Radiology, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands [email protected] Center for Economic Research, Tilburg University, The Netherlands and ERIM Institute of Advanced Management Studies, Erasmus University Rotterdam, Rotterdam, The Netherlands [email protected] Abstract. In this paper we discuss two heuristic tests to measure the degree of monotonicity of datasets. It is shown that in the case of artificially generated data both measures discover the monotone and nonmonotone variables as they were created in the test data. Furthermore in the same study we demonstrate that the tests produce the same ordering on the independent variables from monotone to non-monotone. Finally we prove that although both tests work well in practice, in a theoretical sense, in some cases it can be impossible to decide whether the data were generated from a monotone relation or not. 1 Introduction In many classification and prediction problems in economics one has to deal with relatively small data sets. In these circumstances flexible estimators like neural networks have a tendency to overfit the data. It has been illustrated that in the case of a monotone relationship between the response variables and a (sub)set of independent variables (partially) monotone neural networks yield smaller prediction errors and have a lower variance compared to ordinary feed forward networks ([1–3]). This is mainly due to the fact that the constraints imposing a monotone relation between input and response variable repress noise, but maintain the flexibility to model a non-linear association. In most practical cases one has a belief in advance, which of the independent variables have a monotone relation with the response variable ([4–6]). For example we suppose that the house price increases with the number of square meters of living space, and the income of a person rises with the age and level of education. However, in the case where only a limited amount of empirical data is available, one would like to have a test to find out if the presupposed behaviour is confirmed by the data. Earlier results on estimating regression functions under monotonicity constraints (isotonic regression) and testing monotonicity of a given regression function can be found, for example, in [7, 8]. In this paper we compare two heuristic tests for monotonicity. The first index measures the degree of monotonicity of the whole dataset and is defined as the Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 12 fraction of comparable pairs that are monotone for the datasets obtained after removing one of the independent variables. The second measures the degree of monotonicity of a certain independent variable with respect to the response variable based on the monotonicity index introduced in [9, 10], which is defined by fitting a neural network to the data, and therefore depends on the network architecture. This monotonicity index ranges between 0 and 1 for each variable, with 1 indicating full monotonicity and 0 indicating a non-monotone relation. This index induces an ordering on the independent variables depending on the degree of monotonicity. To compare both measures for monotonicity, we develop a procedure to produce an ordering of the independent variables also from the first measure. It is shown that both methods will produce the same ordering of the variables in the case of experimentally generated data. We show that the second index has a small variance with respect to different neural network architectures but varies considerably with respect to the noise level in the data. Finally we prove that if the data set has no comparable pairs, in which case the first measure is not calculable, we can build a piecewise linear monotone model of which the data is a sample. This is possible even when the data were generated from a completely non-monotone model. The model can be constructed both as non-decreasing or non-increasing piecewise linear. This result is related to proposition 2.5 of [11], where the authors show that an optimal linear fit to a non-linear monotone function may find the incorrect monotonicity direction, for certain input densities. As a consequence we infer that the second index cannot be stable if the number of degrees of freedom in the neural network model is unlimited. 2 Monotone Prediction Problems and Models Qk Let X = i=1 Xi be an input space represented by k attributes (features). A particular point x ∈ X is defined by the vector x = (x1 , x2 , . . . , xk ), where xi ∈ Xi , i = 1, . . . , k. Furthermore, a totally ordered set of labels L is defined. In the discrete case, we have L = {1, . . . , `max } where `max is the maximal label. Note that ordinal labels can be easily quantified by assigning numbers from 1 for the lowest category to `max for the highest category. In the continuous case, we have L ⊂ < or L ⊂ <+ . Unless the distinction is made explicitly, the term label is used to refer generally to the dependent variable irrespective of its type (continuous or discrete). Next a function f is defined as a mapping f : X → L that assigns a label ` ∈ L to every input vector x ∈ X . In prediction problems, the objective is to find an approximation fˆ of f as close as possible; for example in L1 ,L2 , or L∞ norm. In particular, in regression we try to estimate the average dependence of ` given x, E[`|x]. Any estimator such as neural network used in this paper, is an approximation of this function. In classification, we look for a discrete mapping function represented by a classification rule r(`x ) assigning a class ` to each point x in the input space. In reality, the information we have about f is mostly provided by a dataset D = (xn , `xn )N n=1 , where N is the number of points, x ∈ X and `x ∈ L. Then, Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 13 N X = {xn }n=1 is a set of k independent variables represented by an N ×k matrix, N and L = {`xn }n=1 is a vector with the values of the dependent variable. In this context, D corresponds to a mapping fD : X → L and we assume that fD is a close proximity of f . Ideally, fD is equal to f over X, which is seldom the case in practice due to the noise present in D. Hence, our ultimate goal in prediction problems is to obtain a close approximation fˆMD of f by building a prediction model MD from the given data D. The main assumption we make here is that f exhibits monotonicity properties with respect to the input variables and therefore, fˆMD should also obey these properties in a strict fashion. In this study we distinguish between two types of problems, and their respective models, concerning the monotonicity properties. The distinction is based on the set of input variables, which are in monotone relationships with the response: 1. Totally monotone prediction problems (models): f (fˆMD ) depends monotonically on all variables in the input space. 2. Partially monotone prediction problems (models): f (fˆMD ) depends monotonically on some variables in the input space but not all. Without loss of generality, in the remainder of the paper, we consider monotone problems in which the label is continuous. 2.1 Total Monotonicity In monotone prediction problems, we assume that D is generated by a process with the following properties `x = f (x) + ² (1) where f is a monotone function, and ² is a random variable with zero mean and constant variance σ²2 . Note that in classification problems ² is not additive, but multiplicative (non-homogeneous variance) and it is a small probability that the assigned class is incorrect. We say that f is non-decreasing on x if x1 ≥ x2 ⇒ f (x1 ) ≥ f (x2 ) (2) where x1 ≥ x2 is a partial ordering on X defined by x1i ≥ x2i , for i = 1, . . . , k. The pair (x1 , x2 ) is called comparable if x1 ≥ x2 or x1 ≤ x2 , and if the relationship defined in (2) holds, it is also a monotone pair. Throughout the paper we assume that all monotone relationships are monotone increasing. If a relationship is monotone decreasing then the data are transformed in such a way that it becomes monotone increasing. The degree of monotonicity DgrMon of a dataset D is defined by: #Monotone pairs(D) . (3) DgrMon(D) = #Comparable pairs(D) If all comparable pairs are monotone then DgrMon = 1 and the dataset is called monotone (non-decreasing by assumption). Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 14 2.2 Partial Monotonicity Qm In partially monotone problems, we have X = X m × X nm with X m = i=1 Xi Q k and X nm = i=m+1 Xi for 1 ≤ m < k. Furthermore, we have a data set m nm D = (x , x , `x )N , where xm ∈ X m , xnm ∈ X nm , and N is the number of observations. A data point x ∈ D is represented by x = (xm , xnm ); the label of x is `x . We assume that D is generated by the following process `x = f (xm , xnm ) + ², (4) where f is a monotone function in xm and ² is a random error defined as before. The partial monotonicity constraint of f on xm is defined by m xnm = xnm and xm 1 2 1 ≥ x2 ⇒ f (x1 ) ≥ f (x2 ). (5) Henceforth, we call X m the set of monotone variables and X nm the set of non-monotone variables. By non-monotone we mean that monotone dependence is not known a priori. Although, we do not constrain the size of the two sets, our main assumption for the problems considered in this paper is that we have only a small number of non-monotone variables (e.g., < 5), and a large number of monotone variables. 3 Tests for Monotonicity of a Dataset In partially monotone problems, one usually has prior knowledge about the monotone relationships of a subset of attributes with respect to the target, whereas for the remaining attributes such dependences are unknown a priori. To determine whether variables are monotone or non-monotone, we propose the following two empirical tests based on the available data. 3.1 Measuring Monotonicity by Removal of a Variable To determine the ordering of the independent variables from monotone to less monotone we use the measure for monotonicity DgrMon defined in (3). Here we assume that the independent variables are not highly correlated and all monotone relationships are monotone increasing. We compare the measure DgrMon obtained for the original data and for the data with an independent variable removed. The truncated dataset has one dimension less than the original data but the same number of data points. Next we show that (i) if DgrMon decreases after removing a variable then the variable is monotone and (ii) if DgrMon increases after removing a variable then the variable is non-monotone. To see the effect of removing a single variable from a dataset we consider all possible cases of pairs as illustrated in Fig. 1. Let (a, x, `) denote a data point in the original dataset D, where a is the variable to be removed, x are the other independent variables and ` is the label. First, we observe that the removal of a variable keeps or makes a pair comparable Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 15 (1) 0 Monotone pair (a1, x1, < ≤ (a2, x2, (2) 0 1) (x1, 1) 2) (x2, 2) < Non-monotone pair (a1, x1, < ≤ (a2, x2, Monotone pair ≤ < Non-monotone pair 1) (x1, 1) 2) (x2, 2) > ≤ > (3) − Incomparable pair (a1, x1, > ≤ (a2, x2, Non-monotone pair 1) (x1, 1) 2) (x2, 2) > (4) Incomparable pair + > ≤ < ≤ > Monotone pair (a1, x1, 1) (x1, 1) (a2, x2, 2) (x2, 2) ≤ < Fig. 1. Effect of removing the variable a on a comparable (monotone, non-monotone) or incomparable pair. (monotone or non-monotone). This implies that the number of comparable pairs in the truncated data will increase in comparison to that of the original data. Next we note that the incomparable pairs due to x (i.e., x1 6≤ x2 and x1 6≥ x2 ) will remain incomparable after the removal of a variable. So, these pairs will not have an effect on the degree of monotonicity and they are not shown in Fig. 1. We then consider the remaining four cases of pairs and their transformation after the removal of the variable a, all illustrated in Fig. 1a. Case (1) corresponds to a monotone pair in the original data which remains monotone after the removal of a. Case (2) presents a non-monotone pair, which remains nonmonotone without the variable a. Case (3) is an incomparable pair in the original data, which turns into a non-monotone pair when a is removed. Case (4) is also originally an incomparable pair, which becomes monotone after the removal of a. We denote the effect on monotonicity due to the removal of the variable a as follows: ’0’ means that there is no change in the type of the pair (cases (1) and (2)), ’−’ means that a non-monotone pair is created (case (3)) and ’+’ means that a monotone pair is created (case (4)). We now look at the change of the degree of monotonicity DgrMon when we remove the variable a from the data. If DgrMon decreases after removing a then case (3) is more likely to occur than case (4). This means that a has a monotone relationship with the label, because we assumed that none of the variables, including x, has a decreasing relation with the label. If there is substantial increase of DgrMon then case (4) is more likely to happen than case (3). As a consequence a should be a nonmonotone variable. Because if a were monotone cases (1) and (3) would more likely occur than cases (2) and (4), which contradicts that DgrMon increases. If DgrMon remains relatively the same then it cannot be decided whether or not the variable is monotone. This might occur, for example, when two or more independent variables are highly correlated. For example, consider the extreme case when two variables are identical. Then removing any of the two variables, irrespective of their type (monotone or non-monotone), will not affect the degree Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 16 of monotonicity. In such cases, a straightforward solution is to remove all but one of the highly correlated variables. The arguments stated above explain the response of DgrMon due to the removal of a non-monotone or monotone variable in Table 1, concerning the simulation study presented in Section 4. 3.2 Measuring Monotonicity by Function Approximation For datasets with very few or no comparable pairs however, DgrMon cannot be used as a consistent measure. This is likely to occur in cases where the number of independent variables is relatively large and the sample size relatively small. Note that the number of comparable pairs is of the order 2−k N 2 , where k is number of independent variables and N is the number of data points. In the case where there are no comparable pairs we can always construct a perfect fit with a monotone increasing or monotone decreasing piecewise linear function. This is shown in the Appendix. An alternative test for monotonicity, which is independent of the number of comparable pairs, is based on the monotonicity index proposed in [9, 10]. To define this index we fit the data with a standard neural network. Then for every explanatory variable the partial derivative ∂f /∂xi at each data point xp is computed, where f denotes the neural network approximation. The monotonicity index in variable xi is defined as ¯ ¶¯¯ ¶ µ µ N ∂f 1 ¯¯X + ∂f ¯ − (6) (xp ) ¯ (xp ) − I I MonInd(xi ) = ¯ ¯ ∂xi ∂xi N¯ p=1 where I + (z) = 1 if z > 0 and I + (z) = 0 if z ≤ 0 and I − (z) = 1 if z ≤ 0 and I − (z) = 0 if z > 0, N is the number of observations and xp is the p-th observation (vector). Note that 0 ≤ MonInd(xi ) ≤ 1. A value of this index close to zero indicates a non-monotonic relationship, a value close to 1 indicates a monotonic relationship. The value of sign indicates whether the relation of f with respect to xi is increasing or decreasing. Since the monotonicity index depends on neural network approximation, it is interesting to see how the index depends on the effect of network architecture and the noise level in the data. To check this we conducted the following experiment with an one-dimensional dataset. We generated a vector x of 200 observations drawn from the uniform distribution U (0, 1). Based on x we generated a label `x by: 3 `x = sin πx + ², 4 where ² represents noise. We computed the monotonicity index for different number of hidden nodes: 1, 2, 5 and 10, and different noise levels ²: 0, 0.1Norm(0, 1), 0.5Norm(0, 1) and Norm(0, 1), where Norm(0, 1) denotes normal distribution with zero mean and unit variance. The results are presented in Fig. 2. If the neural network has one hidden neuron the output depends linearly on the input variable and the index is 1. If the number of hidden nodes is increased Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 17 Fig. 2. Change in the monotonicity index for different number of hidden nodes and noise levels the network will capture the true signal and the monotonicity indices of the variables will tend to the right value. For the noise-free dataset (² = 0) we expect that MonInd ≈ 1/3 and it is confirmed by the experiments with different hidden nodes. For small noise levels in the data and a number of hidden nodes larger than one, the neural network is still able to get good approximation and get MonInd close to the expected one. However, when the noise level increases considerably the indices become inaccurate and eventually get close to 0, as shown in Fig. 2. This is due to the fact that the network starts fitting the noise, which is completely random. 4 Experiments with Simulation Data In this section we demonstrate the application of the monotonicity tests on artificial data. Furthermore, to illustrate the importance of determining the true monotone relationships for building correct models, we compare the performance of three types of classifiers: partially monotone MIN-MAX networks, totally monotone MIN-MAX networks and standard feed-forward neural networks with Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 18 weight decay. The first two classifiers are based on the two-hidden-layer network architecture introduced in [2] with a combination of minimum and maximum operators over linear functions. In [2] it is shown that totally monotone MINMAX networks are universal approximators of totally monotone functions, and in [1] this result is extended for partially monotone MIN-MAX networks. We generate an artificial dataset D of 200 points and 5 independent variables. Each independent variable Xi , i = 1, . . . , 5 is drawn from the uniform distribution U (0, 1). The dependent variable `x is generated by the following process: `x = x1 + 1.5x2 + 2x3 + cos 10x4 + sin 12x5 + 0.01Norm(0, 1) Clearly, `x is a partially monotone label with added noise. We perform both tests for monotonicity described in the previous section. Monotonicity index is computed for two types of standard networks: one with 4 hidden nodes and one with 8 hidden nodes. The results are reported in Tables 1 and 2. Table 1. Degree of monotonicity (DgrMon) of the original and modified simulation data after removing a variable Removed variable - (original data) x1 x2 x3 x4 x5 Number of comparable pairs 1003 2193 2093 2235 2586 1949 DgrMon 0.664 0.623 0.563 0.499 0.743 0.737 Table 2. Monotonicity index for all independent variables in the simulation data Variable x1 x2 x3 x4 x5 NNet-4 1 1 1 0.38 0.68 NNet-8 0.94 0.97 0.97 0.37 0.44 With respect to the degree of monotonicity we observe that the removal of one of the monotone variables the measure decreases compared to the original data whereas the removal of one of the non-monotone variables leads to a considerably larger measure. The monotonicity indexes reported in Table 2 also comply with the expected monotone relationships in the simulation data. Note that the results are comparable with respect to both function approximators: a neural network Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 19 with 4 and 8 hidden nodes. Both tests induce the same ordering of the variables from monotone to less monotone: x3  x2  x1  x5  x4 . Using this knowledge about the (non)-monotone relationships in the simulation data, we apply partially monotone MIN-MAX networks. As benchmark methods for comparison we use totally monotone MIN-MAX networks and standard neural networks with weight decay. We randomly split the original data into training data of 150 observations (75%) and test data of 50 observations (25%). The random partition of the data is repeated 20 times. The performance of the models is measured by computing the mean-squared error (MSE). We use nine combinations of parameters for MIN-MAX networks: groups - 2, 3, 4; planes - 2, 3, 4, and for standard neural networks: hidden nodes - 4, 8, 12; weight decay 0.000001, 0.00001, 0.0001. At each of the twenty runs, we select the model that obtains the minimum MSE out of the nine parameter combinations with each method. Table 3 reports the minimal, mean and maximal value and the variance of the estimated MSE across the runs. Table 3. Estimated prediction errors of the partially monotone networks (PartMonNet), totally monotone networks (FullMonNet) and standard neural networks with weight decay (NNet) for the simulation data Method PartMonNet FullMonNet NNet Min 0.395 0.845 0.537 Mean 0.534 1.118 0.946 Max 0.677 1.331 1.173 Variance 0.0089 0.0168 0.0277 The results show that the models generated by partially monotone MINMAX networks are more accurate than the models generated by totally monotone MIN-MAX networks and standard neural networks. Furthermore, the variation in the errors across runs is smaller for partially monotone MIN-MAX networks than for the benchmark methods, as are the differences between the respective maximum and minimum error values in Table 3. To check the significance of the difference in both network results, we performed statistical tests. Since the test set in the experiments with the three methods is the same, we conduct paired t-tests to test the null hypothesis that the models derived from partially monotone MIN-MAX networks have the same errors as the models derived from each of the benchmark methods against the one-sided alternatives. The p-values obtained from the tests and the confidence intervals at 95% and 90% are reported in Table 4. The results show that partially monotone MIN-MAX networks lead to models with significantly smaller errors than the errors of the models derived from the benchmark networks. In addition, we perform F-tests for the differences between the error variances of the partially monotone models and the benchmark models. With respect to the totally monotone MIN-MAX networks, the difference is statistically insignificant: the p-value is 8.78%. Regarding the standard neural networks partially monotone Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 20 Table 4. p-values of paired t-tests and one-sided confidence intervals for the difference in error means in the simulation study Comparison PartMonNet vs. FullMonNet NNet p-value 0.0% 0.0% Confidence intervals 95% 90% (-0.651, -0.512) (-0.640, -0.528) (-0.493, -0.332) (-0.480, -0.345) MIN-MAX networks have significantly lower variances for the MSE: the p-value is 0.86%. 5 Conclusions and Future Work In this paper we developed a method to measure the degree of monotonicity of a response variable with respect to the independent variables. This is done by calculating two measures for monotonicity. The first one is based on the removal of one independent variable at time and computing the degree of monotonicity as the fraction of monotone pairs out of the comparable pairs in the truncated data. The second monotonicity measure is based on a neural network approximation to fit the data. This allows us to compute in a straightforward manner a measure for a variable’s influence (monotone or non-monotone) on the label using the partial derivative of the network’s output with respect to every variable. We have shown that both monotonicity measures induce the same ordering on the independent variables, from completely monotone to less monotone. We also exploited the monotonicity properties of the variables in the construction of a neural network approximation to fit the data. It turned out that partially monotone networks constructed in this way have lower prediction errors compared to totally monotone networks and ordinary feed forward neural networks. With respect to the latter networks the variance of the partially monotone networks was also significantly smaller. To conclude, the study presented here showed that investigating the monotonicity properties of the data in order to enforce them in the modelling stage, is required to guarantee the successful outcome of the knowledge discovery process. Although the proposed methods are an important contribution in this direction, a number of open questions remain. Both measures for monotonicity are empirical and depend largely on data under study. This requires further investigation of the impact of factors such as the input distribution, noise level, correlation between variables, on the computed measures. Moreover, the output of each monotonicity measure lies in the continuous range between 0 and 1, providing an order of the variables according to their monotone influence. However, to determine whether or not monotonicity properties hold with respect to a variable we need further to define a benchmark value to compare with. Finally, we plan experiments with real-world data to get more insight in the development and application of the proposed measures. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 21 Appendix In the proposition below we show that if there are no comparable pairs in a dataset D we can construct a perfect non-decreasing (also non-increasing) monotone fit with a piecewise linear function. Proposition 1 Suppose D is a dataset with no comparable pairs. Then there exists a piecewise linear non-decreasing (non-increasing) function f such that f fits exactly D. Proof. In the proof we use similar construction as the one in [2]. For simplicity we restrict to the 2-dimensional case and the case when f is non-decreasing. The generalisation to higher dimensions is straightforward. Let D = (xi , yi , `i )N i=1 with no comparable pairs. Now we define 3 hyperplanes for every point di = (xi , yi , `i ) ∈ D as follows: hi1 = `i (constant) hi2 = a (x − xi ) + `i a>0 hi3 b > 0. = b (y − yi ) + `i Next we define a piecewise non-decreasing linear function by f i (x, y) = min hj (x, y). j=1,2,3 Note that f i (xi , yi ) = `i . Finally we define F (x, y) = max f i (x, y). i=1,...,N We now show that for a and b large enough the following holds: 1. F is non-decreasing in x and y, and 2. F (xi , yi ) = `i . Point 1 follows directly from the definition of f . To prove Point 2, note that f i (xi , yi ) = `i and therefore F (xi , yi ) ≥ `i . Suppose that F (xi , yi ) > `i . Then for some k: f k (xi , yi ) > `i , implying hkj (xi , yi ) > `i for j = 1, 2, 3. So, `k > ` i a (xi − xk ) + `k > `i b (yi − yk ) + `k > `i . Since the points (xi , yi ) and (xk , yk ) are incomparable either (xi − xk ) or (yi − yk ) < 0. This leads to contradiction if a and b are large enough. t u Remark: To construct a monotone non-increasing piecewise linear fit to D we follow the same procedure as in the proof of Proposition 1 but for the mirror image of the data set D where every point (x, y) is mapped to (−x, −y). The function obtained in this way is transformed back by taking g(x, y) = f (−x, −y). Then g fits D and is monotone non-increasing. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 22 References 1. Velikova, M.V.: Monotone models for prediction in data mining. PhD thesis, Tilburg University, Tilburg, The Netherlands (2006) 2. Sill, J.: Monotonic networks. In: Advances in Neural Information Processing Systems (NIPS). Volume 10. MIT Press (1998) 661–667 3. Minin, A., Lang, B.: Comparison of neural networks incorporating partial monotonicity by structure. Lecture Notes in Computer Science 5164 (2008) 597–606 4. Velikova, M., Daniels, H.: Partially monotone networks applied to breast cancer detection on mammograms. 5163 (2008) 917–926 5. Feelders, A., Velikova, M., Daniels, H.: Two polynomial algorithms for relabeling non-monotone data. Technical report UU-CS-2006-046, Utrecht University (2006) 6. Rademaker, M., De Baets, B., De Meyer, H.: Data sets for supervised ranking: to clean or not to clean. In: Proceedings of the fifteenth Annual Machine Learning Conference of Belgium and The Netherlands: BENELEARN 2006, Ghent, Belgium. (2006) 139–146 7. Ghosal, S., Sen, A., Van der Vaart, A.W.: Testing monotonicity of regression. Annals of Statistics 28(4) (2000) 1054–1082 8. Spouge, J., Wan, H., Wilbur, W.: Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications 117(3) (2003) 585–605 9. Verkooijen, J.: Neural networks in economic modelling - an empirical study. PhD thesis, Tilburg University, Tilburg, The Netherlands (1996) 10. Daniels, H.A.M., Kamp, B.: Application of MLP networks to bond rating and house pricing. Neural Computing & Applications 8(3) (1999) 226–234 11. Magdon-Ismail, M., Sill, J.: A linear fit gets the correct monotonicity directions. Machine Learning 70(1) (2008) 21–43 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 23 Generalized PAV Algorithm with Block Refinement for Partially Ordered Monotonic Regression⋆ Oleg Burdakov1 , Anders Grimvall2 , and Oleg Sysoev2 1 Department of Mathematics, Department Computer and Information Sciences, Linköping University, SE-58183, Linköping, Sweden {Oleg.Burdakov, Anders.Grimvall, Oleg.Sysoev}@liu.se 2 Abstract. In this paper, the monotonic regression problem (MR) is considered. We have recently generalized for MR the well-known PoolAdjacent-Violators algorithm (PAV) from the case of completely to partially ordered data sets. The new algorithm, called GPAV, combines both high accuracy and low computational complexity which grows quadratically with the problem size. The actual growth observed in practice is typically far lower than quadratic. The fitted values of the exact MR solution compose blocks of equal values. The GPAV approximation to this solution has also a block structure. We present here a technique for refining blocks produced by the GPAV algorithm to make the new blocks much closer to those in the exact solution. This substantially improves the accuracy of the GPAV solution and does not deteriorate its computational complexity. The computational time for the new technique is approximately triple the time of running the GPAV algorithm. Its efficiency is demonstrated by results of our numerical experiments. Key words: Monotonic regression, Partially ordered data set, Pooladjacent-violators algorithm, Quadratic programming, Large scale optimization, Least distance problem. 1 Introduction The monotonic regression problem (MR), which is also known as the isotonic regression problem, deals with an ordered data set of observations. We focus on partially ordered data sets, because in this case, in contrast to completely ordered sets, there are no efficient algorithms for solving large scale MR problems. The MR problem has important statistical applications in physics, chemistry, medicine, biology, environmental science etc. (see [2, 23]). It is present also in operations research (production planning, inventory control, multi-center location etc.) [13, 15, 24] and signal processing [22, 25]. All these problems are often a kind of a monotonic data fitting problem, which is addressed in Section 4 where ⋆ This work was supported by the Swedish Research Council. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 24 we use it for generating test problems. The most challenging of the applied MR problems are characterized by a very large value of the number of observations denoted here by n. For such large-scale problems, it is of great practical value to develop algorithms whose complexity does not rise with n too rapidly. To formulate the MR problem, we introduce the following notations. The vector of observed values is denoted by Y ∈ Rn . The partial order is expressed here with the use of a directed acyclic graph G(N, E), where N = {1, 2, . . . , n} is a set of nodes and E is a set of edges. Each node is associated with one observation, and each edge is associated with one monotonicity relation as described below. In the MR problem, we must find among those vectors u ∈ Rn preserving the monotonicity of the partially ordered data set, the one, which is the closest to Y in the least-squares sense. It can be formulated as follows. Given Y , G(N, E) and a strictly positive vector of weights w ∈ Rn , find the vector of fitted values u∗ ∈ Rn that solves the problem: Pn 2 min (1) i=1 wi (ui − Yi ) s.t. ui ≤ uj ∀(i, j) ∈ E It can be viewed as a problem of minimizing the weighted distance from the vector Y to the set of feasible points which is a convex cone. This strictly convex quadratic programming problem has a unique optimal solution. The conventional quadratic programming algorithms (see [19]) can be used for solving the general MR problem only in the case of moderate values of n, up to few hundred. There are some algorithms especially developed for solving this problem. They can be divided into the two separate groups, namely, exact and approximate MR algorithms. The most efficient and the most widely used of the exact algorithms is the Pool-Adjacent-Violators (PAV) algorithm [1, 14, 16]. Although it has a very low computational complexity, namely O(n) [11], the area of its application is severely restricted by the complete order. In this case, the graph is just a path and the monotonicity constraints in (1) takes the simple form: u1 ≤ u2 ≤ . . . ≤ un . In [20], the PAV algorithm was extended to a more general, but still restricted, case of the rooted tree type of the graph defining the monotonicity. The computational complexity of this exact algorithm is O(n log n). The minimum lower set algorithm [5, 6] is known to be the first exact algorithm designed for solving partially ordered MR problems. If the order is complete, its complexity is O(n2 ). In the partial order case, its complexity is unknown, but it is expected to grow with n much more rapidly than quadratically. The best known computational complexity of the exact algorithms, which are able to solve partially ordered MR problems, is O(n4 ). It refers to an algorithm introduced in [15, 24]. This algorithm is based on solving the dual problem to (1) by solving O(n) minimal flow problems. The resulting growth of computational requirements in proportion to n4 becomes excessive for large n. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 25 The isotonic block class with recursion (IBCR) algorithm was developed in [4] for solving partially ordered MR problems. It is an exact algorithm. Its computational complexity is unknown, but according to [21], it is bounded below by O(n3 ). In practice, despite of this estimate, it is the fastest among the exact algorithms. This is the reason why we use it in our numerical experiments to compare the performance of algorithms. Perhaps, the most widely used inexact algorithms for solving large-scale partially ordered MR problems are based on simple averaging techniques [17, 18, 26]. They can be easily implemented and have a relatively low computational burden, but the quality of their approximation to u∗ is very case-dependent and furthermore, the approximation error can be too large (see [7]). In [7, 8], we generalized the well-known Pool-Adjacent-Violators algorithm (PAV) from the case of completely to partially ordered variables. The new algorithm, called GPAV, combines both low computational complexity O(n2 ) and high accuracy. The computational time grows in practice less rapidly with n than in this worst-case estimate. The GPAV solution is feasible, and it is optimal if to regard its active constraints as equalities. The corresponding active set induces a partitioning of N into connected subsets of nodes, called blocks, which are obtained after excluding from E the edges representing the non-active constraints. We present here a block refinement technique which substantially improves the accuracy of the GPAV solution, while the overall computational complexity remains O(n2 ). Its run time is between twice and triple the time of running GPAV. Since the GPAV and IBCR algorithms coincide with the PAV algorithm when the order is complete, they both can be viewed as its generalizations. Although they have much in common, the main difference is that the first of them is an approximate algorithm, while the second one is exact. Moreover, according to the reported here result of numerical experiments, the GPAV is much faster than the IBCR, and the difference in the computational time grows rapidly with n. The paper is organized as follows. The block refinement technique is introduced in Section 2. This technique is illustrated with a simple example in Section 3. The results of our numerical experiments are presented and discussed in Section 4. In Section 5, we draw conclusions about the performance of the block refinement technique and discuss future work. 2 Block Refinement Technique Our block refinement technique is based on the GPAV algorithm. To present this algorithm, we will use the definitions and notations from [8]. Let i− = {j ∈ N : (j, i) ∈ E} denote the set of all immediate predecessors for node i ∈ N . The connected subset of nodes B ⊂ N is called a block if, for any i, j ∈ B, all the nodes in all the undirected paths between i to j belong to B. The block Bi is said to be an immediate predecessor for Bj , or adjacent to Bj , if there exist k ∈ Bi and l ∈ Bj Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 26 such that k ∈ l− . Let Bi− denote the set of all blocks adjacent to block Bi . We associate each block with one of its nodes, which is called the head node. If i is the head node for some block, we denote this block by Bi . The set of all head nodes is denoted by H. The set of blocks {Bi }i∈H , where H ⊂ N , is called a block partitioning of N if [ Bi = N i∈H and Bi ∩ Bj = ∅, ∀i 6= j, i, j ∈ H. Let Wk denote the weight of the block Bk . It is computed by the formula X wi . Wk = i∈Bk The GPAV algorithm produces a block partitioning of N . It returns also a solution u ∈ Rn which is uniquely defined by the block partitioning as follows. If node i belongs to a block Bk , the corresponding component ui of the solution equals the block common value: P wi Yi . (2) Uk = i∈Bk Wk This algorithm treats the nodes N or, equivalently, the observations in a consecutive order. Any topological order of N is acceptable, but the accuracy of the resulted solution depends on the choice (see [8]). We assume that the nodes N have been sorted topologically. The GPAV algorithm creates initially the singletone blocks Bi = {i} and sets Bi− = i− for all the nodes i ∈ N . Subsequently it operates with the blocks only. It treats them in the order consistent with the topological sort, namely, B1 , B2 , . . . , Bn . When at iteration k the block Bk is treated, its common value (2) is compared with those of its adjacent blocks. While there exists an adjacent violator of the monotonicity, the block Bk absorbs the one responsible for the most severe violation. The common value Uk and the lists of adjacent blocks are updated accordingly. The outlined GPAV algorithm can be formally presented as follows. Algorithm 1 (GPAV) Given: vectors w, Y ∈ Rn and a directed acyclic graph G(N, E) with topologically sorted nodes. Set H = N . For all i ∈ N , set Bi = {i}, Bi− = i− , Ui = Yi and Wi = wi . For k = 1, 2, . . . , n, do: While there exists i ∈ Bk− such that Ui ≥ Uk , do: Find j ∈ Bk− such that Uj = max{Ui : i ∈ Bk− }. Set H = H \ {j}. Set Bk− = Bj− ∪ Bk− \ {j}. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 27 Set Uk = (Wk Uk + Wj Uj ) / (Wk + Wj ). Set Bk = Bk ∪ Bj and Wk = Wk + Wj . For all i ∈ H such that j ∈ Bi− , set Bi− = Bi− ∪ {k} \ {j}. For all k ∈ H and for all i ∈ Bk , set ui = Uk . We shall refer to this algorithm as the forward mode of GPAV. We denote its output as uF and {BkF }k∈H F . Its backward mode is related to solving in uB ∈ Rn the MR problem: Pn B 2 B min (3) i=1 wi (ui − Yi ) B s.t. uB i ≤ uj ∀(i, j) ∈ E B where YiB = −Yi and E B = {(i, j) : (j, i) ∈ E}. Note that the optimal solution to this problem equals −u∗ . The backward mode of GPAV returns uB and {BkB }k∈H B . In this mode, one can use the inverse of the topological order used by the forward mode. In our block refinement technique, it is assumed that two or more approximate solutions to problem (1) are available. They could result from applying any approximate algorithms, for instance, the forward and backward modes of Algorithm GPAV in combination with one or more topological orders. We will refer to them as old solutions. We denote the vector of the component-wise average of the old solutions by v, i.e. if there are available two old solutions, u′ and u′′ , then v = (u′ + u′′ )/2. The vector v is feasible in problem (1), and it can be viewed as a new approximation to its solution. It induces a block partitioning of the nodes N . The new blocks are denoted by Bknew where k belongs to the new set of head nodes H new . Let h(j) denote the head element of the new block which contains node j, i.e. if j ∈ Bknew then h(j) = k. It can be seen that the new blocks are, roughly speaking, nonempty intersections of the old blocks. Thus, the new ones results from a certain splitting of the old ones. The use of the vector v will allow us to simplify the construction of {Bknew }k∈H new . This idea is presented by the following algorithm. Algorithm 2 (SPLIT) Given: vector v ∈ Rn and a directed acyclic graph G(N, E) with topologically sorted nodes. Set H new = ∅ and E new = ∅. For i = 1, 2, . . . , n, do: If there exists j ∈ i− such that vj = vi , then for k = h(j), set Bknew = Bknew ∪ {i}, else set H new = H new ∪ {i} and Binew = {i}. For all j ∈ i− such that h(j) 6= h(i) & (h(j), h(i)) ∈ / E new , do: new new set E =E ∪ (h(j), h(i)). This algorithm returns not only the new blocks {Bknew }k∈H new , but also the set of directed edges E new ⊂ H new × H new . The new blocks are represented in the new directed acyclic graph G(H new , E new ) by their head elements. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 28 It can be easily seen that a topological order in the new graph can be obtained by sorting the nodes H new in increasing order of the values vk . Using the notations wknew = X i∈Bknew wi , Yknew = X yi wi /wknew , (4) i∈Bknew where k ∈ H new , we can formulate for the new graph the new MR problem: P min k∈H new wknew (unew − Yknew )2 k s.t. unew ≤ unew ∀(i, j) ∈ E new i j (5) Since the number of unknowns unew is, typically, less than n and |E new | < k |E|, this problem is smaller, in its size, than the original problem (1). One can apply here, for instance, Algorithm GPAV. The resulting refined blocks yields an approximate solution to (1) which provides in practice a very high accuracy (see Section 4). Furthermore, if the blocks {Bknew }k∈H new are the same as those induced by u∗ , then the optimal solution to problem (5) produces u∗ . In our numerical experiments, we use the following implementation of the block refinement technique. Algorithm 3 (GPAVR) Given: vectors w, Y ∈ Rn and a directed acyclic graph G(N, E) with topologically sorted nodes. 1. Use the forward and backward modes of Algorithm GPAV to produce two B ∗ approximations, uF i and −ui , to u . F 2. Run Algorithm SPLIT with v = (ui − uB i )/2. 3. For all k ∈ H new , compute wknew and Yknew by formula (4). 4. Use v to sort topologically the nodes H new . 5. Use Algorithm GPAV to find unew which solves approximately the new MR problem (5). 6. For all k ∈ H new and for all i ∈ Bknew , set ui = unew . k The block refinement technique is illustrated in the next section with a simple example. It is not difficult to show that the computational complexity of Algorithm GPAVR is estimated, like for Algorithm GPAV, as O(n2 ). Indeed, Step 1 involves two runs of Algorithm GPAV, which does not break the estimate. The computational burden of Step 2 is proportional to the number of edges in E which does not exceed n2 . The number of arithmetic operations on Steps 3 and 6 grows in proportion to n. The computational complexity of the sorting on Step 4 is estimated as O(n log(n)). On Step 5, Algorithm GPAV is applied to the graph G(H new , E new ) with the number of nodes which does not exceed n. This finally proves the desired estimate. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 29 3 Illustrative Example Consider the MR problem defined in Fig. 1. The weights wi = 1, i = 1, 2, 3, 4. The vector u∗ = (0, −6, 6, 0) is the optimal solution to this problem. It induces the optimal block partitioning: {1, 4}, {2}, {3} (indicated by the dashed lines). Fig. 1. The graph G(N, E) and observed responses Y . One can see that the nodes are already topologically sorted. The forward mode of Algorithm GPAV produces the block partitioning: B2F = {2}, B4F = {1, 3, 4} and the corresponding approximate solution uF = (2, −6, 2, 2). Fig. 2 defines the MR problem (3) for the backward mode. Fig. 2. The graph G(N, E B ) and observed responses Y B . Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 30 In this case, the topological order is 4, 3, 2, 1. The backward mode of Algorithm GPAV produces the block partitioning: B1B = {1, 2, 4}, B3B = {3} and the corresponding approximate solution uB = (2, 2, −6, 2). The nonempty intersections (dashed lines) of the forward mode blocks (dotted lines) and the backward mode blocks (solid lines) are shown in Fig. 3 Fig. 3. The old blocks (solid and dotted lines) and their splitting which yields the new blocks (dashed lines). The same splitting of the old blocks is provided by the input vector v = (uF i − = (0, −2, 2, 0) of Algorithm SPLIT. This algorithm produces H new = {1, 2, 3}, B1new = {1, 4}, B2new = {2}, B3new = {3}. The new MR problem (5) is defined by Fig. 4. uB i )/2 Fig. 4. The new graph G(H new , E new ) and observed responses Y new . The topological sort for the new graph is, obviously, 2, 1, 3. After applying = = −6, unew Algorithm GPAV to solving the new MR problem, we obtain unew 1 2 new 0, u3 = 6. Step 6 of Algorithm GPAVR yields the vector u = (0, −6, 6, 0) which is optimal in the original MR problem. In general, it is not guaranteed that the GPAVR solution is optimal. 4 Numerical Results We use here test problems of the same type as in our earlier paper [8]. They originate from the monotonic data fitting problem which is one of the most Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 31 common types of applied MR problems. In the monotonic data fitting, it is assumed that there exists an unknown response function y(x) of p explanatory variables x ∈ Rp . It is supposed to be monotonic in the sense that y(x′ ) ≤ y(x′′ ), ∀x′ x′′ , x′ , x′′ ∈ Rp , where is a component-wise ≤-type relation. Instead of the function y(x), we have available a data set of n observed explanatory variables Xi ∈ R p , i = 1, . . . , n, and the corresponding observed responses Yi ∈ R1 , i = 1, . . . , n. The function and the data set are related as follows Yi = y(Xi ) + εi , i = 1, . . . , n, (6) where εi is an observation error. In general, if the relation Xi Xj holds, this does not imply that Yi ≤ Yj , because of this error. The relation induces a partial order on the set {Xi }ni=1 . The order can be presented by a directed acyclic graph G(N, E) in which node i ∈ N corresponds to the i-th observation, and the presence if edge (i, j) in E means that Xi Xj . This graph is unique if all redundant relations are eliminated. We call edge (i, j), and also the corresponding monotonicity relations Xi Xj and ui ≤ uj , redundant if there is a directed path from i to j. Redundant edges, if removed, leave the feasible set in (1) unchanged. In monotonic data fitting, one must construct a monotonic response surface model u(x) whose values u(Xi ) are as close as possible to the observed responses Yi for all i = 1, . . . , n. Denoting ui = u(Xi ), i = 1, . . . , n (7) and using the sum of squares as the distance function, one can reformulate this problem as the MR problem (1). In the numerical experiments, we set the weights wi = 1 for all i = 1, . . . , n. In the experiments, we restrict our attention to the case of p = 2 for the following reasons. Suppose that, for the two vectors Xi and Xj in Rp , neither Xi Xj nor Xj Xi holds, i.e. they are incomparable. Then if to delete (or disregard) in these vectors one and the same component, the reduced vectors may become comparable in Rp−1 . On the other hand, if two vectors in Rp are comparable, no deletion of their component is able to break this relation. This means that, for any fixed number of multivariate observations n, the number of monotonic relations Xi Xj attains its maximum value when p = 2. For our test problems, we use two types of functions y(x) of two explanatory variables, namely, linear and nonlinear. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 32 Our nonlinear function is given by the formula ynonlin (x) = f (x1 ) + f (x2 ), (8) √ 3 (9) where f (t) = t, t ≤ 0, t3 , t > 0. This function is shown in Fig. 5. 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 1 0.5 1 0.5 0 0 −0.5 −0.5 −1 −1 Fig. 5. Nonlinear function y(x) defined by (8)–(9). Our choice of the linear test problems is inspired by the observation that the optimal values u∗i that correspond to a local area of values of x depend mostly on the local behavior of the response function y(x) and on the values of the observation errors in this area. Due to the block structure of u∗ , these local values of u∗i typically do not change if to perturb the function values y(x) in distant areas. Therefore, we assume that the local behavior can be well imitated by linear local models. For the linear models, we consider the following two functions ylin1 (x) = 0.1x1 + 0.1x2 , ylin2 (x) = x1 + x2 . (10) They model slower and faster monotonic increase, respectively. The nonlinear function combines the considered types of behavior. In addition, depending on the side from which x1 or x2 approaches zero, the function value changes either sharply, or remarkably slowly. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 33 For the numerical experiments, samples of n = 102 , n = 103 and n = 104 observations {Xi }ni=1 were generated with the use of the independent uniform distribution of the explanatory variables in the interval [−2, 2]. The error terms εi in (6) were independent and normally distributed with mean zero and variance one. It should be emphasized that, in our numerical experiments, the variance of the error εi is comparable with the function values y(Xi ). Such observations, with a high level of noise, were deliberately chosen for checking the performance of the algorithms in this difficult case. We call preprocessing the stage at which MR problem (1) is generated. The observed explanatory variables {Xi }ni=1 are involved in formulating this problem implicitly; namely, via the partial order intrinsic in this data set of vectors. Given {Xi }ni=1 , we generate a directed acyclic graph G(N, E) with all the redundant edges removed. The adjacency-matrix representation [10] is used for the graph. The preprocessing is accomplished by a topological sorting of the nodes N . We used MATLAB for implementing the algorithms GPAV, GPAVR and IBCR. The implementations are based on one of the topological sorts studied in [8], namely, (NumPred). By this sorting, the nodes in N are sorted in ascending order of the number of their predecessors. The (NumPred) is applied in Algorithms GPAV and IBCR to the graph G(N, E), as well as on Step 1 of Algorithm GPAVR to the graphs G(N, E) and G(N, E B ) in the forward and backward modes, respectively. The numerical results presented here were obtained on a PC running under Windows XP with a Pentium 4 processor (2.8 GHz, 1GB RAM). We evaluate performance of the algorithms GPAV and GPAVR by comparing the relative error ϕ(uA ) − ϕ(u∗ ) , e(uA ) = ϕ(u∗ ) where ϕ(uA ) is the objective function value in (1) obtained by Algorithm A, and ϕ(u∗ ) is the optimal value provided by IBCR. In [8], it was shown that kuA − u∗ k p ≤ e(uA ), kY − u∗ k p which means that e(uA ) provides an upper estimate for the relative distance between the approximate solution uA and the optimal solution u∗ . Tables 1 and 2 summarize the performance data obtained for the algorithms. For n = 102 and n = 103 , we present the average values for 10 MR problems generated as described above. For n = 104 , we report the results of solving only one MR problem, because it takes about 40 minutes of CPU time to solve one such large-scale problem with the use of IBCR. The sign ‘—’ corresponds in the tables to the case when IBCR failed to solve problem within 6 hours. The number of constraints (#constr.) reported in Table 1 are aimed at indicating how difficult the quadratic programming problem (1) is for the conventional optimization methods when n is large. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 34 Table 1. Relative error e(uA ) · 100% for A = GPAV, GPAVR algorithm model n = 102 n = 103 n = 104 #constr. = 322 #constr. = 5497 #constr. = 78170 GPAV lin1 lin2 nonlin 0.98 2.79 3.27 0.77 2.43 5.66 — 2.02 11.56 GPAVR lin1 lin2 nonlin 0.01 0.08 0.00 0.07 0.12 0.17 — 0.24 0.46 Table 2. Computational time in seconds algorithm model n = 102 n = 103 n = 104 GPAV lin1 lin2 nonlin 0.02 0.01 0.01 0.76 0.71 0.67 89.37 93.76 87.51 GPAVR lin1 lin2 nonlin 0.05 0.05 0.04 1.67 1.60 1.58 234.31 197.06 192.08 IBCR lin1 lin2 nonlin 0.21 0.09 0.08 129.74 5.07 6.68 — 2203.10 3448.94 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 35 The tables show that GPAVR substantially improves the accuracy of the GPAV solution, while its run time is between twice and triple the time of running GPAV. They also demonstrate the limited abilities of IBCR. 5 Conclusions and Future Work To the best of our knowledge, GPAV is the only practical algorithm able to produce sufficiently accurate solution to very large scale MR problems with partially ordered observations. Up till now, there has not been found any practical algorithm capable of solving such large scale MR problems with as high accuracy as the accuracy provided by the introduced here block refinement technique combined with the GPAV algorithm. This can be viewed as the main contribution of the paper. In this paper, we focused on solving the MR problem (1) which is related to the first stage of constructing a monotonic response model u(x) of an unknown monotonic response function y(x). The second stage accomplishes the construction of a model u(x), which is a monotonic function and interpolates, in accordance with (7), the fitted response values ui , i = 1, . . . , n, obtained in the first stage. The quality of the obtained monotonic response model u(x) depends not only on the accuracy of solving problem (1), but also on the interpolation methods used in the second stage. Among the existing main approaches to solving the monotonicity-preserving interpolation problem one can recognize the following three. One approach [27, 28] involves minimizing some measure of smoothness over a convex cone of smooth functions which are monotone. The drawback of this approach is that the solutions must be found by solving constrained minimization problems, and the solutions are generally somewhat complicated, nonlocal and nonpiecewise polynomial functions. The second approach is to use a space of piecewise polynomials defined over a partition of the interpolation domain, usually into triangles. Up until now, this approach has been studied only in the case where the data is given on a grid (see e.g. [3, 9]). The third approach [12] is based on creating girded data from the scattered data. This approach is not suitable for large number of data points n, because the number of grid nodes grows with n as np and becomes easily unacceptably large, even for the bivariate case (p = 2). Bearing in mind the importance of the second stage and the shortcomings of the existing approaches, we plan to develop efficient monotonicity-preserving methods for interpolation of scattered multivariate data. References 1. Ayer, M., Brunk, H.D., Ewing, G.M., Reid, W.T., Silverman, E.: An empirical distribution function for sampling with incomplete information. T he Annals of Mathematical Statistics 26, 641–647 (1955) Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 36 2. Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D.: Statistical inference under order restrictions. Wiley, New York (1972) 3. Beatson, R.K., and Ziegler, Z.: Monotonicity presertving surface interpolation. SIAM J. Numer. Anal. 22, 401-411 (1985) 4. Block, H., Qian, S., Sampson, A.: Structure Algorithms for Partially Ordered Isotonic regression. Journal of Computational and Graphical Statistics 3, 285–300 (1994) 5. Brunk, H.D.: Maximum likelihood estimates of monotone parameters. The Annals of Mathematical Statistics 26, 607–616 (1955) 6. Brunk H.D., Ewing G.M., Utz W.R.: Minimizing integrals in certain classes of monotone functions, Pacific J. Math. 7, 833–847 (1957) 7. Burdakov, O., Sysoev, O., Grimvall A., Hussian, M.: An O(n2 ) algorithm for isotonic regression problems. In: Di Pillo, G., Roma, M. (eds.) Large Scale Nonlinear Optimization. Ser. Nonconvex Optimization and Its Applications, vol. 83, pp. 25– 33, Springer-Verlag (2006) 8. Burdakov, O., Grimvall, A., Sysoev, O.: Data preordering in generalized PAV algorithm for monotonic regression. Journal of Computational Mathematics 4, 771–790 (2006) 9. Carlson, R.E., and Fritsch, F.N.: Monotone piecewise bicubic interpolation. SIAM J. Numer. Anal. 22, 386–400 (1985) 10. Cormen T.H., Leiserson C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms (Second Edition), MIT Press, Cambridge (2001) 11. Grotzinger, S.J., Witzgall, C.: Projection onto order simplexes. Applications of Mathematics and Optimization 12, 247–270 (1984) 12. Han, L., and Schumaker L.L.: Fitting monotone surfaces to scattered data using C1 piecewise cubics. SIAM J. Numer. Anal. 34, 569–585 (1997) 13. Kaufman, Y. , Tamir, A.: Locating service centers with precedence constraints. Discrete Applied Mathematics 47, 251-261 (1993) 14. Kruskal, J.B.: Nonmetric multidimensional scaling: a numerical method, Psychometrika 29, 115–129 (1964) 15. Maxwell, W.L., Muchstadt, J.A.: Establishing consistent and realistic reorder intervals in production-distribution systems. Operations Research 33, 1316–1341 (1985) 16. Miles, R.E.: The complete amalgamation into blocks, by weighted means, of a finite set of real numbers. Biometrikabf 46:3/4, 317–327 (1959) 17. Mukarjee, H.: Monotone nonparametric regression, The Annals of Statistics 16, 741–750 (1988) 18. H. Mukarjee, H. Stern, Feasible nonparametric estimation of multiargument monotone functions. Journal of the American Statistical Association 425, 77–80 (1994) 19. Nocedal, J., Wright, S.J.: Numerical Optimization (2nd ed.). Springer-Verlag, New York (2006) 20. Pardalos, P.M., Xue, G.: Algorithms for a class of isotonic regression problems. Algorithmica 23, 211–222 (1999) 21. Qian, S., Eddy, W.F.: An Algorithm for Isotonic Regression on Ordered Rectangular Grids. J. of Computational and Graphical Statistics 5, 225–235 (1996) 22. Restrepo, A., Bovik, A.C.: Locally monotonic regression. IEEE Transactions on Signal Processing 41, 2796–2810 (1993) 23. Robertson T., Wright, F.T., Dykstra, R.L.: Order Restricted Statistical Inference. Wiley, New York (1988) 24. Roundy, R.: A 98% effective lot-sizing rule for a multiproduct multistage production/inventory system. Mathematics of Operations Research 11, 699–727 (1986) Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 37 25. Sidiropoulos, N.D., Bro, R.: Mathematical programming algorithms for regressionbased nonlinear filtering in Rn . IEEE Transactions on Signal Processing 47, 771– 782 (1999) 26. Strand, M.: Comparison of methods for monotone nonparametric multiple regression. Communications in Statistics - Simulation and Computation 32, 165–178 (2003) 27. Utreras, F.I.: Constrained surface construction. In: Chui, C.K., Schumaker, L.L., and Utreras F. (eds.) Topics in Multivariate Approximation, pp. 233–254. Academic Press, New York (1987) 28. Utreras, F.I., and Varas M.: Monotone interpolation of scattered data in R2. Constr. Approx. 7 , 49–68 (1991) Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 38 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 39 Discovering monotone relations with Padé Jure Žabkar1 , Martin Možina1 , Ivan Bratko1 , and Janez Demšar1 University of Ljubljana, Faculty of Computer and Information Science, Tržaška 25, SI-1000 Ljubljana, Slovenia, [email protected], www.ailab.si/jure Abstract. We propose a new approach to discovering monotone relations in numerical data. We describe Padé, a tool for estimating partial derivatives of target function from numerical data. Padé is basically a preprocessor that takes numerical data as input and assigns computed qualitative partial derivatives to the learning examples. Using the preprocessed data, an appropriate machine learning method can be used to induce a generalized model. The induced models describe monotone relations between the class variable and the attributes. Experiments performed on artificial domains showed that Padé is quite accurate and robust. 1 Introduction In many real-world applications, e.g. in financial or insurance sector, some relations between the observed variables are known to be monotone. In such cases we can limit the induction to monotone models which are consistent with the domain knowledge and give better predictions. Monotonicity properties can be imposed by a human expert (based on experience) or domain theory itself (e.g. in economics). However, such knowledge is not always available. In this paper we present Padé, a machine learning method for discovering monotone relations in numerical data. Padé can substitute for an expert or domain theory when they are not known or not given. A general scheme in which Padé plays a major role is presented in Fig. 1. Padé works as a data preprocessor. Taking numerical data as input, it calculates qualitative partial derivatives of the class variable w.r.t. each attribute and assigns them to original learning examples. The attributes normally correspond to independent variables in our problem space, and the class corresponds to a dependent variable. Computed qualitative partial derivatives define the class value in the new data set, which can be used as input to an appropriate machine learning method for induction of a qualitative model of discovered monotone relations. For a simple example, consider f as a function of x: f = x2 . The learning data would consist of a sample of pairs of values (x, f ) where x is the attribute and f is the class. Let us take (x, f ) = (2, 4). Padé observes examples in a tube in the direction of the attribute x and uses them to compute the approximation of partial derivative in (x, f ) = (2, 4). It would find out that in this direction, larger values of x imply larger values of f , so the derivative is positive. Padé constructs a new attribute, e.g. Qx, with Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 40 Fig. 1. A general scheme in which Padé works as a preprocessor for computing partial derivatives from numerical data. values from {+, −} and assigns a + to example (2,4). After doing the same for all points in the training data, we can apply a rule learning algorithms using the newly constructed attribute as a class value. The correct qualitative model induced from this data would be: if x > 0 then f = Q(+x) if x < 0 then f = Q(−x) The constraint f = Q(+x) is read as f is qualitatively proportional to x. Roughly, this means that f increases with x, or precisely, ∂f > 0. ∂x The notation f = Q(−x) means that f is inversely qualitatively proportional to x (i.e. the partial derivative of f w.r.t. x is negative). We will also be using an abbreviated notation when referring to several qualitative proportionalities. For example, two constraints f = Q(+x) and f = Q(−y) will be abbreviated to f = Q(+x, −y). Qualitative proportionalities correspond to the monotone relations. The constraint f = Q(+x) means that f is monotonically increasing with x and f = Q(−x) means that f is monotonically decreasing with x. In section 2 we shortly describe two related algorithms from the field of qualitative reasoning that also discover qualitative patterns in data. We present the details of the algorithm Padé in section 3. In section 4 we present the experiments and conclude in section 5. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 41 2 Related work QUIN [1, 2] is a machine learning algorithm which works on similar data as Padé and uses it to compute a qualitative tree. Qualitative trees are similar to classification trees, except that their leaves state the qualitative relations holding in a particular region. Although Padé can also be used to construct such trees, there are numerous differences between the two algorithms. QUIN is based on an specific impurity measure and can only construct trees. Padé is not a learning algorithm but a preprocessor, which can be used with (in principle) any learning algorithm. Padé computes partial derivatives by the specified attribute(s), while QUIN identifies the regions in which there is a monotone relation between the class and some attributes, which are not specified in advance. In this respect, QUIN is more of a subgroup discovery algorithm than a learning method. Related to that, Padé can compute the derivative at any given data point, while QUIN identifies the relation which is generally true in some region. Finally, the partial derivatives of Padé are computed as coefficients of univariate linear regression constructed from examples in the tube lying in the direction of the derivative. This makes it more mathematically sound and much faster than QUIN, which cannot efficiently handle more than half a dozen attributes. Algorithm QING [3] also looks for monotone subspaces in numerical data. It is based on discrete Morse theory [4, 5] from the field of computational topology. The main difference between QING and other algorithms for induction of qualitative models is in attribute space partitioning. Unlike algorithms that split on attribute values (e.g. trees, rules), QING triangulates the space (domain) and constructs the qualitative field which for every learning example tells the directions of increasing/decreasing class. Finally, it abstracts the qualitative field to qualitative graph in which only local extrema are represented. Its main disadvantage is that it lacks robustness needed for real applications. One could also use locally weighted multivariate linear regression (LWR) [6] to approximate the gradient in the given data point. Our experiments show that LWR performs much worse than Padé. We present the experiments in section 4.2. Padé was already presented at the QR’07 workshop [7]. There we described a few prototype methods and showed a few case-study experiments. The method has matured since then, so we here present the reformulated and practically useful version of Padé, and include a somewhat larger set of experiments. 3 Algorithm Padé is based on the mathematical definition of the partial derivative of a function f (x1 , . . . , xn ) in the direction xi at the point (a1 , . . . , an ): ∂f f (a1 , . . . , ai + h, . . . , an ) − f (a1 , . . . , an ) (a1 , . . . , an ) = lim . h→0 ∂xi h The problem in our setup is that we cannot compute the value of the function at an arbitrary point, since we are only given a function tabulated on a finite data set. Instead, we have to estimate the derivative based on the function values in local neighborhoods. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 42 Padé looks for points that lie in the neighborhood of the reference point in the direction of the derived variable while keeping the values of all other attributes as constant as possible. The input for Padé is a set of examples described by a list of attribute values and the value of a continuous dependent variable. An example of a function with two attributes is depicted in Fig. 2(a): each point represents a learning example and is assigned a continuous value. Our task is to compute a partial derivative at each given data point P = (a1 , . . . , an ). To simplify the notation, we will denote (a1 , . . . , ai + h, . . . , an ) as P + h. In our illustrations, we shall show the computation at point P = (5, 5), which is marked with a hollow symbol. Padé considers a certain number of examples nearest to the axis in the direction in which we compute the derivative. These examples define a (hyper)tube (Fig. 2(b)). We now assume that the function is approximately linear within short parts of the tube and estimate the derivative from the corresponding coefficient computed by the univariate regression over the points in the tube. We call this method Tube regression (Fig. 2(b)). (a) Sampled function x2 − y 2 (b) Tube regression Fig. 2. Computing a partial derivative from numerical data by Tube regression. Since the tube can also contain points that lie quite far from P , we weigh the points by their distances from P along the tube (that is, ignoring all dimensions but xi ) using a Gaussian kernel. The weight of the j-th point in the tube equals wj = e−(ai −tji ) 2 /σ 2 , where tji is point’s i-th coordinate, and σ is chosen so that the farthest point in the tube has a user-set negligible weight. For the experiments in this paper we used tubes with 20 points, with the farthest point having a weight of w20 = 0.001. We then use a standard 1-dimensional weighted least squares regression to compute the coefficient of the linear term. We set the free term to 0 in order for the regression line to pass through point P . The formula for the coefficient thus simplifies to P j wj (tji − ai )(yj − yi ) P , bi = 2 j wj (tji − ai ) Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 43 where yj is the function value at tj . The reason for omitting the free term is simple: if the function goes through the point, so should its derivative. The regression line should determine the sign of the derivative, not make the best fit to the data. This requirement may not seem reasonable in noisy domains but we handle noise differently, by using a machine learning algorithm on derived data. The Tube regression is computed from a larger sample of points. We use the t-test to obtain the estimates of the significance of the derivative. Significance together with the sign of bi can be used to define qualitative derivatives in the following way: if the significance is above the user-specified threshold (e.g. t = 0.7) then the qualitative derivative equals the sign of bi ; if significance is below the threshold we define the qualitative derivative to be steady, disregarding the sign of b. 4 Experiments We implemented the described algorithms inside the general machine learning framework Orange. [8] We evaluated Padé on artificial data sets. We observed the correctness of Padé’s derivatives, how it compares to locally weighted regression (LWR), how well it scales in the number of attributes and how it copes with noise. We measured the accuracy by comparing the predicted qualitative behavior with the analytically known one. 4.1 Accuracy In order to estimate the accuracy of Padé we chose a few interesting mathematical functions and calculated numerical derivatives in all sampled points to compare them with Padé’s results. All functions were sampled in 1000 points, chosen uniform randomly in the specified interval in x−y plane. First we chose f (x, y) = x2 −y 2 and f (x, y) = xy in [−10, 10] × [−10, 10] as examples of continuous and differentiable functions in the whole interval. The functions are presented in Fig. 3. Fig. 3. The functions from which we obtained artificial data by sampling. We calculated numerical derivatives and compared their signs with qualitative derivatives calculated by Padé. The proportion of correctly calculated qualitative derivatives are shown in Table 4. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 44 f (x, y) Padé ∂f /∂x ∂f /∂y x2 − y 2 96.5% 97.1% xy 99.8% 99.4% sin x sin y 50.7% 54.7% Fig. 4. The accuracy of Padé on artificial data sets. Padé fails in sin x sin y domain due to the tube being too long and thus covering multiple periods of the function. However, the maximal number of examples in the tube is a parameter of the algorithm and can be adjusted according to the domain properties. On the other hand, our experiments show that for most domains the value of this parameter does not matter much, as shown in section 4.4. 4.2 Comparison with LWR We compared the accuracies of Padé and locally weighted regression (LWR) on an artificial data set. We sampled f (x, y) = x2 −y 2 in 50 points from [−10, 10]×[−10, 10]. To this domain, we added 10 random attributes a1 , . . . , a10 with values in the same range as x and y, i.e. [−10, 10]. The attributes a1 , . . . , a10 had no influence on the output variable f . We took the sign of the coefficients at x and y from LWR to estimate the partial derivatives at each point. Again, we compared the results of Padé and LWR to analytically obtained partial derivatives of f (x, y). Table 5 shows the proportion of correctly calculated qualitative derivatives. Padé LWR ∂f /∂x ∂f /∂y ∂f /∂x ∂f /∂y 90% 96% 70% 70% Fig. 5. The proportion of correctly calculated qualitative derivatives of Padé and LWR. This experiment confirms the importance of the chosen neighborhood. LWR takes a spherical neighborhood around each point while Padé does the regression on the points in the tube. 4.3 Dimensionality We checked the scalability of Padé to high (∼ 100) dimensional spaces. Again, we took the function x2 − y 2 as above, but increased the dimensionality by adding 98 attributes with random values from [−10, 10]. We analyzed the results by inducing classification trees with the computed qualitative derivatives as classes. The trees for derivatives w.r.t. x and y agree well with the correct results (Fig. 6). Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 45 Fig. 6. Qualitative models of the qualitative partial derivatives of x2 − y 2 data set with 98 additional attributes with random values unrelated to the output. 4.4 Noise In this experiment we vary the amount of noise that we add to the class value f . The target function is again f (x, y) = x2 − y 2 defined on [−10, 10] × [−10, 10] which puts f in [−100, 100]. We control noise through standard deviation (SD), from SD=0 (no noise) to SD=50, where the latter means that the noise, added to the function value is half the value of the signal itself. For each noise level, we vary the number of examples in the tube to observe the effect this parameter has on the final result. Finally, we evaluate Padé and Padé combined with classification tree algorithm C4.5 [9] against ground truth, measuring the classification accuracy (Table 7). Noise Tube ∂f /∂x Padé Padé + C4.5 ∂f /∂y Padé Padé + C4.5 5 SD=0 15 30 5 SD=10 15 30 80% 85% 98% 99% 99% 99% 77% 86% 96% 99% 98% 99% 63% 51% 77% 99% 84% 92% 82% 97% 98% 99% 98% 98% 79% 90% 95% 99% 97% 98% 66% 90% 82% 99% 85% 98% 5 SD=50 15 30 Fig. 7. The analysis of Padé w.r.t. noise, tube sizes and additional use of C4.5. Regarding noise we observe that Padé itself is quite robust. Yet, it greatly benefits from being used together with C4.5. Regarding the tube - Tube regression is highly noise resistant, which will also make it smear fine details in noiseless data. The smoothing is regulated by two arguments. The width of the tube should balance between having enough examples for a reliable estimation of the coefficient on one side and not observing the examples where the values of other attributes could significantly affect the function value on the other. However, if the tube is symmetrically covered by the examples (this is probably true except on the boundaries of the covered attribute space) and if the function which we model is negatively symmetrical with respect to other attributes’ values in the part of the space covered by the tube, impacts of other attributes can be expected to cancel out. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 46 5 Conclusion We proposed a new algorithm, Padé, which discovers significant monotone relationships in data. It does so by computing local approximations to partial derivatives in each learning example. For each learning example Padé takes the sign of the partial derivative and instead of inducing the model directly, translates the problem to the field of learning classification models which is one of the most developed and active fields of artificial intelligence. Another distinctive feature of Padé is that it is based on pure mathematical concepts – calculus of real functions and statistical regression. As such, the work can provide a good foundation for further development of the field. References 1. Šuc, D., Bratko, I.: Induction of qualitative trees. In De Raedt, L., Flach, P., eds.: Proceedings of the 12th European Conference on Machine Learning, Springer (2001) 442–453 Freiburg, Germany. 2. Bratko, I., Šuc, D.: Learning qualitative models. AI Magazine 24(4) (2003) 107–119 3. Žabkar, J., Jerše, G., Mramor, N., Bratko, I.: Induction of qualitative models using discrete morse theory. In: Proceedings of the 21st Workshop on Qualitative Reasoning, Aberystwyth (2007) 4. Forman, R.: A user’s guide to discrete Morse theory (2001) 5. King, H.C., Knudson, K., Mramor Kosta, N.: Generating discrete morse functions from point data. Exp. math. 14(4) (2005) 435–444 6. Atkeson, C., Moore, A., Schaal, S.: Locally weighted learning. Artificial Intelligence Review 11 (1997) 11–73 7. Žabkar, J., Bratko, I., Demšar, J.: Learning qualitative models through partial derivatives by pad. In: Proceedings of the 21th International Workshop on Qualitative Reasoning, Aberystwyth, U.K. (2007) 8. Zupan, B., Leban, G., Demšar, J.: Orange: Widgets and visual programming, a white paper (2004) 9. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers (1993) Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 47 Nonparametric Ordinal Classification with Monotonicity Constraints Nicola Barile and Ad Feelders Utrecht University, Department of Information and Computing Sciences, P.O. Box 80089, 3508TB Utrecht, The Netherlands, {barile,ad}@cs.uu.nl Abstract. In many applications of ordinal classification we know that the class label must be increasing (or decreasing) in the attributes. Such relations are called monotone. We discuss two nonparametric approaches to monotone classification: osdl and moca. Our conjecture is that both methods have a tendency to overfit on the training sample, because their basic class probability estimates are often computed on a few observations only. Therefore, we propose to smooth these basic probability estimates by using weighted k nearest neighbour. Through substantial experiments we show how this adjustment improves the classification performance of osdl considerably. The effect on moca on the other hand is less conclusive. 1 Introduction In many applications of data analysis it is reasonable to assume that the response variable is increasing (or decreasing) in one or more of the attributes or features. Such relations between response and attribute are called monotone. Besides being plausible, monotonicity may also be a desirable property of a decision model for reasons of explanation, justification and fairness. Consider two applicants for the same job, where the one who scores worse on all criteria gets the job. While human experts tend to feel uncomfortable expressing their knowledge and experience in terms of numeric assessments, they typically are able to state their knowledge in a semi-numerical or qualitative form with relative conviction and clarity, and with less cognitive effort [9]. Experts, for example, can often easily indicate which of two probabilities is smallest. In addition to requiring less cognitive effort, such relative judgments tend to be more reliable than direct numerical assessments [18]. Hence, monotonicity constraints occur frequently in learning problems and such constraints can be elicited from subject area experts with Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 48 relative ease and reliability. This has motivated the development of algorithms that are able to enforce such constraints in a justified manner. Several data mining techniques have been adapted in order to be able to handle monotonicity constraints in one form or another. Examples are: classification trees [19, 10, 6], neural networks [20, 21], Bayesian networks [1, 11] and rules [8]. In this paper, we confine our attention to two nonparametric approaches to monotone classification: osdl [7, 15] and moca [4]. These methods rely on the estimation of the class probabilities for each observed attribute vector. These basic estimates as we will call them are then further processed in order to extend the classifier to the entire attribute space (by interpolation), and to guarantee the monotonicity of the resulting classification rule. Because the basic estimates are often based on very few observations, we conjecture that osdl and moca are prone to overfitting. Therefore we propose to smooth the basic estimates by including observations that are near to where an estimate is required. We perform a substantial number of experiments to verify whether this indeed improves the classification performance. This paper is organized as follows. In the next section, we establish some concepts and notation that will be used throughout the paper. In section 3 we give a short description of osdl and moca and establish similarities and differences between them. We also provide a small example to illustrate both methods. In section 4 we propose how to adapt the basic estimates that go into osdl and moca, by using weighted k nearest neighbour. Subsequently, these adapted estimates are tested experimentally in section 5. We compare the original algorithms to their adapted counterparts, and test whether significant differences in predictive performance can be found. Finally, we draw conclusions in section 6. 2 Preliminaries Let X denote the vector of attributes, which takes values x in a pdimensional input space X = ×Xi , and let Y denote the class variable which takes values y in a one-dimensional space Y = {1, 2, . . . , q}, where q is the number of class labels. We assume that the values in Xi , i = 1, . . . , p, and the values in Y are totally ordered. An attribute Xi has a positive influence on Y if for all xi , x0i ∈ Xi : xi ≤ x0i ⇒ P (Y |xi , x−i ) P (Y |x0i , x−i ) (1) where x−i is any value assignment to the attributes other than Xi [22]. Here P (Y |xi , x−i ) P (Y |x0i , x−i ) means that the distribution of Y for Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 49 attribute values (xi , x−i ) is stochastically smaller than for attribute values (x0i , x−i ), that is F (y|xi , x−i ) ≥ F (y|x0i , x−i ), y = 1, 2, . . . , q where F (y) = P (Y ≤ y). In words: for the larger value of Xi , larger values of Y are more likely. A negative influence is defined analogously, where for larger values of Xi smaller values of Y are more likely. Without loss of generality, we henceforth assume that all influences are positive. A negative influence from Xi to Y can be made positive simply by reordering the values in Xi . Considering the constraints (1) corresponding to all positive influences together, we get the constraint: ∀x, x0 ∈ X : x x0 ⇒ P (Y |x) P (Y |x0 ), (2) where the order on X is the product order x x0 ⇔ ∀i = 1, . . . , p : xi ≤ x0i . It is customary to evaluate a classifier on the basis of its error-rate or 0/1 loss. For classification problems with ordered class labels this choice is less obvious. It makes sense to incur a higher cost for those misclassifications that are “far” from the true label, than to those that are “close”. One loss function that has this property is L1 loss: L1 (i, j) = |i − j| i, j = 1, . . . , q (3) where i is the true label, and j the predicted label. We note that this is not the only possible choice. One could also choose L2 loss for example, or another loss function that has the desired property that misclassifications that are far from the true label incur a higher loss. Nevertheless, L1 loss is a reasonable candidate, and in this paper we confine our attention to this loss function. It is a well known result from probability theory that predicting the median minimizes L1 loss. A median m of Y has the property that P (Y ≤ m) ≥ 0.5 and P (Y ≥ m) ≥ 0.5. The median may not be unique. Let m` denote the smallest median of Y and let mu denote the largest median. We have [15] P (Y |x) P (Y |x0 ) ⇒ m` (x) ≤ m` (x0 ) ∧ mu (x) ≤ mu (x0 ) The above result shows that predicting the smallest (or largest) median gives an allocation rule c : X → Y that satisfies ∀x, x0 ∈ X : x x0 ⇒ c(x) ≤ c(x0 ), Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 (4) 50 that is, a lower ordered input cannot have a higher class label. Kotlowski [14] shows that if a collection of probability distributions satisfies the stochastic order constraint (2), then the Bayes allocation rule cB (·) satisfies the monotonicity constraint (4), provided the loss function is convex. This encompasses many reasonable loss functions but not 0/1 loss, unless the class label is binary. Let D = {(xi , yi )}N i=1 denote the set of observed data points in X × Y, and let Z denote the set of distinct x values occurring in D. We define the downset of x with respect to Z to be the set {x0 ∈ Z : x0 x}. The upset of x is defined analogously. Any real-valued function f on Z is isotonic with respect to if, for any x, x0 ∈ Z, x x0 implies f (x) ≤ f (x0 ). Likewise, a real-valued function a on Z is antitonic with respect to if, for any x, x0 ∈ Z, x x0 implies a(x) ≥ a(x0 ). 3 OSDL and MOCA In this section we give a short description of osdl and moca, and discuss their similarities and differences. 3.1 OSDL The ordinal stochastic dominance learner (osdl) was developed by CaoVan [7] and generalized by Lievens et al. in [15]. Recall that Z is the set of distinct x values present in the training sample D. Let P̂ (y|x) = n(x, y) , n(x) x ∈ Z, y = 1, . . . , q where n(x) denotes the number of observations in D with attribute values x, and n(x, y) denotes the number of observations in D with attribute values x and class label y. Furthermore, let X F̂ (y|x) = P̂ (j|x), x∈Z j≤y denote the unconstrained maximum likelihood estimate of F (y|x) = P (Y ≤ y|x), x ∈ Z. To obtain a collection of distribution functions that satisfy the stochastic order restriction, Cao-Van [7] defines: F min (y|x0 ) = min F̂ (y|x) xx0 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 (5) 51 and F max (y|x0 ) = max F̂ (y|x), x0 x (6) where x ∈ Z. If there is no point x in Z such that x x0 , then F min (y|x0 ) = 1 (y = 1, . . . , q), and if there is no point x in Z such that x0 x, then F max (y|x0 ) = 0 (y = 1, . . . , q − 1), and F min (q|x0 ) = 1. Note that ∀x, x0 ∈ X : x x0 ⇒ F min (y|x) ≥ F min (y|x0 ) 0 xx ⇒ F max (y|x) ≥ F max (y|x0 ). (7) (8) Proposition (7) holds, since the downset of x is a subset of the downset of x0 , and the minimum taken over a given set is never above the minimum taken over one of its subsets. Proposition (8) follows similarly. In the constant interpolation version of OSDL, the final estimates are obtained by putting F̃ (y|x0 ) = αF min (y|x0 ) + (1 − α)F max (y|x0 ), (9) with α ∈ [0, 1]. This rule is used both for observed data points, as well as for new data points. The interpolation parameter α is a free parameter whose value can be selected so as to minimize empirical loss on a test sample. Note that F̃ satisfies the stochastic order constraint, because both (7) and (8) hold. More sophisticated interpolation schemes called balanced and double balanced osdl are discussed in [15]; we refer the reader to this paper for details. These osdl versions are also included in the experimental evaluation that is presented in section 5. 3.2 MOCA In this section, we give a short description of moca. For each value of y, the moca estimator F ∗ (y|x), x ∈ Z minimizes the sum of squared errors X o2 n n(x) F̂ (y|x) − a(x) (10) x∈Z within the class of antitonic functions a(x) on Z. This is an isotonic regression problem. It has a unique solution, and the best time complexity known is O(|Z|)4 [17]. The algorithm has to be performed q − 1 times, since obviously F ∗ (q|x) = 1. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 52 Note that this estimator satisfies the stochastic order constraint ∀x, x0 ∈ Z: x x0 ⇒ F ∗ (y|x) ≥ F ∗ (y|x0 ) y = 1, . . . , q (11) by construction. Now the isotonic regression is only defined on the observed data points. Typically the training sample does not cover the entire input space, so we need some way to estimate F (y|x0 ) for points x0 not in the training sample. Of course these estimates should satisfy the stochastic order constraint with respect to F ∗ (Y |x). Hence, we can derive the following bounds: F min (y|x0 ) = max F ∗ (y|x) y = 1, . . . , q (12) F max (y|x0 ) = min F ∗ (y|x) y = 1, . . . , q (13) x0 x and xx0 If there is no point x in Z such that x x0 , then we put F max (y|x0 ) = 1 (y = 1, . . . , q), and if there is no point x in Z such that x0 x, then we put F min (y|x0 ) = 0 (y = 1, . . . , q − 1), and F min (q|x0 ) = 1. Because F ∗ is antitonic we always have F min (y) ≤ F max (y). Any choice from the interval [F min (y), F max (y)] satisfies the stochastic order constraint with respect to the training data. A simple interpolation scheme that is guaranteed to produce globally monotone estimates is to take the convex combination F̆ (y|x0 ) = αF min (y|x0 ) + (1 − α)F max (y|x0 ), (14) with α ∈ [0, 1]. Note that for x0 ∈ Z, we have F̆ (y|x0 ) = F ∗ (y|x0 ), since both F min (y|x0 ) and F max (y|x0 ) are equal to F ∗ (y|x0 ). The value of α can be chosen so as to minimize empirical loss on a test sample. Since moca should produce a class prediction, we still have to specify an allocation rule. moca allocates x to the smallest median of F̆ (Y |x): c∗ (x) = min : F̆ (y|x) ≥ 0.5. y First of all, note that since F̆ (y) satisfies the stochastic order constraint (2), c∗ will satisfy the monotonicity constraint given in (4). Furthermore, it can be shown (see [4]) that c∗ minimizes L1 loss N X |yi − c(xi )| i=1 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 53 within the class of monotone integer-valued functions c(·). In other words, of all monotone classifiers, c∗ is among the ones (there may be more than one) that minimize L1 loss on the training sample. 3.3 An Example To illustrate osdl and moca, we present a small example. Suppose we have two real-valued attributes X1 and X2 and a ternary class label Y , that is, Y = {1, 2, 3}. Consider the dataset given in Figure 1. X2 5 1 7 3 3 2 4 2 2 1 1 1 6 2 X1 Fig. 1. Data for example. Observations are numbered for identification. Class label is printed in boldface next to the observation. Table 1 gives the different estimates of F . Since all attribute vectors occur only once, the estimates F̂ are based on only a single observation. The attribute vector of observation 5 is bigger than that of 3 and 4, but observation 5 has a smaller class label. This leads to order reversals in F̂ (1). We have F̂ (1|3) and F̂ (1|4) smaller than F̂ (1|5) (where in a slight abuse of notation we are conditioning on observation numbers), but observation 5 is in the upset of 3 and 4. In this case, the antitonic regression resolves this order reversal by averaging these violators: F ∗ (1|3) = F ∗ (1|4) = F ∗ (1|5) = 0+0+1 1 = 3 3 Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 54 This is the only monotonicity violation present in F̂ so no further averaging is required. We explain the computation of the constant interpolation version of osdl estimate through an example. In Table 1 we see that F̃ (1|3) = 1/2. This is computed as follows: F min (1|3) = min{F̂ (1|1), F̂ (1|2), F̂ (1|3)} = min{1, 1, 0} = 0, since the downset of observation 3 is {1, 2, 3}. Likewise, we can compute F max (1|3) = max{F̂ (1|3), F̂ (1|5), F̂ (1|7)} = max{0, 1, 0} = 1, since the upset of observation 3 is {3, 5, 7}. Combining these together with α = 0.5 we get: F̃ (1|3) = αF min (1|3) + (1 − α)F max (1|3) = 1 2 Table 1. Maximum Likelihood, moca and osdl (α = 0.5) estimates of F (1) and F (2). F̂ (mle) F̆ (moca) F̃ (osdl) ∗ obs 1 2 y 1 2 c 1 2 1 2 3 4 5 6 7 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 1 2 2 1 2 3 1 1 1/3 1/3 1/3 0 0 1 1 1 1 1 1 0 1 1 2 2 2 2 3 1 1 1/2 1/2 1/2 0 0 1 1 1 1 1 1 1/2 cmin cmax 1 1 1 1 1 2 2 1 1 2 2 2 2 3 The moca allocation rule c∗ allocates to the smallest median of F̆ , which gives a total absolute error of 1, since observation 5 has label 1, but is predicted to have label 2 by c∗ . All other predictions of c∗ are correct. An absolute error of 1 is the minimum achievable on the training sample for any monotone classifier. For osdl we have given two allocation rules: one that assigns to the smallest median (cmin ) and one that assigns to the largest median (cmax ). The former has an absolute error of 3 on the training sample, and the latter achieves the minimum absolute error of 1: it is identical to c∗ in this case. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 55 3.4 Comparison of osdl and moca The reader will have noticed the similarity between moca and osdl: moca uses the same interpolation method, and the moca definitions of F min and F max are the reverse of the corresponding definitions for osdl. An important difference is that osdl plugs in the maximum likelihood estimates F̂ in equations (5) and (6), whereas moca plugs in the isotonic regression estimates F ∗ in equations (12) and (13). It should be noted that osdl in principle allows any estimate of F (y|x) to be plugged into equations (5) and (6). From that viewpoint, moca can be viewed as an instantiation of osdl. However, to the best of our knowledge only the unconstrained maximum likelihood estimate has been used in osdl to date. Because F ∗ is plugged in, moca is guaranteed to minimize L1 loss on the training sample. While this is a nice property, the objective is not to minimize L1 loss on the training sample. It remains to be seen whether this also results in better out-of-sample predictive performance. It should be noted that if F̂ already satisfies the stochastic order restriction, then both methods are identical. In that case the isotonic regression will not make any changes to F̂ , since there are no order reversals. Our conjecture is that both methods have a tendency to overfit on the training data. In many applications attribute vectors occur in D only once, in particular when the attributes are numeric. Hence the basic estimates that go into equations (5) and (6) are usually based on a single observation only. Now the interpolation performed in equation (14) will have some smoothing effect, but it is the question whether this is sufficient to prevent overfitting. The same reasoning applies to moca, but to a lesser extent because the isotonic regression has an additional smoothing effect: in case of order reversals basic estimates are averaged to remove the monotonicity violation. Nevertheless, it is possible that moca could be improved by performing the isotonic regression on a smoothed estimate rather than on F̂ in (10). This is what we propose in the next section. 4 Weighted kNN probability estimation In order to prevent overfitting in estimating P (Y ≤ y|x), x ∈ Z, we develop a weight-based estimator based on the nearest neighbours principle along the lines of the one introduced in [13]. In the following, we first discuss kNN as a classification technique and then illustrate how we use it to perform probability estimation. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 56 4.1 k Nearest Neighbour Classification The k Nearest Neighbor technique is an example of instance-based learning: the training dataset is stored, and the classification of new, unlabelled instances is performed by comparing each of them to the k most similar (least dissimilar) elements to it in the training dataset. The dissimilarity is determined by means of a distance metric or function, which is a real-valued function d such that for any data points x, y, and z: 1. d(x, y) > 0, d(x, x) = 0; 2. d(x, y) = d(y, x); 3. d(x, z) ≤ d(x, y) + d(y, z); The distance measure which we adopted is the Euclidean distance: v u p uX d(x, y) = t (xi − yi )2 . i=1 In order to avoid attributes that have large values from having a stronger influence than attributes measured on a smaller scale, it is important to normalize the attribute values. We adopted the Z-score standardization technique, whereby each value x of an attribute X is replaced by x − x̄ , sX where sX denotes the sample standard deviation of X. Once a distance measure to determine the neighbourhood of an unlabelled observation x0 has been selected, the next step to use kNN as a classification technique is represented by determining a criterion whereby the selected labelled individuals will be used to determine the label of x0 . The most straightforward solution is represented by (unweighted) majority voting: the chosen label is the one occurring most frequently in the neighbourhood of x0 . 4.2 Weighted kNN Classification In kNN it is reasonable to request that neighbours that are closer to x0 have a greater importance in deciding its class than those that are more distant. In the Weighted Nearest Neighbours approach x0 is assigned to the class y0 which has a weighted majority among its k nearest neighbours, Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 57 namely y0 = arg max y k X ωi I(yi = y). i=1 Each of the k members xi of the neighbourhood of x0 is assigned a weight ωi which is inversely proportional to its distance d = d(x0 , xi ) from x0 and which is obtained by means of a weighting function or kernel Kλ (x0 , xi ) [12]: d(x0 , xi ) . (15) ωi = Kλ (x0 , xi ) = G λ Kernels are at the basis of the Parzen density estimation method [12]; in that context, the smoothing parameter or bandwidth λ dictates the width of the window considered to perform the estimation. A large λ implies lower variance (averages over more observations) but higher bias (we essentially assume the true function is constant within the window). Here G(·) can be any function with maximum in d = d(x, y) = 0 and values that get smaller with growing value of d. Thus the following properties must hold [13]: 1. G(d) ≥ 0 for all d ∈ R; 2. G(d) gets its maximum for d = 0; 3. G(d) descends monotonically for d → ∞; In the one-dimensional case, one popular kernel is obtained by using the Gaussian density function φ(t) as G(·), with the standard deviation playing the role of the parameter λ. In Rp , with p > 1, the natural generalization is represented by ( ) 1 1 kx0 − xi k 2 Kλ (x0 , xi ) = √ , exp − 2 λ 2πλ which is the kernel we adopt in our method. Although the kernel used is in a sense a parameter of wkNN, experience has shown that the choice of kernel (apart from the the rectangular kernel, which gives equal weights to all neighbours) is not crucial [12]. In equation (15) it is assumed that λ is a fixed value over the whole of the space of data samples. The optimal value of λ may be locationdependent, giving a large value in regions where the data samples are sparse and a small value where the data samples are densely packed. One solution is represented by the use of adaptive windows, where λ depends on the location of the sample in the data space. Let hλ (x0 ) be a width Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 58 function (indexed by λ) which determines the width of the neighbourhood at x0 . Then we have Kλ (x0 , xi ) = G d(x0 , xi ) hλ (x0 ) As kernels are used to compute weights in wkNN, we set hλ (x0 ) equal to the distance d(x0 , xk+1 ) of x0 from the first neighbour xk+1 that is not taken into consideration [12, 13]. 4.3 Using wkNN for Probability Estimation We adopted the weighted k-nearest neighbour principle to estimate class probabilities for each distinct observation x ∈ Z as follows: let Nk (x) be the set of indices in D of all occurences of the k attribute vectors in Z closest to x. Note that Nk (x) may contain more than k elements if some attribute vectors occur multiple times in D. Then P P̂ (y|x) = i∈Nk (x) ωi I(yi P i∈Nk (x) ωi = y) , y = 1, . . . , q, (16) It should be noted that x is included in its own neighbourhood and its occurrences have a relatively large weight ωi in (16). In the case of moca, the adoption of this new probability estimator has an impact on the computation of the moca estimator not only in terms of the probability estimates that the antitonic regression is performed on but also on the weights used, which are now equal to the cardinality of Nk (x) for each x ∈ Z. Note that if k = 1, then equation (16) produces the maximum likelihood estimates used in standard osdl and moca. The estimator presented in this section is analogous to the one adopted in [13], where the estimates obtained are used to perform ordinal classification (without monotonicity constraints) by predicting the median. 5 Experiments We performed a series of experiments on a several real-world datasets in order to determine whether and to what extent moca and osdl would benefit from the new wkNN probability estimator. The results were measured in terms of the average L1 error rate achieved by the two algorithms. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 59 5.1 Datasets We selected a number of datasets where monotonicity constraints are likely to apply. We used the KC1, KC4, PC3, and PC4 datasets from the NASA Metrics Data Program [16], the Acceptance/Rejection, Employee Selection, Lecturers Evaluation and Social Workers Decisions datasets from A. Ben-David [5], the Windsor Housing dataset [2], as well as several datasets from the UCI Machine Learning Repository [3]. Table 2 lists all the datasets used. Table 2. Charasterics of datasets used in the experiments Dataset cardinality #attributes #labels Australian Credit 690 14 2 Auto MPG 392 7 4 Boston Housing 506 13 4 Car Evaluation 1728 6 4 Empoyee Rej/Acc 1000 4 9 Employee selection 488 4 9 Haberman survival 306 3 2 KC1 2107 21 3 KC4 122 4 6 Lecturers evaluation 1000 4 5 CPU Performance 209 6 4 PC3 320 15 5 PC4 356 16 6 Pima Indians 768 8 2 Social Workers Decisions 1000 10 4 Windsor Housing 546 11 4 5.2 Dataset Preprocessing For datasets with a numeric response that is not a count (Auto MPG, Boston Housing, CPU Performance, and Windsor Housing) we discretized the response values into four separate intervals, each interval containing roughly the same number of observations. For all datasets from the NASA Metrics Data Program the attribute ERROR COUNT was used as the response. All attributes that contained missing values were removed. Furthermore, the attribute MODULE was removed because it is a unique identifier of the module and the ERROR DENSITY was removed because it is a function of the response variable. Furthermore, attributes with zero variance were removed from the dataset. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 60 5.3 Experimental results Each of the datasets was randomly divided into two parts, a training set (containing roughly two thirds of the data) and a validation set. The training set was used to determine the optimal values for k and α in moca and osdl through 10-fold cross validation. We started with k = 1 and incremented its value by one until the difference of the average L1 error between two consecutive iterations for both classifiers was less than or equal to 1−6 . For each value of k we determined the the optimal α in {0, 0.25, 0.5, 0.75, 1}. Once the optimal parameter values were determined, they were used to train both algorithms on the complete training set and then to test them on the validation set. We then performed a paired ttest of the L1 errors on the validation set to determine whether observed differences were significant. Table 3 lists all the results. Table 3. Experimental results. The first four columns contain average L1 errors on the validation set. The final two columns contain p-values. The penultimate column compairs smoothed moca to standard moca. The final column compairs smoothed osdl to standard osdl. Dataset MOCA wkNN OSDL wkNN MOCA MLE OSDL MLE 1. vs. 3. 2. vs. 4. Australian Credit 0.1304 0.1130 0.1348 0.3565 0.3184 0 Auto MPG 0.2977 0.2977 0.2977 0.2977 − − Boston Housing 0.5030 0.4675 0.4675 0.5207 0.4929 0.2085 Car Evaluation 0.0625 0.0556 0.0625 0.0556 − − Empoyee Rej/Acc 1.2006 1.2066 1.2814 1.2814 0.0247 0.0445 Employee selection 0.3620 0.3742 0.3620 0.4110 1 0.2018 Haberman survival 0.3529 0.3431 0.3529 0.3431 − − KC1 0.1863 0.1977 0.1863 0.3940 1 0 KC4 0.8095 0.8095 0.8571 0.8571 0.4208 0.5336 Lecturers evaluation 0.4162 0.4162 0.4102 0.4102 0.8060 0.8060 CPU Performance 0.3571 0.3286 0.3571 0.3571 − 0.5310 PC3 0.1228 0.1228 0.1228 0.1228 − − PC4 0.1872 0.1872 0.1872 0.1872 − − Pima Indians 0.3008 0.3086 0.3008 0.3086 − − Social Workers 0.5359 0.5479 0.5060 0.4940 0.1492 0.0092 Windsor Housing 0.5604 0.5220 0.5604 0.6044 − 0.0249 We first check whether smoothing actually improves the classifiers. Comparing standard osdl against smoothed osdl we observe that the latter is signifcantly better (at α = 0.05) four times, whereas it is significantly worse one time (for Social Workers Decisions). Furthermore, the smoothed version almost never has higher estimated error (Lecturer Evaluation and Social Workers Decisions being the two exceptions). Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 61 Comparing standard moca against smoothed moca, we observe that the latter is significantly better only once (on Employee Rejection). All other observed differences are not significant. Table 4. Experimental results for balanced and double balanced osdl. The first four columns contain average L1 errors on the validation set. The final two columns contain p-values. The penultimate column compairs smoothed balanced osdl to standard balanced osdl. The final column compairs smoothed double balanced osdl to standard double balanced osdl. Dataset bosdl wkNN bosdl MLE dbosdl wkNN dbosdl MLE 1. vs. 2. 3. vs. 4. Australian Credit 0.3565 0.3565 0.3565 0.3565 − − Auto MPG 0.2977 0.2977 0.2977 0.2977 − − Boston Housing 0.5207 0.5207 0.5207 0.5207 − − Car Evaluation 0.0556 0.0556 0.0556 0.0556 − − Empoyee Rej/Acc 1.8533 1.9760 1.8533 1.9760 0.1065 0.1065 Employee selection 0.5092 0.5092 0.5092 0.5092 − − Haberman survival 0.3431 0.3431 0.3431 0.3431 − − KC1 0.1892 0.3030 0.3883 0.3940 0 0.2061 KC4 0.7619 0.8571 0.7857 0.8571 0.1031 0.1829 Lecturers evaluation 0.9251 0.9251 0.9251 0.9251 − − CPU Performance 0.4143 0.4143 0.4143 0.4143 − − PC3 0.1228 0.1228 0.1228 0.1228 − − PC4 0.1872 0.1872 0.1872 0.1872 − − Pima Indians 0.2930 0.2930 0.2930 0.2930 − − Social Workers 0.6617 0.6557 0.6617 0.6437 0.8212 0.4863 Windsor Housing 0.6044 0.6044 0.6044 0.6044 − − In table 4 the effect of smoothing on balanced and double balanced osdl is investigated. We conclude that smoothing doesn’t have much effect in either case: only one significant improvement is found (for KC1). Furthermore, comparing table 3 and table 4, we observe that constant interpolation osdl with smoothing tends to outperform its balanced and double balanced counterparts. 6 Conclusion We have discussed two related methods for nonparametric monotone classification: osdl and moca. The basic class probability estimates used by these algorithms are typically based on very few observations. Therefore we conjectured that they both have a tendency to overfit on the training sample. We have proposed a weighted k nearest neighbour approach to smoothing the basic estimates. The experiments have shown that smoothing is beneficial for osdl: the predictive performance was significantly better on a number of datasets, Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 62 and almost never worse. For moca, smoothing seems to have much less effect. This is probably due to the fact that the isotonic regression already smooths the basic estimates by averaging them in case of order reversals. Hence, moca is already quite competitive for k = 1. The more sophisticated interpolation schemes of osdl (balanced and double balanced) do not seem to lead to an improvement over the constant interpolation version on the datasets considered. References 1. E.A. Altendorf, A.C. Restificar, and T.G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In F. Bacchus and T. Jaakkola, editors, Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI05), pages 18–25. AUAI Press, 2005. 2. P.M. Anglin and R. Gençay. Semiparametric estimation of a hedonic price function. Journal of Applied Econometrics, 11(6):633–648, 1996. 3. A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. 4. N. Barile and A. Feelders. Nonparametric monotone classification with MOCA. In F. Giannotti, editor, Proceedings of the Eighth IEEE International Conference on Data Mining (ICDM 2008), pages 731–736. IEEE Computer Society, 2008. 5. A. Ben-David, L. Sterling, and Y. Pao. Learning and classification of monotonic ordinal concepts. Computational Intelligence, 5:45–49, 1989. 6. Arie Ben-David. Monotonicity maintenance in information-theoretic machine learning algorithms. Machine Learning, 19:29–43, 1995. 7. K. Cao-Van. Supervised ranking, from semantics to algorithms. PhD thesis, Universiteit Gent, 2003. 8. K. Dembczynski, W. Kotlowski, and R. Slowinski. Ensemble of decision rules for ordinal classification with monotonicity constraints. In Rough Sets and Knowledge Technology, volume 5009 of LNCS, pages 260–267. Springer, 2008. 9. M.J. Druzdzel and L.C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In P. Besnard and S. Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI-95), pages 141–148. Morgan Kaufmann, 1995. 10. A. Feelders and M. Pardoel. Pruning for monotone classification trees. In M.R. Berthold, H-J. Lenz, E. Bradley, R. Kruse, and C. Borgelt, editors, Advances in Intelligent Data Analysis V, volume 2810 of LNCS, pages 1–12. Springer, 2003. 11. A. Feelders and L. van der Gaag. Learning Bayesian network parameters with prior knowledge about context-specific qualitative influences. In F. Bacchus and T. Jaakkola, editors, Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI-05), pages 193–200. AUAI Press, 2005. 12. R. Hastie, T.and Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer, second edition, 2009. 13. K. Hechenbichler and K. Schliep. Weighted k-nearest-neighbor techniques and ordinal classification. Discussion Paper 399, Collaborative Research Center (SFB) 386 Statistical Analysis of discrete structures - Applications in Biometrics and Econometrics, 2004. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009 63 14. W. Kotlowski and R. Slowinski. Statistical approach to ordinal classification with monotonicity constraints. In ECML PKDD 2008 Workshop on Preference Learning, 2008. 15. S. Lievens, B. De Baets, and K. Cao-Van. A probabilistic framework for the design of instance-based supervised ranking algorithms in an ordinal setting. Annals of Operations Research, 163:115–142, 2008. 16. J. Long. NASA metrics data program [http://mdp.ivv.nasa.gov/repository.html]. 2008. 17. W.L. Maxwell and J.A. Muckstadt. Establishing consistent and realistic reorder intervals in production-distribution systems. Operations Research, 33(6):1316–1341, 1985. 18. M.A. Meyer and J.M. Booker. Eliciting and Analyzing Expert Judgment: A Practical Guide. Series on Statistics and Applied Probability. ASA-SIAM, 2001. 19. R. Potharst and J.C. Bioch. Decision trees for ordinal classification. Intelligent Data Analysis, 4(2):97–112, 2000. 20. J. Sill. Monotonic networks. In Advances in neural information processing systems, NIPS (Vol. 10), pages 661–667, 1998. 21. M. Velikova, H. Daniels, and M. Samulski. Partially monotone networks applied to breast cancer detection on mammograms. In Proceedings of the 18th International Conference on Artificial Neural Networks (ICANN), volume 5163 of LNCS, pages 917–926. Springer, 2008. 22. M.P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44:257–303, 1990. Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009