Download L A ECML

Document related concepts

Regression analysis wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
LEARNING MONOTONE MODELS
FROM DATA
AT ECML PKDD 2009
MoMo 2009
September 7, 2009
Bled, Slovenia
Workshop Organization
•
•
Rob Potharst (Erasmus Universiteit Rotterdam)
Ad Feelders (Universiteit Utrecht)
Program Committee
Arno Siebes
Michael Berthold
Malik Magdon-Ismail
Ivan Bratko
Hennie Daniels
Oleg Burdakov
Arie Ben-David
Bernard De Baets
Michael Rademaker
Roman Slowinski
Linda van der Gaag
Ioannis Demetriou
Universiteit Utrecht
Universität Konstanz
Rensselaer Polytechnic
Institute
University of Ljubljana
Tilburg University
Linköping University
Holon Institute of Technology
Universiteit Gent
Universiteit Gent
University of Poznan
Universiteit Utrecht
University of Athens
The Netherlands
Germany
USA
Slovenia
The Netherlands
Sweden
Israel
Belgium
Belgium
Poland
The Netherlands
Greece
Preface
In many application areas of data analysis, we know beforehand that the relation between some
predictor variable and the response should be increasing (or decreasing). Such prior knowledge
can be translated to the model requirement that the predicted response should be a (partially)
monotone function of the predictor variables. There are also many applications where a nonmonotone model would be considered unfair or unreasonable. Try explaining to a rejected job
applicant why someone who scored worse on all application criteria got the job! The same holds
for many other application areas, such as credit rating and university entrance selection.
These considerations have motivated the development of learning algorithms that are guaranteed
to produce (or have a bias towards) monotone models. Examples are monotone versions of
classification trees, neural networks, rule learning, Bayesian networks, nearest neighbor methods
and rough sets. Work on this subject has however been scattered over different research
communities (machine learning, data mining, neural networks, statistics and operations
research), and our aim with the workshop Learning Monotone Models from Data at ECML
PKDD 2009 is to bring together researchers from these different fields to exchange ideas.
Even though the number of submissions was not overwhelming, we were pleased to receive
some high quality contributions that have been included in these proceedings. We are also very
happy that Bernard De Baets from Ghent University in Belgium and Arie Ben-David from the
Holon Institute of Technology in Israel accepted our invitation to give their view of the research
area in two invited lectures at the workshop.
Rob Potharst and Ad Feelders, August 2009.
5
Table of Contents
Arie Ben-David, Monotone Ordinal Concept Learning:
Past, Present and Future…………………………………………………………..7
Bernard De Baets, Monotone but not Boring: how to deal with Reversed
Preference in Monotone Classification…………………………………………….9
Marina Velikova and Hennie Daniels, On Testing Monotonicity Of Datasets…...11
Oleg Burdakov, Anders Grimvall and Oleg Sysoev, Generalized PAV
Algorithm with Block Refinement for Partially Ordered Monotonic Regression...23
Jure Žabkar, Martin Možina, Ivan Bratko and Janez Demšar, Discovering
Monotone Relations with Padé……………………………………………………39
Nicola Barile and Ad Feelders, Nonparametric Ordinal Classification with
Monotonicity Constraints…………………………………………………………47
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
6
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
7
Monotone Ordinal Concept Learning:
Past, Present and Future
Arie Ben-David
Department of Technology Management
Holon Institute of Technology
Holon,Israel
Abstract: This talk will survey the history of ordinal concept learning in general and that of
monotone ordinal learning in particular. Some approaches that were taken over the years will
be presented, as well as a survey of recent publications about the topic. Some key points that
should be addressed in future research will be discussed. In particular: the use of a standard
operating environment, the establishment of a "large enough" publicly available body of
ordinal benchmarking files, and the use of an agreed upon set of meters and procedures for
comparing the performance of the various models. Time will be allocated for an open
discussion about how these and other goals can be promoted.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
8
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
9
Monotone but not Boring: how to deal with Reversed
Preference in Monotone Classification
Bernard De Baets
KERMIT
Research Unit Knowledge-based Systems
Ghent University
Ghent, Belgium
Abstract: We deal with a particular type of classification problems, in which there exists a linear
ordering on the label set (as in ordinal regression) as well as on the domain of each of the
features. Moreover, there exists a monotone relationship between the features and the class
labels. Such problems of monotone classification typically arise in a multi-criteria evaluation
setting. When learning such a model from a data set, we are confronted with data impurity in the
form of reversed preference. We present the Ordinal Stochastic Dominance Learner framework
which allows to build various instance-based algorithms able to process such data. Moreover, we
explain how reversed preference can be eliminated by relating this problem to the maximum
independent set problem and solving it efficiently using flow network algorithms.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
10
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
11
On Testing Monotonicity of Datasets
Marina Velikova1 and Hennie Daniels2
1
2
Department of Radiology, Radboud University Nijmegen Medical Centre,
Nijmegen, The Netherlands
[email protected]
Center for Economic Research, Tilburg University, The Netherlands and ERIM
Institute of Advanced Management Studies, Erasmus University Rotterdam,
Rotterdam, The Netherlands
[email protected]
Abstract. In this paper we discuss two heuristic tests to measure the
degree of monotonicity of datasets. It is shown that in the case of artificially generated data both measures discover the monotone and nonmonotone variables as they were created in the test data. Furthermore in
the same study we demonstrate that the tests produce the same ordering
on the independent variables from monotone to non-monotone. Finally
we prove that although both tests work well in practice, in a theoretical
sense, in some cases it can be impossible to decide whether the data were
generated from a monotone relation or not.
1
Introduction
In many classification and prediction problems in economics one has to deal
with relatively small data sets. In these circumstances flexible estimators like
neural networks have a tendency to overfit the data. It has been illustrated
that in the case of a monotone relationship between the response variables and
a (sub)set of independent variables (partially) monotone neural networks yield
smaller prediction errors and have a lower variance compared to ordinary feed
forward networks ([1–3]). This is mainly due to the fact that the constraints
imposing a monotone relation between input and response variable repress noise,
but maintain the flexibility to model a non-linear association.
In most practical cases one has a belief in advance, which of the independent variables have a monotone relation with the response variable ([4–6]). For
example we suppose that the house price increases with the number of square
meters of living space, and the income of a person rises with the age and level of
education. However, in the case where only a limited amount of empirical data is
available, one would like to have a test to find out if the presupposed behaviour
is confirmed by the data. Earlier results on estimating regression functions under monotonicity constraints (isotonic regression) and testing monotonicity of a
given regression function can be found, for example, in [7, 8].
In this paper we compare two heuristic tests for monotonicity. The first index
measures the degree of monotonicity of the whole dataset and is defined as the
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
12
fraction of comparable pairs that are monotone for the datasets obtained after
removing one of the independent variables. The second measures the degree
of monotonicity of a certain independent variable with respect to the response
variable based on the monotonicity index introduced in [9, 10], which is defined
by fitting a neural network to the data, and therefore depends on the network
architecture. This monotonicity index ranges between 0 and 1 for each variable,
with 1 indicating full monotonicity and 0 indicating a non-monotone relation.
This index induces an ordering on the independent variables depending on the
degree of monotonicity.
To compare both measures for monotonicity, we develop a procedure to produce an ordering of the independent variables also from the first measure. It is
shown that both methods will produce the same ordering of the variables in the
case of experimentally generated data. We show that the second index has a
small variance with respect to different neural network architectures but varies
considerably with respect to the noise level in the data. Finally we prove that
if the data set has no comparable pairs, in which case the first measure is not
calculable, we can build a piecewise linear monotone model of which the data is
a sample. This is possible even when the data were generated from a completely
non-monotone model. The model can be constructed both as non-decreasing or
non-increasing piecewise linear. This result is related to proposition 2.5 of [11],
where the authors show that an optimal linear fit to a non-linear monotone function may find the incorrect monotonicity direction, for certain input densities.
As a consequence we infer that the second index cannot be stable if the number
of degrees of freedom in the neural network model is unlimited.
2
Monotone Prediction Problems and Models
Qk
Let X = i=1 Xi be an input space represented by k attributes (features). A
particular point x ∈ X is defined by the vector x = (x1 , x2 , . . . , xk ), where
xi ∈ Xi , i = 1, . . . , k. Furthermore, a totally ordered set of labels L is defined. In
the discrete case, we have L = {1, . . . , `max } where `max is the maximal label.
Note that ordinal labels can be easily quantified by assigning numbers from 1
for the lowest category to `max for the highest category. In the continuous case,
we have L ⊂ < or L ⊂ <+ . Unless the distinction is made explicitly, the term
label is used to refer generally to the dependent variable irrespective of its type
(continuous or discrete). Next a function f is defined as a mapping f : X → L
that assigns a label ` ∈ L to every input vector x ∈ X .
In prediction problems, the objective is to find an approximation fˆ of f as
close as possible; for example in L1 ,L2 , or L∞ norm. In particular, in regression
we try to estimate the average dependence of ` given x, E[`|x]. Any estimator
such as neural network used in this paper, is an approximation of this function. In classification, we look for a discrete mapping function represented by a
classification rule r(`x ) assigning a class ` to each point x in the input space.
In reality, the information we have about f is mostly provided by a dataset
D = (xn , `xn )N
n=1 , where N is the number of points, x ∈ X and `x ∈ L. Then,
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
13
N
X = {xn }n=1 is a set of k independent variables represented by an N ×k matrix,
N
and L = {`xn }n=1 is a vector with the values of the dependent variable. In this
context, D corresponds to a mapping fD : X → L and we assume that fD is a
close proximity of f . Ideally, fD is equal to f over X, which is seldom the case
in practice due to the noise present in D.
Hence, our ultimate goal in prediction problems is to obtain a close approximation fˆMD of f by building a prediction model MD from the given data D.
The main assumption we make here is that f exhibits monotonicity properties
with respect to the input variables and therefore, fˆMD should also obey these
properties in a strict fashion.
In this study we distinguish between two types of problems, and their respective models, concerning the monotonicity properties. The distinction is based on
the set of input variables, which are in monotone relationships with the response:
1. Totally monotone prediction problems (models): f (fˆMD ) depends monotonically on all variables in the input space.
2. Partially monotone prediction problems (models): f (fˆMD ) depends monotonically on some variables in the input space but not all.
Without loss of generality, in the remainder of the paper, we consider monotone problems in which the label is continuous.
2.1
Total Monotonicity
In monotone prediction problems, we assume that D is generated by a process
with the following properties
`x = f (x) + ²
(1)
where f is a monotone function, and ² is a random variable with zero mean and
constant variance σ²2 . Note that in classification problems ² is not additive, but
multiplicative (non-homogeneous variance) and it is a small probability that the
assigned class is incorrect. We say that f is non-decreasing on x if
x1 ≥ x2 ⇒ f (x1 ) ≥ f (x2 )
(2)
where x1 ≥ x2 is a partial ordering on X defined by x1i ≥ x2i , for i = 1, . . . , k. The
pair (x1 , x2 ) is called comparable if x1 ≥ x2 or x1 ≤ x2 , and if the relationship
defined in (2) holds, it is also a monotone pair. Throughout the paper we assume
that all monotone relationships are monotone increasing. If a relationship is
monotone decreasing then the data are transformed in such a way that it becomes
monotone increasing. The degree of monotonicity DgrMon of a dataset D is
defined by:
#Monotone pairs(D)
.
(3)
DgrMon(D) =
#Comparable pairs(D)
If all comparable pairs are monotone then DgrMon = 1 and the dataset is
called monotone (non-decreasing by assumption).
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
14
2.2
Partial Monotonicity
Qm
In partially monotone problems, we have X = X m × X nm with X m = i=1 Xi
Q
k
and X nm =
i=m+1 Xi for 1 ≤ m < k. Furthermore, we have a data set
m
nm
D = (x , x , `x )N , where xm ∈ X m , xnm ∈ X nm , and N is the number
of observations. A data point x ∈ D is represented by x = (xm , xnm ); the label
of x is `x . We assume that D is generated by the following process
`x = f (xm , xnm ) + ²,
(4)
where f is a monotone function in xm and ² is a random error defined as before.
The partial monotonicity constraint of f on xm is defined by
m
xnm
= xnm
and xm
1
2
1 ≥ x2 ⇒ f (x1 ) ≥ f (x2 ).
(5)
Henceforth, we call X m the set of monotone variables and X nm the set of
non-monotone variables. By non-monotone we mean that monotone dependence
is not known a priori. Although, we do not constrain the size of the two sets,
our main assumption for the problems considered in this paper is that we have
only a small number of non-monotone variables (e.g., < 5), and a large number
of monotone variables.
3
Tests for Monotonicity of a Dataset
In partially monotone problems, one usually has prior knowledge about the
monotone relationships of a subset of attributes with respect to the target,
whereas for the remaining attributes such dependences are unknown a priori.
To determine whether variables are monotone or non-monotone, we propose the
following two empirical tests based on the available data.
3.1
Measuring Monotonicity by Removal of a Variable
To determine the ordering of the independent variables from monotone to less
monotone we use the measure for monotonicity DgrMon defined in (3). Here we
assume that the independent variables are not highly correlated and all monotone relationships are monotone increasing. We compare the measure DgrMon
obtained for the original data and for the data with an independent variable
removed. The truncated dataset has one dimension less than the original data
but the same number of data points. Next we show that (i) if DgrMon decreases
after removing a variable then the variable is monotone and (ii) if DgrMon increases after removing a variable then the variable is non-monotone. To see the
effect of removing a single variable from a dataset we consider all possible cases
of pairs as illustrated in Fig. 1.
Let (a, x, `) denote a data point in the original dataset D, where a is the
variable to be removed, x are the other independent variables and ` is the label.
First, we observe that the removal of a variable keeps or makes a pair comparable
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
15
(1)
0
Monotone pair
(a1, x1,
< ≤
(a2, x2,
(2)
0
1)
(x1,
1)
2)
(x2,
2)
<
Non-monotone pair
(a1, x1,
< ≤
(a2, x2,
Monotone pair
≤ <
Non-monotone pair
1)
(x1,
1)
2)
(x2,
2)
>
≤ >
(3)
−
Incomparable pair
(a1, x1,
> ≤
(a2, x2,
Non-monotone pair
1)
(x1,
1)
2)
(x2,
2)
>
(4)
Incomparable pair
+
> ≤ <
≤ >
Monotone pair
(a1, x1,
1)
(x1,
1)
(a2, x2,
2)
(x2,
2)
≤ <
Fig. 1. Effect of removing the variable a on a comparable (monotone, non-monotone)
or incomparable pair.
(monotone or non-monotone). This implies that the number of comparable pairs
in the truncated data will increase in comparison to that of the original data.
Next we note that the incomparable pairs due to x (i.e., x1 6≤ x2 and x1 6≥ x2 )
will remain incomparable after the removal of a variable. So, these pairs will not
have an effect on the degree of monotonicity and they are not shown in Fig. 1.
We then consider the remaining four cases of pairs and their transformation
after the removal of the variable a, all illustrated in Fig. 1a. Case (1) corresponds to a monotone pair in the original data which remains monotone after
the removal of a. Case (2) presents a non-monotone pair, which remains nonmonotone without the variable a. Case (3) is an incomparable pair in the original
data, which turns into a non-monotone pair when a is removed. Case (4) is also
originally an incomparable pair, which becomes monotone after the removal of
a. We denote the effect on monotonicity due to the removal of the variable a as
follows: ’0’ means that there is no change in the type of the pair (cases (1) and
(2)), ’−’ means that a non-monotone pair is created (case (3)) and ’+’ means
that a monotone pair is created (case (4)). We now look at the change of the
degree of monotonicity DgrMon when we remove the variable a from the data.
If DgrMon decreases after removing a then case (3) is more likely to occur
than case (4). This means that a has a monotone relationship with the label,
because we assumed that none of the variables, including x, has a decreasing
relation with the label. If there is substantial increase of DgrMon then case (4)
is more likely to happen than case (3). As a consequence a should be a nonmonotone variable. Because if a were monotone cases (1) and (3) would more
likely occur than cases (2) and (4), which contradicts that DgrMon increases.
If DgrMon remains relatively the same then it cannot be decided whether or
not the variable is monotone. This might occur, for example, when two or more
independent variables are highly correlated. For example, consider the extreme
case when two variables are identical. Then removing any of the two variables,
irrespective of their type (monotone or non-monotone), will not affect the degree
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
16
of monotonicity. In such cases, a straightforward solution is to remove all but
one of the highly correlated variables. The arguments stated above explain the
response of DgrMon due to the removal of a non-monotone or monotone variable
in Table 1, concerning the simulation study presented in Section 4.
3.2
Measuring Monotonicity by Function Approximation
For datasets with very few or no comparable pairs however, DgrMon cannot be
used as a consistent measure. This is likely to occur in cases where the number
of independent variables is relatively large and the sample size relatively small.
Note that the number of comparable pairs is of the order 2−k N 2 , where k is
number of independent variables and N is the number of data points. In the
case where there are no comparable pairs we can always construct a perfect fit
with a monotone increasing or monotone decreasing piecewise linear function.
This is shown in the Appendix.
An alternative test for monotonicity, which is independent of the number
of comparable pairs, is based on the monotonicity index proposed in [9, 10].
To define this index we fit the data with a standard neural network. Then for
every explanatory variable the partial derivative ∂f /∂xi at each data point xp is
computed, where f denotes the neural network approximation. The monotonicity
index in variable xi is defined as
¯
¶¯¯
¶
µ
µ
N
∂f
1 ¯¯X + ∂f
¯
−
(6)
(xp ) ¯
(xp ) − I
I
MonInd(xi ) =
¯
¯
∂xi
∂xi
N¯
p=1
where I + (z) = 1 if z > 0 and I + (z) = 0 if z ≤ 0 and I − (z) = 1 if z ≤ 0
and I − (z) = 0 if z > 0, N is the number of observations and xp is the p-th
observation (vector). Note that 0 ≤ MonInd(xi ) ≤ 1. A value of this index close
to zero indicates a non-monotonic relationship, a value close to 1 indicates a
monotonic relationship. The value of sign indicates whether the relation of f
with respect to xi is increasing or decreasing.
Since the monotonicity index depends on neural network approximation, it is
interesting to see how the index depends on the effect of network architecture and
the noise level in the data. To check this we conducted the following experiment
with an one-dimensional dataset. We generated a vector x of 200 observations
drawn from the uniform distribution U (0, 1). Based on x we generated a label
`x by:
3
`x = sin πx + ²,
4
where ² represents noise. We computed the monotonicity index for different number of hidden nodes: 1, 2, 5 and 10, and different noise levels ²: 0, 0.1Norm(0, 1),
0.5Norm(0, 1) and Norm(0, 1), where Norm(0, 1) denotes normal distribution
with zero mean and unit variance. The results are presented in Fig. 2.
If the neural network has one hidden neuron the output depends linearly on
the input variable and the index is 1. If the number of hidden nodes is increased
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
17
Fig. 2. Change in the monotonicity index for different number of hidden nodes and
noise levels
the network will capture the true signal and the monotonicity indices of the
variables will tend to the right value. For the noise-free dataset (² = 0) we
expect that MonInd ≈ 1/3 and it is confirmed by the experiments with different
hidden nodes. For small noise levels in the data and a number of hidden nodes
larger than one, the neural network is still able to get good approximation and
get MonInd close to the expected one. However, when the noise level increases
considerably the indices become inaccurate and eventually get close to 0, as
shown in Fig. 2. This is due to the fact that the network starts fitting the noise,
which is completely random.
4
Experiments with Simulation Data
In this section we demonstrate the application of the monotonicity tests on
artificial data. Furthermore, to illustrate the importance of determining the true
monotone relationships for building correct models, we compare the performance
of three types of classifiers: partially monotone MIN-MAX networks, totally
monotone MIN-MAX networks and standard feed-forward neural networks with
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
18
weight decay. The first two classifiers are based on the two-hidden-layer network
architecture introduced in [2] with a combination of minimum and maximum
operators over linear functions. In [2] it is shown that totally monotone MINMAX networks are universal approximators of totally monotone functions, and
in [1] this result is extended for partially monotone MIN-MAX networks.
We generate an artificial dataset D of 200 points and 5 independent variables.
Each independent variable Xi , i = 1, . . . , 5 is drawn from the uniform distribution
U (0, 1). The dependent variable `x is generated by the following process:
`x = x1 + 1.5x2 + 2x3 + cos 10x4 + sin 12x5 + 0.01Norm(0, 1)
Clearly, `x is a partially monotone label with added noise. We perform both
tests for monotonicity described in the previous section. Monotonicity index is
computed for two types of standard networks: one with 4 hidden nodes and one
with 8 hidden nodes. The results are reported in Tables 1 and 2.
Table 1. Degree of monotonicity (DgrMon) of the original and modified simulation
data after removing a variable
Removed variable
- (original data)
x1
x2
x3
x4
x5
Number of comparable pairs
1003
2193
2093
2235
2586
1949
DgrMon
0.664
0.623
0.563
0.499
0.743
0.737
Table 2. Monotonicity index for all independent variables in the simulation data
Variable
x1
x2
x3
x4
x5
NNet-4
1
1
1
0.38
0.68
NNet-8
0.94
0.97
0.97
0.37
0.44
With respect to the degree of monotonicity we observe that the removal of one
of the monotone variables the measure decreases compared to the original data
whereas the removal of one of the non-monotone variables leads to a considerably
larger measure. The monotonicity indexes reported in Table 2 also comply with
the expected monotone relationships in the simulation data. Note that the results
are comparable with respect to both function approximators: a neural network
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
19
with 4 and 8 hidden nodes. Both tests induce the same ordering of the variables
from monotone to less monotone: x3 Â x2 Â x1 Â x5 Â x4 .
Using this knowledge about the (non)-monotone relationships in the simulation data, we apply partially monotone MIN-MAX networks. As benchmark
methods for comparison we use totally monotone MIN-MAX networks and standard neural networks with weight decay. We randomly split the original data into
training data of 150 observations (75%) and test data of 50 observations (25%).
The random partition of the data is repeated 20 times. The performance of the
models is measured by computing the mean-squared error (MSE). We use nine
combinations of parameters for MIN-MAX networks: groups - 2, 3, 4; planes - 2,
3, 4, and for standard neural networks: hidden nodes - 4, 8, 12; weight decay 0.000001, 0.00001, 0.0001. At each of the twenty runs, we select the model that
obtains the minimum MSE out of the nine parameter combinations with each
method. Table 3 reports the minimal, mean and maximal value and the variance
of the estimated MSE across the runs.
Table 3. Estimated prediction errors of the partially monotone networks (PartMonNet), totally monotone networks (FullMonNet) and standard neural networks with
weight decay (NNet) for the simulation data
Method
PartMonNet
FullMonNet
NNet
Min
0.395
0.845
0.537
Mean
0.534
1.118
0.946
Max
0.677
1.331
1.173
Variance
0.0089
0.0168
0.0277
The results show that the models generated by partially monotone MINMAX networks are more accurate than the models generated by totally monotone MIN-MAX networks and standard neural networks. Furthermore, the variation in the errors across runs is smaller for partially monotone MIN-MAX
networks than for the benchmark methods, as are the differences between the
respective maximum and minimum error values in Table 3.
To check the significance of the difference in both network results, we performed statistical tests. Since the test set in the experiments with the three
methods is the same, we conduct paired t-tests to test the null hypothesis that
the models derived from partially monotone MIN-MAX networks have the same
errors as the models derived from each of the benchmark methods against the
one-sided alternatives. The p-values obtained from the tests and the confidence
intervals at 95% and 90% are reported in Table 4. The results show that partially
monotone MIN-MAX networks lead to models with significantly smaller errors
than the errors of the models derived from the benchmark networks.
In addition, we perform F-tests for the differences between the error variances
of the partially monotone models and the benchmark models. With respect to the
totally monotone MIN-MAX networks, the difference is statistically insignificant:
the p-value is 8.78%. Regarding the standard neural networks partially monotone
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
20
Table 4. p-values of paired t-tests and one-sided confidence intervals for the difference
in error means in the simulation study
Comparison
PartMonNet vs.
FullMonNet
NNet
p-value
0.0%
0.0%
Confidence intervals
95%
90%
(-0.651, -0.512)
(-0.640, -0.528)
(-0.493, -0.332)
(-0.480, -0.345)
MIN-MAX networks have significantly lower variances for the MSE: the p-value
is 0.86%.
5
Conclusions and Future Work
In this paper we developed a method to measure the degree of monotonicity of
a response variable with respect to the independent variables. This is done by
calculating two measures for monotonicity. The first one is based on the removal
of one independent variable at time and computing the degree of monotonicity as
the fraction of monotone pairs out of the comparable pairs in the truncated data.
The second monotonicity measure is based on a neural network approximation to
fit the data. This allows us to compute in a straightforward manner a measure
for a variable’s influence (monotone or non-monotone) on the label using the
partial derivative of the network’s output with respect to every variable. We
have shown that both monotonicity measures induce the same ordering on the
independent variables, from completely monotone to less monotone.
We also exploited the monotonicity properties of the variables in the construction of a neural network approximation to fit the data. It turned out that
partially monotone networks constructed in this way have lower prediction errors
compared to totally monotone networks and ordinary feed forward neural networks. With respect to the latter networks the variance of the partially monotone
networks was also significantly smaller.
To conclude, the study presented here showed that investigating the monotonicity properties of the data in order to enforce them in the modelling stage, is
required to guarantee the successful outcome of the knowledge discovery process.
Although the proposed methods are an important contribution in this direction,
a number of open questions remain. Both measures for monotonicity are empirical and depend largely on data under study. This requires further investigation
of the impact of factors such as the input distribution, noise level, correlation
between variables, on the computed measures. Moreover, the output of each
monotonicity measure lies in the continuous range between 0 and 1, providing
an order of the variables according to their monotone influence. However, to determine whether or not monotonicity properties hold with respect to a variable
we need further to define a benchmark value to compare with. Finally, we plan
experiments with real-world data to get more insight in the development and
application of the proposed measures.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
21
Appendix
In the proposition below we show that if there are no comparable pairs in a
dataset D we can construct a perfect non-decreasing (also non-increasing) monotone fit with a piecewise linear function.
Proposition 1 Suppose D is a dataset with no comparable pairs. Then there
exists a piecewise linear non-decreasing (non-increasing) function f such that f
fits exactly D.
Proof. In the proof we use similar construction as the one in [2]. For simplicity
we restrict to the 2-dimensional case and the case when f is non-decreasing. The
generalisation to higher dimensions is straightforward.
Let D = (xi , yi , `i )N
i=1 with no comparable pairs. Now we define 3 hyperplanes
for every point di = (xi , yi , `i ) ∈ D as follows:
hi1 = `i (constant)
hi2 = a (x − xi ) + `i
a>0
hi3
b > 0.
= b (y − yi ) + `i
Next we define a piecewise non-decreasing linear function by
f i (x, y) = min hj (x, y).
j=1,2,3
Note that f i (xi , yi ) = `i . Finally we define
F (x, y) = max f i (x, y).
i=1,...,N
We now show that for a and b large enough the following holds:
1. F is non-decreasing in x and y, and
2. F (xi , yi ) = `i .
Point 1 follows directly from the definition of f . To prove Point 2, note that
f i (xi , yi ) = `i and therefore F (xi , yi ) ≥ `i . Suppose that F (xi , yi ) > `i . Then
for some k: f k (xi , yi ) > `i , implying hkj (xi , yi ) > `i for j = 1, 2, 3. So,
`k > ` i
a (xi − xk ) + `k > `i
b (yi − yk ) + `k > `i .
Since the points (xi , yi ) and (xk , yk ) are incomparable either (xi − xk ) or (yi −
yk ) < 0. This leads to contradiction if a and b are large enough.
t
u
Remark: To construct a monotone non-increasing piecewise linear fit to D
we follow the same procedure as in the proof of Proposition 1 but for the mirror
image of the data set D where every point (x, y) is mapped to (−x, −y). The
function obtained in this way is transformed back by taking g(x, y) = f (−x, −y).
Then g fits D and is monotone non-increasing.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
22
References
1. Velikova, M.V.: Monotone models for prediction in data mining. PhD thesis,
Tilburg University, Tilburg, The Netherlands (2006)
2. Sill, J.: Monotonic networks. In: Advances in Neural Information Processing
Systems (NIPS). Volume 10. MIT Press (1998) 661–667
3. Minin, A., Lang, B.: Comparison of neural networks incorporating partial monotonicity by structure. Lecture Notes in Computer Science 5164 (2008) 597–606
4. Velikova, M., Daniels, H.: Partially monotone networks applied to breast cancer
detection on mammograms. 5163 (2008) 917–926
5. Feelders, A., Velikova, M., Daniels, H.: Two polynomial algorithms for relabeling
non-monotone data. Technical report UU-CS-2006-046, Utrecht University (2006)
6. Rademaker, M., De Baets, B., De Meyer, H.: Data sets for supervised ranking: to
clean or not to clean. In: Proceedings of the fifteenth Annual Machine Learning
Conference of Belgium and The Netherlands: BENELEARN 2006, Ghent, Belgium.
(2006) 139–146
7. Ghosal, S., Sen, A., Van der Vaart, A.W.: Testing monotonicity of regression.
Annals of Statistics 28(4) (2000) 1054–1082
8. Spouge, J., Wan, H., Wilbur, W.: Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications 117(3) (2003) 585–605
9. Verkooijen, J.: Neural networks in economic modelling - an empirical study. PhD
thesis, Tilburg University, Tilburg, The Netherlands (1996)
10. Daniels, H.A.M., Kamp, B.: Application of MLP networks to bond rating and
house pricing. Neural Computing & Applications 8(3) (1999) 226–234
11. Magdon-Ismail, M., Sill, J.: A linear fit gets the correct monotonicity directions.
Machine Learning 70(1) (2008) 21–43
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
23
Generalized PAV Algorithm with Block
Refinement for Partially Ordered
Monotonic Regression⋆
Oleg Burdakov1 , Anders Grimvall2 , and Oleg Sysoev2
1
Department of Mathematics,
Department Computer and Information Sciences,
Linköping University, SE-58183, Linköping, Sweden
{Oleg.Burdakov, Anders.Grimvall, Oleg.Sysoev}@liu.se
2
Abstract. In this paper, the monotonic regression problem (MR) is
considered. We have recently generalized for MR the well-known PoolAdjacent-Violators algorithm (PAV) from the case of completely to partially ordered data sets. The new algorithm, called GPAV, combines both
high accuracy and low computational complexity which grows quadratically with the problem size. The actual growth observed in practice is
typically far lower than quadratic. The fitted values of the exact MR
solution compose blocks of equal values. The GPAV approximation to
this solution has also a block structure. We present here a technique for
refining blocks produced by the GPAV algorithm to make the new blocks
much closer to those in the exact solution. This substantially improves
the accuracy of the GPAV solution and does not deteriorate its computational complexity. The computational time for the new technique
is approximately triple the time of running the GPAV algorithm. Its
efficiency is demonstrated by results of our numerical experiments.
Key words: Monotonic regression, Partially ordered data set, Pooladjacent-violators algorithm, Quadratic programming, Large scale optimization, Least distance problem.
1
Introduction
The monotonic regression problem (MR), which is also known as the isotonic
regression problem, deals with an ordered data set of observations. We focus on
partially ordered data sets, because in this case, in contrast to completely ordered
sets, there are no efficient algorithms for solving large scale MR problems.
The MR problem has important statistical applications in physics, chemistry,
medicine, biology, environmental science etc. (see [2, 23]). It is present also in operations research (production planning, inventory control, multi-center location
etc.) [13, 15, 24] and signal processing [22, 25]. All these problems are often a
kind of a monotonic data fitting problem, which is addressed in Section 4 where
⋆
This work was supported by the Swedish Research Council.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
24
we use it for generating test problems. The most challenging of the applied MR
problems are characterized by a very large value of the number of observations
denoted here by n. For such large-scale problems, it is of great practical value
to develop algorithms whose complexity does not rise with n too rapidly.
To formulate the MR problem, we introduce the following notations. The
vector of observed values is denoted by Y ∈ Rn . The partial order is expressed
here with the use of a directed acyclic graph G(N, E), where N = {1, 2, . . . , n}
is a set of nodes and E is a set of edges. Each node is associated with one observation, and each edge is associated with one monotonicity relation as described
below. In the MR problem, we must find among those vectors u ∈ Rn preserving
the monotonicity of the partially ordered data set, the one, which is the closest to
Y in the least-squares sense. It can be formulated as follows. Given Y , G(N, E)
and a strictly positive vector of weights w ∈ Rn , find the vector of fitted values
u∗ ∈ Rn that solves the problem:
Pn
2
min
(1)
i=1 wi (ui − Yi )
s.t. ui ≤ uj ∀(i, j) ∈ E
It can be viewed as a problem of minimizing the weighted distance from the
vector Y to the set of feasible points which is a convex cone. This strictly convex
quadratic programming problem has a unique optimal solution.
The conventional quadratic programming algorithms (see [19]) can be used
for solving the general MR problem only in the case of moderate values of n, up
to few hundred.
There are some algorithms especially developed for solving this problem.
They can be divided into the two separate groups, namely, exact and approximate MR algorithms.
The most efficient and the most widely used of the exact algorithms is
the Pool-Adjacent-Violators (PAV) algorithm [1, 14, 16]. Although it has a very
low computational complexity, namely O(n) [11], the area of its application is
severely restricted by the complete order. In this case, the graph is just a path
and the monotonicity constraints in (1) takes the simple form:
u1 ≤ u2 ≤ . . . ≤ un .
In [20], the PAV algorithm was extended to a more general, but still restricted, case of the rooted tree type of the graph defining the monotonicity.
The computational complexity of this exact algorithm is O(n log n).
The minimum lower set algorithm [5, 6] is known to be the first exact algorithm designed for solving partially ordered MR problems. If the order is
complete, its complexity is O(n2 ). In the partial order case, its complexity is unknown, but it is expected to grow with n much more rapidly than quadratically.
The best known computational complexity of the exact algorithms, which are
able to solve partially ordered MR problems, is O(n4 ). It refers to an algorithm
introduced in [15, 24]. This algorithm is based on solving the dual problem to (1)
by solving O(n) minimal flow problems. The resulting growth of computational
requirements in proportion to n4 becomes excessive for large n.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
25
The isotonic block class with recursion (IBCR) algorithm was developed in
[4] for solving partially ordered MR problems. It is an exact algorithm. Its computational complexity is unknown, but according to [21], it is bounded below
by O(n3 ). In practice, despite of this estimate, it is the fastest among the exact
algorithms. This is the reason why we use it in our numerical experiments to
compare the performance of algorithms.
Perhaps, the most widely used inexact algorithms for solving large-scale partially ordered MR problems are based on simple averaging techniques [17, 18,
26]. They can be easily implemented and have a relatively low computational
burden, but the quality of their approximation to u∗ is very case-dependent and
furthermore, the approximation error can be too large (see [7]).
In [7, 8], we generalized the well-known Pool-Adjacent-Violators algorithm
(PAV) from the case of completely to partially ordered variables. The new algorithm, called GPAV, combines both low computational complexity O(n2 ) and
high accuracy. The computational time grows in practice less rapidly with n
than in this worst-case estimate. The GPAV solution is feasible, and it is optimal if to regard its active constraints as equalities. The corresponding active
set induces a partitioning of N into connected subsets of nodes, called blocks,
which are obtained after excluding from E the edges representing the non-active
constraints. We present here a block refinement technique which substantially
improves the accuracy of the GPAV solution, while the overall computational
complexity remains O(n2 ). Its run time is between twice and triple the time of
running GPAV.
Since the GPAV and IBCR algorithms coincide with the PAV algorithm when
the order is complete, they both can be viewed as its generalizations. Although
they have much in common, the main difference is that the first of them is an
approximate algorithm, while the second one is exact. Moreover, according to
the reported here result of numerical experiments, the GPAV is much faster than
the IBCR, and the difference in the computational time grows rapidly with n.
The paper is organized as follows. The block refinement technique is introduced in Section 2. This technique is illustrated with a simple example in
Section 3. The results of our numerical experiments are presented and discussed
in Section 4. In Section 5, we draw conclusions about the performance of the
block refinement technique and discuss future work.
2
Block Refinement Technique
Our block refinement technique is based on the GPAV algorithm. To present this
algorithm, we will use the definitions and notations from [8]. Let
i− = {j ∈ N : (j, i) ∈ E}
denote the set of all immediate predecessors for node i ∈ N . The connected
subset of nodes B ⊂ N is called a block if, for any i, j ∈ B, all the nodes in all
the undirected paths between i to j belong to B. The block Bi is said to be an
immediate predecessor for Bj , or adjacent to Bj , if there exist k ∈ Bi and l ∈ Bj
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
26
such that k ∈ l− . Let Bi− denote the set of all blocks adjacent to block Bi . We
associate each block with one of its nodes, which is called the head node. If i is
the head node for some block, we denote this block by Bi . The set of all head
nodes is denoted by H. The set of blocks {Bi }i∈H , where H ⊂ N , is called a
block partitioning of N if
[
Bi = N
i∈H
and
Bi ∩ Bj = ∅,
∀i 6= j,
i, j ∈ H.
Let Wk denote the weight of the block Bk . It is computed by the formula
X
wi .
Wk =
i∈Bk
The GPAV algorithm produces a block partitioning of N . It returns also a
solution u ∈ Rn which is uniquely defined by the block partitioning as follows.
If node i belongs to a block Bk , the corresponding component ui of the solution
equals the block common value:
P
wi Yi
.
(2)
Uk = i∈Bk
Wk
This algorithm treats the nodes N or, equivalently, the observations in a
consecutive order. Any topological order of N is acceptable, but the accuracy of
the resulted solution depends on the choice (see [8]). We assume that the nodes
N have been sorted topologically. The GPAV algorithm creates initially the singletone blocks Bi = {i} and sets Bi− = i− for all the nodes i ∈ N . Subsequently
it operates with the blocks only. It treats them in the order consistent with the
topological sort, namely, B1 , B2 , . . . , Bn . When at iteration k the block Bk is
treated, its common value (2) is compared with those of its adjacent blocks.
While there exists an adjacent violator of the monotonicity, the block Bk absorbs the one responsible for the most severe violation. The common value Uk
and the lists of adjacent blocks are updated accordingly.
The outlined GPAV algorithm can be formally presented as follows.
Algorithm 1 (GPAV)
Given: vectors w, Y ∈ Rn and a directed acyclic graph G(N, E) with topologically sorted nodes.
Set H = N .
For all i ∈ N , set Bi = {i}, Bi− = i− , Ui = Yi and Wi = wi .
For k = 1, 2, . . . , n, do:
While there exists i ∈ Bk− such that Ui ≥ Uk , do:
Find j ∈ Bk− such that Uj = max{Ui : i ∈ Bk− }.
Set H = H \ {j}.
Set Bk− = Bj− ∪ Bk− \ {j}.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
27
Set Uk = (Wk Uk + Wj Uj ) / (Wk + Wj ).
Set Bk = Bk ∪ Bj and Wk = Wk + Wj .
For all i ∈ H such that j ∈ Bi− , set Bi− = Bi− ∪ {k} \ {j}.
For all k ∈ H and for all i ∈ Bk , set ui = Uk .
We shall refer to this algorithm as the forward mode of GPAV. We denote its
output as uF and {BkF }k∈H F . Its backward mode is related to solving in uB ∈ Rn
the MR problem:
Pn
B 2
B
min
(3)
i=1 wi (ui − Yi )
B
s.t. uB
i ≤ uj
∀(i, j) ∈ E B
where YiB = −Yi and E B = {(i, j) : (j, i) ∈ E}. Note that the optimal solution to this problem equals −u∗ . The backward mode of GPAV returns uB and
{BkB }k∈H B . In this mode, one can use the inverse of the topological order used
by the forward mode.
In our block refinement technique, it is assumed that two or more approximate solutions to problem (1) are available. They could result from applying
any approximate algorithms, for instance, the forward and backward modes of
Algorithm GPAV in combination with one or more topological orders. We will
refer to them as old solutions.
We denote the vector of the component-wise average of the old solutions by v,
i.e. if there are available two old solutions, u′ and u′′ , then v = (u′ + u′′ )/2. The
vector v is feasible in problem (1), and it can be viewed as a new approximation
to its solution. It induces a block partitioning of the nodes N . The new blocks
are denoted by Bknew where k belongs to the new set of head nodes H new . Let
h(j) denote the head element of the new block which contains node j, i.e. if
j ∈ Bknew then h(j) = k.
It can be seen that the new blocks are, roughly speaking, nonempty intersections of the old blocks. Thus, the new ones results from a certain splitting of
the old ones. The use of the vector v will allow us to simplify the construction
of {Bknew }k∈H new . This idea is presented by the following algorithm.
Algorithm 2 (SPLIT)
Given: vector v ∈ Rn and a directed acyclic graph G(N, E) with topologically
sorted nodes.
Set H new = ∅ and E new = ∅.
For i = 1, 2, . . . , n, do:
If there exists j ∈ i− such that vj = vi ,
then for k = h(j), set Bknew = Bknew ∪ {i},
else set H new = H new ∪ {i} and Binew = {i}.
For all j ∈ i− such that h(j) 6= h(i) & (h(j), h(i)) ∈
/ E new , do:
new
new
set E
=E
∪ (h(j), h(i)).
This algorithm returns not only the new blocks {Bknew }k∈H new , but also the
set of directed edges E new ⊂ H new × H new . The new blocks are represented in
the new directed acyclic graph G(H new , E new ) by their head elements.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
28
It can be easily seen that a topological order in the new graph can be obtained
by sorting the nodes H new in increasing order of the values vk .
Using the notations
wknew =
X
i∈Bknew
wi ,
Yknew =
X
yi wi /wknew ,
(4)
i∈Bknew
where k ∈ H new , we can formulate for the new graph the new MR problem:
P
min k∈H new wknew (unew
− Yknew )2
k
s.t. unew
≤ unew
∀(i, j) ∈ E new
i
j
(5)
Since the number of unknowns unew
is, typically, less than n and |E new | <
k
|E|, this problem is smaller, in its size, than the original problem (1). One can
apply here, for instance, Algorithm GPAV. The resulting refined blocks yields
an approximate solution to (1) which provides in practice a very high accuracy
(see Section 4). Furthermore, if the blocks {Bknew }k∈H new are the same as those
induced by u∗ , then the optimal solution to problem (5) produces u∗ .
In our numerical experiments, we use the following implementation of the
block refinement technique.
Algorithm 3 (GPAVR)
Given: vectors w, Y ∈ Rn and a directed acyclic graph G(N, E) with topologically sorted nodes.
1. Use the forward and backward modes of Algorithm GPAV to produce two
B
∗
approximations, uF
i and −ui , to u .
F
2. Run Algorithm SPLIT with v = (ui − uB
i )/2.
3. For all k ∈ H new , compute wknew and Yknew by formula (4).
4. Use v to sort topologically the nodes H new .
5. Use Algorithm GPAV to find unew which solves approximately the new MR
problem (5).
6. For all k ∈ H new and for all i ∈ Bknew , set ui = unew
.
k
The block refinement technique is illustrated in the next section with a simple
example.
It is not difficult to show that the computational complexity of Algorithm
GPAVR is estimated, like for Algorithm GPAV, as O(n2 ). Indeed, Step 1 involves two runs of Algorithm GPAV, which does not break the estimate. The
computational burden of Step 2 is proportional to the number of edges in E
which does not exceed n2 . The number of arithmetic operations on Steps 3 and
6 grows in proportion to n. The computational complexity of the sorting on
Step 4 is estimated as O(n log(n)). On Step 5, Algorithm GPAV is applied to
the graph G(H new , E new ) with the number of nodes which does not exceed n.
This finally proves the desired estimate.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
29
3
Illustrative Example
Consider the MR problem defined in Fig. 1. The weights wi = 1, i = 1, 2, 3, 4.
The vector u∗ = (0, −6, 6, 0) is the optimal solution to this problem. It induces
the optimal block partitioning: {1, 4}, {2}, {3} (indicated by the dashed lines).
Fig. 1. The graph G(N, E) and observed responses Y .
One can see that the nodes are already topologically sorted. The forward
mode of Algorithm GPAV produces the block partitioning: B2F = {2}, B4F =
{1, 3, 4} and the corresponding approximate solution uF = (2, −6, 2, 2).
Fig. 2 defines the MR problem (3) for the backward mode.
Fig. 2. The graph G(N, E B ) and observed responses Y B .
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
30
In this case, the topological order is 4, 3, 2, 1. The backward mode of Algorithm GPAV produces the block partitioning: B1B = {1, 2, 4}, B3B = {3} and the
corresponding approximate solution uB = (2, 2, −6, 2).
The nonempty intersections (dashed lines) of the forward mode blocks (dotted lines) and the backward mode blocks (solid lines) are shown in Fig. 3
Fig. 3. The old blocks (solid and dotted lines) and their splitting which yields the new
blocks (dashed lines).
The same splitting of the old blocks is provided by the input vector v = (uF
i −
= (0, −2, 2, 0) of Algorithm SPLIT. This algorithm produces H new =
{1, 2, 3}, B1new = {1, 4}, B2new = {2}, B3new = {3}. The new MR problem (5) is
defined by Fig. 4.
uB
i )/2
Fig. 4. The new graph G(H new , E new ) and observed responses Y new .
The topological sort for the new graph is, obviously, 2, 1, 3. After applying
=
= −6, unew
Algorithm GPAV to solving the new MR problem, we obtain unew
1
2
new
0, u3 = 6. Step 6 of Algorithm GPAVR yields the vector u = (0, −6, 6, 0) which
is optimal in the original MR problem. In general, it is not guaranteed that the
GPAVR solution is optimal.
4
Numerical Results
We use here test problems of the same type as in our earlier paper [8]. They
originate from the monotonic data fitting problem which is one of the most
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
31
common types of applied MR problems. In the monotonic data fitting, it is
assumed that there exists an unknown response function y(x) of p explanatory
variables x ∈ Rp . It is supposed to be monotonic in the sense that
y(x′ ) ≤ y(x′′ ),
∀x′ x′′ ,
x′ , x′′ ∈ Rp ,
where is a component-wise ≤-type relation. Instead of the function y(x), we
have available a data set of n observed explanatory variables
Xi ∈ R p ,
i = 1, . . . , n,
and the corresponding observed responses
Yi ∈ R1 ,
i = 1, . . . , n.
The function and the data set are related as follows
Yi = y(Xi ) + εi ,
i = 1, . . . , n,
(6)
where εi is an observation error. In general, if the relation Xi Xj holds, this
does not imply that Yi ≤ Yj , because of this error.
The relation induces a partial order on the set {Xi }ni=1 . The order can be
presented by a directed acyclic graph G(N, E) in which node i ∈ N corresponds
to the i-th observation, and the presence if edge (i, j) in E means that Xi Xj .
This graph is unique if all redundant relations are eliminated. We call edge
(i, j), and also the corresponding monotonicity relations Xi Xj and ui ≤ uj ,
redundant if there is a directed path from i to j. Redundant edges, if removed,
leave the feasible set in (1) unchanged.
In monotonic data fitting, one must construct a monotonic response surface
model u(x) whose values u(Xi ) are as close as possible to the observed responses
Yi for all i = 1, . . . , n. Denoting
ui = u(Xi ),
i = 1, . . . , n
(7)
and using the sum of squares as the distance function, one can reformulate this
problem as the MR problem (1). In the numerical experiments, we set the weights
wi = 1 for all i = 1, . . . , n.
In the experiments, we restrict our attention to the case of p = 2 for the
following reasons. Suppose that, for the two vectors Xi and Xj in Rp , neither
Xi Xj nor Xj Xi holds, i.e. they are incomparable. Then if to delete (or
disregard) in these vectors one and the same component, the reduced vectors
may become comparable in Rp−1 . On the other hand, if two vectors in Rp are
comparable, no deletion of their component is able to break this relation. This
means that, for any fixed number of multivariate observations n, the number of
monotonic relations Xi Xj attains its maximum value when p = 2.
For our test problems, we use two types of functions y(x) of two explanatory
variables, namely, linear and nonlinear.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
32
Our nonlinear function is given by the formula
ynonlin (x) = f (x1 ) + f (x2 ),
(8)
√
3
(9)
where
f (t) =
t, t ≤ 0,
t3 , t > 0.
This function is shown in Fig. 5.
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
1
0.5
1
0.5
0
0
−0.5
−0.5
−1
−1
Fig. 5. Nonlinear function y(x) defined by (8)–(9).
Our choice of the linear test problems is inspired by the observation that the
optimal values u∗i that correspond to a local area of values of x depend mostly
on the local behavior of the response function y(x) and on the values of the
observation errors in this area. Due to the block structure of u∗ , these local
values of u∗i typically do not change if to perturb the function values y(x) in
distant areas. Therefore, we assume that the local behavior can be well imitated
by linear local models.
For the linear models, we consider the following two functions
ylin1 (x) = 0.1x1 + 0.1x2 ,
ylin2 (x) = x1 + x2 .
(10)
They model slower and faster monotonic increase, respectively.
The nonlinear function combines the considered types of behavior. In addition, depending on the side from which x1 or x2 approaches zero, the function
value changes either sharply, or remarkably slowly.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
33
For the numerical experiments, samples of n = 102 , n = 103 and n = 104
observations {Xi }ni=1 were generated with the use of the independent uniform
distribution of the explanatory variables in the interval [−2, 2]. The error terms
εi in (6) were independent and normally distributed with mean zero and variance
one.
It should be emphasized that, in our numerical experiments, the variance of
the error εi is comparable with the function values y(Xi ). Such observations,
with a high level of noise, were deliberately chosen for checking the performance
of the algorithms in this difficult case.
We call preprocessing the stage at which MR problem (1) is generated. The
observed explanatory variables {Xi }ni=1 are involved in formulating this problem
implicitly; namely, via the partial order intrinsic in this data set of vectors. Given
{Xi }ni=1 , we generate a directed acyclic graph G(N, E) with all the redundant
edges removed. The adjacency-matrix representation [10] is used for the graph.
The preprocessing is accomplished by a topological sorting of the nodes N .
We used MATLAB for implementing the algorithms GPAV, GPAVR and
IBCR. The implementations are based on one of the topological sorts studied
in [8], namely, (NumPred). By this sorting, the nodes in N are sorted in ascending order of the number of their predecessors. The (NumPred) is applied
in Algorithms GPAV and IBCR to the graph G(N, E), as well as on Step 1 of
Algorithm GPAVR to the graphs G(N, E) and G(N, E B ) in the forward and
backward modes, respectively.
The numerical results presented here were obtained on a PC running under
Windows XP with a Pentium 4 processor (2.8 GHz, 1GB RAM).
We evaluate performance of the algorithms GPAV and GPAVR by comparing
the relative error
ϕ(uA ) − ϕ(u∗ )
,
e(uA ) =
ϕ(u∗ )
where ϕ(uA ) is the objective function value in (1) obtained by Algorithm A, and
ϕ(u∗ ) is the optimal value provided by IBCR.
In [8], it was shown that
kuA − u∗ k p
≤ e(uA ),
kY − u∗ k
p
which means that e(uA ) provides an upper estimate for the relative distance
between the approximate solution uA and the optimal solution u∗ .
Tables 1 and 2 summarize the performance data obtained for the algorithms.
For n = 102 and n = 103 , we present the average values for 10 MR problems
generated as described above. For n = 104 , we report the results of solving only
one MR problem, because it takes about 40 minutes of CPU time to solve one
such large-scale problem with the use of IBCR. The sign ‘—’ corresponds in the
tables to the case when IBCR failed to solve problem within 6 hours.
The number of constraints (#constr.) reported in Table 1 are aimed at indicating how difficult the quadratic programming problem (1) is for the conventional optimization methods when n is large.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
34
Table 1. Relative error e(uA ) · 100% for A = GPAV, GPAVR
algorithm model
n = 102
n = 103
n = 104
#constr. = 322 #constr. = 5497 #constr. = 78170
GPAV
lin1
lin2
nonlin
0.98
2.79
3.27
0.77
2.43
5.66
—
2.02
11.56
GPAVR
lin1
lin2
nonlin
0.01
0.08
0.00
0.07
0.12
0.17
—
0.24
0.46
Table 2. Computational time in seconds
algorithm model n = 102
n = 103
n = 104
GPAV
lin1
lin2
nonlin
0.02
0.01
0.01
0.76
0.71
0.67
89.37
93.76
87.51
GPAVR
lin1
lin2
nonlin
0.05
0.05
0.04
1.67
1.60
1.58
234.31
197.06
192.08
IBCR
lin1
lin2
nonlin
0.21
0.09
0.08
129.74
5.07
6.68
—
2203.10
3448.94
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
35
The tables show that GPAVR substantially improves the accuracy of the
GPAV solution, while its run time is between twice and triple the time of running
GPAV. They also demonstrate the limited abilities of IBCR.
5
Conclusions and Future Work
To the best of our knowledge, GPAV is the only practical algorithm able to produce sufficiently accurate solution to very large scale MR problems with partially
ordered observations. Up till now, there has not been found any practical algorithm capable of solving such large scale MR problems with as high accuracy as
the accuracy provided by the introduced here block refinement technique combined with the GPAV algorithm. This can be viewed as the main contribution
of the paper.
In this paper, we focused on solving the MR problem (1) which is related
to the first stage of constructing a monotonic response model u(x) of an unknown monotonic response function y(x). The second stage accomplishes the
construction of a model u(x), which is a monotonic function and interpolates, in
accordance with (7), the fitted response values ui , i = 1, . . . , n, obtained in the
first stage.
The quality of the obtained monotonic response model u(x) depends not only
on the accuracy of solving problem (1), but also on the interpolation methods
used in the second stage. Among the existing main approaches to solving the
monotonicity-preserving interpolation problem one can recognize the following
three.
One approach [27, 28] involves minimizing some measure of smoothness over
a convex cone of smooth functions which are monotone. The drawback of this
approach is that the solutions must be found by solving constrained minimization
problems, and the solutions are generally somewhat complicated, nonlocal and
nonpiecewise polynomial functions.
The second approach is to use a space of piecewise polynomials defined over
a partition of the interpolation domain, usually into triangles. Up until now, this
approach has been studied only in the case where the data is given on a grid
(see e.g. [3, 9]).
The third approach [12] is based on creating girded data from the scattered
data. This approach is not suitable for large number of data points n, because
the number of grid nodes grows with n as np and becomes easily unacceptably
large, even for the bivariate case (p = 2).
Bearing in mind the importance of the second stage and the shortcomings
of the existing approaches, we plan to develop efficient monotonicity-preserving
methods for interpolation of scattered multivariate data.
References
1. Ayer, M., Brunk, H.D., Ewing, G.M., Reid, W.T., Silverman, E.: An empirical
distribution function for sampling with incomplete information. T he Annals of
Mathematical Statistics 26, 641–647 (1955)
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
36
2. Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D.: Statistical inference under order restrictions. Wiley, New York (1972)
3. Beatson, R.K., and Ziegler, Z.: Monotonicity presertving surface interpolation.
SIAM J. Numer. Anal. 22, 401-411 (1985)
4. Block, H., Qian, S., Sampson, A.: Structure Algorithms for Partially Ordered Isotonic regression. Journal of Computational and Graphical Statistics 3, 285–300
(1994)
5. Brunk, H.D.: Maximum likelihood estimates of monotone parameters. The Annals
of Mathematical Statistics 26, 607–616 (1955)
6. Brunk H.D., Ewing G.M., Utz W.R.: Minimizing integrals in certain classes of
monotone functions, Pacific J. Math. 7, 833–847 (1957)
7. Burdakov, O., Sysoev, O., Grimvall A., Hussian, M.: An O(n2 ) algorithm for isotonic regression problems. In: Di Pillo, G., Roma, M. (eds.) Large Scale Nonlinear
Optimization. Ser. Nonconvex Optimization and Its Applications, vol. 83, pp. 25–
33, Springer-Verlag (2006)
8. Burdakov, O., Grimvall, A., Sysoev, O.: Data preordering in generalized PAV algorithm for monotonic regression. Journal of Computational Mathematics 4, 771–790
(2006)
9. Carlson, R.E., and Fritsch, F.N.: Monotone piecewise bicubic interpolation. SIAM
J. Numer. Anal. 22, 386–400 (1985)
10. Cormen T.H., Leiserson C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms
(Second Edition), MIT Press, Cambridge (2001)
11. Grotzinger, S.J., Witzgall, C.: Projection onto order simplexes. Applications of
Mathematics and Optimization 12, 247–270 (1984)
12. Han, L., and Schumaker L.L.: Fitting monotone surfaces to scattered data using
C1 piecewise cubics. SIAM J. Numer. Anal. 34, 569–585 (1997)
13. Kaufman, Y. , Tamir, A.: Locating service centers with precedence constraints.
Discrete Applied Mathematics 47, 251-261 (1993)
14. Kruskal, J.B.: Nonmetric multidimensional scaling: a numerical method, Psychometrika 29, 115–129 (1964)
15. Maxwell, W.L., Muchstadt, J.A.: Establishing consistent and realistic reorder intervals in production-distribution systems. Operations Research 33, 1316–1341 (1985)
16. Miles, R.E.: The complete amalgamation into blocks, by weighted means, of a finite
set of real numbers. Biometrikabf 46:3/4, 317–327 (1959)
17. Mukarjee, H.: Monotone nonparametric regression, The Annals of Statistics 16,
741–750 (1988)
18. H. Mukarjee, H. Stern, Feasible nonparametric estimation of multiargument monotone functions. Journal of the American Statistical Association 425, 77–80 (1994)
19. Nocedal, J., Wright, S.J.: Numerical Optimization (2nd ed.). Springer-Verlag, New
York (2006)
20. Pardalos, P.M., Xue, G.: Algorithms for a class of isotonic regression problems.
Algorithmica 23, 211–222 (1999)
21. Qian, S., Eddy, W.F.: An Algorithm for Isotonic Regression on Ordered Rectangular Grids. J. of Computational and Graphical Statistics 5, 225–235 (1996)
22. Restrepo, A., Bovik, A.C.: Locally monotonic regression. IEEE Transactions on
Signal Processing 41, 2796–2810 (1993)
23. Robertson T., Wright, F.T., Dykstra, R.L.: Order Restricted Statistical Inference.
Wiley, New York (1988)
24. Roundy, R.: A 98% effective lot-sizing rule for a multiproduct multistage production/inventory system. Mathematics of Operations Research 11, 699–727 (1986)
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
37
25. Sidiropoulos, N.D., Bro, R.: Mathematical programming algorithms for regressionbased nonlinear filtering in Rn . IEEE Transactions on Signal Processing 47, 771–
782 (1999)
26. Strand, M.: Comparison of methods for monotone nonparametric multiple regression. Communications in Statistics - Simulation and Computation 32, 165–178
(2003)
27. Utreras, F.I.: Constrained surface construction. In: Chui, C.K., Schumaker, L.L.,
and Utreras F. (eds.) Topics in Multivariate Approximation, pp. 233–254. Academic Press, New York (1987)
28. Utreras, F.I., and Varas M.: Monotone interpolation of scattered data in R2. Constr. Approx. 7 , 49–68 (1991)
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
38
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
39
Discovering monotone relations with Padé
Jure Žabkar1 , Martin Možina1 , Ivan Bratko1 , and Janez Demšar1
University of Ljubljana,
Faculty of Computer and Information Science,
Tržaška 25, SI-1000 Ljubljana, Slovenia,
[email protected],
www.ailab.si/jure
Abstract. We propose a new approach to discovering monotone relations in numerical data. We describe Padé, a tool for estimating partial derivatives of target function from numerical data. Padé is basically a preprocessor that takes numerical data as input and assigns computed qualitative partial derivatives to the
learning examples. Using the preprocessed data, an appropriate machine learning
method can be used to induce a generalized model. The induced models describe
monotone relations between the class variable and the attributes. Experiments
performed on artificial domains showed that Padé is quite accurate and robust.
1
Introduction
In many real-world applications, e.g. in financial or insurance sector, some relations
between the observed variables are known to be monotone. In such cases we can limit
the induction to monotone models which are consistent with the domain knowledge
and give better predictions. Monotonicity properties can be imposed by a human expert (based on experience) or domain theory itself (e.g. in economics). However, such
knowledge is not always available. In this paper we present Padé, a machine learning
method for discovering monotone relations in numerical data. Padé can substitute for
an expert or domain theory when they are not known or not given.
A general scheme in which Padé plays a major role is presented in Fig. 1. Padé
works as a data preprocessor. Taking numerical data as input, it calculates qualitative
partial derivatives of the class variable w.r.t. each attribute and assigns them to original
learning examples. The attributes normally correspond to independent variables in our
problem space, and the class corresponds to a dependent variable. Computed qualitative
partial derivatives define the class value in the new data set, which can be used as
input to an appropriate machine learning method for induction of a qualitative model of
discovered monotone relations.
For a simple example, consider f as a function of x: f = x2 . The learning data
would consist of a sample of pairs of values (x, f ) where x is the attribute and f is the
class.
Let us take (x, f ) = (2, 4). Padé observes examples in a tube in the direction of
the attribute x and uses them to compute the approximation of partial derivative in
(x, f ) = (2, 4). It would find out that in this direction, larger values of x imply larger
values of f , so the derivative is positive. Padé constructs a new attribute, e.g. Qx, with
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
40
Fig. 1. A general scheme in which Padé works as a preprocessor for computing partial derivatives
from numerical data.
values from {+, −} and assigns a + to example (2,4). After doing the same for all
points in the training data, we can apply a rule learning algorithms using the newly
constructed attribute as a class value. The correct qualitative model induced from this
data would be:
if x > 0 then f = Q(+x)
if x < 0 then f = Q(−x)
The constraint f = Q(+x) is read as f is qualitatively proportional to x. Roughly,
this means that f increases with x, or precisely,
∂f
> 0.
∂x
The notation f = Q(−x) means that f is inversely qualitatively proportional to x
(i.e. the partial derivative of f w.r.t. x is negative). We will also be using an abbreviated notation when referring to several qualitative proportionalities. For example, two
constraints f = Q(+x) and f = Q(−y) will be abbreviated to f = Q(+x, −y).
Qualitative proportionalities correspond to the monotone relations. The constraint
f = Q(+x) means that f is monotonically increasing with x and f = Q(−x) means
that f is monotonically decreasing with x.
In section 2 we shortly describe two related algorithms from the field of qualitative
reasoning that also discover qualitative patterns in data. We present the details of the
algorithm Padé in section 3. In section 4 we present the experiments and conclude in
section 5.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
41
2
Related work
QUIN [1, 2] is a machine learning algorithm which works on similar data as Padé and
uses it to compute a qualitative tree. Qualitative trees are similar to classification trees,
except that their leaves state the qualitative relations holding in a particular region. Although Padé can also be used to construct such trees, there are numerous differences
between the two algorithms. QUIN is based on an specific impurity measure and can
only construct trees. Padé is not a learning algorithm but a preprocessor, which can be
used with (in principle) any learning algorithm. Padé computes partial derivatives by the
specified attribute(s), while QUIN identifies the regions in which there is a monotone relation between the class and some attributes, which are not specified in advance. In this
respect, QUIN is more of a subgroup discovery algorithm than a learning method. Related to that, Padé can compute the derivative at any given data point, while QUIN identifies the relation which is generally true in some region. Finally, the partial derivatives
of Padé are computed as coefficients of univariate linear regression constructed from
examples in the tube lying in the direction of the derivative. This makes it more mathematically sound and much faster than QUIN, which cannot efficiently handle more than
half a dozen attributes.
Algorithm QING [3] also looks for monotone subspaces in numerical data. It is
based on discrete Morse theory [4, 5] from the field of computational topology. The
main difference between QING and other algorithms for induction of qualitative models is in attribute space partitioning. Unlike algorithms that split on attribute values (e.g.
trees, rules), QING triangulates the space (domain) and constructs the qualitative field
which for every learning example tells the directions of increasing/decreasing class. Finally, it abstracts the qualitative field to qualitative graph in which only local extrema
are represented. Its main disadvantage is that it lacks robustness needed for real applications.
One could also use locally weighted multivariate linear regression (LWR) [6] to
approximate the gradient in the given data point. Our experiments show that LWR performs much worse than Padé. We present the experiments in section 4.2.
Padé was already presented at the QR’07 workshop [7]. There we described a few
prototype methods and showed a few case-study experiments. The method has matured
since then, so we here present the reformulated and practically useful version of Padé,
and include a somewhat larger set of experiments.
3
Algorithm
Padé is based on the mathematical definition of the partial derivative of a function
f (x1 , . . . , xn ) in the direction xi at the point (a1 , . . . , an ):
∂f
f (a1 , . . . , ai + h, . . . , an ) − f (a1 , . . . , an )
(a1 , . . . , an ) = lim
.
h→0
∂xi
h
The problem in our setup is that we cannot compute the value of the function at an
arbitrary point, since we are only given a function tabulated on a finite data set. Instead,
we have to estimate the derivative based on the function values in local neighborhoods.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
42
Padé looks for points that lie in the neighborhood of the reference point in the direction of the derived variable while keeping the values of all other attributes as constant
as possible.
The input for Padé is a set of examples described by a list of attribute values and the
value of a continuous dependent variable. An example of a function with two attributes
is depicted in Fig. 2(a): each point represents a learning example and is assigned a
continuous value. Our task is to compute a partial derivative at each given data point
P = (a1 , . . . , an ). To simplify the notation, we will denote (a1 , . . . , ai + h, . . . , an ) as
P + h. In our illustrations, we shall show the computation at point P = (5, 5), which
is marked with a hollow symbol.
Padé considers a certain number of examples nearest to the axis in the direction in
which we compute the derivative. These examples define a (hyper)tube (Fig. 2(b)). We
now assume that the function is approximately linear within short parts of the tube and
estimate the derivative from the corresponding coefficient computed by the univariate
regression over the points in the tube. We call this method Tube regression (Fig. 2(b)).
(a) Sampled function x2 − y 2
(b) Tube regression
Fig. 2. Computing a partial derivative from numerical data by Tube regression.
Since the tube can also contain points that lie quite far from P , we weigh the points
by their distances from P along the tube (that is, ignoring all dimensions but xi ) using
a Gaussian kernel. The weight of the j-th point in the tube equals
wj = e−(ai −tji )
2
/σ 2
,
where tji is point’s i-th coordinate, and σ is chosen so that the farthest point in the tube
has a user-set negligible weight. For the experiments in this paper we used tubes with
20 points, with the farthest point having a weight of w20 = 0.001.
We then use a standard 1-dimensional weighted least squares regression to compute
the coefficient of the linear term. We set the free term to 0 in order for the regression
line to pass through point P . The formula for the coefficient thus simplifies to
P
j wj (tji − ai )(yj − yi )
P
,
bi =
2
j wj (tji − ai )
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
43
where yj is the function value at tj .
The reason for omitting the free term is simple: if the function goes through the
point, so should its derivative. The regression line should determine the sign of the
derivative, not make the best fit to the data. This requirement may not seem reasonable
in noisy domains but we handle noise differently, by using a machine learning algorithm
on derived data.
The Tube regression is computed from a larger sample of points. We use the t-test
to obtain the estimates of the significance of the derivative. Significance together with
the sign of bi can be used to define qualitative derivatives in the following way: if the
significance is above the user-specified threshold (e.g. t = 0.7) then the qualitative
derivative equals the sign of bi ; if significance is below the threshold we define the
qualitative derivative to be steady, disregarding the sign of b.
4
Experiments
We implemented the described algorithms inside the general machine learning framework Orange. [8]
We evaluated Padé on artificial data sets. We observed the correctness of Padé’s
derivatives, how it compares to locally weighted regression (LWR), how well it scales
in the number of attributes and how it copes with noise. We measured the accuracy by
comparing the predicted qualitative behavior with the analytically known one.
4.1
Accuracy
In order to estimate the accuracy of Padé we chose a few interesting mathematical functions and calculated numerical derivatives in all sampled points to compare them with
Padé’s results. All functions were sampled in 1000 points, chosen uniform randomly in
the specified interval in x−y plane. First we chose f (x, y) = x2 −y 2 and f (x, y) = xy
in [−10, 10] × [−10, 10] as examples of continuous and differentiable functions in the
whole interval. The functions are presented in Fig. 3.
Fig. 3. The functions from which we obtained artificial data by sampling.
We calculated numerical derivatives and compared their signs with qualitative derivatives calculated by Padé. The proportion of correctly calculated qualitative derivatives
are shown in Table 4.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
44
f (x, y)
Padé
∂f /∂x ∂f /∂y
x2 − y 2 96.5% 97.1%
xy
99.8% 99.4%
sin x sin y 50.7% 54.7%
Fig. 4. The accuracy of Padé on artificial data sets.
Padé fails in sin x sin y domain due to the tube being too long and thus covering
multiple periods of the function. However, the maximal number of examples in the tube
is a parameter of the algorithm and can be adjusted according to the domain properties. On the other hand, our experiments show that for most domains the value of this
parameter does not matter much, as shown in section 4.4.
4.2
Comparison with LWR
We compared the accuracies of Padé and locally weighted regression (LWR) on an
artificial data set. We sampled f (x, y) = x2 −y 2 in 50 points from [−10, 10]×[−10, 10].
To this domain, we added 10 random attributes a1 , . . . , a10 with values in the same
range as x and y, i.e. [−10, 10]. The attributes a1 , . . . , a10 had no influence on the
output variable f . We took the sign of the coefficients at x and y from LWR to estimate
the partial derivatives at each point. Again, we compared the results of Padé and LWR
to analytically obtained partial derivatives of f (x, y). Table 5 shows the proportion of
correctly calculated qualitative derivatives.
Padé
LWR
∂f /∂x ∂f /∂y ∂f /∂x ∂f /∂y
90% 96% 70% 70%
Fig. 5. The proportion of correctly calculated qualitative derivatives of Padé and LWR.
This experiment confirms the importance of the chosen neighborhood. LWR takes a
spherical neighborhood around each point while Padé does the regression on the points
in the tube.
4.3
Dimensionality
We checked the scalability of Padé to high (∼ 100) dimensional spaces. Again, we took
the function x2 − y 2 as above, but increased the dimensionality by adding 98 attributes
with random values from [−10, 10]. We analyzed the results by inducing classification
trees with the computed qualitative derivatives as classes. The trees for derivatives w.r.t.
x and y agree well with the correct results (Fig. 6).
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
45
Fig. 6. Qualitative models of the qualitative partial derivatives of x2 − y 2 data set with 98 additional attributes with random values unrelated to the output.
4.4
Noise
In this experiment we vary the amount of noise that we add to the class value f . The
target function is again f (x, y) = x2 − y 2 defined on [−10, 10] × [−10, 10] which puts
f in [−100, 100]. We control noise through standard deviation (SD), from SD=0 (no
noise) to SD=50, where the latter means that the noise, added to the function value is
half the value of the signal itself. For each noise level, we vary the number of examples
in the tube to observe the effect this parameter has on the final result. Finally, we evaluate Padé and Padé combined with classification tree algorithm C4.5 [9] against ground
truth, measuring the classification accuracy (Table 7).
Noise
Tube
∂f /∂x
Padé
Padé + C4.5
∂f /∂y
Padé
Padé + C4.5
5
SD=0
15
30
5
SD=10
15
30
80%
85%
98%
99%
99%
99%
77%
86%
96%
99%
98%
99%
63%
51%
77%
99%
84%
92%
82%
97%
98%
99%
98%
98%
79%
90%
95%
99%
97%
98%
66%
90%
82%
99%
85%
98%
5
SD=50
15
30
Fig. 7. The analysis of Padé w.r.t. noise, tube sizes and additional use of C4.5.
Regarding noise we observe that Padé itself is quite robust. Yet, it greatly benefits from being used together with C4.5. Regarding the tube - Tube regression is highly
noise resistant, which will also make it smear fine details in noiseless data. The smoothing is regulated by two arguments. The width of the tube should balance between having enough examples for a reliable estimation of the coefficient on one side and not
observing the examples where the values of other attributes could significantly affect
the function value on the other. However, if the tube is symmetrically covered by the
examples (this is probably true except on the boundaries of the covered attribute space)
and if the function which we model is negatively symmetrical with respect to other attributes’ values in the part of the space covered by the tube, impacts of other attributes
can be expected to cancel out.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
46
5
Conclusion
We proposed a new algorithm, Padé, which discovers significant monotone relationships in data. It does so by computing local approximations to partial derivatives in
each learning example. For each learning example Padé takes the sign of the partial
derivative and instead of inducing the model directly, translates the problem to the field
of learning classification models which is one of the most developed and active fields
of artificial intelligence. Another distinctive feature of Padé is that it is based on pure
mathematical concepts – calculus of real functions and statistical regression. As such,
the work can provide a good foundation for further development of the field.
References
1. Šuc, D., Bratko, I.: Induction of qualitative trees. In De Raedt, L., Flach, P., eds.: Proceedings
of the 12th European Conference on Machine Learning, Springer (2001) 442–453 Freiburg,
Germany.
2. Bratko, I., Šuc, D.: Learning qualitative models. AI Magazine 24(4) (2003) 107–119
3. Žabkar, J., Jerše, G., Mramor, N., Bratko, I.: Induction of qualitative models using discrete
morse theory. In: Proceedings of the 21st Workshop on Qualitative Reasoning, Aberystwyth
(2007)
4. Forman, R.: A user’s guide to discrete Morse theory (2001)
5. King, H.C., Knudson, K., Mramor Kosta, N.: Generating discrete morse functions from point
data. Exp. math. 14(4) (2005) 435–444
6. Atkeson, C., Moore, A., Schaal, S.: Locally weighted learning. Artificial Intelligence Review
11 (1997) 11–73
7. Žabkar, J., Bratko, I., Demšar, J.: Learning qualitative models through partial derivatives by
pad. In: Proceedings of the 21th International Workshop on Qualitative Reasoning, Aberystwyth, U.K. (2007)
8. Zupan, B., Leban, G., Demšar, J.: Orange: Widgets and visual programming, a white paper
(2004)
9. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers (1993)
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
47
Nonparametric Ordinal Classification with
Monotonicity Constraints
Nicola Barile and Ad Feelders
Utrecht University, Department of Information and Computing Sciences,
P.O. Box 80089, 3508TB Utrecht, The Netherlands,
{barile,ad}@cs.uu.nl
Abstract. In many applications of ordinal classification we know that
the class label must be increasing (or decreasing) in the attributes. Such
relations are called monotone. We discuss two nonparametric approaches
to monotone classification: osdl and moca. Our conjecture is that both
methods have a tendency to overfit on the training sample, because their
basic class probability estimates are often computed on a few observations only. Therefore, we propose to smooth these basic probability
estimates by using weighted k nearest neighbour. Through substantial
experiments we show how this adjustment improves the classification performance of osdl considerably. The effect on moca on the other hand is
less conclusive.
1
Introduction
In many applications of data analysis it is reasonable to assume that the
response variable is increasing (or decreasing) in one or more of the attributes or features. Such relations between response and attribute are
called monotone. Besides being plausible, monotonicity may also be a
desirable property of a decision model for reasons of explanation, justification and fairness. Consider two applicants for the same job, where the
one who scores worse on all criteria gets the job.
While human experts tend to feel uncomfortable expressing their
knowledge and experience in terms of numeric assessments, they typically are able to state their knowledge in a semi-numerical or qualitative
form with relative conviction and clarity, and with less cognitive effort [9].
Experts, for example, can often easily indicate which of two probabilities
is smallest. In addition to requiring less cognitive effort, such relative
judgments tend to be more reliable than direct numerical assessments
[18].
Hence, monotonicity constraints occur frequently in learning problems and such constraints can be elicited from subject area experts with
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
48
relative ease and reliability. This has motivated the development of algorithms that are able to enforce such constraints in a justified manner.
Several data mining techniques have been adapted in order to be able to
handle monotonicity constraints in one form or another. Examples are:
classification trees [19, 10, 6], neural networks [20, 21], Bayesian networks
[1, 11] and rules [8].
In this paper, we confine our attention to two nonparametric approaches to monotone classification: osdl [7, 15] and moca [4]. These
methods rely on the estimation of the class probabilities for each observed attribute vector. These basic estimates as we will call them are
then further processed in order to extend the classifier to the entire attribute space (by interpolation), and to guarantee the monotonicity of the
resulting classification rule. Because the basic estimates are often based
on very few observations, we conjecture that osdl and moca are prone
to overfitting. Therefore we propose to smooth the basic estimates by
including observations that are near to where an estimate is required.
We perform a substantial number of experiments to verify whether this
indeed improves the classification performance.
This paper is organized as follows. In the next section, we establish
some concepts and notation that will be used throughout the paper. In
section 3 we give a short description of osdl and moca and establish
similarities and differences between them. We also provide a small example to illustrate both methods. In section 4 we propose how to adapt the
basic estimates that go into osdl and moca, by using weighted k nearest
neighbour. Subsequently, these adapted estimates are tested experimentally in section 5. We compare the original algorithms to their adapted
counterparts, and test whether significant differences in predictive performance can be found. Finally, we draw conclusions in section 6.
2
Preliminaries
Let X denote the vector of attributes, which takes values x in a pdimensional input space X = ×Xi , and let Y denote the class variable
which takes values y in a one-dimensional space Y = {1, 2, . . . , q}, where q
is the number of class labels. We assume that the values in Xi , i = 1, . . . , p,
and the values in Y are totally ordered. An attribute Xi has a positive
influence on Y if for all xi , x0i ∈ Xi :
xi ≤ x0i ⇒ P (Y |xi , x−i ) P (Y |x0i , x−i )
(1)
where x−i is any value assignment to the attributes other than Xi [22].
Here P (Y |xi , x−i ) P (Y |x0i , x−i ) means that the distribution of Y for
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
49
attribute values (xi , x−i ) is stochastically smaller than for attribute values
(x0i , x−i ), that is
F (y|xi , x−i ) ≥ F (y|x0i , x−i ),
y = 1, 2, . . . , q
where F (y) = P (Y ≤ y). In words: for the larger value of Xi , larger
values of Y are more likely. A negative influence is defined analogously,
where for larger values of Xi smaller values of Y are more likely. Without
loss of generality, we henceforth assume that all influences are positive. A
negative influence from Xi to Y can be made positive simply by reordering
the values in Xi .
Considering the constraints (1) corresponding to all positive influences
together, we get the constraint:
∀x, x0 ∈ X : x x0 ⇒ P (Y |x) P (Y |x0 ),
(2)
where the order on X is the product order
x x0 ⇔ ∀i = 1, . . . , p : xi ≤ x0i .
It is customary to evaluate a classifier on the basis of its error-rate or
0/1 loss. For classification problems with ordered class labels this choice
is less obvious. It makes sense to incur a higher cost for those misclassifications that are “far” from the true label, than to those that are “close”.
One loss function that has this property is L1 loss:
L1 (i, j) = |i − j|
i, j = 1, . . . , q
(3)
where i is the true label, and j the predicted label. We note that this is
not the only possible choice. One could also choose L2 loss for example, or
another loss function that has the desired property that misclassifications
that are far from the true label incur a higher loss. Nevertheless, L1 loss
is a reasonable candidate, and in this paper we confine our attention to
this loss function. It is a well known result from probability theory that
predicting the median minimizes L1 loss.
A median m of Y has the property that P (Y ≤ m) ≥ 0.5 and P (Y ≥
m) ≥ 0.5. The median may not be unique. Let m` denote the smallest
median of Y and let mu denote the largest median. We have [15]
P (Y |x) P (Y |x0 ) ⇒ m` (x) ≤ m` (x0 ) ∧ mu (x) ≤ mu (x0 )
The above result shows that predicting the smallest (or largest) median gives an allocation rule c : X → Y that satisfies ∀x, x0 ∈ X :
x x0 ⇒ c(x) ≤ c(x0 ),
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
(4)
50
that is, a lower ordered input cannot have a higher class label. Kotlowski
[14] shows that if a collection of probability distributions satisfies the
stochastic order constraint (2), then the Bayes allocation rule cB (·) satisfies the monotonicity constraint (4), provided the loss function is convex.
This encompasses many reasonable loss functions but not 0/1 loss, unless
the class label is binary.
Let D = {(xi , yi )}N
i=1 denote the set of observed data points in X × Y,
and let Z denote the set of distinct x values occurring in D. We define the
downset of x with respect to Z to be the set {x0 ∈ Z : x0 x}. The upset
of x is defined analogously. Any real-valued function f on Z is isotonic
with respect to if, for any x, x0 ∈ Z, x x0 implies f (x) ≤ f (x0 ).
Likewise, a real-valued function a on Z is antitonic with respect to if,
for any x, x0 ∈ Z, x x0 implies a(x) ≥ a(x0 ).
3
OSDL and MOCA
In this section we give a short description of osdl and moca, and discuss
their similarities and differences.
3.1
OSDL
The ordinal stochastic dominance learner (osdl) was developed by CaoVan [7] and generalized by Lievens et al. in [15]. Recall that Z is the set
of distinct x values present in the training sample D. Let
P̂ (y|x) =
n(x, y)
,
n(x)
x ∈ Z, y = 1, . . . , q
where n(x) denotes the number of observations in D with attribute values
x, and n(x, y) denotes the number of observations in D with attribute
values x and class label y. Furthermore, let
X
F̂ (y|x) =
P̂ (j|x),
x∈Z
j≤y
denote the unconstrained maximum likelihood estimate of
F (y|x) = P (Y ≤ y|x), x ∈ Z.
To obtain a collection of distribution functions that satisfy the stochastic order restriction, Cao-Van [7] defines:
F min (y|x0 ) = min F̂ (y|x)
xx0
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
(5)
51
and
F max (y|x0 ) = max F̂ (y|x),
x0 x
(6)
where x ∈ Z. If there is no point x in Z such that x x0 , then
F min (y|x0 ) = 1 (y = 1, . . . , q), and if there is no point x in Z such
that x0 x, then F max (y|x0 ) = 0 (y = 1, . . . , q − 1), and F min (q|x0 ) = 1.
Note that ∀x, x0 ∈ X :
x x0 ⇒ F min (y|x) ≥ F min (y|x0 )
0
xx ⇒
F max (y|x)
≥
F max (y|x0 ).
(7)
(8)
Proposition (7) holds, since the downset of x is a subset of the downset of
x0 , and the minimum taken over a given set is never above the minimum
taken over one of its subsets. Proposition (8) follows similarly.
In the constant interpolation version of OSDL, the final estimates are
obtained by putting
F̃ (y|x0 ) = αF min (y|x0 ) + (1 − α)F max (y|x0 ),
(9)
with α ∈ [0, 1].
This rule is used both for observed data points, as well as for new data
points. The interpolation parameter α is a free parameter whose value
can be selected so as to minimize empirical loss on a test sample. Note
that F̃ satisfies the stochastic order constraint, because both (7) and (8)
hold. More sophisticated interpolation schemes called balanced and double
balanced osdl are discussed in [15]; we refer the reader to this paper
for details. These osdl versions are also included in the experimental
evaluation that is presented in section 5.
3.2
MOCA
In this section, we give a short description of moca. For each value of y,
the moca estimator F ∗ (y|x), x ∈ Z minimizes the sum of squared errors
X
o2
n
n(x) F̂ (y|x) − a(x)
(10)
x∈Z
within the class of antitonic functions a(x) on Z. This is an isotonic
regression problem. It has a unique solution, and the best time complexity
known is O(|Z|)4 [17]. The algorithm has to be performed q − 1 times,
since obviously F ∗ (q|x) = 1.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
52
Note that this estimator satisfies the stochastic order constraint ∀x, x0 ∈
Z:
x x0 ⇒ F ∗ (y|x) ≥ F ∗ (y|x0 )
y = 1, . . . , q
(11)
by construction.
Now the isotonic regression is only defined on the observed data
points. Typically the training sample does not cover the entire input
space, so we need some way to estimate F (y|x0 ) for points x0 not in
the training sample. Of course these estimates should satisfy the stochastic order constraint with respect to F ∗ (Y |x). Hence, we can derive the
following bounds:
F min (y|x0 ) = max F ∗ (y|x)
y = 1, . . . , q
(12)
F max (y|x0 ) = min F ∗ (y|x)
y = 1, . . . , q
(13)
x0 x
and
xx0
If there is no point x in Z such that x x0 , then we put F max (y|x0 ) = 1
(y = 1, . . . , q), and if there is no point x in Z such that x0 x, then we
put F min (y|x0 ) = 0 (y = 1, . . . , q − 1), and F min (q|x0 ) = 1.
Because F ∗ is antitonic we always have F min (y) ≤ F max (y). Any
choice from the interval [F min (y), F max (y)] satisfies the stochastic order
constraint with respect to the training data.
A simple interpolation scheme that is guaranteed to produce globally
monotone estimates is to take the convex combination
F̆ (y|x0 ) = αF min (y|x0 ) + (1 − α)F max (y|x0 ),
(14)
with α ∈ [0, 1]. Note that for x0 ∈ Z, we have F̆ (y|x0 ) = F ∗ (y|x0 ), since
both F min (y|x0 ) and F max (y|x0 ) are equal to F ∗ (y|x0 ). The value of α
can be chosen so as to minimize empirical loss on a test sample.
Since moca should produce a class prediction, we still have to specify
an allocation rule. moca allocates x to the smallest median of F̆ (Y |x):
c∗ (x) = min : F̆ (y|x) ≥ 0.5.
y
First of all, note that since F̆ (y) satisfies the stochastic order constraint (2),
c∗ will satisfy the monotonicity constraint given in (4). Furthermore, it
can be shown (see [4]) that c∗ minimizes L1 loss
N
X
|yi − c(xi )|
i=1
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
53
within the class of monotone integer-valued functions c(·). In other words,
of all monotone classifiers, c∗ is among the ones (there may be more than
one) that minimize L1 loss on the training sample.
3.3
An Example
To illustrate osdl and moca, we present a small example. Suppose we
have two real-valued attributes X1 and X2 and a ternary class label Y ,
that is, Y = {1, 2, 3}. Consider the dataset given in Figure 1.
X2
5 1
7 3
3 2
4 2
2 1
1 1
6 2
X1
Fig. 1. Data for example. Observations are numbered for identification. Class label is
printed in boldface next to the observation.
Table 1 gives the different estimates of F . Since all attribute vectors
occur only once, the estimates F̂ are based on only a single observation.
The attribute vector of observation 5 is bigger than that of 3 and 4,
but observation 5 has a smaller class label. This leads to order reversals
in F̂ (1). We have F̂ (1|3) and F̂ (1|4) smaller than F̂ (1|5) (where in a
slight abuse of notation we are conditioning on observation numbers),
but observation 5 is in the upset of 3 and 4. In this case, the antitonic
regression resolves this order reversal by averaging these violators:
F ∗ (1|3) = F ∗ (1|4) = F ∗ (1|5) =
0+0+1
1
=
3
3
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
54
This is the only monotonicity violation present in F̂ so no further averaging is required. We explain the computation of the constant interpolation
version of osdl estimate through an example. In Table 1 we see that
F̃ (1|3) = 1/2. This is computed as follows:
F min (1|3) = min{F̂ (1|1), F̂ (1|2), F̂ (1|3)} = min{1, 1, 0} = 0,
since the downset of observation 3 is {1, 2, 3}. Likewise, we can compute
F max (1|3) = max{F̂ (1|3), F̂ (1|5), F̂ (1|7)} = max{0, 1, 0} = 1,
since the upset of observation 3 is {3, 5, 7}. Combining these together
with α = 0.5 we get:
F̃ (1|3) = αF min (1|3) + (1 − α)F max (1|3) =
1
2
Table 1. Maximum Likelihood, moca and osdl (α = 0.5) estimates of F (1) and F (2).
F̂ (mle)
F̆ (moca)
F̃ (osdl)
∗
obs
1
2
y
1
2
c
1
2
1
2
3
4
5
6
7
1
1
0
0
1
0
0
1
1
1
1
1
1
0
1
1
2
2
1
2
3
1
1
1/3
1/3
1/3
0
0
1
1
1
1
1
1
0
1
1
2
2
2
2
3
1
1
1/2
1/2
1/2
0
0
1
1
1
1
1
1
1/2
cmin
cmax
1
1
1
1
1
2
2
1
1
2
2
2
2
3
The moca allocation rule c∗ allocates to the smallest median of F̆ ,
which gives a total absolute error of 1, since observation 5 has label 1, but
is predicted to have label 2 by c∗ . All other predictions of c∗ are correct.
An absolute error of 1 is the minimum achievable on the training sample
for any monotone classifier. For osdl we have given two allocation rules:
one that assigns to the smallest median (cmin ) and one that assigns to
the largest median (cmax ). The former has an absolute error of 3 on the
training sample, and the latter achieves the minimum absolute error of 1:
it is identical to c∗ in this case.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
55
3.4
Comparison of osdl and moca
The reader will have noticed the similarity between moca and osdl:
moca uses the same interpolation method, and the moca definitions of
F min and F max are the reverse of the corresponding definitions for osdl.
An important difference is that osdl plugs in the maximum likelihood
estimates F̂ in equations (5) and (6), whereas moca plugs in the isotonic
regression estimates F ∗ in equations (12) and (13). It should be noted
that osdl in principle allows any estimate of F (y|x) to be plugged into
equations (5) and (6). From that viewpoint, moca can be viewed as an
instantiation of osdl. However, to the best of our knowledge only the
unconstrained maximum likelihood estimate has been used in osdl to
date.
Because F ∗ is plugged in, moca is guaranteed to minimize L1 loss on
the training sample. While this is a nice property, the objective is not to
minimize L1 loss on the training sample. It remains to be seen whether
this also results in better out-of-sample predictive performance.
It should be noted that if F̂ already satisfies the stochastic order
restriction, then both methods are identical. In that case the isotonic
regression will not make any changes to F̂ , since there are no order reversals.
Our conjecture is that both methods have a tendency to overfit on
the training data. In many applications attribute vectors occur in D only
once, in particular when the attributes are numeric. Hence the basic estimates that go into equations (5) and (6) are usually based on a single
observation only. Now the interpolation performed in equation (14) will
have some smoothing effect, but it is the question whether this is sufficient to prevent overfitting. The same reasoning applies to moca, but to a
lesser extent because the isotonic regression has an additional smoothing
effect: in case of order reversals basic estimates are averaged to remove the
monotonicity violation. Nevertheless, it is possible that moca could be
improved by performing the isotonic regression on a smoothed estimate
rather than on F̂ in (10). This is what we propose in the next section.
4
Weighted kNN probability estimation
In order to prevent overfitting in estimating P (Y ≤ y|x), x ∈ Z, we develop a weight-based estimator based on the nearest neighbours principle
along the lines of the one introduced in [13].
In the following, we first discuss kNN as a classification technique and
then illustrate how we use it to perform probability estimation.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
56
4.1
k Nearest Neighbour Classification
The k Nearest Neighbor technique is an example of instance-based learning: the training dataset is stored, and the classification of new, unlabelled
instances is performed by comparing each of them to the k most similar
(least dissimilar) elements to it in the training dataset. The dissimilarity is determined by means of a distance metric or function, which is a
real-valued function d such that for any data points x, y, and z:
1. d(x, y) > 0, d(x, x) = 0;
2. d(x, y) = d(y, x);
3. d(x, z) ≤ d(x, y) + d(y, z);
The distance measure which we adopted is the Euclidean distance:
v
u p
uX
d(x, y) = t
(xi − yi )2 .
i=1
In order to avoid attributes that have large values from having a stronger
influence than attributes measured on a smaller scale, it is important to
normalize the attribute values. We adopted the Z-score standardization
technique, whereby each value x of an attribute X is replaced by
x − x̄
,
sX
where sX denotes the sample standard deviation of X. Once a distance
measure to determine the neighbourhood of an unlabelled observation x0
has been selected, the next step to use kNN as a classification technique
is represented by determining a criterion whereby the selected labelled
individuals will be used to determine the label of x0 .
The most straightforward solution is represented by (unweighted) majority voting: the chosen label is the one occurring most frequently in the
neighbourhood of x0 .
4.2
Weighted kNN Classification
In kNN it is reasonable to request that neighbours that are closer to x0
have a greater importance in deciding its class than those that are more
distant.
In the Weighted Nearest Neighbours approach x0 is assigned to the
class y0 which has a weighted majority among its k nearest neighbours,
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
57
namely
y0 = arg max
y
k
X
ωi I(yi = y).
i=1
Each of the k members xi of the neighbourhood of x0 is assigned a weight
ωi which is inversely proportional to its distance d = d(x0 , xi ) from x0 and
which is obtained by means of a weighting function or kernel Kλ (x0 , xi )
[12]:
d(x0 , xi )
.
(15)
ωi = Kλ (x0 , xi ) = G
λ
Kernels are at the basis of the Parzen density estimation method [12];
in that context, the smoothing parameter or bandwidth λ dictates the
width of the window considered to perform the estimation. A large λ
implies lower variance (averages over more observations) but higher bias
(we essentially assume the true function is constant within the window).
Here G(·) can be any function with maximum in d = d(x, y) = 0
and values that get smaller with growing value of d. Thus the following
properties must hold [13]:
1. G(d) ≥ 0 for all d ∈ R;
2. G(d) gets its maximum for d = 0;
3. G(d) descends monotonically for d → ∞;
In the one-dimensional case, one popular kernel is obtained by using
the Gaussian density function φ(t) as G(·), with the standard deviation
playing the role of the parameter λ. In Rp , with p > 1, the natural
generalization is represented by
(
)
1
1 kx0 − xi k 2
Kλ (x0 , xi ) = √
,
exp −
2
λ
2πλ
which is the kernel we adopt in our method.
Although the kernel used is in a sense a parameter of wkNN, experience has shown that the choice of kernel (apart from the the rectangular
kernel, which gives equal weights to all neighbours) is not crucial [12].
In equation (15) it is assumed that λ is a fixed value over the whole
of the space of data samples. The optimal value of λ may be locationdependent, giving a large value in regions where the data samples are
sparse and a small value where the data samples are densely packed. One
solution is represented by the use of adaptive windows, where λ depends
on the location of the sample in the data space. Let hλ (x0 ) be a width
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
58
function (indexed by λ) which determines the width of the neighbourhood
at x0 . Then we have
Kλ (x0 , xi ) = G
d(x0 , xi )
hλ (x0 )
As kernels are used to compute weights in wkNN, we set hλ (x0 ) equal to
the distance d(x0 , xk+1 ) of x0 from the first neighbour xk+1 that is not
taken into consideration [12, 13].
4.3
Using wkNN for Probability Estimation
We adopted the weighted k-nearest neighbour principle to estimate class
probabilities for each distinct observation x ∈ Z as follows: let Nk (x) be
the set of indices in D of all occurences of the k attribute vectors in Z
closest to x. Note that Nk (x) may contain more than k elements if some
attribute vectors occur multiple times in D. Then
P
P̂ (y|x) =
i∈Nk (x) ωi I(yi
P
i∈Nk (x) ωi
= y)
,
y = 1, . . . , q,
(16)
It should be noted that x is included in its own neighbourhood and its
occurrences have a relatively large weight ωi in (16).
In the case of moca, the adoption of this new probability estimator has
an impact on the computation of the moca estimator not only in terms
of the probability estimates that the antitonic regression is performed on
but also on the weights used, which are now equal to the cardinality of
Nk (x) for each x ∈ Z. Note that if k = 1, then equation (16) produces
the maximum likelihood estimates used in standard osdl and moca.
The estimator presented in this section is analogous to the one adopted
in [13], where the estimates obtained are used to perform ordinal classification (without monotonicity constraints) by predicting the median.
5
Experiments
We performed a series of experiments on a several real-world datasets in
order to determine whether and to what extent moca and osdl would
benefit from the new wkNN probability estimator. The results were measured in terms of the average L1 error rate achieved by the two algorithms.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
59
5.1
Datasets
We selected a number of datasets where monotonicity constraints are
likely to apply. We used the KC1, KC4, PC3, and PC4 datasets from the
NASA Metrics Data Program [16], the Acceptance/Rejection, Employee
Selection, Lecturers Evaluation and Social Workers Decisions datasets
from A. Ben-David [5], the Windsor Housing dataset [2], as well as several
datasets from the UCI Machine Learning Repository [3]. Table 2 lists all
the datasets used.
Table 2. Charasterics of datasets used in the experiments
Dataset
cardinality #attributes #labels
Australian Credit
690
14
2
Auto MPG
392
7
4
Boston Housing
506
13
4
Car Evaluation
1728
6
4
Empoyee Rej/Acc
1000
4
9
Employee selection
488
4
9
Haberman survival
306
3
2
KC1
2107
21
3
KC4
122
4
6
Lecturers evaluation
1000
4
5
CPU Performance
209
6
4
PC3
320
15
5
PC4
356
16
6
Pima Indians
768
8
2
Social Workers Decisions
1000
10
4
Windsor Housing
546
11
4
5.2
Dataset Preprocessing
For datasets with a numeric response that is not a count (Auto MPG,
Boston Housing, CPU Performance, and Windsor Housing) we discretized
the response values into four separate intervals, each interval containing
roughly the same number of observations.
For all datasets from the NASA Metrics Data Program the attribute
ERROR COUNT was used as the response. All attributes that contained missing values were removed. Furthermore, the attribute MODULE was removed
because it is a unique identifier of the module and the ERROR DENSITY was
removed because it is a function of the response variable. Furthermore,
attributes with zero variance were removed from the dataset.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
60
5.3
Experimental results
Each of the datasets was randomly divided into two parts, a training
set (containing roughly two thirds of the data) and a validation set. The
training set was used to determine the optimal values for k and α in
moca and osdl through 10-fold cross validation. We started with k = 1
and incremented its value by one until the difference of the average L1
error between two consecutive iterations for both classifiers was less than
or equal to 1−6 . For each value of k we determined the the optimal α in
{0, 0.25, 0.5, 0.75, 1}. Once the optimal parameter values were determined,
they were used to train both algorithms on the complete training set and
then to test them on the validation set. We then performed a paired ttest of the L1 errors on the validation set to determine whether observed
differences were significant. Table 3 lists all the results.
Table 3. Experimental results. The first four columns contain average L1 errors on
the validation set. The final two columns contain p-values. The penultimate column
compairs smoothed moca to standard moca. The final column compairs smoothed
osdl to standard osdl.
Dataset
MOCA wkNN OSDL wkNN MOCA MLE OSDL MLE 1. vs. 3. 2. vs. 4.
Australian Credit
0.1304
0.1130
0.1348
0.3565 0.3184
0
Auto MPG
0.2977
0.2977
0.2977
0.2977
−
−
Boston Housing
0.5030
0.4675
0.4675
0.5207 0.4929 0.2085
Car Evaluation
0.0625
0.0556
0.0625
0.0556
−
−
Empoyee Rej/Acc
1.2006
1.2066
1.2814
1.2814 0.0247 0.0445
Employee selection
0.3620
0.3742
0.3620
0.4110
1 0.2018
Haberman survival
0.3529
0.3431
0.3529
0.3431
−
−
KC1
0.1863
0.1977
0.1863
0.3940
1
0
KC4
0.8095
0.8095
0.8571
0.8571 0.4208 0.5336
Lecturers evaluation
0.4162
0.4162
0.4102
0.4102 0.8060 0.8060
CPU Performance
0.3571
0.3286
0.3571
0.3571
− 0.5310
PC3
0.1228
0.1228
0.1228
0.1228
−
−
PC4
0.1872
0.1872
0.1872
0.1872
−
−
Pima Indians
0.3008
0.3086
0.3008
0.3086
−
−
Social Workers
0.5359
0.5479
0.5060
0.4940 0.1492 0.0092
Windsor Housing
0.5604
0.5220
0.5604
0.6044
− 0.0249
We first check whether smoothing actually improves the classifiers.
Comparing standard osdl against smoothed osdl we observe that the
latter is signifcantly better (at α = 0.05) four times, whereas it is significantly worse one time (for Social Workers Decisions). Furthermore,
the smoothed version almost never has higher estimated error (Lecturer
Evaluation and Social Workers Decisions being the two exceptions).
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
61
Comparing standard moca against smoothed moca, we observe that
the latter is significantly better only once (on Employee Rejection). All
other observed differences are not significant.
Table 4. Experimental results for balanced and double balanced osdl. The first four
columns contain average L1 errors on the validation set. The final two columns contain
p-values. The penultimate column compairs smoothed balanced osdl to standard balanced osdl. The final column compairs smoothed double balanced osdl to standard
double balanced osdl.
Dataset
bosdl wkNN bosdl MLE dbosdl wkNN dbosdl MLE 1. vs. 2. 3. vs. 4.
Australian Credit
0.3565
0.3565
0.3565
0.3565
−
−
Auto MPG
0.2977
0.2977
0.2977
0.2977
−
−
Boston Housing
0.5207
0.5207
0.5207
0.5207
−
−
Car Evaluation
0.0556
0.0556
0.0556
0.0556
−
−
Empoyee Rej/Acc
1.8533
1.9760
1.8533
1.9760 0.1065 0.1065
Employee selection
0.5092
0.5092
0.5092
0.5092
−
−
Haberman survival
0.3431
0.3431
0.3431
0.3431
−
−
KC1
0.1892
0.3030
0.3883
0.3940
0 0.2061
KC4
0.7619
0.8571
0.7857
0.8571 0.1031 0.1829
Lecturers evaluation
0.9251
0.9251
0.9251
0.9251
−
−
CPU Performance
0.4143
0.4143
0.4143
0.4143
−
−
PC3
0.1228
0.1228
0.1228
0.1228
−
−
PC4
0.1872
0.1872
0.1872
0.1872
−
−
Pima Indians
0.2930
0.2930
0.2930
0.2930
−
−
Social Workers
0.6617
0.6557
0.6617
0.6437 0.8212 0.4863
Windsor Housing
0.6044
0.6044
0.6044
0.6044
−
−
In table 4 the effect of smoothing on balanced and double balanced
osdl is investigated. We conclude that smoothing doesn’t have much
effect in either case: only one significant improvement is found (for KC1).
Furthermore, comparing table 3 and table 4, we observe that constant
interpolation osdl with smoothing tends to outperform its balanced and
double balanced counterparts.
6
Conclusion
We have discussed two related methods for nonparametric monotone classification: osdl and moca. The basic class probability estimates used by
these algorithms are typically based on very few observations. Therefore
we conjectured that they both have a tendency to overfit on the training
sample. We have proposed a weighted k nearest neighbour approach to
smoothing the basic estimates.
The experiments have shown that smoothing is beneficial for osdl: the
predictive performance was significantly better on a number of datasets,
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
62
and almost never worse. For moca, smoothing seems to have much less
effect. This is probably due to the fact that the isotonic regression already
smooths the basic estimates by averaging them in case of order reversals.
Hence, moca is already quite competitive for k = 1.
The more sophisticated interpolation schemes of osdl (balanced and
double balanced) do not seem to lead to an improvement over the constant
interpolation version on the datasets considered.
References
1. E.A. Altendorf, A.C. Restificar, and T.G. Dietterich. Learning from sparse data
by exploiting monotonicity constraints. In F. Bacchus and T. Jaakkola, editors,
Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI05), pages 18–25. AUAI Press, 2005.
2. P.M. Anglin and R. Gençay. Semiparametric estimation of a hedonic price function.
Journal of Applied Econometrics, 11(6):633–648, 1996.
3. A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.
4. N. Barile and A. Feelders. Nonparametric monotone classification with MOCA.
In F. Giannotti, editor, Proceedings of the Eighth IEEE International Conference
on Data Mining (ICDM 2008), pages 731–736. IEEE Computer Society, 2008.
5. A. Ben-David, L. Sterling, and Y. Pao. Learning and classification of monotonic
ordinal concepts. Computational Intelligence, 5:45–49, 1989.
6. Arie Ben-David. Monotonicity maintenance in information-theoretic machine
learning algorithms. Machine Learning, 19:29–43, 1995.
7. K. Cao-Van. Supervised ranking, from semantics to algorithms. PhD thesis, Universiteit Gent, 2003.
8. K. Dembczynski, W. Kotlowski, and R. Slowinski. Ensemble of decision rules for
ordinal classification with monotonicity constraints. In Rough Sets and Knowledge
Technology, volume 5009 of LNCS, pages 260–267. Springer, 2008.
9. M.J. Druzdzel and L.C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In P. Besnard and
S. Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial
Intelligence (UAI-95), pages 141–148. Morgan Kaufmann, 1995.
10. A. Feelders and M. Pardoel. Pruning for monotone classification trees. In M.R.
Berthold, H-J. Lenz, E. Bradley, R. Kruse, and C. Borgelt, editors, Advances in
Intelligent Data Analysis V, volume 2810 of LNCS, pages 1–12. Springer, 2003.
11. A. Feelders and L. van der Gaag. Learning Bayesian network parameters with
prior knowledge about context-specific qualitative influences. In F. Bacchus and
T. Jaakkola, editors, Proceedings of the 21st Conference on Uncertainty in Artificial
Intelligence (UAI-05), pages 193–200. AUAI Press, 2005.
12. R. Hastie, T.and Tibshirani and J. Friedman. The Elements of Statistical Learning.
Springer, second edition, 2009.
13. K. Hechenbichler and K. Schliep. Weighted k-nearest-neighbor techniques and
ordinal classification. Discussion Paper 399, Collaborative Research Center (SFB)
386 Statistical Analysis of discrete structures - Applications in Biometrics and
Econometrics, 2004.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009
63
14. W. Kotlowski and R. Slowinski. Statistical approach to ordinal classification with
monotonicity constraints. In ECML PKDD 2008 Workshop on Preference Learning, 2008.
15. S. Lievens, B. De Baets, and K. Cao-Van. A probabilistic framework for the design
of instance-based supervised ranking algorithms in an ordinal setting. Annals of
Operations Research, 163:115–142, 2008.
16. J. Long. NASA metrics data program [http://mdp.ivv.nasa.gov/repository.html].
2008.
17. W.L. Maxwell and J.A. Muckstadt. Establishing consistent and realistic reorder intervals in production-distribution systems. Operations Research, 33(6):1316–1341,
1985.
18. M.A. Meyer and J.M. Booker. Eliciting and Analyzing Expert Judgment: A Practical Guide. Series on Statistics and Applied Probability. ASA-SIAM, 2001.
19. R. Potharst and J.C. Bioch. Decision trees for ordinal classification. Intelligent
Data Analysis, 4(2):97–112, 2000.
20. J. Sill. Monotonic networks. In Advances in neural information processing systems,
NIPS (Vol. 10), pages 661–667, 1998.
21. M. Velikova, H. Daniels, and M. Samulski. Partially monotone networks applied to
breast cancer detection on mammograms. In Proceedings of the 18th International
Conference on Artificial Neural Networks (ICANN), volume 5163 of LNCS, pages
917–926. Springer, 2008.
22. M.P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44:257–303, 1990.
Ad Feelders and Rob Potharst (eds) Proceedings of MoMo 2009