Article type: Advanced Review
Computational rank-based statistics
Joseph W. McKean, [email protected] Western Michigan University
Jeff T. Terpstra, [email protected] North Dakota State University
John D. Kloke, [email protected] Pomona College
least absolute deviation/least absolute value, newton method/newton raphson method, rank
method/rank statistics/rank tests, robust regression, statistical software
This review discusses two algorithms which can be used to compute rank-based regression
estimates. For completeness, a brief overview of rank-based inference procedures in the context
of a linear model is presented. The discussion includes geometry, estimation, inference, and
diagnostics. In regard to computing the rank-based estimates, we discuss two approaches. The
first approach is based on an algebraic identity which allows one to compute the (Wilcoxon)
regression routine. The other approach is a Newton type algorithm. In
estimates using a
addition, we discuss how rank-based inference can be generalized to nonlinear and random
effects models. Some simple examples using existing statistical software are also presented for
the sake of illustration and comparison.
Traditional least squares (LS) procedures offer the user an encompassing methodology for
analyzing models, linear or non-linear. These procedures are based on the simple premise of
fitting the model by minimizing the Euclidean distance between the vector of responses and the
model. Besides the fit, the LS procedures include diagnostics to check the quality of fit and an
array of inference procedures including confidence intervals (regions) and tests of hypotheses.
LS procedures, though, are not robust. One outlier can spoil the LS fit, its associated inference
and even its diagnostic procedures (i.e. methods which should detect the outliers).
Rank-based procedures also offer the user a complete methodology. The only essential change
is to replace the Euclidean norm by another norm, so that the geometry remains the same. As
with the LS procedures, these rank-based procedures offer the user diagnostic tools to check the
quality of fit and associated inference procedures. Further, in contrast to the LS procedures, they
are robust to the effect of outliers. They are generalizations of simple nonparametric rank
procedures such as the Wilcoxon one and two-sample methods and they retain the high
efficiency of these simple rank methods. Further, depending on the knowledge of the underlying
error distribution, this rank-based analysis can be optimized by the choice of the norm (scores).
Weighted versions of the fit can obtain high (50%) breakdown.
Rank-Based Inference for Linear Models
For brevity, we discuss rank-based inference in terms of a linear model. This analysis, though,
has been generalized to many different models. For instance, time series models are discussed
in Terpstra et al. (2000, 2001a, 2001b, 2001), nonlinear models are discussed in Abebe and
McKean (2007), and random effect models are discussed in Kloke et al. (2008).
What follows then is a very brief description of the rank-based procedures for linear models. For
more details, see Chapters 3-5 of Hettmansperger and McKean (1998). Let denote a
vector of responses and assume that it follows the linear model,
denotes a
vector of ones, is the intercept parameter,
is a
matrix of
predictors, is a
vector of regression coefficients, and is a
vector of random errors.
The design matrix
can contain incidence vectors as well as continuous predictors (covariates).
Let be the range of
and assume without loss of generality that
is centered and has full
column rank, so that the dimension of is .
The geometry of the rank-based procedures is the same as least squares (LS), except that
instead of the Euclidean norm, the following pseudo-norm is used:
where the scores are generated as
for a nondecreasing function
, defined
on the interval
, and
is the rank of
. We assume without loss
of generality that the scores sum to zero. Two of the most popular score functions are the
) and the
Wilcoxon (
The rank-based estimate of
is given by
is a norm
is a convex function of ; hence,
exists. Because the scores sum
to zero and the ranks are invariant to a constant shift, the intercept cannot be estimated using the
norm. Instead, it is usually estimated as the median of the residuals. That is,
, where
is the th row of .
The rank-based estimate generalizes robust estimates in the simple location problems. For
is the incidence vector of the two samples. If
example, in the two samples location problem
Wilcoxon scores are used, then
is the Hodges-Lehmann estimate (i.e. the median of all
differences between the samples).
denote the probability density function (pdf) of
. Then, under regularity conditions,
It follows from this asymptotic theory that the asymptotic relative efficiency between the rank, where
denotes the variance of . This is the
based and least squares estimates of is
same efficiency as in the simple location problems. In particular, for Wilcoxon scores and normal
errors, this efficiency is 0.955. For error distributions with thicker tails, this efficiency can be quite
, appropriate scores can
high. Furthermore, depending on the knowledge of the error pdf
result in asymptotically efficient estimates.
, where
is a
matrix of full row
Next, consider the general linear hypothesis,
rank. Denote the reduced model subspace by
. There are
several rank-based tests: a Wald-type test based on the asymptotic distribution of
, an
aligned rank (scores-type) test, and a reduction in dispersion test. The reduction in dispersion test
is given by
, where
denotes the distance
between and the corresponding subspace. While
is asymptotically
, small
sample studies indicate that
should be compared with the same -critical values as the LS
test. For the aligned rank test, the reduced model residuals are obtained and then the test
statistic is an adjusted gradient test based on the ranks of these residuals. In the literature, LS is
often used for the reduced model fit. In this case, the aligned rank test is not robust. A better
choice is to use the rank-based estimate for the reduced model fit. Rank-based confidence
intervals for linear functions of are the same as that for LS except the estimate of is replaced
by the estimate of .
This rank-based inference inherits the good efficiency properties of the estimates discussed
above. Further, the influence functions of both estimates and tests are bounded in (response)
space. They are not, however, bounded in
(factor) space. However, this is irrelevant for
designed experiments. The rank-based high breakdown procedures discussed in the sequel have
high breakdown and bounded influence in both and
spaces and are therefore more
appropriate for observational studies.
Diagnostics for the rank-based analysis are discussed in McKean et al. (1993). In particular,
studentized residuals are presented, which for each point adjusts for both its error variation and
its location in factor space. These performed well in a Monte Carlo study for autoregressive rankbased procedures as discussed by Terpstra et al. (2003).
The main computational problem for these rank-based procedures concerns the computation of
. We discuss two algorithms. The first algorithm is for the Wilcoxon scores for which a simple
identity enables the computation to be carried out as an
estimate on differences of residuals.
The other algorithm is a Newton type algorithm which can be used for general scores.
Computing Rank-based Regression Estimates
-Algorithms for Wilcoxon and Sign Scores
The Wilcoxon estimate of
is defined to be a minimum of the following dispersion function:
denotes the rank of
be shown (e.g. Hettmansperger, 1984) that
. However, it can
for all and . Note that multiplying the right-hand side of (6) by a
constant does not change the estimate of . Hence, for the Wilcoxon estimate, we can assume
the weights are equal to one without loss of generality. This algebraic identity can be exploited to
readily compute the Wilcoxon estimate of . For example, to compute the Wilcoxon estimate one
can simply use a
regression routine with
playing the role of the
response variables and design points, respectively. Recall that the
regression estimate
, where
corresponds to the
Incidentally, it is interesting to point out that the
regression estimate can be viewed as a rankbased estimate as well. For instance, the identity on page 188 of Hettmansperger and McKean
regression estimate essentially corresponds to
(1998) illustrates that the
Many of the mainstream statistical software packages have the ability to compute
For example, in SAS,
regression estimates can be computed using the IML procedure and the
LAV subroutine. In S-PLUS,
regression estimates are computed via the l1fit function. To
our knowledge, R does not provide an explicit
regression function. Nevertheless, since
regression estimates are equivalent to (median) quantile regression estimates, the quantreg
package written by Roger Koenker can be used to calculate the Wilcoxon estimate. We refer the
interested reader to Koenker (2005) for more information on quantile regression. This R
implementation is discussed in Terpstra and McKean (2005) as the Wilcoxon estimate is a
special case of their so-called class of weighted Wilcoxon (WW1) estimates.
Weighted Wilcoxon Estimates
Briefly, a WW-estimate corresponds to a minimum of (6) where the weights are no longer
constant. In terms of computing the WW-estimate there is essentially no difference; simply use a
regression routine with
playing the role of the response variables
and design points, respectively. This is basically how weighted LS regression estimators are
Of course, the simplest weighting scheme corresponds to
(i.e. the Wilcoxon estimate). The
Wilcoxon estimate, though, is not robust against outlying points in the design. However,
robustness against outlying design points can be achieved by selecting an appropriate weighting
are typically referred to as Mallows weights and are
scheme. Weights of the form
discussed in Sievers (1983), Naranjo and Hettmansperger (1994), and Hettmansperger and
McKean (1998). As discussed by Naranjo and Hettmansperger, an appropriately chosen set of
Mallows weights can yield a bounded influence function and positive breakdown point. For
yields the well-known Theil
instance, for the simple linear regression model,
(1950) estimate; which has a 33% breakdown point (e.g. Theorem 5.7.3 of Hettmansperger and
A 50% breakdown estimate can be obtained by allowing the weights to depend on both the
design points and the residuals from an initial fit. These weights are typically referred to as
Schweppe weights. As an example, let
denotes a robust version of squared Mahalanobis distance, and
. The tuning constant is set at
denotes the
residual based on an initial estimate. Thus, these weights
incorporate information from both the design points and the responses via the initial residuals.
The weights in (7) and (8) were suggested by Chang et al. (1999) and are further discussed in
section 5.8.1 of Hettmansperger and McKean (1998). This particular WW-estimate is referred to
as the HBR-estimate since it acquires a 50% breakdown point provided the initial estimates (i.e.
, , and ) are 50% breakdown estimates. These weights, as well as others, are available in R
via the suite of functions discussed by Terpstra and McKean (2005).
A Newton Algorithm for General Scores
A finite algorithm to minimize
is discussed in Osborne (1985). However, the algorithm we
describe in this section minimizes
using a Newton type of algorithm based on the
asymptotic quadraticity of
. It is generally much faster than algorithms based on the
gradient and is currently used in MINITAB (i.e. the RREGRESS command) and RGLM2 (Kapenga,
McKean, and Vidmar, 1988).
Our discussion is brief; more details can be found in Section 3.7 of Hettmansperger and McKean
(1998). The algorithm begins with an initial estimate (e.g. least squares) which we denote as
Based on the asymptotic quadraticity of
, the first Newton step is given by
denotes the negative of the gradient of
. The subsequent iterated estimates are
often called -step estimates. Under regularity conditions, it can be shown that
; see McKean and Hettmansperger (1978). At that time, this was an
important result for computing R-estimates. For instance, one could simply use one or two steps.
Because of the speed of modern computing, the importance of (9) now is that it is the basis for a
Newton algorithm to obtain fully iterated estimates. This more formal algorithm written in terms of
is presented next.
a QR-decomposition of
Consider the QR-decomposition of
which is given by
, where is an
orthogonal matrix and is an
upper triangular matrix of rank ; see Stewart (1973). Writing
, it follows that the columns of
form an orthonormal basis for ,
the column space of . In particular the projection matrix onto is given by
. The
software package LAPACK is a collection of subroutines which efficiently computes QRdecompositions and it further has routines which obtain projections of vectors.
Note that we can write the th Newton step in terms of residuals as
, where
denotes the vector whose th component is
. Let
denote the dispersion function evaluated at
. The Newton step is a step
along the direction
. If
the step is successful;
otherwise, a linear search can be made along the direction to find a value which minimizes
This would then become the th step residual. Such a search can be performed using methods
such as false position as discussed in Chapter 3 of Hettmansperger and McKean (1998).
Stopping rules can be based on the relative drop in dispersion. For instance, one can stop when
, where
is a specified tolerance. A similar stopping rule can be
based on the relative size of the step. Upon stopping at step , obtain the fitted value
and then obtain the estimate of by solving
. This is the algorithm for the
package RGLM, except as the more steps fail, an average over past history of directions is used.
Also, Abebe and McKean (2007) discuss the convergence of this algorithm in a probability sense.
The QR-decomposition can readily be used to form a reduced model design matrix for the
reduction in dispersion and the aligned rank tests of general linear hypotheses of the form
. Let
be a
matrix whose columns consist of an orthonormal basis for the
. Then, the matrix
is a design matrix for the reduced model, which is
easily computed by LAPACK routines.
Rank-Based Estimation for Other Models
Nonlinear Models
The algorithms discussed in the previous section can also be used for nonlinear models. In this
, where is a vector with components
case the model is of the form
some specified function . The rank-based estimate is
. Since the
model is nonlinear, this is not necessarily a convex function of , but the usual Gauss-Newton
type of algorithm can be used to obtain the rank-based fit. Recall that this is an iterated algorithm
evaluated at the current estimate to
which uses the Taylor Series expansion of the function
obtain the estimate at the next iteration. Thus, each of the iterations consists of the rank-based fit
of a linear model. As such, the fits can be computed using the above algorithms; see Abebe and
McKean (2007) for details.
Random Effect Models
Here, we briefly look at a mixed model which includes a random effect. Consider an experiment
observations. Within block , let
performed over random blocks, where block has
denote respectively the
vector of responses, the
design matrix and the
vector of errors. Let
denote a vector of
ones. Then, the model for
where is the random effect. For a matrix formulation of the model, let
total sample size. Let
. We can now write the mixed model as
denote the
, and
is the
incidence matrix denoting block membership.
For this article, we are interested in the computation of rank-based estimates of the vector of fixed
effects, . We present two estimators. Note that for the estimation of the fixed effects, the
geometry is the same as that for the linear model in (1). Hence, our first estimator is the rankbased estimator given in (3),
. Its computation remains the same as discussed above. For this
model, asymptotic theory is presented in Kloke et al. (2008). Correct standardization of the
associated inference, of course, involves estimation of variance components which are also
discussed in Kloke et al.
For the second estimator, consider the case where the fixed effects model is complete within
each block. For example, this occurs in a multi-center clinical design where the block is
synonymous to center. The rank-based estimator
can be used. However, the following
estimator is invariant to the random effects. For each center
, consider Jaeckel’s
(1972) dispersion function given by
is the pseudo-norm in equation (2), but here over the space
. That is,
denotes the rank of among
. We use the same scores for each center.
Since the rankings are within center, it is easy to see from (13) that
; hence,
is invariant to the random effect . Our objective function is the
dispersion function given by
where the subscript
stands for multiple rankings (i.e. a ranking for each center). Thus,
is invariant to the vector of random effects . Since the functions
are convex, their
sum is also. An estimator of the fixed effects is given by the value that minimizes
That is,
This estimator was proposed by Rashid et al. (2008). Asymptotic theory follows if the number of
for each . Since
is convex, there is software
centers is held fixed and
. The function
can be minimized using the Nelder-Mead
available to compute
algorithm (Olsson and Nelson, 1975) or a quasi-Newton algorithm. Both algorithms do not require
second derivatives. In SAS, we can use the subroutines NLPNMS (for the Nelder-Mead algorithm)
and NLPQN (for the quasi-Newton algorithm) of PROC NLP to minimize the dispersion function. In
the statistical software R, the non-linear minimization function NLM can be used. Also, based on
the theory of the estimator, a Newton step algorithm can be written similar to the algorithm
discussed above for R estimates; see Kloke and McKean (2004).
Quadratic Regression Data
The purpose of this example is to illustrate the computational approaches discussed above. More
specifically, we compute the Wilcoxon estimate for a simulated data set using the R code
discussed by Terpstra and McKean (2005), MINITAB’s RREGRESS command and RGLM. In
addition, we compute the least squares estimate for the sake of comparison.
The model for the data is a polynomial model of degree two. That is,
where the distribution of is a mixture of a standard normal distribution (75%) and a normal
distribution with mean zero and standard deviation 26 (25%). This type of distribution is
commonly referred to as a contaminated normal distribution in the robust and nonparametric
literature. Thus, the tails of this distribution are heavier than the standard normal distribution. The
generated observations for this example are given in the Supplementary Information
The estimates and corresponding standard errors are given in Table 1.
Table 1: Wilcoxon and Least Squares Estimates for the Quadratic Regression Data
SE( )
SE( )
SE( )
Least Squares
The Wilcoxon estimates that appear in Table 1 were obtained using the R code by Terpstra and
McKean (2005). The estimates produced by MINITAB’s RREGRESS command and RGLM are the
same (to three decimal places) as those given in Table 1 and are therefore not presented. We
note that the RGLM estimates are the same to four decimal places. Thus, the three packages
used to compute the Wilcoxon estimate are in agreement.
On the other hand, the least squares estimate is quite different from the Wilcoxon estimate. Most
notably is the estimate pertaining to the coefficient on the quadratic term (i.e. ). The inference
based on the Wilcoxon estimate indicates that this coefficient is significantly different from zero
(as it should be) and is very close to the true value of the parameter (i.e. 0.15). In contrast, the
least squares estimate is not significantly different from zero and is, in fact, negative. The fits and
residual plots given in the Supplementary Information section indicate that the 19th observation is
largely responsible for the poor fit of the least squares estimate. As previously mentioned, the
reason that this does not happen for the Wilcoxon estimate is that the influence function for the
Wilcoxon estimate is bounded in .
Lastly, we also tested
using the Wilcoxon version of the reduction in dispersion test
given by
above. Again, we used the R code of Terpstra and McKean (2005) which produced a
test statistic value of 10.9369 and a p-value of 0.0042. We note that the (standard normal) pvalue for the Wald test, based on the information given in Table 1, is 0.0024. Hence, contrary to
the least squares procedures, the reduction in dispersion test and the Wald test are not
equivalent. Nevertheless, both tests are in agreement as to the significance of the result.
Stars Data
In this example we analyze the well-known stars data, which appears in many textbooks and
journal articles pertaining to robust regression techniques. Briefly, the data set consists of two
measurements on 47 stars. The first measurement is the independent variable and represents
the logarithm of the temperature of the star while the second measurement is the response
variable which represents the logarithm of the light intensity of the star. The actual data, along
with a more detailed description, is given in Rousseeuw and Leroy (1987).
The stars data set is commonly used to illustrate the effects of bad leverage points on different
is outlying
types of estimates. Here, a bad leverage point denotes a point where the value of
with respect to the other design points and the value of is inconsistent with the fit of the majority
of data. The stars data contains at least four such points. In fact, all four of these values relate to
giant stars.
In what follows we will compare the Wilcoxon (WIL) estimate to the HBR weighted Wilcoxon
estimate in (7) and (8). All of the analyses were done using the R code of Terpstra and McKean
(2005). To begin, the estimates and corresponding standard errors are given in Table 2.
Table 2: Wilcoxon and HBR Estimates for the Stars Data
SE( )
SE( )
To summarize, the Wilcoxon and HBR estimates are clearly different. Most notably, the sign on
the slope parameter for the Wilcoxon estimate is negative while the sign of the slope parameter
for the HBR estimate is positive. As the top plot in Figure 2 of the Supplementary Information
section indicates, the HBR fit is undoubtedly the better fit for this data set. The Wilcoxon estimate
is heavily influenced by the four giant stars. Of course, this illustrates the well-known fact that the
Wilcoxon estimate is not robust against outliers in the design as its influence function is
unbounded in .
It is clear from this example that the Wilcoxon estimate and a corresponding WW-estimate can be
quite different. Thus, it is desirable to have a diagnostic that compares the Wilcoxon estimate
for all
) to a WW-estimate (
for some
). This can be accomplished using
the following diagnostic
denote the parameter estimates (including the intercept) for the Wilcoxon
and WW-estimate, respectively. The variance-covariance matrix in (17) is an estimate of the
variance-covariance matrix given in (4). The two fits are declared different if the diagnostic
. See, for example, McKean et al. (1996a, 1996b). As the bottom plot in
Figure 2 of the Supplementary Information section indicates, the Wilcoxon and HBR estimates
are clearly different in this regard.
In such instances, it is often informative (especially in multiple linear regression problems) to
determine which fits are different. A diagnostic which compares the individual fits is given by
where the denominator of (18) corresponds to the estimated standard error of
. Here,
denote the Wilcoxon and WW fits for the th case. Hettmansperger and McKean
(1998) suggest using
as a benchmark for declaring two individual fits different.
Again, referring to the bottom plot in Figure 2 of the Supplementary Information section we see
that the Wilcoxon and HBR fits are vastly different for the stars data set. Lastly, we note that the
intent of these diagnostics is to determine differences in fits as opposed to determining which fit is
better. The question as to which fit is better should be answered via a standard residual analysis
based on the studentized residuals.
As discussed in this review, rank-based analyses for linear and nonlinear models can range from
highly efficient to highly robust, depending on the score and/or weight functions employed. With
the exception of MINITAB’s RREGRESS command, this valuable class of estimators has yet to be
implemented in any mainstream statistical software packages. Nevertheless, Wilcoxon estimates,
and more generally weighted Wilcoxon estimates, can be easily implemented in any software
regression estimate. Most notably is the collection of
package that has the ability to compute a
functions discussed by Terpstra and McKean (2005). However, other statistical software
packages can be used to obtain these analyses as well. For example, in S-PLUS,
estimates are computed via the l1fit function. In SAS,
regression estimates can be
computed using the IML procedure and the LAV subroutine. Furthermore, the HBR weights
defined in (7) and (8) can be computed using calls to MAD, MCD, and LTS. On the other hand,
rank-based estimates that utilize a nonlinear score function can be computed using the Newton
algorithm discussed in this review. As previously mentioned, the web-based RGLM program and
MINITAB’s RREGRESS command implement this algorithm. Furthermore, both programs have the
capability of producing a complete inference, including residual analyses.
The R code for the WW-estimates can be found on the web at:
RGLM can be accessed via the web at:
LAPACK can be found on the web at:
Supplementary Information
Table 3. Quadratic Regression Data Set
Stud. Residuals
-4 -3 -2 -1
Stud. Residuals
Least Squares
Figure 1. Fits (row 1) and residual plots (row 2) corresponding to the least squares
(column 1) and Wilcoxon (column 2) estimates of the quadratic regression data.
Log Light Intensity
Log Temperature
TDBETA: 65.96 Benchmark: 0.34
Figure 2. These plots compare the Wilcoxon and HBR estimates for the stars data. The top
plot shows the Wilcoxon (dotted line) and HBR (solid line) fits along with the stars data.
The bottom plot compares the Wilcoxon and HBR fits for each point via the CFITS
