Nonlinear and Multiple
Regression with
Transformed Variables
The necessity for an alternative to the linear model
Y = 0 + 1x +  may be suggested either by a theoretical
argument or else by examining diagnostic plots from a
linear regression analysis.
In either case, settling on a model whose parameters can
be easily estimated is desirable. An important class of such
models is specified by means of functions that are
“intrinsically linear.”
A function relating y to x is intrinsically linear if, by means
of a transformation on x and/or y, the function can be
expressed as y = 0 + 1x, where x = the transformed
independent variable and y = the transformed dependent
Four of the most useful intrinsically linear functions are
given in Table 13.1.
Useful Intrinsically Linear Functions
Table 13.1
In each case, the appropriate transformation is either a log
transformation—either base 10 or natural logarithm
(base e)—or a reciprocal transformation.
Representative graphs of the four functions appear in
Figure 13.3.
Graphs of the intrinsically linear functions given in Table 13.1
Figure 13.3
For an exponential function relationship, only y is
transformed to achieve linearity, whereas for a power
function relationship, both x and y are transformed.
Because the variable x is in the exponent in an exponential
relationship, y increases (if  > 0) or decreases (if  < 0)
much more rapidly as x increases than is the case for the
power function, though over a short interval of x values it
can be difficult to differentiate between the two functions.
Examples of functions that are not intrinsically linear are
y =  + γex and y =  + γx.
Intrinsically linear functions lead directly to probabilistic
models that, though not linear in x as a function, have
parameters whose values are easily estimated using
ordinary least squares.
A probabilistic model relating Y to x is intrinsically linear if,
by means of a transformation on Y and/or x, it can be
reduced to a linear probabilistic model
Y = 0 + 1x + .
The intrinsically linear probabilistic models that correspond
to the four functions of Table 13.1 are as follows:
Useful Intrinsically Linear Functions
Table 13.1
a. Y = ex  , a multiplicative exponential model, from
which ln(Y) = Y = 0 + 1x +  with x = x, 0 = ln(),
1 = , and  = ln().
b. Y = x  , a multiplicative power model, so that
log(Y) = Y = 0 + 1x +  with x = log(x),
0 = log(x) + , and  = log().
c. Y =  +  log(x) + , so that x = log(x) immediately
linearizes the model.
d. Y =  +   1/x + , so that x = 1/x yields a linear model.
The additive exponential and power models, Y = ex + 
and Y = x + , are not intrinsically linear.
Notice that both (a) and (b) require a transformation on Y
and, as a result, a transformation on the error variable .
In fact, if  has a lognormal distribution with
and V() = independent of x, then the
transformed models for both (a) and (b) will satisfy all the
assumptions regarding the linear probabilistic model; this in
turn implies that all inferences for the parameters of the
transformed model based on these assumptions will be
If  2 is small, Yx  ex in (a) or x in (b).
The major advantage of an intrinsically linear model is that
the parameters 0 and 1 of the transformed model can be
immediately estimated using the principle of least squares
simply by substituting x and y into the estimating formulas:
Parameters of the original nonlinear model can then be
estimated by transforming back and/or if necessary.
Once a prediction interval for y when x = x has been
calculated, reversing the transformation gives a PI for y
In cases (a) and (b), when  2 is small, an approximate CI
for Yx results from taking antilogs of the limits in the
CI for
(strictly speaking, taking antilogs gives a CI
for the median of theY distribution, i.e., for
. Because
the lognormal distribution is positively skewed,
; the
two are approximately equal if  2 is close to 0.)
Example 3
Taylor’s equation for tool life y as a function of cutting time
x states that xyc = k or, equivalently, that y = x.
The article “The Effect of Experimental Error on the
Determination of Optimum Metal Cutting Conditions”
(J. of Engr. for Industry, 1967: 315–322) observes that the
relationship is not exact (deterministic) and that the
parameters  and  must be estimated from data.
Example 3
Thus an appropriate model is the multiplicative power
model Y =   x  , which the author fit to the
accompanying data consisting of 12 carbide tool life
observations (Table 13.2).
Data for Example 3
Table 13.2
Example 3
In addition to the x, y, x, and y values, the predicted
transformed values
and the predicted values on the
original scale ( , after transforming back) are given.
The summary statistics for fitting a straight line to the
transformed data are xI = 74.41200, yI = 26.22601,
xI2 = 461.75874, yI2 = 67.74609, and
xI yI = 160.84601, so
Example 3
The estimated values of  and , the parameters of the
power function model, are
Example 3
Thus the estimated regression function is
 3.094491530  1015  x –5.3996.
To recapture Taylor’s (estimated) equation,
set y = 3.094491530  1015  x –5.3996, whence xy.185 = 740.
Example 3
Figure 13.4(a) gives a plot of the standardized residuals
from the linear regression using transformed variables (for
which r2 = .922); there is no apparent pattern in the plot,
though one standardized residual is a bit large, and the
residuals look as they should for a simple linear regression.
(a) Standardized residuals versus x from Example 3
Figure 13.4
Example 3
Figure 13.4(b) pictures a plot of versus y, which indicates
satisfactory predictions on the original scale.
(b) y^ Versus y from Example 3
Figure 13.4
Example 3
To obtain a confidence interval for median tool life when
cutting time is 500, we transform x = 500 to x = 6.21461.
= 2.1120, and a 95% CI for 0 + 1(6.21461)
is 2.1120  (2.228)(.0824) = (1.928, 2.296).
The 95% CI for
is then obtained by taking antilogs:
(e1.928, e2.296) = (6.876, 9.930).
It is easily checked that for the transformed data
s2 = ≈ .081. Because this is quite small, (6.876, 9.930) is
an approximate interval for
More General Regression Methods
Thus far we have assumed that either Y = f(x) +  (an
additive model) or that Y = f(x)   (a multiplicative model).
In the case of an additive model, yx = f(x), so estimating
the regression function f(x) amounts to estimating the curve
of mean y values.
On occasion, a scatter plot of the data suggests that there
is no simple mathematical expression for f(x).
Statisticians have recently developed some more flexible
methods that permit a wide variety of patterns to be
modeled using the same fitting procedure.
One such method is LOWESS (or LOESS), short for locally
weighted scatter plot smoother. Let (x, y) denote a
particular one of the n (x, y) pairs in the sample.
The value corresponding to (x, y) is obtained by fitting a
straight line using only a specified percentage of the data
(e.g., 25%) whose x values are closest to x.
Furthermore, rather than use “ordinary” least squares,
which gives equal weight to all points, those with x values
closer to x are more heavily weighted than those whose x
values are farther away.
The height of the resulting line above x is the fitted value
This process is repeated for each of the n points, so n
different lines are fit (you surely wouldn’t want to do all this
by hand).
Finally, the fitted points are connected to produce a
LOWESS curve.
Example 5
Weighing large deceased animals found in wilderness
areas is usually not feasible, so it is desirable to have a
method for estimating weight from various characteristics of
an animal that can be easily determined.
Minitab has a stored data set consisting of various
characteristics for a sample of n = 143 wild bears.
Example 5
Figure 13.7(a) displays a scatter plot of y = weight versus
x = distance around the chest (chest girth).
A Minitab scatter plot for the bear weight data
Figure 13.7 (a)
Example 5
At first glance, it looks as though a single line obtained from
ordinary least squares would effectively summarize the
pattern. Figure 13.7(b) shows the LOWESS curve
produced by Minitab using a span of 50% [the fit at (x, y)
is determined by the closest 50% of the sample].
A Minitab LOWESS curve for the bear weight data
Figure 13.7 (b)
Example 5
The curve appears to consist of two straight line segments
joined together above approximately x = 38.
The steeper line is to the right of 38, indicating that weight
tends to increase more rapidly as girth does for girths
exceeding 38 in.
Logistic Regression
The simple linear regression model is appropriate for
relating a quantitative response variable to a quantitative
predictor x.
Consider now a dichotomous response variable with
possible values 1 and 0 corresponding to success and
Let p = P(S) = P(Y = 1). Frequently, the value of p will
depend on the value of some quantitative variable x.
For example, the probability that a car needs warranty
service of a certain kind might well depend on the car’s
mileage, or the probability of avoiding an infection of a
certain type might depend on the dosage in an inoculation.
Instead of using just the symbol p for the success
probability, we now use p(x) to emphasize the dependence
of this probability on the value of x. The simple linear
regression equation Y = 0 + 1x +  is no longer
appropriate, for taking the mean value on each side of the
equation gives
Whereas p(x) is a probability and therefore must be
between 0 and 1, 0 + 1x need not be in this range.
Instead of letting the mean value of Y be a linear function of
x, we now consider a model in which some function of the
mean value of Y is a linear function of x.
In other words, we allow p(x) to be a function of 0 + 1x
rather than 0 + 1x itself. A function that has been found
quite useful in many applications is the logit function
Figure 13.8 shows a graph of p(x) for particular values of
0 and 1 with 1 > 0.
A graph of a logit function
Figure 13.8
As x increases, the probability of success increases. For
1 negative, the success probability would be a decreasing
function of x.
Logistic regression means assuming that p(x) is related to
x by the logit function. Straightforward algebra shows that
The expression on the left-hand side is called the odds.
If, for example,
then when x = 60 a success is three times as likely as a
We now see that the logarithm of the odds is a linear
function of the predictor.
In particular, the slope parameter 1 is the change in the log
odds associated with a one-unit increase in x.
This implies that the odds itself changes by the
multiplicative factor
when x increases by 1 unit.
Fitting the logistic regression to sample data requires that
the parameters 0 and 1 be estimated. This is usually done
using the maximum likelihood technique.
The details are quite involved, but fortunately the most
popular statistical computer packages will do this on
request and provide quantitative and pictorial indications of
how well the model fits.
Example 6
Here is data, in the form of a comparative stem-and-leaf
display, on launch temperature and the incidence of failure
of O-rings in 23 space shuttle launches prior to the
Challenger disaster of 1986 (Y = yes, failed; N = no, did not
Observations on the left side of the display tend to be
smaller than those on the right side.
Stem: Tens digit
Leaf : Ones digit
Example 6
Figure 13.9 shows Minitab output for a logistic regression
analysis and a graph of the estimated logit function from
the R software.
(b) graph of estimated logistic function
and classification probabilities from R
(a) Logistic regression output from Minitab
Figure 13.9
Example 6
We have chosen to let p denote the probability of failure.
The graph of decreases as temperature increases
because failures tended to occur at lower temperatures
than did successes.
The estimate of 1 and its estimated standard deviation are
= –.232 and = .1082, respectively.
We assume that the sample size n is large enough here so
that has approximately a normal distribution.
Example 6
If 1 = 0 (i.e., temperature does not affect the likelihood of
O-ring failure), the test statistic
has approximately
a standard normal distribution.
The reported value of this ratio is z = –2.14, with a
corresponding two-tailed P value of .032 (some packages
report a chi square value which is just z2, with the same
At significance level .05, we reject the null hypothesis of no
temperature effect.
Example 6
The estimated odds of failure for any particular temperature
value x is
This implies that the odds ratio—the odds of failure at a
temperature of x + 1 divided by the odds of failure at a
temperature of x—is
Example 6
The interpretation is that for each additional degree of
temperature, we estimate that the odds of failure will
decrease by a factor of .79 (21%). A 95% CI for the true
odds ratio also appears on output.
In addition, Minitab provides three different ways of
assessing model lack-of-fit: the Pearson, deviance, and
Hosmer-Lemeshow tests. Large P-values are consistent
with a good model.
Example 6
These tests are useful in multiple logistic regression, where
there is more than one predictor in the model relationship
so there is no single graph like that of Figure 13.9(b).
(b) graph of estimated logistic function
and classification probabilities from R
Figure 13.9
Example 6
Various diagnostic plots are also available. The R output
provides information based on classifying an observation
as a failure if the estimated p(x) is at least .5 and as a non
failure otherwise.
Since p(x) = .5 when x = 64.80, three of the seven failures
(Ys in the graph) would be misclassified as non-failures
(a misclassification proportion of .429), whereas none of
the non-failure observations would be misclassified.
Example 6
A better way to assess the likelihood of misclassification is
to use cross-validation: Remove the first observation from
the sample, estimate the relationship, then classify the first
observation based on this estimated relationship, and
repeat this process with each of the other sample
observations (so a sample observation does not affect its
own classification).
The launch temperature for the Challenger mission was
only 31°F. This temperature is much smaller than any value
in the sample, so it is dangerous to extrapolate the
estimated relationship. Nevertheless, it appears that O-ring
failure is virtually a sure thing for a temperature this small.