Download Here - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Statistical Inferences Involving Linear Regression and Linear
Correlation
This document summarizes important types of statistical inference that can be done on quantities arising in
and from a simple linear regression model and the linear correlation coefficient. We looked at these two
approaches to describing or characterizing relationships between two variables earlier in the course.
Review of Basic Formulas
Methods of simple linear regression and correlation are applied in situations in which we have two
quantitative variables, x and y. It is thought that either the values of x and y tend to vary together (either
increasing or decreasing together, or when one variable increases in value, the other tends to decrease in
value or vice versa). At this level, the goal is to first determine whether or not that relationship is linear, and
if so, to characterize or quantify the apparent linear relationship between the two variables.
In a regression study, there is also the notion that the variable x is the independent or control variable -- that
it is the value of x which influences or partially determines the value of y. The variable y is thus considered
to be a dependent variable. The value of y may be influenced by the values of a large number of variables
other than x as well. For the simple linear regression study to be useful, none of those other influences can
be very important. However, because of their existence, we are not able to predict the value of y precisely
knowing just the value of x. Even if the relationship between y and x itself is a linear one, the observed
values of y will not satisfy a linear equation involving just x, and so when we plot the data as points on a
graph, they will not lie on a perfect straight line.
On the other hand, in correlation studies, we are simply attempting to establish the existence of a linear
relationship between x and y without trying to attribute control or dependence of one variable on the other.
When we first considered the simple linear regression and linear correlation models as methods of
descriptive statistics, we primarily viewed them as procedures for summarizing data that appeared to
indicate the presence of some sort of a linear relationship. We pictured an experiment being carried out in
which for each of a set of n values of x, {x1, x2, …, xn}, a value of y was observed: {y1, y2, …, yn}. Together,
these two sets of observations amounted to n pairs of values: (x1, y1), (x2, y2), … (xn, yn). When these pairs
of values were plotted as points on a set of x-y axes, the resulting "scatterplot" of points, while not forming a
perfect straight line, might appear very much as if the points are scattered randomly along a straight line
path. Arguments were presented earlier that the so-called correlation coefficient
r
SS xy
(LR - 1)
SS x SS y
where
n
SS x   x k
2
 n 
  xk 
n
k 1 
2

  x k  nx 2
n
k 1
2
 n 
  yk 
n
k 1 
2

  y k  ny 2
n
k 1
k 1
n
SS y   y k
k 1
(LR - 2a)
(LR - 2b)
and
SS xy
 n
 n

  xk    yk 
n
n
 k 1   k 1 
  xk yk 
  xk yk  n x y
n
k 1
k 1
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
(LR - 2c)
Page 1 of 20
was a single number that could be used to measure the degree to which the points of the scatterplot
clustered about a straight line. If r = 1, the points lie on a perfect straight line of either positive or negative
slope, and values of r between +1 and -1 indicated lesser degrees of clustering about a straight line pattern,
with value of r near zero indicating little or no discernable linear pattern to the points.
In the case of linear regression analysis, we actually attempted to come up with an equation for the straight
line most closely fitting the points in the scatterplot. We wrote that equation in the standard form
ŷ  b 0  b1x
(LR - 3)
The coefficients, b0 and b1, were then to be computed so that the sum of the squares of the vertical
distances of the points from the line was as small as possible -- the so-called "least squares" criterion of
determining what is meant by the best fit straight line. We define the residual of point #k as
 k  y k  ŷ k
(LR - 4)
where ŷ k  ŷ evaluated with x = xk. Thus, k is the vertical distance by which the point (xk, yk) is separated
from the best fit straight line given by (LR - 3). Then, the least squares principle means that we choose b 0
and b1 so that the quantity
2
2
SSE   k   y k  ŷ k    y k  b0  b1x k 
n
k 1
2
n
k 1
n
(LR - 5)
k 1
is made as small as possible. This is not a difficult problem to solve symbolically, and we end up with the
formulas
b1 
SS xy
and
SS x
b0  y  b1x
(LR - 6)
where x and y are the mean values of the x and y coordinates, respectively, for the points in the
scatterplot, and SSxy and SSx are calculated using formulas (LR - 2a-c) above.
The "goodness" of the best-fit straight line was then simply evaluated visually (did it seem to come quite
close to all of the points), or, by computation of the so-called coefficient of determination:
r2 
SST  SSE
SSE
 1
SST
SST
(LR - 7)
which turned out to be the square of the value of the correlation coefficient, r, defined in formula (LR - 1)
above. However, with the definitions
SSE 
2
SS y  2b1SS xy  b1 SS x
 SS y  b1SS xy  SS y 
SS xy
SS x
2
(LR - 8)
obtained by substituting formulas (LR - 6) into (LR - 5), so that (LR - 8) is a measure of the variation of the
observed y-values about the best fit straight line, and
SST   y k  y 
n
k 1
2
(  SS y )
(LR - 9)
is a measure of the variability in the y-values without reference to the straight line, we get an alternative, and
very informative interpretation of r2. If all of the points lay exactly on a straight line, SSE given by (LR - 8)
would be exactly zero, and thus r2 would be exactly 1. When not all of the points fall on a straight line, we
conclude that the value of x is insufficient to completely determine the value of y, and hence, SSE is a
Page 2 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
measure of how much of the variability in the y-values is not accounted for by reference to x. Since SST
measures the variability in the y-values without any reference to x at all, the ratio SSE/SST is the fraction of
the variability in the y-values that is not accounted for by reference to x. Thus, r2 is the fraction of the
variability in the y-values that is accounted for by reference to the values of x. In a sense, you can think of
the decimal value of r2 as telling you the fraction of the variation in the y-values that can be attributed to the
fact that they correspond to different x-values.
All of these formulas and interpretations were based on intuitive arguments when we looked at them earlier.
Now, however, we can begin to reinterpret that work from a more statistical point of view, which will allow us
to characterize the "significance" of the numbers more rigorously. While it is beyond the scope of this
course to examine the mathematical background of linear regression and correlation to any depth, we need
to go over a few basic ideas so that you will be able to understand the various methods presented later in
this document.
The Simple Linear Regression Model
Because we hadn't discussed probability and random sampling when we first looked at the problem of
describing relationships, it wasn't possible to point out that we are really considering y to be a random
variable in the linear regression model, and both x and y to be random variables in a correlation model.
That is, even if you know the value of x in advance, you cannot predict with certainty the precise value of y
that will be observed. We expect, or at least hope, that the value of x will have a strong influence on the
value of y -- that's why we're trying to characterize or quantify the relationship between the two quantities -but we know that since the scatterplot does not give points lying on a perfect straight line, there must be
other things that we are not taking into
account that are also influencing the value of
y = 0 + 1x
y
y.
The situation in a linear regression study is
illustrated in the diagram to the right. For a
given value of x, the possible values of y are
considered to form a normal distribution, with
some mean value and standard deviation.
The formulas and methods to follow make
the assumption that the standard deviation of
these distributions does not change as x
varies. On the other hand, the mean values
of y for different values of x do change. A
linear regression model applies when the
graph of those mean values of y lie on a
straight line:
y =  0 +  1x
0 + 1xn
0 + 1x2
0 + 1x1
x
x1
x2
xn
(LR - 10)
We consider (LR - 10) to be the equation of the line passing through this succession of mean values of y for
varying values of x. Thus, (LR - 10) amounts to the "population regression line" or "true regression line," that
results when all possible observations of y for every possible value of x is included. (This is the same notion
of a statistical population that we've used throughout the course: "all possible things of a certain type."
Here, those "things" are pairs of (x, y) values rather than apples or fish or cans of soup, etc.) Then the
equation (LR - 3)
ŷ  b 0  b1x
(LR - 3)
containing actual numerical values of the coefficients b 0 and b1 that were calculated from the n observed
pairs of values (xk, yk) can be viewed as the "sample regression line," and forming a "point estimate" of the
population regression line in (LR - 10). Thus, b0, calculated from the data, will be used as an estimate of 0
and b1. calculated from the data, will be used as an estimate of 1. If two experiments are performed, the n
pairs of values, (xk, yk), that result will be different, so the values of b 0 and b1 will be different, and so we see
that b0 and b1 are actually random variables.
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 3 of 20
By a similar argument, we find that the correlation coefficient, r, (or more correctly, the sample correlation
coefficient r) is a random variable, that can be used to estimate a corresponding population correlation
coefficient . Because in this case, both x and y are considered random variables, the probability picture is
a bit more involved than the one for the linear regression case sketched above (actually, the sketch above is
a bit of a simplification anyway.) It involves the so-called "bivariate normal distribution," which amounts to a
normal distribution in three-dimensions -- a shape similar to a real brass bell! The study of the detailed
mathematical properties of the bivariate normal distribution is beyond the scope of this course, so the
relevant consequences will just be stated as required.
In fact, all of the descriptive measures mentioned so far: b0, b1, ŷ , r, etc. are random variables whose
values can be used to estimate various corresponding population parameters. The rest of this document will
describe a few of the more important techniques.
Throughout, the basic issues are similar to the reason behind all of the statistical inference we've done until
now, namely: what is the probability that the values or patterns observed in the sample data really reflect a
pattern or value in the population from which the sample data was obtained.
Is There Really a Linear Relationship Between y and x?
The first question to be handled is whether the apparent linear relationship we see in the sample data really
reflects a linear relationship in the population. We've already looked at informal ways to assess this issue.
To make the answering of this question a bit more rigorous, we need to set up a hypothesis test along the
lines:
H0: there is no linear relationship between y and x
vs.
HA: there is a linear relationship between y and x
The most common way to test these hypotheses is a procedure based on the F-distribution. Organize the
relevant information into a table with the following standard form:
Source of
Variation
Degrees of
Freedom
Sum of Squares
Mean Square
1
SSR
SSR
MSR 
1
Error
n-2
SSE
Total
n-1
SST
Regression
MSE  s 2 
F
MSR
F
MSE
SSE
n2
For reasons that will become clear later in the course, this summary of information in this tabular format is
often called an ANOVA table. Table entries that have not been defined in preceding formulas are defined by
the formulas in the table itself. Notice that the third row is just the sum of the first and second rows. The
means squares are just the sum-of-squares quantities divided by the degrees of freedom  hence the some
what strange way of writing the formula for MSR. (In regression models involving k independent variables,
the regression degrees of freedom would be k, and so the denominator in the formula for MSR would be k
rather than 1.)
The quantity in the last column, F, is an F-distributed random variable with the numerator degrees of
freedom equal to 1 and the denominator degrees of freedom equal to n - 2. If the points all lie exactly on a
straight line, SSE and hence MSE will be zero, and so F will be +. This means that the rejection region for
the null hypothesis above is a right-tailed rejection region:
reject H0 if F > F, 1, n - 2
and of course, this means that
p-value = Pr(F > value given in the table)
Page 4 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
It turns out that this F-test gives identical results to the t-test for 1 (which we will describe shortly) when
there is just one control variable in the problem. However, when more than one control variable is present
 a situation which the Food Technology students will explore in greater detail next term  then this F-test
gives results not duplicated by other methods described below.
As we proceed through this document, we'll illustrate the various types of calculations or analysis using the
one or more of the four examples first described in the earlier document on characterizing relationships.
Those examples were:
Example 1: the simple six point example,
Example 2: the nectarine size example,
Example 3: the potato yield example, and
Example 4: the "no apparent linear relationship" example.
Example 5: the "non-linear relationship" example
For easier reference, we repeat a somewhat expanded version of the summary table given near the end of
the document on characterizing relationships:
Simple
Example:
n
PotatoYield NectarineSize
No Linear
Non-Linear
6
38
65
26974
50
11918
100
24572
70
24757
200
3487.1
8790
15525
38401
 xk
2
320
12085222
3427438
7851466
10232087
2
7402
191947.93
1617678
3223933
22015057
 xk
 yk
 yk
 xk yk
SSx
SSy
SSxy
b0
b1
SSE
r
r2
1507
1499131.4
1909678
3781584
14498751
79.33333333
735.33333333
240.33333333
14.14705882
3.02941176
7.26470588
0.99504800
0.99012053
891426.9846
4873.062154
52038.54769
29.42226827
0.058376680
1835.224515
0.789553029
0.623393986
586663.52
72396
-185506.4
251.1708114
-0.316205787
13737.80281
-0.900133800
0.810240858
1813634.16
813676.75
-33219
159.7506721
-0.018316263
813068.3021
-0.027345493
0.000747776
1476243.443
948816.9857
917414.4714
328.7958901
0.621452021
378687.9081
0.775167168
0.600884139
Example 1:
For the simple example, n = 6, so the Error degrees of freedom are n - 2 = 4, and the Total degrees of
freedom are n - 1 = 5. SST = SSy = 735.33. SSE has already been calculated, given as 7.2647, so that
SSR = SST - SSE = 735.3333 - 7.2647 = 728.0686
and this gives the value of MSR = SSR/1 as well. Since
MSE = 7.2647/4 = 1.8162
we get
F = MSR/MSE = 728.0686/1.8162 = 400.88
Thus, the full ANOVA table here becomes:
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 5 of 20
Source of
Variation
Regression
Error
Total
Degrees of
Freedom
1
4
5
Sum of Squares
728.0686
7.2647
735.3333
Mean Square
728.0686
1.8162
F
400.88
Now, from the table of critical values of the F-distribution, we get F0.05,1,4 = 7.71, so that we can reject H0 at a
level of significance of 0.05 if the value of our F-statistic exceeds 7.71. Since 400.88 well exceeds 7.71, we
can reject H0 at a level of significance of 0.05. The conclusion is that the data supports the existence of a
linear relationship between y and x at a level of significance of 0.05.

For reference purposes and to serve as practice problems for you, we just state the ANOVA tables for the
other four data sets with just brief comments. You can calculate all of these numbers using information from
the summary table above.
PotatoYield:
Source of
Variation
Regression
Error
Total
Degrees of
Freedom
1
63
64
Sum of Squares
3037.84
1835.22
4837.06
Mean Square
3037.84
29.1305
F
104.28
With denominator degrees of freedom being 63, we need to compare this test statistic with the closest
critical value available in our tables of critical values of the F-distribution: F0.05,1,60 = 4.00. Again, the
criterion for rejecting H0 is met in great excess at  = 0.05, and so we conclude without hesitation that the
PotatoYield data reflects a linear relationship between yield and mm of available water. The p-value here is
5.5 x 10-15, reinforcing that conclusion, and also giving you an idea of how small the p-value of this F-test is
when a fairly large number of data points can be seen to cluster at least loosely about a straight line.
NectarineSize
Source of
Variation
Regression
Error
Total
Degrees of
Freedom
1
48
49
Sum of Squares
58658.2
13737.8
72396
Mean Square
58658.2
286.204
F
204.95
If we compare this value of F with a critical value of just over 4.00 from our  = 0.05 F-table, we see that the
rejection criterion is again met far beyond the requirement. (The p-value here is about 3.9 x 10-19.) Thus,
we can conclude with no practical probability of error that this data supports a conclusion that the size of
nectarines is related linearly to the crop load.
No Apparent Linear Relationship:
Source of
Variation
Regression
Error
Total
Degrees of
Freedom
1
98
99
Sum of Squares
608.4479
813068.3
813676.8
Mean Square
608.4479
8296.615
F
.0733
From the first three examples, it was beginning to look like this F-test always gives values of the test
statistic in the hundreds and p-values that are zero for all practical purposes. However, for this set of data,
constructed to have no discernible linear pattern, you see that we get a value of 0.0733 for the test statistic,
whereas rejection of the null hypothesis in this case at  = 0.05 would require the test statistic to be greater
than approximately 4.00. So we cannot reject H0 here at  = 0.05. In fact, the p-value in this case is 0.787.
Page 6 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
What may be a bit surprising are the following results for the example in which we saw that the points clearly
did not follow a linear path.
Non-linear Relationship:
Source of
Variation
Regression
Error
Total
Degrees of
Freedom
1
68
69
Sum of Squares
570129.078
378687.908
948816.986
Mean Square
570129.078
5568.93983
F
102.38
The p-value for this F-test is also a miniscule 2.87 x 10-15, which indicates incredibly strong evidence in favor
of the alternative hypothesis over the null hypothesis here  yet we know that
HA: there is a linear relationship between y and x
is false here. What's happening is worth a caution.
The best-fit straight line for the data is the slanted line
in the figure to the right. What the F-Test is detecting
is that even with the curved pattern of points, the
points are on the average closer to this line than they
are to the horizontal line at the mean of the y-values.
If you like, the F-test is just measuring how much
better a straight line is than no line at all, but that
doesn't necessarily mean that the straight line itself is
all that good a representation of the data. This is why
you must always do a scatterplot to verify visually that
a straight line model is plausible.
Curved Pattern
900
800
700
600
500
400
300
200
100
0
0
100
200
300
400
500
600
700
Testing Hypotheses Involving the Slope, 1
The null hypothesis can be written
H0: 1 = 1,0
where 1,0 is some specific numerical target value. Most commonly, this value 1,0 is zero, and so the result
of the hypothesis test would be a decision as to whether the data supports the conclusion of a slope that is
nonzero in some way.
Under the conditions already described, the standardized test statistic to use here is
t
b 1   1,0
(LR - 11a)
s b1
where
s b1 
s
SS x

MSE
SS x
(LR - 11b)
is an estimate of the standard deviation of the sampling distribution of b 1. Then, just do the t-test (with
 = n - 2 degrees if freedom):
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 7 of 20
Table 1.
Hypotheses:
H0: 1 = 1,0
HA: 1 > 1,0
reject H0 at a level of
significance  if:
t > t,
(single-tailed rejection region)
H0: 1 = 1,0
HA: 1 < 1,0
t < -t,
(single-tailed rejection region)
Pr(t < test statistic value)
H0: 1 = 1,0
HA: 1  1,0
t > t/2, or t < -t/2,
(two-tailed rejection region)
2Pr(t > test statistic value)
p-value
Pr(t > test statistic value)
The hypotheses tests here when the value 1,0 is zero are particularly important because rejection of H0
means that the part of the equation for y which involves x does not disappear  that is, rejection of H0
means that a linear relationship really does exist between y and x. Thus, this test gives a similar sort of
conclusion as does the F-test we described earlier. Thus, it is not a surprise that the p-value for the twotailed t-test in this situation is equal to the p-value for the F-test. (This is true only when the regression
model contains just one control variable.)
Example 6: A food technologist is studying the relationship between brine concentration (%) and the iron
content of pickled fish of a certain species. She takes 15 specimens of the fish pickled in various brines,
and determines both the brine concentrations and the iron concentrations in the fish in ppm. For the
purposes of this experiment, the iron concentrations are considered to be the dependent variable (y), since it
is difficult to imagine how small variations in the iron content of the fish (at a ppm level) could have a
measurable effect on the concentration of the brine surrounding the fish.
Anyway, the data she collected is given in the table to the
right. From this data, we can compute SSx = 86.4493,
SSy = 472.6133, SSxy = 165.1767, which gives
SSE = 157.0142. Further, the equation of the regression
line comes out as
ŷ( x )  45 .11  1.911 x
Is this data adequate evidence to conclude that the average
ppm Fe is linearly related to concentration of the brine?
Further, can we conclude that the Fe content increases by
more than one ppm for each additional percent
concentration of the brine?
Solution
Specimen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Concentration
of brine (%)
Fe content
(ppm)
3.8
3.9
4.3
4.3
4.8
4.9
7.4
7.5
7.9
9.1
9.2
9.3
9.7
10.1
10.2
51.8
52.5
59.3
52.8
55.6
50.7
53.9
54.9
65.2
65.8
62.9
60.4
61.3
66.8
66.1
The two questions here require two somewhat different approaches. To answer the first question: Is this
data evidence of a linear relationship between brine percentage and ppm Fe, we could use the F-test
described earlier, or we could test the hypotheses
H0: 1 = 0
vs
HA: 1  0
For the F-test, the ANOVA table is:
Source of
Variation
Regression
Error
Total
Page 8 of 20
Degrees of
Freedom
1
13
14
Sum of Squares
315.5991
157.0142
472.6133
Mean Square
315.5991
12.0780
Inference in Linear Regression and Correlation
F
26.13
© David W. Sabo (1999)
leading to a p-value of 0.000199588 (obtained using Excel's FDIST() function). Thus, the F-test indicates
the data is strong evidence in support of a linear relationship between ppm Fe and percent concentration of
brine.
We get the same result using the t-test on the hypotheses involving 1. First we calculate
sb1 
MSE
12.0780

 0.37378
SS x
86.4493
Thus, the standardized test statistic has the value:
t
b1  1,0
sb1

1.911  0
 5.112
0.37378
This is quite a large value -- much larger than t0.025, 13 = 2.160 from our tables so that we can reject H0 in
favor of HA here easily at a level of significance of 0.05. In fact, when we calculate the p-value for this test,
we get
p-value = 2 Pr(t > 5.113,  = 13) = 2 x 0.000099794 = 0.000199598,
just as we got for the F-test.
To answer the second question here, we need to test a different set of hypotheses:
H0: 1 = 1
vs
HA: 1 > 1
If we are able to reject this H0 in favor of this HA, we will have demonstrated support for the claim that the
slope of the best fit line relating ppm Fe to % concentration of the brine is greater than 1, equivalent to
saying that the ppm Fe increases by more than one unit for each percent increase in the brine
concentration. These hypotheses are tested again using the t-test, but now the standardized test statistic is:
t
b1  1,0
sb1

1.911  1
 2.437
0.37378
We can reject H0 at a level of significance of 0.05 if this test statistic is greater than t0.05, 13 = 1.771. Since
2.437 is greater than 1.771, we can reject H0, and so conclude that the ppm Fe does increase by more than
one unit for each percent increase in brine concentration. (The p-value in this case is given by Excel's
TDIST() function as 0.0150, which is well below the conventional cutoff at 0.05.)

From formulas (LR - 11a, b) it is easy to deduce that the formula for the 100(1 - )% confidence interval
estimate of 1 is
1  b1  t  / 2, sb1
(LR - 12)
Example 7: Construct a 95% confidence interval estimate of the slope of the true regression line based on
the ppm Fe vs brine concentration data in Example 6.
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 9 of 20
Solution
We just substitute the required (already calculated) values into formula (LR - 12). The degrees of freedom,
, here are n - 2 = 15 - 2 = 13, and from our t-tables, we get that t0.025, 13 = 2.160. Thus,
1 = 1.911  (2.160)(0.37378)
or
1 = 1.911  0.807
@95%
1.104  1  2.718
@95%
or
This means that there is a 95% probability that the interval 1.104 to 2.718 captures or contains the value of
the slope of the true regression line in this case.

Estimation and Prediction of y-values
The formula
ŷ  b 0  b1x
(LR - 3)
gives us information about the values of y that may occur for a specific value of x. Generally, there are two
issues of interest here, distinguished by the keywords estimation and prediction.
We use the word estimation to indicate the process of calculating an estimate of the mean of all of the yvalues corresponding to a specific value of x, which we will indicate by the symbol  y x . The symbol 
indicates that we are speaking of a population mean, the subscript 'y', indicates that it is a mean value of y,
and the subscript 'x' indicates that this value depends on the specific value of x selected. The value of
ŷ( x ) serves as an acceptable point estimator for  y x . Of course, ŷ( x ) is a random variable (because the
formula (LR - 3) is based on the data in a random sample of observations), and so to obtain an interval
estimate of  y x or to be able to test hypotheses involving  y x , we need a formula for the variance of the
sampling distribution of ŷ( x ) .
The underlying probability theory here is quite complex and really just beyond the scope of this course, so
we will just state the resulting formulas and demonstrate their use. It can be shown mathematically that
t
ŷ( x*)   y x *
(LR - 13)
s ŷ( x *)
is a t-distributed random variable with n - 2 degrees of freedom. In this formula, we have used the symbol x*
to indicate a specific value of x is being considered. The denominator, s ŷ( x*) indicates the standard
deviation of the sampling distribution of ŷ( x ) for that particular value of x, and is given by the formula
1
x *  x 2 
s 2ŷ( x *)  s 2 


SS x 
 n
(LR - 14)
where s2 is a synonym for MSE, defined in the ANOVA table earlier. The form of this expression has some
important practical implications. The first term in the brackets gives the usual s/n type of expression for the
standard deviation of a sampling distribution. The second term is novel here, though. Notice that the
second term is zero when x *  x , that is, when x* is at the horizontal "center" of the scatter plot of points,
but increases in value if you move either rightwards or leftwards from that location. Since the precision with
which we will be able to estimate  y x * goes down as this standard deviation increases in value, what this
Page 10 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
term in (LR - 14) is reflecting is that the greatest estimation precision will occur right in the middle of the
cloud of points, and gets poorer if you move out towards the extremes of observations.
Given these two formulas, we can now write down the usual statistical inference formulas for  y x * . The
100(1 - )% confidence interval estimate of  y x * is given by
 y x *  ŷ( x*)  t  / 2,n2 s ŷ( x *)
@ 100(1 - )%
(LR - 15)
and the null hypothesis
H0 :  y x *   0
(LR - 16a)
can be tested by applying the t-test formulas to the standardized test statistic
t
ŷ( x*)   0
s ŷ( x *)
(LR - 16b)
Example 8: Using the data given in Example 6 above, obtain 95% confidence interval estimates of the
mean ppm Fe in fish pickled in a 7% brine solution and of fish pickled in a 10% brine solution.
Solution
We are being asked to construct a 95% confidence interval estimate for  y x * for the values x* = 7.0 and x*
= 10.0. For the data given in Example 6, it is easy to determine that x = 7.0933. We already know from
previous calculations that n = 15, SSx = 86.4493 and s2 = MSE = 12.0780. From the t-table, we get that
t/2,n-2 = t0.025, 13 = 2.160. Thus, we just combine formulas (LR - 14) and (LR - 15), substituting in the required
numbers to get:
 y 7.0  ŷ(7.0)  t 0.025,13 s ŷ(7.0 )
@95%
for which we need
ŷ(7.0)  45 .11  1.911  7.0  58 .4883 ppm
and
1
7.0  x 2   12 .0780 2
s ŷ( 7.0 )  s 2 


SS x 
 n
So
1
7.0  7.0933 2   0.8980



86 .4493
15

 y 7.0  58.4883  2.160  0.9074  58.4883  1.9397
@ 95%
We should probably round these numbers down to about 2 decimal places at the most before reporting the
result, but the precision above will help you confirm that you've got the calculations right when you attempt
this on your own. In words, this result means that there is a 95% probability that pickled fish of this type that
have been prepared in a 7.0% brine solution will have a mean ppm Fe which is contained in the interval
58.48  1.94 ppm.
The calculation for x* = 10.0% is done in exactly the same way, substituting the value 10.0 wherever 7.0
occurs in the above calculation. In summary, we get
ŷ(10 .0)  45 .11  1.911  10 .0  64 .2204 ppm
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 11 of 20
1
10 .0  x 2   12 .0780 2
s ŷ(10.0 )  s 2 


SS x
 n

1
10 .0  7.0933 2   1.4091



86 .4493
15

and so, finally,
 y 10.0  64.2204  2.160  1.4091  64.2204  3.0437
@ 95%
Notice particularly how much larger s ŷ(10.0 ) is in value than is s ŷ(7.0) , and as a result, how much wider the
confidence interval is for  y 10.0 than it is for  y 7.0 . This is due to the fact that x* = 7.0 is much nearer to
the value of x = 7.0933 than is the value x* = 10.0, and so the second term in formula (LR - 14) is makes a
much larger contribution to the standard deviation when x* = 10.0 than when x* = 7.0.

Example 9: Does the pickled fish data given in Example 6 support a claim that when this type of fish is
pickled in a 7% brine solution, the mean ppm Fe is less than 60 ppm?
Solution
To answer this question, we need to test the hypotheses:
H0 :  y 7.0  60 ppm
vs
HA :  y 7.0  60 ppm
The value of the standardized test statistic for these hypotheses can be calculated from numbers already
obtained in Example 8, just above. We get
t
ŷ(7.0)   0
58 .4883  60

  1.683
s ŷ(7.0 )
0.8980
The hypotheses here require a left-tailed test: we can reject H0 at a level of significance of 0.05 if the
standardized test statistic has a value less than -t0.05,13 = -1.771. Since -1.683 is not quite less than -1.771,
we cannot reject H0 here at a level of significance of 0.05, and so the original claim is not supported at that
level of significance. (From Excel's TDIST() function, we get that the p-value is 0.05808 here.)

On the other hand, we use the word prediction in statistics to refer to the attempt to forecast the actual
values that may be observed for a random variable. So, whereas the word estimation has been used when
we speak of estimating the value of a population parameter such as a mean value (for instance, in this
discussion, the mean value of y for a specific value of x), we use the word prediction to speak of
forecasting what value of y may result when an individual observation, y p(x*), is made for a specific value of
x* of x. Of course, a single number result here is not too meaningful -- what is more useful is an "interval
estimate" type of result for this value.
Again, the best "predictor" of the value of y that we are likely to observe is ŷ( x *) , which is a point estimator
of the mean,  y x * , of all of the y-values that could be observed for x = x*. A measure of the uncertainty in
such a prediction is the variance of the error, Var[y - ŷ( x *) ]. But by results quoted much earlier in the
course, we can write
Var  y  ŷx *    Var y   Var ŷx *  
Page 12 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
1
x *  x 2 
2 2  

SS x 
 n

x *  x 2 
1
  2 1 


n
SS x 

(LR - 17)
where the variance, 2, of y, would in practice be estimated by the value of s 2 = MSE. Thus, the variance of
the individual values of y is given by a formula very similar to (LR - 14), the variance for ŷ( x *) -- the only
difference is the additional term "1" in the square brackets. For values of x near x , this can be a relatively
large contribution to the value of the expression in square brackets.
Finally, since ŷ( x *) is essentially t-distributed with n - 2 degrees of freedom under the conditions normally
adopted in linear regression work, we can write:
y p x *   ŷx *   t  / 2,n2 s 1 
x *  x 
1

n
SS x
2
@ 100 1   %
(LR - 18)
This formula means that there is a probability of 100(1 - )% that the next value of y observed when x = x*
will lie between the numbers given when the + and - signs are used. But this means that the two limits given
by (LR - 18) enclose 100(1 - )% of the possible individual y-values. For this reason, formula (LR - 18) is
not normally used to predict the value of a single observation of y, but rather, to compute the interval
containing the middle 100(1 - )% of the possible individual y-values that may be observed.
In fact, the limits of both the estimation interval, (LR - 15), and this prediction interval (LR - 18) are functions
of x, and so it is possible to calculate these limits in each case for a succession of values of x, and plot them
on a graph. The upper and lower limits of the estimation and prediction intervals then form pairs of curves
above and below the regression line, enclosing a regions located symmetrically about the regression line.
These regions are called, respectively, the 100(1 - )% estimation band and the 100(1 - )% prediction band
for the regression line.
Example 10: Construct a graph of the 90% estimation and prediction bands based on the data presented in
Example 6 above.
Solution
The calculation of values of the limits given by (LR - 15) and (LR - 18) for a succession of values of x* is a
very tedious task to do by hand. Instead, we've set the formulas up in an Excel/97 spreadsheet. The
following are just some of the calculated values obtained using  = 0.10:
x*
1
4
7
9
ŷ( x*)
 y x* ()
47.02428
52.75631
58.48834
62.30969
51.35963
55.34831
60.07871
64.33909
 y x * ( ) y p ( x *) (  ) y p ( x*) ( )
42.68894
50.16431
56.89796
60.28028
54.55271
59.43467
64.84532
68.79047
39.49585
46.07795
52.13135
55.82891
Here  y x * (  ) and  y x * ( ) are the results obtained from formula (LR - 15) using the + and - signs,
respectively. Similarly y p ( x*) ( ) and y p ( x *) (  ) the results obtained from formula (LR - 18) with the +
and - signs respectively.
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 13 of 20
The pair of curves closest to this straight line are
the graphs of  y x * (  ) and  y x * ( ) vs x*. You
see that these two curves taken as a pair seem
to form a band that is narrowest around x = 7,
the vicinity of the mean of the x-values for which
data was obtained. For a given value of x, the
vertical interval between these two curves is the
90% confidence interval estimate for the mean
value of y corresponding to that value of x.
Iron Content of Pickled Fish
80
75
70
Fe (ppm)
The graph to the right shows the relationship
between a variety of items arising in this
example. The plotted points are the fifteen data
points forming the original scatterplot. The
straight line in the middle of the pattern is the
actual best fit straight line calculated using
formula (LR - 3).
65
60
55
50
45
40
0
2
4
6
8
10
12
brine (% )
The outer two curves are the graphs of y p ( x*) ( ) and y p ( x *) (  ) . These also form a band which is
narrowest in the vicinity of the mean of the x-values for which observations were taken. The vertical interval
between these two curves contains 90% of the values of y for that value of x.
Notice that because of the shape of these "bands", we get the most precise estimate of the mean value of y
and of the interval containing the next value of y to be observed in the vicinity of the mean of the values of x
observed, and both of these "estimates" get less precise as you move away to the left or the right. Thus,
you should always plan your experiments so that the mean of the observed x-values falls in the vicinity of
the x-values for which you may want to calculate  y x * or yp(x*).

Inferences About the Correlation Coefficient
The theory here gets well beyond the study of probability theory that we are able to do in this course. It
involves the so-called bivariate normal distribution, which must be graphed in three-dimensions. Its graph
would be a surface something like an actual brass bell.
The methods below are valid only if the (x, y) data obeys the bivariate normal distribution at least
approximately. There are no simple tests to verify this, but there is a sort of a negative test. If the values of
x by themselves or the values of y by themselves are not approximately normally distributed, then taken
together as pairs of values, (x, y), they cannot be approximately bivariate normally distributed. So, you
could prepare normal probability plots of the x's and the y's separately. If either plot caused serious concern
that the data was not adequately consistent with the normal distribution, then the methods we now describe
are probably not valid.
We developed and illustrated the calculation and rule-of-thumb interpretation of the sample correlation
coefficient, r, earlier in the document on characterizing relationships. This sample correlation coefficient, r,
is, of course, a random variable, estimating the population correlation coefficient, . It turns out that to test
the null hypothesis
H0:  = 0
we need to consider two distinct situations: one when 0 = 0, and the other when 0 is nonzero.
Page 14 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
(i) Testing H0:  = 0
In this case, the quantity
t
r n2
(LR -19)
1  r2
has the t-distribution with  = n - 2 degrees of freedom. Calculate this quantity, and apply the t-test rules.
Example 11: Does the data given in Example 6 above support a claim that there is a nonzero correlation
between % concentration of the brine and the ppm Fe in the pickled fish?
Solution
This question is asking us to test the hypotheses
H0:  = 0
vs
(LR - 20)
H0:   0
where  is the population correlation coefficient between the two variables described.
For completeness, we start by displaying the normal probability plots for the x- and y-values here. They are
Normal Probability Plot for X-values
Normal Probability Plot for Y-values
12
70
% Concentration of Brine
10
65
ppm Fe
8
6
60
4
55
2
0
-2
-1
50
0
1
2
-2
-1
0
1
2
Neither of these are particularly great. The plot for the x-coordinates shows a definite heavy-tailedness, and
the plot for the y-values is not all that tightly clustered about a straight line. These would normally be cause
for considerable concern. However, because the value of the standardized test statistic will turn out to be so
much larger than necessary to reject H0 at a level of significance of 0.05, we can probably safely proceed
here -- keeping in mind that our results may be somewhat optimistic.
The values of SSx, SSy and SSxy are given in Example 6, so we can calculate immediately that
r
SS xy

SS x SS y
165 .1767
86 .4493  472 .6133
 0.8172
Thus, the value of the standardized test statistic is
t
r n2
1 r
2
© David W. Sabo (1999)

0.8172 15  2
1  0.8172 2
 5.112
Inference in Linear Regression and Correlation
Page 15 of 20
Since this is a two-tailed test, we reject H0 at  = 0.05 if t > t0.025,13 = 2.160 or if t < -t0.025,13 = -2.160. Since
5.112 is much greater than 2.160, we can comfortably reject H0 at a level of significance of 0.05, concluding
that there is a nonzero correlation coefficient between the variables x and y in this example -- a linear
relationship does exist. The p-value for this test works out to be 0.0001996, but we may not want to make
too much of this precise result because of the concerns over the potential failure of the population to obey a
bivariate normal distribution.

Note three things about the method just illustrated:
(i) Testing H0:  = 0 is a quite a bit more specific way of evaluating the presence of evidence of a linear
relationship that is the "rule of thumb" given in the document on characterizing relationships. For one
thing, the hypothesis test takes into account sample size, which we have seen is a factor of
considerable importance in statistical work.
(ii) We can't really exploit this procedure to deduce a formula for a confidence interval estimate of  based
on the t-distribution. Because H0 states that  = 0, a confidence interval that allowed the possibility of
 having one of a range of values about zero would be contradictory. The more general method
described just below does allow the derivation of a confidence interval estimate formula for .
(iii) You may have noticed that the value of the standardized test statistic, t, just above looks a bit familiar. In
fact, it is precisely the value we got when testing
H0: 1 = 0
vs
(LR - 21)
HA: 1  0
earlier in Example 6. In fact, in this case, the hypotheses (LR - 20) and (LR - 21) are completely equivalent.
In fact, we can show that the formulas for the two test statistics are identical. Starting with the standardized
test statistic for (LR - 21), we can proceed as follows:
SS xy
b
t 1 
s b1
SS xy
SS x
MSE
SS x

SSE
SS x (n  2)
SS x
But, from formula (LR - 8) earlier
SSE  SS y 
SS 2xy
SS x

SS x SS y  SS 2xy
SS x
Substitute this into the denominator of the previous equation:
SS xy
SS xy
t
SS x
SS x SS y  SS 2xy
SS 2x (n  2)

SS x
2

1  SS xy

SS x SS y


1
 SS SS
x
y

SS x

1
n2
This looks pretty bad, but all we've done is some factoring in the denominator. Now, invert and multiply the
factor involving the square root of n - 2 -- it will appear in the numerator of the result. The SSx in the
denominator of the numerator cancels the SSx in the denominator of the denominator. The small square
root left in the denominator can be moved into the denominator of the numerator, leaving
Page 16 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
SS xy
SS x SS y
t
n2
2

1  SS xy

SS x SS y






r n2
1  r2
since
SS xy
SS x SS y
r
and so 1 
SS 2xy
SS x SS y
 1  r2
Thus, we've demonstrated algebraically that the values of the test statistics for the two tests are given by
equivalent formulas. It doesn't matter which of these two pairs of hypotheses you test -- you'll get the same
result, and so they both lead to the same interpretation.
(ii) Testing H0:  = 0 (0  0)
The formulas get a bit more complicated here, but not beyond what your mathematical skills should be able
to handle. Here, we must use something called the Fisher Transform, given by
V
1 1 r 
ln 

2 1 r 
(LR - 22)
where "ln" is the symbol for the natural logarithm. Then it turns out that if the x and y-values obey a bivariate
normal distribution, the quantity (actually, a sample statistic, since its value depends on the sample statistic
r) V is normally distributed with mean and standard deviation given by
V 
1 1   

ln 
2  1   
and
V 
1
n3
(LR - 23)
Thus, to test the hypothesis
H0:  = 0 (0  0)
simply calculate the standardized test statistic
V
z
1  1  0
ln 
2  1  0
1
n3



(LR - 24)
and draw your conclusion based on the standard z-test rules.
Example 12: Is the data for Nectarine Sizes, given in the summary table on page 5 of this document,
evidence to claim that the population correlation coefficient between the size of nectarines and the crop load
is less than -0.8 (that is, that there is a strong negative correlation between these two variables)?
Solution
We are asked to test the hypotheses
H0:  = -0.8
vs
H0:  < -0.8
© David W. Sabo (1999)
Inference in Linear Regression and Correlation
Page 17 of 20
The relevant data in the table on page 5 is that the sample was of size n = 50, and the sample correlation
coefficient was r = -0.9001338. To calculate the required standardized test statistic, start with formula
(LR - 22)
V
1  1  r  1  1   0.9001338  
   1.4729
ln 
  ln
2  1  r  2  1   0.9001338  
Then, using formula (LR - 24),
V
z
1  1  0
ln 
2  1  0
1
n3

1  1   0.8  
  1.4729  ln 

2

 1   0.8     2.566
1
50  3
The hypotheses here require a left-tailed test, so we would reject H0 at a level of significance of 0.05 if the
value of z turns out to be less than -z0.05 = -1.645. But, since -2.566 is less than -1.645, we can reject H0 at
a level of significance of 0.05, and so conclude that the data does support the conclusion that the correlation
coefficient for these two variables is less than -0.8. (The p-value for this test is -0.0051.)

Formulas (LR - 22) and (LR - 23) along with the fact that V is normally distributed can be exploited to derive
a confidence interval formula for the correlation coefficient, . As usual when the normal distribution is
involved, we start out with the statement
Pr(  z  / 2  z  z  / 2 )  1  
but now, for z, substitute the generic equivalent of equation (LR - 24)


1 1  



V  ln 
2 1  


Pr   z  / 2 
 z / 2   1  
1


n3




This means that there is a probability of 1 -  that the following event will occur:
V
 z / 2 
1 1  

ln 
2  1   
 z / 2
1
n3
(LR - 25)
Now, what we really need is just  in the middle of this double inequality. This means we're going to have to
rearrange things a bit (including solving a logarithmic equation -- you probably never expected to see one of
those again!). It looks bad, but not much more than careful basic algebra is required. For example, take the
left-hand inequality:
V
 z / 2 
1 1  

ln 
2  1   
1
n3
(LR - 26)
First isolate the logarithmic term:
Page 18 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)
1 1  
1
  V  z / 2
ln 
2  1   
n3
It will be convenient to use some notation to make the equations a bit simpler looking. So, use the symbol
2 to stand for the right-hand side of this inequality:
1
2  V  z  / 2
(LR - 27)
n3
so that now, we can write the preceding inequality as
1 1  
  2
ln 
2  1   
Now, we must solve for . The procedure is to get the logarithm absolutely by itself on one side:
1  
  2 2
ln 
1  
and then convert the equation to exponential form:
1 
 e 2 2
1 
The inequality stays as it was, because the ln function is a strictly increasing function. Now we have a
relatively simple algebraic equation for  (keep in mind that the right-hand side of this last inequality is just a
number). Solving, we get

e 2 2  1
(LR - 28a)
e 2 2  1
On the other hand, if you start with the right-hand inequality in (LR - 25) and isolate the logarithmic term, you
come up with


z
1   
  2  1  2  V   / 2 
ln 

n  3 
1   
(where the expression in the square brackets defines what we mean by 1 here). Solving this logarithmic
inequality for  gives

e 21  1
(LR - 28b)
e 21  1
Finally, putting (LR - 28a, b) together gives the desired result:
e 21  1
e 21  1

© David W. Sabo (1999)
e 2 2  1
e 2 2  1
@ 100 1   %
Inference in Linear Regression and Correlation
(LR - 29)
Page 19 of 20
Example 13: Compute the 95% confidence interval estimate of the population correlation coefficient from
NectarineSize data.
Solution
Very briefly, we've already determined that r = -0.9001 to four decimal places, and we know that the sample
size n = 50. From the normal probability tables we get z0.025 = 1.96. In Example 12, we calculated V to be
-1.4729. Now, calculate the quantities 1 and 2:
1  V 
z / 2
2  V 
z / 2
n3
  1.4729 
1.96
50  3
  1.7588
and
n3
  1.4729 
1.96
50  3
  1.1870
Thus, formula (LR - 29) gives
e 21.7588   1
e 21.1870   1



e 21.7588   1
e 21.1870   1
or
-0.9424    -0.8297
@95%
as the desired result. Thus, there is a probability of 95% that the interval between -0.9424 and -0.8297
contains the true value of the population correlation coefficient, , in this case.

This is as far as we go with linear regression and correlation in this course. Those of you in Food
Technology will be taking a subsequent course that will go deeper into the multiple regression and nonlinear
multiple regression models, which tend to be very useful in research in food technology.
Page 20 of 20
Inference in Linear Regression and Correlation
© David W. Sabo (1999)