Download When estimating the process standard deviation, , it is conventional

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Transcript
Degrees of freedom
These notes discuss the concept of degrees of freedom. This concept arose in a strictly
mathematical context relating to the freedom of points in n-dimensional space to vary within
lower dimensional subspaces of the full space. In this regards, degrees of freedom are
identified with dimension.
For ordinary mortals, less terrifying expositions are required. Here, the idea is introduced in the
context of estimating a population or process standard deviation. Topics discussed include:
Estimating  from a set of measurements
bias due to constraint on the deviations on which the estimate is based
reduction in bias
Degrees of freedom for
2 measurements
3 measurements
n measurements
without and with constraints on their freedom to vary
Degrees of freedom and number of linear parameters to be estimated before estimating 
simple linear regression
multipe linear regression
Degrees of freedom and the Analysis of Variance
Minitab knows the rules!
Estimating 
When calculating the standard deviation of a set of n numbers, say X1, X2, . . . Xn, the formula
sX 
1
n
n
 ( X  X)
2
i
i 1
is used.
This is frequently referred to as the population standard deviation formula because it is the
formula that would be used if the set of numbers available were measurements on a complete
population of n individuals, for example, in a census, where n might be millions. In that case, X
is the process or population mean that is otherwise denoted by and the formula above may
be denoted by .
Bias in estimating 
When estimating a process or population standard deviation, , from a sample of n
measurements, say X1, X2, . . . , Xn, it is conventional to use a slight variation on the formula
above in which the divisor, n, is replaced by n – 1. This is because the deviations Xi – X of
individual measurements from the sample mean tend to be smaller than the deviations of the Xi
from , the population or process mean, which are the deviations we would ideally use. Indeed,
X is the least squares estimator of  so it as close to the n sample values, in the least squares
sense, as any one number can be. The result is that the formula with divisor n is smaller than it
really should be, that is, it is biased downwards.
Dividing by n – 1 instead of n helps1 counteract this downward bias in the estimate of . The
resulting estimate of the population or process standard deviation, , is
ˆ 
1
n 1
n
 ( X  X)
2
i
.
i 1
This estimate for  is used by popular convention.
Degrees of freedom for estimating 
In mathematical language, n – 1 is referred to as the number of degrees of freedom associated
with the n deviations, X1  X, X 2  X,  , Xn  X , which are the basic building blocks from which
the standard deviation formula is constructed. This terminology arises from the fact that the
sum of the deviations is 0,
n
 ( X  X)  0 .
i
i 1
This equation constrains the deviations Xi  X such that, if values are given for any n – 1 of the
deviations, the value of the remaining deviation is automatically determined; it is minus the sum
of the given n – 1 values. For example,
n 1
Xn  X  
 ( X  X) .
i
i 1
Because one of the n deviations is determined by the other n – 1 deviations in this way, the n
deviations are said to have lost one degree of freedom, so that the n deviations have only n – 1
degrees of freedom.
Explaining degrees of freedom in special cases, n = 1, 2, n
This rather abstract account may be illustrated more concretely as follows.
Consider an arbitrary pair of variables, say
( X1 , X2 ),
The adjustment does not work perfectly in estimating . It does work perfectly in estimating 2, in the
sense that the so-called expected value of ̂ 2 is 2. The expected value of ̂ 2 is the mean of the
1
sampling distribution of ̂ 2 , that is, the long run average of values of ̂ 2 resulting from indefinitely
repeated sampling of the same population or process. Because the average of the square roots of values
of ̂ 2 is not the same as the square root of the average of the values of ̂ 2 , the expected value of ̂ is
not equal to , meaning that ̂ is biased as an estimate of . It is possible to make further adjustment to
remove the remaining small bias but this adjustment is rather complicated and not worth the effort.
Some mathematical statisticians make a big virtue out of the unbiasedness ̂ 2 as an estimate of 2,
ignoring the fact that it is  rather than 2, that is meaningful in most applications.
page 2
where each variable is free to vary independently of the other, that is, each variable may
assume any value, irrespective of the value of the other. Such a pair may be said to have two
degrees of freedom. However, if the pair is required to satisfy the equation
X1 + X2 = 0,
then specifying a value for one of the variables automatically determines the value for the other
variable. If X1 = 2, then X2 must be –2. In that case, the pair is said to have lost one degree of
freedom and so has just one degree of freedom instead of two.
The triple of variables
( X1 , X2 , X3 ),
is thought of as having three degrees of freedom. However, if the three variables also satisfy
the equation
X1 + X2 + X3 = 0,
then the triple has just two degrees of freedom, having lost one degree of freedom because of
the constraint. If X1 = 2 and X2 = 4, then X3 must be – 6;
X3 = – ( X1 + X2 ).
More generally, if
n
X  0,
i
i1
then
n 1
Xn = 
X .
i
i 1
The n variables, constrained by having to sum to 0, thus lose one degree of freedom and so
have n – 1 degrees of freedom rather than the n degrees of freedom that they would have if they
were unconstrained.
Replacing the variables X1, X2, . . . , Xn in the last paragraph by the deviations
X1 – X , X2 – X , . . . , X n – X
and noting that the deviations do sum to 0, we conclude that the deviations lose one degree of
freedom due to the constraint of summing to 0 and so have n – 1 degrees of freedom to vary.
Degrees of freedom for estimating 
As noted earlier, given a sample of n measurements X1, X2, . . . , Xn from a population or
process, we would like to be able to estimate  by
page 3
1
n
n
 ( X  )
i
2
.
i1
However, since typically we do not know , we estimate  in this formula by X , replace n by
n – 1 to adjust for the bias introduced by using X and thus end up with the estimate
ˆ 
1
n 1
n
 ( X  X)
2
i
i 1
for . Using the mathematical language informally, we say that we lose one degree of freedom
in estimating  by X and have n – 1 degrees of freedom remaining on which to base an
estimate of , that is we have the n – 1 degrees of freedom associated with the n deviations
X1 – X , X2 – X , . . . , X n – X

on which to base an estimate of .
Simple linear regression
In simple linear regression, the estimate of  is based on the residuals
e i  Yi  ˆ  ˆ X i , 1 ≤ i ≤ n.
The actual formula used is
n
ˆ 
e
2
i
i 1
n2
.
Note that e does not occur in the sum of squares part of the formula, as it would in a normal
standard deviation formula. This is because the sum of the residuals is 0 and so e = 0.
Bias
Ideally, the actual values of  and would be used in calculating the residuals, in which case
the divisor used would be n. However, it is their least squares estimates, ̂ and ̂ , that are
used. Because these are chosen so that the fitted values, ˆ  ˆ Xi , are as close as possible to
the observed values, Yi, using them means that ̂ is biased downwards. The divisor n – 2 is
used to counteract this bias.
We may think of this as using two degrees of freedom to estimate the two parameters  and  in
the ideal residuals, Yi –  – Xi, leading to the least squares residuals, e i  Yi  ˆ  ˆ X i 
Thus, we end up having n – 2 degrees of freedom on which to base an estimate of .
page 4
"Losing" degrees of freedom
More technically, the mathematical derivation leading to the formulas for ̂ and ̂ includes two
equations involving the residuals; ei = 0 and Xiei = 0. The residuals can, in principle, take
on any values. However, the first equation says that their sum must be 0, so that, once n – 1
of them are assigned values, the last one is determined as minus their sum and thus the last
one is not free to vary; one degree of freedom is lost. Similarly, because of the second
equation, a second degree of freedom is lost. Thus, ultimately, the residuals have n – 2
degrees of freedom.
Multiple linear regression
This account extends naturally to multiple linear regression. The mathematical derivation of the
regression coefficient estimates involves the solution of algebraic equations involving the
residuals, as many equations as there are regression coefficients. The fact that the residuals
satisfy these equations places a corresponding number of constraints on the residuals.
Mathematically, they are not free to vary; they have lost as many degrees of freedom as there
are regression coefficients, one for each equation. This is reflected in the formula
ˆ 
e
2
i
np
.
Using least squares estimates of the p regression coefficients means that the residuals are
biased downwards in magnitude and so, therefore, is ̂ . Using n–p as the divisor in the formula
for ̂ adjusts for this bias.
Degrees of freedom in the analysis of variance in regression
Standard computer output of the results of a regression analysis invariably include an Analysis
of Variance table. The main component of this table that may be useful in regression is the
value of the F statistic for testing the hypothesis that all the regression coefficients, excluding
the intercept, are zero. This hypothesis means that the explanatory variables, the X's, actually
explain nothing regarding variation in the response variable, Y. The sampling distribution of the
F statistic depends on the numbers of degrees of freedom associated with its numerator and its
denominator. These numbers of degrees of freedom relate to the fitted values and the residual,
respectively.
To illustrate, consider the analysis of variance table resulting from the first fit regression in the
Jobtimes case study:
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
15
19
SS
756055
21050
777105
MS
189014
1403
F
134.69
P
0.000
Degrees of freedom and the F test statistic
The calculated value of F, 134.69, is highly statistically significant, since the corresponding pvalue is 0 to 3 decimal places. Alternatively, the 5% critical value for F with 4 and 15 degrees of
freedom, (the degrees of freedom being given in the DF column), is F4,15; 0.05 = 3.1, considerably
exceeded by the calculated value of 134.69.
page 5
Degrees of freedom for Residual Error
The number of degrees of freedom for "Residual Error" is 15 because the 20 residuals each
incorporate estimates of the 5 regression coefficients so that 5 degrees of freedom are lost due
to estimation of the regression coefficients leaving 20 – 5 = 15 degrees of freedom in the
residuals for the estimation of .
Note that the sum of squares of the residuals, a crude measure of the variation in the residuals,
is given in the SS column as 21,050. Dividing this by 15, the corresponding number of degrees
of freedom, gives the value of the "Mean Square of Residual", given in the MS column as 1,403.
This is the estimate of 2, based on these data. Note that the square root of this is the value of
s, correct to 2 decimal paces, that is, ̂ .
Degrees of freedom for Regression
The number of degrees of freedom for "Regression" is more subtle. The sum of squares for
regression, 756,055 as given in the SS column, is actually the sum of the squares of the
deviations of the fitted values, Ŷi , from their mean, which equals Y . Given the values of the X
variables, the n = 20 fitted values are determined by the values of the estimates of the five
regression coefficients. This means that the n = 20 fitted values have five degrees of freedom;
they have as much freedom to vary as have the five regression coefficients. Because the sum
of squares involves the deviations of the fitted values from their mean, and these deviations
necessarily sum to 0, the deviations have one less degree of freedom, that is, 5 – 4 = 4 in this
case.
Note that the sum of the degrees of freedom for Regression and the degrees of freedom for
Residual Error sum to the Total degrees of freedom, 4 + 15 = 19. The Total number of degrees
of freedom corresponds to the deviations of the observed values, Yi, from their mean, Y and so
equal 20 – 1 =19. The corresponding sum of squares, 777,105 in the SS column of the table, is
the sum of squares of these deviations. Note that this sum of squares is the sum of the other
two.
Basis for the analysis of variance
This last equation is the basis for the analysis of variance. Recall from Lecture 2.2, Slide 49:
Regression Sum of Squares measures
explained variation
Residual Sum of Squares measures
unexplained (chance) variation
Total Variation =
Explained
+
Unexplained
In terms of sum of squares formulas,
n
(Y  Y)
i
n
2
=
i 1
(Ŷ  Y)
n
2
i
+
i 1
(Y  Ŷ )
2
i
i
.
i 1
Degrees of freedom follow a corresponding equation:
n–1
=
p–1
+
n – p,
where p is the number of regression coefficients (including the intercept).
page 6
The software knows the rules
Fortunately, there is no need to memorise all this detail, awareness of the ideas is enough.
Virtually all statistical software computes the appropriate numbers of degrees of freedom and
displays them in an analysis of variance table. In many cases, the software will indicate
explicitly the number of degrees of freedom associated with s = ̂ , that is, the residual degrees
of freedom. Minitab does not do this so the number of residual degrees of freedom must be
read from the Analysis of Variance table.
page 7