Download The computational formula for SS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 4: Variability
Variability provides a quantitative measure of the degree to which scores in a distribution are
spread out or clustered together.
The range (either maximum score – minimum score or maximum score – minimum score + 1) is
one gross measure of variability (that relies on only two scores from the distribution). The semiinterquartile range is half of the interquartile range (Q3 – Q1), so (Q3 – Q1) / 2. We won’t really
make much use of either of these measures throughout the semester.
Let’s think about the type of measure that we’d like to use to assess variability. First of all, we’d
probably like every score to contribute to the measure. And one very reasonable approach would
be to assess the difference (or distance) of every score from the mean. If every score were near
the mean, then those distances would be small, so the measure of variability would be small as
well. If scores were far from the mean, then those distances would also be large, resulting in a
large measure of variability. Sound good so far?
OK, then we could simply sum all the differences between the scores in our distribution and the
mean of the distribution. But we already know what will happen then, right?
X-
X
1
2
3
Sum
The problem is that the signs of the differences cancel out one another, resulting in zero. So, how
to get rid of the signs? One approach would be to take absolute values of the differences [  |(X )| ].
X
|X – 
1
2
3
Sum
That’s actually a pretty good measure of variability, except that it will typically be larger with
larger data sets. A more reasonable measure of variability would correct for the number of scores
(as we do for the mean). So, then a very reasonable measure of variability (the mean absolute
deviation) would be:
MAD 
 X  

Ch4 - 1
N
Although reasonable, we won’t actually use this measure. Instead, we’ll take a different approach
to removing those pesky signs…we’ll square all the differences. That measure is called the sum
of squared differences from the mean (SS), and would be computed as follows:
SS  X  
2

X
1
2
3
(X – 
Sum
Once again, however, the SS isn’t a good measure of variability because it will typically be larger
for larger data sets. So, once again, we’d be inclined to divide the SS by the number of scores in
the data set. That measure of variability is one that we will use, and it’s called the variance (here
for the population).
SS X  
 

N
N
2
2
The variance is a nifty measure, but it has the problem that the units involved in the measure
(seconds, inches, etc.) are now squared. In order to rectify that problem, we need a new measure
of variability that returns to
the original units of measurement, which is the standard deviation
(). To compute the standard deviation, you simply take the square root of the variance.
SS
   

N
2
X  
2
N
So, there you have it. That’s the logic that underlies the measures of variability that we’ll use
most often. Now, we’ll have to expand on these measures just a bit as we develop an easier way
to compute the SS on acalculator and develop the measures of variability that we’d use to
estimate population variability from a sample.
The computational formula for SS
Computer programs for statistical analyses actually use the definitional formula for computing
SS. However, when using a calculator (and a small data set), it’s better to use the computational
formula for SS:
 X 
SS   X 
n
2
2
So, for this simple data set, compute SS.

Ch4 - 2
X2
X
1
2
3
Sum
SS =
Note that this is the only computational formula for SS. It does not differ when computing the SS
for a population or a sample.
Describing population variance and Estimating population variance from a sample
The formulas for describing the variance and standard deviation of a population are
straightforward (as illustrated above). However, when we are dealing with sample data (as is
typically the case), and we’re interested in estimating the population variability, then the
formulas are slightly different.
Describing Population Variability
 X 
SS   X 
n
 X 
X 
Estimating Population Variability
2
SS
2
2
2
2
Variance
2
2

SS
 

N
2
 X 
SS   X 
n
 X 
X 
2
N
SS
ˆ 
s  


n 1
2
N
 X 
X  N
2
2
Standard
Deviation
  2 
SS

N
N 
 X 
X  n
2
2

n 1
n
2
ˆ
s  
SS

n 1
n 1
Why are the degrees of freedom (n-1) for sample variance?
When one wants to estimate the population variance (2) from a sample, one divides the


sample sum of squares (SS) by the appropriate degrees of freedom (df). I want to demonstrate
here why the df needs to be (n - 1).
Suppose that one has a simple population, with only three members (1, 2, 3). From the
computations above, you should know the population mean () and variance (2). [Remember,
this is the population, not a sample, so one would use descriptive statistics. For the variance, just
divide the SS by N, not (N-1).] Write those values below:
2

Ch4 - 3
Next we want to get a sense of all possible samples of two scores we might take from this
population. We will sample with replacement, which means that it is possible to get the same
score twice (like picking it out of the hat, putting it back in, and then picking it out again). What
are the nine possible combinations of scores we could get from this population?
Now, for each sample compute the mean of the sample ( X ) and the SS for each sample.
Then divide the SS by n (n = 2 in this instance) for one estimate of the population variance. Then
divide SS by n - 1 (or 1 in this instance) as a second estimate of the population variance. Finally,
compute the mean for each column of statistics.
Sample
Mean
SS
SS / n
SS / (n-1)
Mean ->
When you compute the statistic for every possible sample which can be taken from a
distribution, and then create a frequency distribution, you have created a sampling distribution.
So the means of all 9 samples form a sampling distribution of the mean. [This is a very
important distribution, to which we will return shortly.] The mean of the sampling distribution
should equal the population parameter that one wants to estimate. So, if you wanted to estimate
the parameter , the population mean, the sample mean would be an accurate statistic, because
the average (mean) sample mean is exactly equal to the population mean. Check your mean of all
9 sample means. Does it equal ? Can you see why the sample mean is, on average, a good
estimate of the population mean? Can you also see that the sample mean does not always equal
the population mean? (That’s sampling error.)
Here are a couple of other questions related to the sampling distribution of the mean.
What proportion of the population scores fell at the mean ()? What proportion of means in the
sampling distribution fell at the mean ()? Can you explain why this happens?
So, now the question is, “Using which df in the sample variance yields the best estimate
of the population variance?” Looking at the means of the two columns should convince you that
using df = n produces an underestimation of the population variance, while using df = n - 1 yields
an accurate estimation of the population variance. So you should be convinced that it is
important to use n - 1 as the df when computing the sample variance to estimate the population
variance.
Ch4 - 4
How is variability affected by constant changes in scores?
What would be the impact on variance of adding a constant to each member of the data set?
What would be the impact on variance of multiplying each member of the data set by a constant?
You could answer these questions logically, but let’s simply compute to determine the answers.
X
1
2
3
4
5
X+3
4
5
6
7
8
X
SS
s2

Ch4 - 5
X•3
3
6
9
12
15