Download The Standard Deviation Second most important after the middle of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
The Standard Deviation
Second most important after the middle of data is how spread out or variable
it is.
The standard deviation describes the typical distance of a data value from
the mean.
• Name - the Population S.D. is called σ, the Greek letter “sigma.” The
Sample S.D. is called s.
• The Empirical Rule - for symmetric unimodal data:
– About 2/3 of data falls within one standard deviation of the
mean, or between µ − σ and µ + σ.
– About 95% of data falls within two standard deviations of the
mean, or between µ − 2σ and µ + 2σ.
– Almost all data falls within three standard deviations of the
mean, or between µ − 3σ and µ + 3σ.
• σ is the Greek ancestor of our “s” for standard deviation.
• People sometimes talk about the variance, which is the standard deviation squared. It is
another way to package the same info about the spread of data.
Empirical Rule
• 2/3 w/in 1 s.d of mean
• 95% w/in 2 s.d.s of mean
• almost all w/in 3 s.d.s of mean
Ex:College age women’s heights roughly symmetric, unimodal, mean µ =
65.5 inches, a stand. dev. of 2.5 inches. So
• 2/3 of women are between 65.5 − 2.5 = 63 in. and 65.5 + 2.5 = 68 in. or
00
50 3 and 50 800
• 95% of women are between 65.5 − 5 = 60.5 in. and 65.5 + 5 = 70.5 in. or
50 0.500 and 50 10.500
1
• almost all women are between 65.5 − 7.5 = 58 in. and 65.5 + 7.5 = 73 in.
or 40 1000 and 60 100
• This rule works pretty well if the data is not unimodal, but it works poorly for highly skewed
data. In general don’t use mean and standard deviation for highly skewed data.
Standard Deviation Formula
1. Each data value minus the mean (distance from mean):
x1 − µ,
x2 − µ,
x3 − µ, . . .
xn − µ
2. Square each difference (makes distances positive)
(x1 − µ)2 ,
(x2 − µ)2 ,
(x3 − µ)2 , . . .
(xn − µ)2
3. Average
(x1 − µ)2 + (x2 − µ)2 + (x3 − µ)2 + · · · + (xn − µ)2
n
4. Take square root
r
(x1 − µ)2 + (x2 − µ)2 + (x3 − µ)2 + · · · + (xn − µ)2
σ=
n
Calculating Standard Deviation
1. Data value minus mean.
2. Square each difference.
3. Average.
4. Square root.
2
Ex: The mean of 0 and 4 is µ = 2
xi
0
4
xi − µ −2 2
(xi − µ)2
4
4
(x1 − µ)2 + (x2 − µ)2
4+4
=
=4
2
2
r
(x1 − µ)2 + (x2 − µ)2
2
√
= 4 = 2.
σ=
• This is about as hard a calculation of standard deviation as I will expect you to do (we’ll
learn how to compute all these quantities in Excel). But the formula is helpful to understand
how s.d. works.
• So the typical distance of 0 and 4 from their mean (2) is 2. That makes sense.
Sample Standard Deviation
The formula for sample standard deviation is almost the same, but not quite.
Of course you use sample mean instead of population, but also
r
(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2
s=
n−1
you divide by n − 1 instead of n.
This turns out to make the sample s.d. of the sample do the best job of
approximating the population s.d of the population. But if you calculate both
with the same data the sample s.d. will be slightly larger.
Whenever someone says the standard deviation they mean the sample s.d.
The standard deviation is sensitive to outliers.
• If your data is reasonably symmetric the mean and the standard deviation together give a
nice simple picture of the data (the Empirical Rule tells you how to interpret this). If your
data is skewed, you should not use the mean and standard deviation, but instead should use
the median and ... you’ll find out what next lecture!
• When you calculate with real data, it is almost always the standard deviation. If you thought
there was only one standard deviation given by the formula with n − 1 in it you would never
run into any practical problems. But this sample s.d. is just a rough approximation to the
population standard deviation, which is a Platonic ideal. When we speak about what the
standard deviation means, such as in the Empirical Rule, we are always speaking about the
population standard deviation.
3
Calculating Sample Standard Deviation
1. Data value minus mean.
2. Square each difference.
3. Divide sum by n − 1
4. Square root.
Ex: The mean of 0 and 4 is x = 2
xi
0
4
xi − x −2 2
(xi − x)2
4
4
(x1 − x)2 + (x2 − x)2
4+4
=
=8
2−1
1
r
(x1 − xu)2 + (x2 − x)2
2−1
√
√
= 8 = 2 2 ∼ 2.83.
σ=
• This is about as hard a calculation of standard deviation as I will expect you to do (we’ll
learn how to compute all these quantities in Excel). But the formula is helpful to understand
how s.d. works.
• So the typical distance of 0 and 4 from their mean (2) is 2. That makes sense.
Lecture 7 Key Points
After this lecture you should be able to
• Be able to apply the Empirical Rule given the mean and s.d., and have a
rough sense of what the s.d. tells you based on the Empirical Rule and
the interpretation as the typical distance from the mean.
• Very roughly estimate the standard deviation from a histogram.
• Know not to use the s.d. to summarize a skewed distribution.
After processing this lecture you should be able to
• Understand that there is a difference between the population and sample
standard deviation
• Use the standard deviation formula to understand the properties of the
standard deviation and to compute it by hand for very small data sets.
• Calculate population and standard deviation in Excel.
4