Download Module 7 - Wharton Statistics

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia, lookup

Inductive probability wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

History of statistics wikipedia, lookup

Confidence interval wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Taylor's law wikipedia, lookup

Regression toward the mean wikipedia, lookup

Misuse of statistics wikipedia, lookup

Law of large numbers wikipedia, lookup

Student's t-test wikipedia, lookup

STAT 101, Module 7:
The Root-N Law, the Central Limit Theorem,
Standard Errors, and Confidence Intervals
(Book: chapter 7)
Independent and Uncorrelated Random Variables
 Definitions: Two random variables X and Y are called …
o … independent if the events Ai = (X=xi) and Bj = (Y=yj) are
independent for all possible values xi of X and yj of Y.
o … uncorrelated if C(X,Y) = 0.
We say “uncorrelated” even though we use the covariance in the definition.
Maybe that’s because we can’t say “uncovarianced”, or maybe because if σ(X)
and σ(Y) are both >0, then C(X,Y) = 0  c(X,Y) = 0.
 Theorem: If X and Y are independent, they are uncorrelated.
The theorem is important because it is often easier to recognize that
two random variables are independent than uncorrelated, even though
independence is a more stringent condition. For example, one
recognizes coin flips immediately as independent.
o “Proof”: C(X,Y) = E((X – µX) (Y – µY)) = E(X – µX) E(Y – µY) = 0· 0 = 0
The second equality is some grinding algebra, but nothing deep.
o The converse is not true! Here is a counter example of a pair of random
variables that are uncorrelated but not independent:
P(X=1 and Y=2) = P(X=2 and Y=1) =
P(X=2 and Y=3) = P(X=3 and Y=2) = ¼
This can be realized by a game where two dice are thrown repeatedly till
they show a 1-2 pair or a 2-3 pair in any order. The outcome will be one
of (1,2), (2,1), (2,3), (3,2), with equal probability.
To see that the two random variables are not independent, check the
marginal (plain) probabilities:
P(X=1) = P(Y=1) = P(X=3) = P(Y=3) = ¼
P(X=2) = P(Y=2) = ½
=> P(X=1 and Y=2) = ¼
≠ P(X=1)·P(Y=2) = ¼ · ½ = ⅛
Intuitively, the two random variables cannot be independent because if
X=1 we know Y=2, for example.
To see that the two random variables are uncorrelated, calculate their
covariance. Note, however, that E(X) = E(Y) = 2, hence each summand in
C(X,Y) has a factor 0 (each pair has an outcome 2). Therefore C(X,Y) = 0.
o The following is an example of two independent variables:
P(X=1) = P(X=2) = P(X=3) = ⅓
P(Y=1) = P(Y=2) = P(Y=3) = ⅓
P(X=x and Y=y) = P(X=x) · P(Y=y)
Note that for independent variables we only need to specify the marginal
probabilities, and the joint probabilities are obtained by multiplication. In
this example the important thing is not that the probabilities of 1,2,3 are
equal, but that they can be multiplied to obtain the joint probability of all
pairs of values.
This can be realized by a game
where two dice are thrown till
they both show a value of 3 or
o Both of the above examples are constructed by “conditioning”. This is
often useful to go from a known situation to a slightly different one:
simply single out the cases that you like and condition on them. In this
case the instructor didn’t want to deal with 6 · 6 = 36 outcomes, which is
why he scaled things down to outcomes 3 or less.
The Root-N Law and the Standard Error
 We are finally able to determine the rate at which relative frequencies
and means grow more precise. It will be a disappointing result,
because the precision gets better only very slooooowwwwwwly…
We consider a possibly long series of random variables with identical
possible outcomes and identical probabilities for these outcomes
(“identically distributed”), and we also assume the variables are
X1 , X2 , X3 , X4 , X5 , …, XN
As examples, keep in mind flipping a coin, rolling a die, but also daily
stock market returns (which are surprisingly uncorrelated day to day), or
the monthly credit card bills of a randomly sampled series of households,
measurements of blood glucose in a given patient (the measurements are
slightly different even from the same blood sample due to measurement
error), survival times of cancer patients treated with a new therapy,…
Note that X1 stands for the values of the first case across datasets,
X2 for the values of the second case across datasets,… It therefore makes
sense to talk about the probability distribution of the variable X1, X2,…
The assumption of identical distribution has the consequence that not
only are all the probabilities P(X1=x) = P(X2=x) = P(X3=x) =… the
same for all possible values x, but so are the expected values and
variances and SDs:
E(X1) = E(X2) = E(X3) = … = E(XN) = µ
V(X1) = V(X2) = V(X3) = … = V(XN) = σ2
We think of the whole series as repeatable: Over and over, we could
o flip another N coins,
o roll another N dice,
o look at another series of N daily stock returns,
o another sample of N households and their monthly credit card bills,
o another set of N blood glucose measurements from the same blood sample,
o another clinical trial with N treated patients and their survival times,…
We are now interested in the mean value of these outcomes:
X = (X1 + X2 + X3 + X4 + X5 + …+ XN ) / N
Because of the assumed repeatability, X is a random variable in its
own right: every repetition would produce a slightly different mean.
Its expected values is obviously E( X ) = μ, but what is its SD?
 Theorem: If X1 , X2 , …, XN are uncorrelated and identically
distributed with same variance σ2, then
V( X ) = σ2 / N
o Proof:
V( X ) = V(X1 + X2 + …+ XN ) / N2
= C(X1 + X2 + …+ XN , X1 + X2 + …+ XN )/ N2
= ( V(X1) + V(X2) + …+ V(XN) +
… + C(Xi, Xj) + … ) / N 2
= ( N σ2 ) / N 2
= σ2 / N
The steps of the proof are as follows: 1) pull out the factor 1/N as 1/N 2 ;
2) expand the variance of the sum into N variances and N(N–1)
covariances; 3) use the fact that all covariances disappear; 4) use the fact
that all variances are the same, σ2.
(For those who enjoy math: This is really a giant application of a version
of the theorem of Pythagoras. It is like taking a giant N-dimensional
triangle, or N-angle, really, and doing something like this: hypotenuse2 =
(side 1)2 + (side 2)2 + (side 3)2 + … + (sideN)2, where all sides are of
equal length, so that hypotenuse2 = N · (any side)2. The quantity we are
examining, though, is hypotenuse2 /N 2 = (any side)2 / N, which is σ2/N.)
o What is disappointing about this result? It becomes clear once
we reformulate it in terms of standard deviations, which are the
real measure of dispersion:
σ( X ) = σ / N ½
 Definition: σ( X ) is called the standard error of the mean .
The standard error is a standard deviation but only in a special case: when
describing the variability of an estimate such as a mean across datasets.
 Interpretation: The standard error of the mean is a measure of
dispersion of the mean from dataset to dataset , assuming one could
obtain datasets like the observed one over and over and over…
This mental exercise should give you something to think. In any given
data analysis, you are looking at one single dataset. You are calculating
one number from a column, its mean (mean household income, say, and
this could be something like $53,128.358). How come we are going to
think that this number is “variable”? It’s one number, right? There are
many households in the sample, but there is only one mean. And how are
we going to pretend we knew something about the “variability” of this one
Well, the mental exercise starts from the realization of repeatability of the
data collection. We could collect other datasets just like the one we have,
at least in principle, and each time the mean would be slightly different.
The miracle of the root-N law for the standard error of the mean happens
by making an assumption that the cases/rows/records were uncorrelated,
which is usually the case when the cases are obtained by independent
sampling or can otherwise be thought of as arising independently of each
other. This is where the math proof gives insight: the “Pythagorean
miracle” happens only because we assume that the individual observations
are uncorrelated (“orthogonal”) to each other.
 Examples:
o Assume we are looking at household surveys of various sample
sizes. To make things concrete, assume the observations are the
household incomes, which may average around $50,000 with a
SD of $30,000. Then:
1: σ( X ) = σ / 1 (= dispersion of the raw observations)
100: σ( X ) = σ / 10
N = 10000: σ( X ) = σ / 100
N = 1000000: σ( X ) = σ / 1000
Thus the uncertainty in the mean household incomes drops to ±$3,000
(N=100), to ±$300 (N=10,000), to ±$30 (N=1,000,000).
We have a diminishing returns effect! Gaining 10-fold
precision requires 100-fold increases in sample size.
o Your employer conducted a survey of households on a
shoestring budget, and the sample size was just N = 200. The
manager is naturally dissatisfied with the precision of the
estimates of product take-rates, average household income,
average household spending, household preferences,… So
he/she presses upper management for more money. He/she
happily reports back to the group that conducted the survey,
saying “I got sufficient funds to double the sample size, so we
can slash the errors by a factor of two.”
What should your response be?
“Apologies, but we’ll be able to reduce the errors only by
about 30%, not 50%.”
Why is this the correct response?
The sample size grows from 200 to 400. The standard
error decreases from σ/2001/2 to σ/4001/2. The ratio is
( σ/4001/2 ) / (σ/2001/2 ) =
(200/400)1/2 = 1/21/2 = 0.7071068 ≈ 70%
Thus the reduction is not even quite 30%.
To slash the standard error by half, one needs to quadruple
the sample size!!!
Standard Error Estimate of the Mean
 The root-N law and the standard error are theoretical so far because
they rest on an unknown population quantity σ. While it is nice to
have insights into how precision depends on the sample size N, it
would be even nicer if the standard error could be estimated. This is
indeed done and part of standard statistical practice:
 Although we don’t know σ, we can estimate it! The obvious estimate
is the empirical standard deviation s of the observations:
( X 1  X ) 2  ( X 2  X ) 2  ...  ( X N  X ) 2 
s = 
 N 1
1/ 2
which in the limit N → ∞ goes to
σ =
( (x1–μ)2 · P(X=x1) + (x2– μ)2 · P(X=x2) + … )1/2
In words:
o σ is the “true” or population SD “calculated” from infinitely many
observations Xi.
o s is the estimated or sample SD calculated from the N observations X1,
X2, …, XN of a single dataset.
Therefore, the natural estimate of the standard error is:
stderr( X ) = s / N ½
With this estimation step, we have achieved something remarkable:
Based on one single dataset (the one we have in hand),
we estimate how much the mean of a variable varies across
Isn’t this stranger than strange? How is this possible? It is possible
due to the math that goes into the root-N law. This math draws on the
assumption that the cases/rows of the dataset are sampled
independently. Such independence makes the N values of a variable
uncorrelated if we could repeat data collections. Having zerocovariances between all N observations wipes out most terms in
V( X ), and the root-N miracle happens, leaving us with a population
standard deviation that can be estimated from any single dataset…
The full ramifications will become clear as we develop the notion of a
confidence interval constructed from standard errors.
When polls around election time report a margin of error, it is the
standard error of a proportion of voters. Recall that a proportion is
just a mean of 0s and 1s, where 1=‘in favor of the incumbent’, 0=‘in
favor of the challenger’.
As for terminology: the technically correct term “standard error
estimate of the mean” is usually replaced with the shorter “standard
error of the mean” or even shorter “standard error”. This is
technically not correct because the standard error is a theoretical
population quantity, but the precise term is too much of a mouth full
to bother.
 Standard Errors in JMP: Take any dataset with quantitative
variables and apply Distribution to them. For example, go to the
dataset PennStudents.JMP and run the variables Height and Weight
through Distribution. We focus on the bottom list labeled
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
From Module 6 we know how to interpret the ‘Mean’ and the ‘Std
Dev’ in conjunction with the bellcurve.
What is new is that we can make sense of the next three numbers,
labeled ‘Std Err Mean’, ‘upper 95% Mean’ and ‘lower 95% Mean’:
o The ‘Std Err Mean’ is of course the standard error estimate of
the mean. We can confirm that it is obtained from the standard
deviation (of the observations) by dividing with the root of N:
3.9749694 / 3901/2 = 0.2012804
So: The mean, which is 67.75 for this dataset, would be
different for other datasets, but it would vary around the
population mean of Height (which we don’t know) with a
standard deviation of about 0.2.
o The next two numbers, ‘upper 95% Mean’ and ‘lower 95%
Mean’ are roughly the mean ± two standard errors. So why
aren’t these two number not exactly
67.754193 ± 2· 0.2012804 = (67.35154, 68.15666) ?
The reason is that the empirical rule as we formulated it with a
nice factor 2 is not exact. JMP and all software packages
calculate exacter numbers to achieve 95% coverage, but you see
that JMP’s numbers are reasonably close to the rough-andready ± 2 stderr rule. When available, use JMP’s numbers;
when not use the empirical rule.
Wait a minute! How could JMP assume that the distribution of the
means across datasets is approximately normally distributed? This is
what seems to be going on when labeling these bounds as upper and
lower bounds of a 95% coverage interval.
Something is missing: the Central Limit Theorem.
The Central Limit Theorem
 Theorem: If X1 , X2 , …, XN are mutually independent and identically
distributed with the same population mean μ and the same population
variance σ2. Then, as N → ∞, the variation of the sample means
X =
(X1 + X2 + …+ XN )
from dataset to dataset resembles ever more a normal distribution
with population mean μ and population variance σ2/N.
We knew the last part already: Whatever the distribution is, it must have
population mean (expected value) μ and variance σ2/N, the latter due to
the root-N law. The powerful part is that this distribution looks ever more
like a bellcurve.
Unfortunately, we can’t indulge the intellectually curious with a proof or even a proof
idea. The best we can do is to illustrate with a simulation, and this is what you are doing
in Homework 5. In class we will do another simulation using Sim 300x Uniform.JMP
The powerful and counter-intuitive part of the central limit theorem (CLT)
is that it does not matter what the distribution of the observations Xi of a
variable is: means across datasets will look ever more normally
distributed. In other words, your Distribution analysis of the
variable/column with values (X1 , X2 , …, XN) may look skewed or
discrete, the means of the same variable across datasets would look
approximately normally distributed, and this approximation gets better as
N → ∞.
A rule of thumb is that for sample sizes as low as N = 50, the normal
distribution is a good approximation to the distribution of means
across datasets.
 Reminder: We have been careful spelling out that the object of study
is the distribution of the mean of a variable/column across datasets
with N cases/rows. “Across datasets” means “across dataset
collections”. Keep in mind: We are playing a mind game by
examining hypothetically what means would look like if we could
collect datasets over and over and over…
So we said this is a hypothetical mind game. In reality, if we are ever in
the situation of collecting more than one dataset with the same variables,
we will most likely not analyze the datasets separately. Instead, we will
merge them into one larger dataset with many more cases and the same
variables. If the two datasets were both of size N, the merged dataset will
be of size 2N. (By how much can we hope to slash the standard error of
the means of the variables?)
(A note on “meta-analysis” for the intellectually curious: There exists a
situation in which one analyzes results from multiple datasets, namely,
when one surveys research that has been going on for years and has
produced multiple studies of roughly the same problem resulting in
datasets that all contain some of the same variables of interest. This is the
case typically in the medical field where a disease is investigated over and
over from various angles. Such studies will have some of the same
variables and also some that are specific to them. When surveying such
studies, one can use techniques from a statistical specialty called “metaanalysis”. Typically one has only access to the summary statistics such as
means, standard deviations, correlations of the variables as reported in
papers published in scientific journals, but one does not have access to the
multiple datasets themselves. By combining the estimates of multiple
studies, meta-analytic techniques will then provide more accurate
estimates than any of the individual studies.)
The Empirical Rule Based on the Central Limit Theorem
 The upshot of the central limit theorem is that for moderate and large
samples sizes (N ≥ 30), we can make approximate probability
statements such as those of the empirical rule:
P( | X – μ | ≤ 2σ/N½ ) ≈ 19/20
P( | X – μ | ≤ σ/N½ ) ≈ 2/3
This is of course not useful, although true. It becomes potentially
useful once we try to estimate the unknown population standard
deviation σ with a sample standard deviation s:
P( | X – μ | ≤ 2s/N½ ) ≈ 19/20
P( | X – μ | ≤ s/N½ ) ≈ 2/3
Are these still acceptable approximations? It turns out the answer is
yes! Here is why this is a non-trivial answer: By estimating σ with s,
we incur dataset-to-dataset variability in s, just as in the sample
mean X . Wouldn’t one expect this variation in s to destroy the nice
empirical rule? Think about it: s undershoots σ as often as it
overshoots, and when it overshoots, it makes the interval wider than
necessary, so maybe the problem is not so bad. In fact, it isn’t.
Here is what mathematical statistics found out: For small sample sizes
we need to lift the factor 2 just a little bit, but for large sample sizes
we can actually use a factor slightly below 2. The following table lists
factors for various sample sizes as suggested by the theory:
These factors used to be tabulated but are now computed by software
such as JMP as needed. If we denote the factors by tN , the following
probability statement is made exact, assuming the data themselves are
normally and independently distributed:
P( | X – μ | ≤ tN ·s/N½ ) = 0.95
Note the equal sign! For all practical purposes, the factor 2 will be
just fine if we only remember that it is a little too small for N less than
about 50. For N ≥ 100, the factor may actually be conservative in
many cases, which is not a problem. It only means the probability
may be a tad greater than 0.95, such as 0.952 for N = 100. Now, these
probabilities are computed assuming normal data. If the data are nonnormal, such as skewed or discrete, the probability may be a touch
below 0.95. In all,
P( | X – μ | ≤ 2 s/N½ ) ≈ 0.95
is a pretty good rule, definitely for N ≥ 100, unless the data are crazy
even for N ≥ 50.
Insight into the problem discussed here developed in the early 1900s.
Someone named Gosset did a mathematical investigation into the
probability distribution of the quantity t = ( X – μ)/s/N½, the so-called tstatistic, assuming that the observations X1, X2,… are all normally and
independently distributed. He actually derived the density function for
this statistic. What we denote as tN is the 97.5% quantile of this tdistribution. The sample size N is called the “degree of the tdistribution”, and you may encounter references to “a t-distribution with N
degrees of freedom”.
Here are some trivia surrounding these discoveries, quoted from the
Wikipedia (search “student’s t”): “The derivation of the t-distribution was
first published in 1908 by William Sealy Gosset, while he worked at a
Guinness Brewery in Dublin. He was not allowed to publish under his
own name, so the paper was written under the pseudonym Student. The ttest and the associated theory became well-known through the work of
R.A. Fisher, who called the distribution "Student's distribution".”
Confidence Intervals
 Above we looked at the probability of the statement | X – μ | ≤ 2s/N½
and came away with the message that it is close to 0.95 for N ≥ 50.
 Preliminary observation for the next step:
o In words, the inequality | X – μ | ≤ 2 s/N½ expresses the idea that the
distance between X and μ is no more than 2s/N½.
o There are two ways to express the same idea asymmetrically:
 μ is no further away from X than 2s/N½ :
X – 2 s/N½ ≤ μ ≤ X + 2 s/N½
 X is no further away from μ than 2 s/N½ :
μ – 2 s/N½ ≤ X ≤ μ + 2 s/N½
It is the first of these two asymmetric formulations we now consider.
The figure below illustrates the three ways of looking at the condition.
The top of the figure shows the
distance between X and μ and
compares it with 2 s/N½, which
here is larger.
The middle of the figure shows an
interval centered at X of half-width
2s/N½ catching the value μ.
The bottom of the figure shows an
interval centered at μ of half-width
2s/N½ catching the value X .
 We rewrite the probability statement in the following suggestive form:
P( X – 2 s/N½ ≤ μ ≤ X + 2 s/N½ ) ≈ 0.95
In words:
The interval “ X ± 2 s/N½” catches the true population mean μ
for about 19 out of 20 datasets.
This interval is called:
for the unknown population mean μ
A common abbreviation for “confidence interval” is “CI”, so we may
say “the 95% CI for the mean is …” The number 95% or 0.95 refers
to the “coverage probability”, where “coverage” refers to covering the
true value μ in the interval.
Q: What would be the coverage probability of the CI X ± s/N½ ?
 Two vexing aspects of confidence intervals:
o CIs are random intervals because they are constructed from
datasets: each dataset produces one value for X and one for s,
and both vary from dataset to dataset.
o The target μ, by contrast, does not vary. In this mental game it
is fixed but unknown across data collections.
You can compare the situation to blindly shooting an arrow with a wide
suction cup as a tip at a bull’s eye, and 19 out of 20 times the arrow’s
suction cup covers the very center point of the bull’s eye (μ). Note that
not only is it random where the arrow hits ( X ), but random is also the
radius of the suction cup (2s/N½). Of course this is a two-dimensional
metaphor for something that is going on in one dimension only.
The figure on the left shows 100
CIs from 100 simulated datasets,
each of size N=20. The dot shows
the horizontal location of X , and
the vertical centerline shows the
fixed target μ. The 100 horizontal
line segments represent the 100
CIs, vertically spread out for better
comparison. Note that there are
shorter and longer line segments,
representing the variability in s.
It appears that 6 or 7 intervals are
missing the true value μ, in rough
agreement with the approximate
5% missing rate expected in the
long run as the number of datasets
goes to infinity.
In reality, you will see only one
dataset and hence one CI for a
variable, but you have to mentally
embed this one CI in this picture.
 CIs in practice: We return to the analysis of the variables Height and
Weight in the dataset PennStudents.JMP:
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
First, recall that Std Err Mean = Std Dev / N½ .
We now understand “upper 95% Mean” and “lower 95% Mean”:
These represent the 95% CI = Mean ± tN Std Err Mean, where tN ≈ 2.
The interpretation is:
o The interval (67.36 in, 68.15 in) has about a 95% chance of
catching the population mean height of Penn students.
o The interval (147.1 lb, 153.1 lb) has about a 95% chance of
catching the population mean weight of Penn students.
o The unknown “population mean height” is fixed while the CI
varies from sample to sample. This particular CI (67.36, 68.15)
has a 95% chance of containing the population mean height,
and this is all we can know about the population mean height.
o The unknown “population mean weight” is fixed while the CI
varies from sample to sample. The particular CI (147.1, 153.1)
has a 95% chance of containing the population mean weight,
and this is all we can know about the population mean weight.
In either of case, are we ever going to know whether these particular
two intervals contain the respective population means? No, we will
never know, yet it’s the best game we can play.
 The trade-off between precision and uncertainty: We will never
know whether the interval actually contains the population mean μ,
but we can play with the width of the interval, that is, with the
precision requirement:
o We could lower the precision by widening the CI to three
standard errors, which would reduce the uncertainty by raising
the coverage probability from 0.95 to 0.997.
o We could raise the precision by narrowing the CI to one
standard error, which would raise the uncertainty by lowering
the coverage probability from 0.95 to 0.68.
There is no escaping the trade-off between precision and uncertainty.
The conventional trade-off is to ask for 0.95 coverage probability,
leaving a 5% uncertainty, implying a precision of ±2 stderr.
When would we want a wider CI? If the cost on acting the CI is high.
For example, if a clinical trial says average survival times increase by
2 years with a new treatment, wouldn’t you want to be quite certain
before switching to the new treatment?
Standard Error Estimates and CIs for Proportions
 The case of a random variable with 0/1 outcomes is so special for its
simplicity and importance that we should examine it separately. It
even has a special name: a Bernoulli random variable.
When X=1 and X=0 are the only values, then
o the sample mean of N realizations X1, X2, X3, …, XN is just the
relative frequency or proportion of 1’s, and
o the population mean is just the probability of observing 1.
For these reasons, one writes
p = P(X = 1)
p̂ = X = #{Xi =
This notation indicates that the proportion p̂ is an estimate of the
probability p.
Terminology: X is called a Bernoulli random variable with
parameter p.
 Variance and standard deviation: We found in Module 6 that the
population variance and standard deviation of X are
V(X) = p(1–p) ,
σ(X) = (p(1–p))1/2
Now doesn’t this suggest that the sample variance and standard
deviation should be the following?
s2(X) = p̂ (1– p̂ ) ,
s(X) = ( p̂ (1– p̂ ))1/2
Very close! In fact, if we calculate the sample variance according to
the usual formula and make use of the fact that the values are only 1’s
and 0’s, we get
1 ((X – X )2 + (X – X )2 + … + (X – X )2 ) =
N 1
p̂ (1– p̂ ) N/(N–1)
Now, there is a reason to ignore the factor N/(N–1), which is close to
1 anyway, and we don’t need to know the details. Hence we take the
formulas in the red box as the final definitions.
 Standard errors: Sample standard deviations of 0/1 outcomes have
really no practical meaning because there is certainly no empirical
rule that applies here. Instead, the purpose of the standard deviation is
as an aide in calculating a standard error estimate of the
stderr( p̂ ) = ( p̂ (1– p̂ ) / N )1/2
We can restate the empirical rule for proportions as follows:
P( | p̂ – p | ≤ 2 stderr( p̂ ) ) ≈ 0.95
In words: The sample proportion p̂ has a chance of about 19 in 20 to
catch the true probability p within two standard errors.
 Application: Consider a poll of a candidate based on 1000
respondents (= people bothered by phone during dinner time, yet willing to
volunteer an answer). Let’s say 465 were in favor of candidate Z. The
proportion is p̂ = 0.465. The standard error (…estimate of the
proportion) is stderr = (0.465 · 0.535 / 1000)1/2 = 0.01577. Thus two
standard errors is 0.03154, and the newspapers will report “candidate
Z favored by 46.5% of likely voters, with a margin of error of 3%”.
Clarification: The newspapers invented the term “statistical dead
heat”. They mean that based on the margin of error one cannot be
sure that one candidate is ahead of the other. You should realize that
this “dead heat” is less a property of what is going on in the
population than a definition based on convention and sample size. It
is assumed (and this is a convention) that the “margin of error” should
be based on two standard errors, implying a coverage probability of
the true proportion of 95%. Also, the pollers seem to have taken
sample sizes around 1000 as the standard (see for example the sample
sizes quoted in BushJobRatingsGallup.JMP). These two facts combine to
a definition of “statistical dead heat”. One would have fewer dead
heats if one made the CIs narrower and allowed greater uncertainty in
coverage, and/or if one used sample sizes greater than 1000. For your
own thinking, you could switch to a confidence interval based on
±stderr, which leaves you with an uncertainty of 1/3 instead of 1/20,
but let’s you gamble that candidate Z is ahead or lagging.
Evidence for/against μ: Rejection and Significance Levels based on CIs
 CIs are random intervals centered at the sample mean that, like a
fishing net, try to catch something, here the population mean μ.
Now let us change the point of view: let us center things at the
unknown population μ, as in the following two figures:
We drew the bellcurve because based on the CLT it is a good
approximation for the dataset-to-dataset distribution of sample means
X around the population mean μ.
As the figures state, μ makes X look more likely in the first case than
in the second case. The farther X is from μ, the lower the density
function and the lower the probability of X gets.
Turning things around, we now ask what evidence X lends for the
unknown μ. The following seems reasonable:
o When μ makes X more probable, X lends more evidence for μ.
o When μ makes X less probable, X lends less evidence for μ.
The principle at work here is:
o Population means μ assign probabilities to sample means X .
o Sample means X assign evidence to population means μ.
Strictly speaking, of course, it is the pair of parameters (μ, σ) that
define a normal population, and it is this population that assigns
probabilities to intervals of values of X .
Conversely, it is X in conjunction with the standard error estimate
s(X)/N1/2 that together assign evidence to values of μ.
Never mind, the two bullets above are more catchy and more
The evidence game has given rise to the follow formulation:
Values of μ that are further away from X than two standard
errors are rejected at the 5% significance level.
This is language from the theory of “statistical tests” which we’ll look
into in the next module. For now it gives us another way of
interpreting the values inside and outside the CI:
o The values inside the CI are possible population means μ for
which there is no evidence to reject them.
o The values outside the CI are possible population means μ for
which there is evidence to reject them.
 Example: Above we looked at political candidate Z who had 46.5%
of respondents favoring him/her. Two standard errors of this
proportion turned out to be about about 3.2%. Thus the confidence
interval CI is 46.5%±3.2% = (43.3%, 49.7%). This implies that the
value μ=50% is rejected at the 5% significance level because it falls
outside the 95% CI. The evidence at the 5% significance level is that
the candidate does not have a majority.
Please, do not confuse the various percentages: the proportion 46.5% and
the CI refer to proportions of likely voters. So does the hypothetical
boundary value 50% that divides majority from minority. The two values
95% and 5%, however, refer to strength of evidence in these voter
proportions: 95% is the probability that the CI catches the true population
proportion; 100% – 95% = 5% is the “unlikeliness” of finding the sample
mean this far out in the tail.
 Coverage probabilities of CIs and significance level of rejection:
If the CI has another half-width, for example, ± 3 stderr, then the
coverage probability is 0.997, and we will say that we reject the
values outside this CI at the 0.3% significance level.
If we wanted to play the game at the 1% significance level, we would
have to use CIs with coverage probability 99%, requiring a half-width
of about ± 2.6 stderr.
In general, if the coverage probability of the CI is 1– α, we say the
values outside the CI are rejected at the α·100% significance level.