Download 2. Hypothesis Testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia, lookup

Transcript
IE241: Introduction to
Hypothesis Testing
Topic
Slide
Hypothesis testing………………………………………..3
Light bulb example………………………………………..4
Null and alternative hypotheses………………..……….5
Two types of error…………………………………………8
Decision rule……………………………………..……….11
test statistic……………………………………………11
critical region………………………………………….12
Power of the test……………………………….…….17
Simple hypothesis testing……………………………...18
Neyman-Pearson lemma……………….…….…….19
example………………………………………………..21
Composite hypothesis testing ……………...………..26
example…………………………………..……………29
Likelihood ratio test………………………………….34
relationship to mean…………………………….38
Examples of 1-sided composite hypotheses
drug to help sleep……………………………………42
civil service exam………....................…………..44
difference between two proportions ……….46
effect of size of n…………..………………….51
railroad ties………………………………………..…. 55
fertilizer to improve yield of corn…………………..58
test of two variances…………………………..62
F distribution ……………………………………63
Tests of correlated means…………………………..…69
Bayes’ likelihood ratio test…………………………..…77
example…………………...……………………..…..78
Topic
Slide
Chi-square tests……………………………………….81
goodness of fit…………………………………….82
independence in contingency tables…………..91
testing sample vs hypothesized variance……..108
Significance testing…………………………………….111
We said before that estimation of parameters
was one of the two major areas of statistics.
Now let’s turn to the second major area of
statistics, hypothesis testing.
A test of a statistical hypothesis is a
procedure for deciding whether or not to
reject the hypothesis.
What is a statistical hypothesis? A statistical
hypothesis is an assumption about f(X) if X is
continuous or p(X) if X is discrete.
Let’s look at an example.
A buyer of light bulbs bought 50 bulbs
of each of two brands. When he tested
them, Brand A had an average life of
1208 hours with a standard deviation of
94 hours. Brand B had a mean life of
1282 hours with a standard deviation of
80 hours. Are brands A and B really
different in quality?
We set up two hypotheses.
The first, called the null hypothesis Ho,
is the hypothesis of no difference.
Ho: μA = μB
The second, called the alternative
hypothesis Ha, is the hypothesis that
there is a difference.
Ha: μA ≠ μB
On the basis of the sample of 50 from
each of the two populations of light
bulbs, we shall either reject or not reject
the hypothesis of no difference.
In statistics, we always test the null
hypothesis. The alternative hypothesis
is the default winner if the null
hypothesis is rejected.
We never really accept the null
hypothesis; we simply fail to reject it on
the basis of the evidence in hand.
Now we need a procedure to test the
null hypothesis. A test of a statistical
hypothesis is a procedure for deciding
whether or not to reject the null
hypothesis.
There are two possible decisions, reject
or not reject. This means there are also
two kinds of error we could make.
The two types of error are shown in the table
below.
True state
Ho true
Ho false
Decision
Reject Ho Type 1
error
Do not
reject Ho
Correct
α
decision
Correct
Type 2
decision error
β
If we reject Ho when Ho is in fact true,
then we make a type 1 error. The
probability of type 1 error is α.
If we do not reject Ho when Ho is really
false, then we make a type 2 error. The
probability of a type 2 error is β.
Now we need a decision rule that will make
the probability of the two types of error very
small. The problem is that the rule cannot
make both of them small simultaneously.
The one type of error the experimenter has
under his control is α error. He can choose
the size of α.
Because in science we have to take the
conservative route and never claim that we
have found a new result unless we are really
convinced that it is true, we choose a very
small α, the probability of type 1 error.
Then among all possible decision rules given α, we
choose the one that makes β as small as possible.
The decision rule consists of a test statistic and a
critical region where the test statistic may fall. For
means from a normal population, the test statistic is
XA  XB
XA  XB
t

sdiff
s A2 s B2

n A nB
where the denominator is the standard deviation of
the difference between two independent means.
The critical region is a tail of the distribution
of the test statistic. If the test statistic falls in
the critical region, Ho is rejected.
Now, how much of the tail should be in the
critical region? That depends on just how
small you want α to be. The usual choice is
α = .05, but in some very critical cases, α is
set at .01.
Here we have just a non-critical choice of
light bulbs, so we’ll choose α = .05. This
means that the critical region has probability
= .025 in each tail of the t distribution.
For a t distribution with .025 in each tail,
the critical value of t = 1.96, the same
as z because the sample size is greater
than 30. The critical region then is
|t |> 1.96.
In our light bulb example, the test
statistic is
t
1282  1208
74

 4.23
2
2
17.5
80
94

50
50
Now 4.23 is much greater than 1.96 so
we reject the null hypothesis of no
difference and declare that the average
life of the B bulbs is longer than that of
the A bulbs.
Because α = .05, we have 95%
confidence in the decision we made.
We cannot say that there is a 95% probability that we
are right because we are either right or wrong and we
don’t know which.
But there is such a small probability that t will land in
the critical region if Ho is true that if it does get there,
we choose to believe that Ho is not true.
If we had chosen α = .01, the critical value of t would
be 2.58 and because 4.23 is greater than 2.58, we
would still reject Ho. This time it would be with 99%
confidence.
How do we know that the test we used
is the best test possible?
We have controlled the probability of
Type 1 error. But what is the probability
of Type 2 error in this test? Does this
test minimize it subject of the value of α?
To answer this question, we need to
consider the concept of test power.
The power of a statistical test is the
probability of rejecting Ho when Ho is
really false. Thus power = 1-β.
Clearly if the test maximizes power, it
minimizes the probability of Type 2 error
β. If a test maximizes power for given
α, it is called an admissible testing
strategy.
Before going further, we need to distinguish
between two types of hypotheses.
A simple hypothesis is one where the value of
the parameter under Ho is a specified
constant and the value of the parameter
under Ha is a different specified constant.
For example, if you test
Ho: μ = 0
vs
Ha: μ = 10
then you have a simple hypothesis test.
Here you have a particular value for Ho and a
different particular value for Ha.
For testing one simple hypothesis Ha against
the simple hypothesis Ho, a ground-breaking
result called the Neyman-Pearson lemma
provides the most powerful test.
L(ˆa )

L(ˆ0 )
λ is a likelihood ratio with the Ha parameter
MLE in the numerator and the Ho parameter
MLE in the denominator. Clearly, any value of
λ > 1 would favor the alternative hypothesis,
while values less than 1 would favor the null
hypothesis.
Basically, this likelihood ratio says that if there
exists a critical region A of size α and a
constant k such that
n
La

Lo
 f ( x ; )
i
a
i 1
n
 f ( x ; )
i
k
inside A
k
outside A
o
i 1
and
n
La

Lo
 f ( x ; )
i
a
i 1
n
 f ( x ; )
i
o
i 1
then A is a best (most powerful) critical region
of size α.
Consider the following example of a
test of two simple hypotheses.
A coin is either fair or has p(H) = 2/3.
Under Ho, P(H) = ½ and under Ha, P(H)
= 2/3.
The coin will be tossed 3 times and a
decision will be made between the two
hypotheses. Thus X = number of heads
= 0, 1, 2, or 3. Now let’s look at how
the decision will be made.
First, let’s look at the probability of Type 1
error α. In the table below, Ho⇒ P(H) =1/2
and Ha⇒ P(H) = 2/3.
X P(X|Ho) P(X|Ha)
0
1/8
1/27
1
2
3/8
3/8
6/27
12/27
3
1/8
8/27
Now what should the critical region be?
Under Ho, if X = 0, α = 1/8. Under Ho, if X = 3,
α = 1/8. So if either of these two values is chosen
as the critical region, the probability of Type 1 error
would be the same.
Now what if Ha is true? If X = 0 is chosen as the
critical region, the value of β = 26/27 because that is
the probability that X ≠ 0.
On the other hand, if X = 3 is chosen as the critical
region, the value of β = 19/27 because that is the
probability that X ≠ 3.
Clearly, the better choice for the critical region is X=3
because that is the region that minimizes β for fixed
α. So this critical region provides the more powerful
test.
In discrete variable problems like this, it
may not be possible to choose a critical
region of the desired α. In this
illustration, you simply cannot find a
critical region where α = .05 or .01.
This is seldom a problem in real-life
experimentation because n is usually
sufficiently large so that there is a wide
variety of choices for critical regions.
This problem to illustrate the general
method for selecting the best test was
easy to discuss because there was only
a single alternative to Ho.
Most problems involve more than a
single alternative. Such hypotheses are
called composite hypotheses.
Examples of composite hypotheses:
Ho: μ = 0
vs
Ha: μ ≠ 0
which is a two-sided Ha.
A one-sided Ha can be written as
Ho: μ = 0
vs
Ha: μ > 0
Ho: μ = 0
vs
Ha: μ < 0
or
All of these hypotheses are composite because they
include more than one value for Ha. And
unfortunately, the size of β here depends on the
particular alternative value of μ being considered.
In the composite case, it is necessary
to compare Type 2 errors for all
possible alternative values under Ha.
So now the size of Type 2 error is a
function of the alternative parameter
value θ.
So β(θ) is the probability that the
sample point will fall in the noncritical
region when θ is the true value of the
parameter.
Because it is more convenient to work
with the critical region, the power
function 1-β(θ) is usually used.
The power function is the probability
that the sample point will fall in the
critical region when θ is the true value
of the parameter.
As an illustration of these points,
consider the following continuous
example.
Let X = the time that elapses between
two successive trippings of a Geiger
counter in studying cosmic radiation.
The density function is
f(x;θ) = θe-θx
where θ is a parameter which depends
on experimental conditions.
Under Ho, θ = 2. Now a physicist
believes that θ < 2. So under Ha, θ < 2.
Now one choice for the critical region is the right tail
of the distribution, X ≥ 1

   2e
2 x
dx  .135
1
Another choice is the left tail, X ≤ .07 for which α
= .135. That is,
.07
   2e 2 x dx  .135
0
Now let’s examine the power for the two competing
critical regions.
For the right-tail critical region X > 1,

1   (1 )   e x dx  e 
1
and for the left-tail critical region X <.07,
.07
1   (2 )   e x dx  1  e .07
0
The graphs of these two functions are called
the power curves for the two critical regions.
These two power functions are
P o we r fu nc tio ns fo r two c ritic al re gio ns
1 .2
critical region X>1
1
critical region X<.07
Power
0 .8
0 .6
0 .4
0 .2
0
0
0 .5
1
1 .5
2
2 .5
3
3 .5
4
Th e ta
Note that the power function for X>1 region is always
higher than the power function for X<.07 region before
they cross at θ = 2. Since the alternative θ values in
the problem are all θ<2, clearly the right-tail critical
region X>1 is more powerful than the left-tail region.
What we just saw was a 1-sided composite
alternative hypothesis test. Unfortunately, with
two-sided composite alternative hypotheses,
there is no best test that covers all alternative
values.
Clearly, if the alternative were θa < θo , the
left tail would be best, and if the alternative
were θa > θo , the right tail would be best.
This shows that best critical regions exist only
if the alternative hypothesis is suitably
restricted.
So for composite hypotheses, a new
principle needs to be introduced to find a
good test. This principle is called a
likelihood ratio test.
L(ˆ0 )

L(ˆ)
where the denominator is the maximum of
the likelihood function with respect to all the
parameters, and the numerator is the
maximum of the likelihood function after
some or all of the parameters have been
restricted by Ho.
Consequently, the numerator can never exceed the
denominator, so λ can assume values only between
0 and 1.
A value of λ close to 1 lends support to Ho because
then it is clear that allowing the parameters to
assume values other than those possible under Ho
would not increase the likelihood of the sample
values very much, if at all.
If, however, λ is close to 0, then the probability of the
sample values of X is very low under Ho, and Ho is
therefore not supported by the data.
Because increasing values of λ correspond to
increasing degrees of belief in Ho, λ may serve as a
statistic for testing Ho, with small values leading to
rejection of Ho.
Now the MLEs are functions of the values of the
random variable X, so λ is also a function of these
values of X and is therefore an observable random
variable.
λ is often related to X whose distribution is known
so it is not necessary to find the distribution of λ.
Suppose we have a normal population
with σ = 1 and we are interested in
testing whether the mean = μo. That is,
1
e
2
1
 ( x  )2
2
Let’s see how we would construct a
likelihood ratio test.
In this case,
n
2
n
L(  )  (2 ) e

1

( xi   ) 2
2 i 1
Since maximizing L(μ) is equivalent to
maximizing log L(μ),
 log L(  ) n
  ( xi   )

i 1
so ̂  X and therefore
n

n
2
L(  )  (2 ) e

1

( xi  X ) 2
2 i 1
Under Ho, there are no parameters to
be estimated, so
n
2
L(o )  (2 ) e
n

1

( xi o )2
2 i 1
and λ then is
 e
e
n
n

1
2
2
  ( xi   o )  ( xi  X ) 
2  i 1

i 1

n
 ( X  o ) 2
2

This expression shows a relationship between
λ and X , such that for each value of λ, there
are two critical values of X , which are
symmetrical with respect to X = μo.
So the 5% critical region for λ corresponds to
the two 2.5% tails of the normal X distribution
given by
X 

o
| 1.96 |
n
Thus the likelihood ratio test is identical to the
t test and serves as a compromise test when
no best test is available.
It is because of the concept of power
that we simply fail to reject the null
hypothesis and do not accept it when
the test value does not fall into the
rejection region.
The reason is that if we had a more
powerful test, we might have been able
to reject Ho.
Now let’s look at some examples.
As an example of a one-sided composite
hypothesis test, suppose a new drug is
available which claims to produce additional
sleep. The drug is tested on 10 patients with
the results shown.
Patient
Hours
gained
1
2
3
4
5
6
7
8
9
10
0.7
-1.1
-0.2
1.2
0.1
3.4
3.7
0.8
1.8
2.0
We are testing the hypothesis
Ho: μ = 0 vs Ha: μ > 0
The mean hours gained = 1.24 and s = 1.45. So the
t statistic is
1.24  0
t
 2.7
1.45
10
which has 9 df.
For df = 9 and α = .05, the required t = 2.262.
Since our obtained t is greater then the required t,
we can, with 95% confidence, reject Ho.
So in this case, even with only 10 patients, we can
endorse the drug for obtaining longer sleep.
Now let’s take a second example. A civil
service exam is given to a group of 200
candidates. Based on their total scores, the
200 candidates are divided into two groups,
the top 30% and the bottom 70%.
Now consider the first question in the
examination. In the upper 30% group, 40 had
the right answer. In the lower 70% group, 80
had the right answer. Is the question a good
discriminator between the top scorers and
the lower scorers?
To answer this question, we first set up the
two hypotheses.
In this case, the null hypothesis is
Ho: pu = pl
and the alternative is
Ha: pu > pl
because we would expect the upper group to
do better than the lower group on all
questions.
In binomial situations, we must deal
with proportions instead of counts
unless the two sample sizes are the
same.
The proportion of successes p = x/n
may be assumed to be normally
distributed with mean p and variance
pq/n if n is large.
Then the difference between two sample proportions
may also be approximately normally distributed if n is
large.
In this situation, μp1-p2 = p1-p2 and

2
p1  p 2
p1q1 p2q2


n1
n2
Just as for the binomial distribution, the normal
approximation will be satisfactory if each nipi exceeds
5 when p ≤ ½ and niqi exceeds 5 when p > ½.
The test statistic is
t 
pu  pl
pq
pq

nu
nl
We need the common estimate of p
under Ho to use in the denominator, so
we use the estimate for the entire group.
So p = 120/200 = 3/5 =.6 and q = .4.
The p for the upper group = 40/60 = .67.
The p for the lower group = 80/140≈.57.
So inserting our values into the test statistic,
we get
t
.67  .57
.10

 1.32
.6(.4) .6(.4) .076

60
140
Our critical region is t > 1.65 because we
have set α = .05 as the critical value in this
1-tailed test. Because of the large sample
size, t.95 = z.95 .
Because the obtained t = 1.32 is lower than
the required t = 1.65, we cannot reject the
null hypothesis because the data didn’t allow
us to do so.
So, given the data, we conclude that the first
question is not a good one for distinguishing
between the upper scorers and the lower
scorers on the entire test.
Now let’s look at our test problem again.
Suppose instead of 200 candidates we
tested 500, but kept everything else in the
problem the same.
t
.67  .57
.10

 2.092
.6(.4) .6(.4) .0478

150
350
Now we will reject Ho because now
t = 2.092, which is greater than 1.65, the
critical value of t.
This is why we never accept Ho, but
only fail to reject it with the evidence in
hand. It is always possible that a more
powerful test will provide evidence to
reject Ho.
But this leads to another question. If,
theoretically, we can always keep
increasing sample size, then eventually
we will always be able to reject Ho. So
why do the test to begin with?
The reality is that you can’t keep
increasing n in the real world because
there are constraints on time, money,
and manpower that prevent having n so
large that rejection of Ho is a foregone
conclusion.
We usually have to get by with the n we
have available.
Furthermore, even if we could get a
larger sample size, there is no
guarantee that everything else will
remain the same.
The mean difference in the numerator
could change. So could the variance
estimates in the denominator.
So we do the test because there is no
other choice.
Let’s look at another example of testing
the difference between two proportions.
A railroad company installed two sets
of 50 ties. The two sets were treated
by creosote using two different
processes.
After a number of years in service, 22
ties of the first set and 18 ties of the
second set were still in good condition.
The question is whether one method of
treating with creosote is better than the
other. So we set up two hypotheses:
Ho: p1 = p2
Ha: p1 ≠ p2
Now we can use the t test statistic because
the samples are large enough to assume
normality of p1 – p2. For a 2-tailed test with
α = .05, the critical value of t = 1.96.
t
p1  p2
pq pq

n1 n2
First, we need to get the values of p and q
for the denominator. Since Ho treats both
p1 and p2 as coming from populations with
the same p, the common estimate of p is
(22+18)/100 = .4. So q = .6.
Now the t test is
t
.44  .36
.08

 .816
(.4)(.6) (.4)(.6) .09798

50
50
Clearly, we cannot reject Ho.
As another example, consider the application
of a fertilizer to plots of farm ground and the
effect it has on the yield of corn in bushels.
The data are
Treated
6.2
5.7
6.5
Untreated
5.6
5.9
5.6
6
6.3 5.8 5.7
5.7 5.8 5.7
6
6
6
5.8
5.5 5.7 5.5
The average yield for the treated plots = 6.0,
with s2 = 0.0711. The average yield for the
untreated plots = 5.7 with s2 =0.0267.
Ho: μtreated = μuntreated
Ha: μtreated ≠ μuntreated
The test statistic is
t
6  5.7
.3

 3.0339
0.0711 0.0267 .098883

10
10
So Ho can be rejected because α = .05 and
t.025 = 2.101 with 18 df. When you test the
difference between two means,
df = (nA-1) + (nB-1).
So we can conclude that the fertilizer will help
produce more bushels of corn.
Now can ask how many extra bushels of corn we will get
with the fertilizer. The point estimate is .3, but we can
find a 95% confidence interval around this estimate.
In the case of a confidence interval for the difference
between two means,
( X A  X B )  t.95 s X2 A  s X2 B
 .3  2.101(.0989)
 .3  .208
.092   A   B  .508
Note that the 95% confidence interval does not include 0,
and the t test rejected Ho that the difference = 0. The
confidence interval thus confirms the t test outcome.
Because the sample size was only 10
for each group, we can’t say with any
degree of confidence that the increase
in yield is more than .092, but it may be
as much as .508.
One caution about using small samples
to test the difference between means is
that t assumes equality of the two
variances. If the samples are large, this
assumption is unnecessary.
Now how can we know if the two
variances are equal? We can test them.
We already know that each variance is
distributed as chi-square. Now how
can we test to see if two variances are
equal?
The answer is the F test.
The F distribution is a ratio of two
chi-square variables. So if s21 and s22
possess independent chi-square
distributions with v1 and v2 df,
respectively, then
s12
F
s22
v1
v2
has the F distribution with v1 and v2 df.
The F distribution is
f (F )  cF
1
( v1 2 )
2
(v2  v1F )
1
 ( v1 v2 )
2
where c is given by
v v 
 v1  v2  2 
v1
v v  1 2 

!
2
 2  
  v1  2
 v1   v2 
 v1  2   v2  2   v2 
    

!
!
2  2
 2  2 
v1 v2
2 2
1 2
and the symbol Γ(x) denotes the gamma or factorial
function of x, which has the property that Γ(x+1) =
xΓ(x).
Now let’s do the test to see if our two
variances are equal. In the problem,
the two variances are .0711 and .0267.
So
s12 .0711
F 2 
 2.66
s2 .0267
and there are 10-1 and 10-1 df. Is
2.66 greater than would be expected if
the two variances were equal? To
answer this, we must consult the F
distribution.
Now it turns out that the critical region in the two tails
have critical values that are reciprocals of each other.
That is, if
then
s12
F 2
s2
s22
1/ F  2
s1
Because of this reciprocal property, the procedure is
always to place the larger variance over the smaller.
Then we can refer to the F distribution for the .025
critical region to see if the hypothesis of a common
variance is to be rejected.
For this case, with 9 and 9 df, the
critical value of F = 4.025. There is an
FINV function in EXCEL to find critical
values for the F distribution.
Since the observed value of 2.66 is
less than the critical value of 4.025, we
cannot reject the null hypothesis of
common variance.
To see the reciprocality, if we had
placed the smaller variance over the
larger, the observed ratio would be,
1/2.66 = .376
The critical value would be
1/F = 1/4.025 = .2484
In this case, for the left tail, the
observed value should be less than the
critical value. But here .376 > .2484, so
we would not reject Ho.
So far, we have always been talking
about the difference between two
independent means. What if the means
are not independent?
In the situation of correlated means, we
must find the standard error of the
difference between two dependent
means. So now the situation is
complicated.
Suppose we have 10 heart patients who took
a treadmill test, then went on a strict exercise
program, then retook the treadmill test.
Patient
Test 1
Test 2
Gain
1
15
21
6
2
18
23
5
3
18
21
3
4
20
25
5
5
20
24
4
6
22
29
7
7
23
28
5
8
24
28
4
9
22
25
3
10
25
33
8
Mean
20.7
25.7
5
St dev
3.093
3.802
1.633
Our question is whether the strict
exercise program really helped the
patients’ treadmill performance.
Ho: μafter = μbefore
Ha: μafter > μbefore
Clearly, the before and after test values
are correlated because they are scores
for the same patients. In fact, the
correlation coefficient = .908. We can
use the t test statistic, but what do we
put in the denominator?
We have to incorporate this dependence in the
standard error. The simplest way to do this is
to reformulate Ho and Ha.
Ho: μdiff = 0
Ha: μdiff > 0
Now all we have to do is test the observed
difference against the null difference of 0.
Clearly the difference between means is
normally distributed, so the t test can be used.
Now we have
50
5
t

 9.68
1.633 .5164
10
For α =.05 in the right tail (1-tailed test),
the critical value of t with df = 9 is 1.833.
Our observed value of 9.68 is much
greater than 1.833 so we can reject Ho
with 95% confidence.
The point to see here is that with correlated
data, the test of the differences gives the
same result as a test of before and after
values if the covariance is incorporated in the
standard error.
There is no clear way to incorporate the
covariance in the standard error. Therefore,
the way to handle correlated values is to use
the average of the differences in the t test
statistic numerator and the standard deviation
of the differences, divided by the square root
of n, in the denominator.
Now what about a confidence interval
around this difference?
The 95% confidence interval is
5  2.262(.5164)
= 5  1.168
3.832 < μdiff < 6.168
Again the 95% confidence interval does not
include 0, confirming the t test result at α
= .05. In this case, the t test is one-sided
and the confidence interval is 2-sided.
This does not always happen with 1-sided t
tests.
Tests of correlated data are much less
common than tests of independent groups.
Nonetheless, they do happen and testing the
differences is the way to handle this.
Most correlated data come from before-after
situations, so you must be careful in your
inferences about the effect of the intervening
activity if you reject Ho.
For example, it is possible in our example,
that some of the patients were taking some
drug that helped their treadmill performance.
So we can’t conclude that it was just the
exercise program unless we make sure that
there are no other factors to consider.
Another approach to an admissible test
strategy is that developed by Bayes. which
turns out to be a likelihood ratio test. Bayes’
formula is used to determine the likelihood of
a hypothesis, given an outcome.
P ( H i | D) 
P(Hi )P(D | Hi )
k
 P(H )P(D | H )
i 1
i
i
This formula gives the likelihood of Hi given
the data you actually got versus the total
likelihood of every hypothesis given the data
you got. So Bayes’ strategy is a likelihood
ratio test.
Consider an example where there are two
identical boxes. Box 1 contains 2 red balls and
Box 2 contains 1 red ball and 1 white ball.
Now a box is selected by chance and 1 ball is
drawn from it. What is the probability that it was
Box 1 that was selected if the ball that was
drawn was red?
Let’s test this with Bayes’ formula.
There are only two hypotheses here, so H1= Box1
and H2 = Box2. The data, of course, = R. So we
can find
P ( H1 | R) 
P ( H1 ) P ( R | H1 )
P ( H1 ) P ( R | H1 )  P ( H 2 ) P ( R | H 2 )
(1 / 2)(1)
2


(1 / 2)(1)  (1 / 2)(1 / 2) 3
And we can find
P ( H 2 | R) 

P(H2 )P(R | H2 )
P ( H1 ) P ( R | H1 )  P ( H 2 ) P ( R | H 2 )
(1 / 2)(1 / 2)
1

(1 / 2)(1)  (1 / 2)(1 / 2) 3
So we can see that the odds of the data favoring
Box1 to Box2 are 2:1.
We are twice as likely to be right if we
choose Box 1, but there is still some
probability that it could be Box 2.
The reason we choose Box 1 is because
it is more likely, given the data we have.
This is the whole idea behind likelihood
ratio tests. We choose the hypothesis
which has the greater likelihood, given
the data we have. With other data, we
might choose another hypothesis.
Now we’re going to look at tests where the
test statistic is χ2.
The first case is a test of goodness-of-fit.
Here we are comparing an observed
distribution with some distribution expected
under the null hypothesis to see if the data fit
Ho or not.
Since we’re dealing with distributions here
and not means, we will use frequencies
instead of measurements.
Suppose you want to know if a die is
fair. Under the null hypothesis, the die
is fair. Under the alternative hypothesis,
the die is not fair.
Ho: p = 1/6 for all sides of the die
Ha: p ≠ 1/6 for all sides of the die
Now how do we test this?
We do an experiment to test the die. The
observed data are shown in the table below:
X
Observed Expected
1
n1
n/6
2
n2
n/6
3
n3
n/6
4
n4
n/6
5
n5
n/6
6
n6
n/6
The expected data are what is expected
under Ho.
This question is whether or not the observed
data agree with the expected data. If they do
not agree, then we reject Ho that the die is
fair.
This goodness-of-fit test is due to Karl
Pearson. The test statistic is
k
ni  npio 2
i 1
npio
2  
where k = number of categories.
In fact, this χ2 test, like the t test, turns out to
be equivalent to a likelihood ratio test.
Now suppose the experiment consists of 60
rolls of the die with the following results:
X
Observed
Expected
1
15
60/6= 10
2
7
60/6= 10
3
4
60/6= 10
4
11
60/6= 10
5
6
60/6= 10
6
17
60/6= 10
Now we can apply our test statistic to these
data to see if Ho is to be rejected or not.
The test statistic is
k
ni  npio 
i 1
npio
 
2
2
(15  10) 2 (7  10) 2 (4  10) 2 (11  10) 2 (6  10) 2 (17  10) 2
 





10
10
10
10
10
10
136

 13.6
10
2
Now for α = .05, the critical value of χ2 with
6-1 = 5 df is 11.1. Since our observed
χ2 > 11.1, we reject Ho that the die is fair.
The test statistic we used approaches
the χ2 distribution when n is large
because the proportions p are
distributed normally when n is large.
A limitation of the use of χ2 is that all of
the expected frequencies must be ≥ 5.
This is similar to the limitation for the
use of the normal approximation to the
binomial in which np and nq were
required to be > 5.
The expected frequencies are not always equal.
Consider the following example.
In experiments on breeding flowers, the colors were
purple, red, green, and yellow. The flower colors are
expected to occur in a 9:3:3:1 ratio. So
Ho: pp = 9/16 pr = 3/16 pg = 3/16
py = 1/16
in a multinomial distribution involving four categories
for which n = 217.
The question is whether the colors are in accord with
the theoretically expected frequencies.
The observed and expected data are
purple
red
green
yellow
total
ni
120
48
36
13
217
ei
122
41
41
14
218
The expected data are obtained by multiplying each expected p
by n. So the expected frequencies for purple and red flowers
are
purple: (217) 9/16 = 122.06
red:
(217) 3/16 = 40.69
Because this gives rounded decimal values the total for
observed and expected frequencies may not be identical.
Here the test statistic is
k
ni  npio 2
i 1
npio
2  
(120  122)2 (48  41)2 (36  41)2 (13  14)2
 



 1.909
122
41
41
14
2
The critical value of χ2 with 3 df for a critical
region of size α=.05 is 7.8. Since the
observed χ2 of 1.909 < 7.8, the null
hypothesis cannot be rejected.
Another use of the χ2 distribution for
hypothesis testing is for tests of
independence in contingency tables.
A contingency table is a crossclassification of two variables
Variable A
Variable B
B1
B2
B3
B4
A1
p1 .
A2
(p2.)(p.3)
p2 .
A3
Totals
Totals
p3 .
p.1
p.2
p.3
p.4
where pi. is the probability of being
in the ith row and p.j is the
probability of being in the jth
column.
Since, under Ho, these variables
are independent, the probability of
the ijth cell is the product of the row
and column probabilities.
The dot notation is common in crossclassifications, where the dot is in place
of what has been summed over to get
the marginal.
c
ni    nij
j 1
r
n j   nij
and pi. = ni. / n
p.j = n.j / n
i 1
In this case,
Ho: Variables A and B are independent
Ha: Variables A and B are not independent
and the test statistic is
r
c
  
2
i 1 j 1
n
ij
 npi p j

2
npi p j
with (r-1)(c-1) df. This test is also due to
Karl Pearson.
Let’s look at an example. Suppose an experimenter is
interested in whether or not educational level is
related to marital adjustment, and has collected the
following data, where the values in parentheses are
expected frequencies.
Educational
Level
Marital Adjustment Score
Very low
Low
High
Very high
Totals
College
18 (27)
29 (39)
70 (64)
115 (102)
232
High school
17 (13)
28 (19)
30 (32)
41 (51)
116
Grade school
11 (6)
10 (9)
11 (14)
20 (23)
52
111
176
400
Totals
46
67
How do you get the expected frequencies?
Just replace the pi. and p.j with their
maximum likelihood estimates as
frequencies.
The expected frequency is
 ni   n j  ni  n j
 
npi  p j  n 
n
 n  n 
and we can find the expected frequency
for cell11 by
232 * 46
 26.68  27
400
Now the test statistic is
nin j 

n 

r
c  ij
n

 2   
ni n j
i 1 j 1
n
2
which has a χ2 distribution with (r-1)(c-1) df
if n is sufficiently large and Ho is true.
Note that nij is the observed cell frequency
and (ni.n.j)/n is the expected cell frequency
under Ho.
Now we can find all the cell expected
frequencies and find the observed χ2
just as we did for the goodness-of-fit
test.
The χ2 for (3-1)(4-1)= 6 df with α = .05
is 12.592. The observed χ2 is 20.7 for
our contingency table, so the null
hypothesis can be rejected with 95%
confidence.
Again, we must be sure that all
expected frequencies are ≥ 5 to assure
the validity of the χ2 test.
Now let’s look at some more examples.
The following data are for school children in
a city in Scotland.
Hair
Eyes
Fair
Red
Medium Brown
Black Total
blue
1368
170
1041
398
1
2978
light
2577
474
2703
932
11
6697
medium 1390
420
3826
1842
33
7511
255
1848
2506
112
5175
Total 5789 1319
9418
5678
157
22361
dark
454
Test to see whether hair color and eye color
are independently distributed.
In this study,
Ho: hair and eye color are independent
Ha: hair and eye color are not independent
The appropriate test statistic here is χ2
with (4-1)(5-1) = 12 df. For α = .05,
the critical value of χ2 = 21.026.
Now this is a big table and involves so
much computation that we will look for
a shortcut.
Let’s look for some obvious
incompatibility. The smallest eye color
total overall is for blue eyes. Blue eyes
has the smallest frequency for all hair
colors except fair.
Since cell11 is incompatible with the
other cells in row 1, let’s look at the χ2
value for this cell.
For cell11, the expected frequency under Ho is
nin j
n

2978 * 5789
 770.97  771
22361
So the χ2 value for this cell is
(1368  771) 2 356409

 462.27
771
771
Just with this one cell, we can reject Ho
because 462.27 >> 21.026. We don’t need to
compute the χ2 values for all the other cells
because we already can reject Ho. If we were
to add the χ2 values for all other cells, the
rejection would be very much stronger.
This example illustrates the value of looking
at the data to see where there is an
inconsistency and then checking that cell.
We also could have checked cell35 because
the highest total row is row 3 but this cell is
not highest in its column.
Looking at the data also helps to find the
cell(s) with the most unexpected result. It
may be that most of the cells are close to
expectation, but that one or two of them are
very much not in accord with expectation.
This could be a very useful finding.
Take another example. Five brands of canned
salmon are being tested for quality. The tester
examines 24 cans of each brand and finds the
following results.
Brand
Quality
A
B
C
D
E
Total
High
21
14
17
22
16
90
Very low
3
10
7
2
8
30
24
24
24
24
24
120
Total
Can you say which brand of tuna you would
most like to buy and which you would not
accept?
In this case,
Ho: brand and quality are independent
Ha: brand and quality are not independent
The critical value for the χ2 statistic with
(2-1)(5-1) = 4 df where α = .05 is 9.488.
Let’s see if we can reject Ho. The expected
frequencies are easy to compute here
because the column totals are all equal. So
for the first row, all the expected frequencies
= 18. For the second row, all the expected
frequencies are 6.
The χ2 is
(21  18) 2 (14  18) 2 (17  18) 2 (22  18) 2 (16  18) 2
 




18
18
18
18
18
(3  6) 2 (10  6) 2 (7  6) 2 (2  6) 2 (8  6) 2





6
6
6
6
6
 .5  .89  .06  .89  .22
2
 1.5  2.67  .17  2.67  .67
 10.22
which is greater than the critical value of 9.488, so
we can reject Ho with 95% confidence.
But we can do more than that. We can choose the
tuna brand we will buy (brand D) and the brand we
will avoid (brand B) because they have the largest
combined (high quality + low quality) χ2 values.
Finally, there is a third way to use χ2 as a test
statistic. We have already seen χ2 used to
form a confidence interval for the variance σ2.
Now we see how to use χ2 for testing an
observed vs a hypothetical value of σ.
Consider the following problem. Past
experience for a manufactured product has
shown that σ = 7.5. However, the latest
sample of size 25 gave s =10. Has the
variability in this product increased?
We set up the two hypotheses:
Ho: σ = 7.5
Ha: σ > 7.5
The test statistic is
n
 
2
2


x


 i
i 1

2

(n  1) s 2
2
(25  1)102

 42.67
2
7.5
and the critical value of χ2 with 24 df for α = .05 in
this 1-tailed test is 36.415. Since our observed
value > the critical value, we reject Ho with 95%
confidence and claim that variability has increased.
The right tail of the χ2 distribution is a
restricted best critical region because
the best critical region exists only when
μ = 0.
If the alternative were σ < σo, then the
left tail of the χ2 distribution is a
restricted best test.
But if the alternative were σ ≠ σo, there
is no best critical region and we use the
two equal tails of the χ2 distribution as
the compromise critical region.
We have looked at hypothesis testing
with three different statistics, t, F, χ2.
In all cases, we have chosen the
desired α level and used the test
statistic to see if the result falls in the
critical region. When it does and we
reject Ho at the chosen α level, we call
the result significant.
There is another way to deal with testing
that does not involve setting the chosen
α level beforehand.
This way is called significance testing.
We still do hypothesis testing as before
except that now we do not choose the α level.
That is, we do not have a critical region or a
critical value of the test statistic.
Instead we compute the observed value of
the test statistic and then find the probability
that the test statistic will exceed this value. If
the probability is small enough, we say the
result is significant at p < whatever probability
value we get.
For example, in the light bulb test, we
observed a t of 4.23. If we had been
doing significance testing instead of
hypothesis testing, we would find the
probability of a t this great or greater.
It turns out that this probability is
0.0000994, so we would say that this
result is significant at p < .0000994.
Significance testing, in effect, gives us
the smallest α level at which we would
reject Ho for the data we have.
For the treadmill test of correlated
means, we observed a t = 9.68, which
was significant at the .05 level.
If we were doing significance testing
here, we would declare the result
significant at p < 0.00000469.
Significance testing vs hypothesis
testing was at one time a very
controversial issue. But today
significance testing is more commonly
used than hypothesis testing.
This is a case where technological
advances changed statistical practice.
When hypothesis testing was first
developed, there were no computers so
people selected α levels because they
were tabled for the major test statistics.
With computers, significance testing
took over because there was no longer
any need to use tables. The p-value
was computable in a split second, so
why not use it?