Download Hypothesis Testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Omnibus test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Hypothesis Testing
COMP 245 STATISTICS
Dr N A Heard
Contents
1
Hypothesis Testing
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Error Rates and Power of a Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2
2
Testing for a population mean
2.1 Normal Distribution with Known Variance . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Duality with Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . .
2.2 Normal Distribution with Unknown Variance . . . . . . . . . . . . . . . . . . . . .
3
3
4
5
3
Testing for differences in population means
3.1 Two Sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Normal Distributions with Known Variances . . . . . . . . . . . . . . . . . . . . .
3.3 Normal Distributions with Unknown Variances . . . . . . . . . . . . . . . . . . .
7
7
7
8
4
Goodness of Fit
4.1 Count Data and Chi-Square Tests
4.2 Proportions . . . . . . . . . . . .
4.3 Model Checking . . . . . . . . . .
4.4 Independence . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
13
1
1.1
Hypothesis Testing
Introduction
Hypotheses
Suppose we are going to obtain a random i.i.d. sample X = ( X1 , . . . , Xn ) of a random
variable X with an unknown distribution PX . To proceed with modelling the underlying population, we might hypothesise probability models for PX and then test whether such hypotheses
seem plausible in light of the realised data x = ( x1 , . . . , xn ).
Or, more specifically, we might fix upon a parametric family PX |θ with unknown parameter
θ and then hypothesise values for θ. Generically let θ0 denote a hypothesised value for θ. Then
after observing the data, we wish to test whether we can indeed reasonably assume θ = θ0 .
For example, if X ∼ N(µ, σ2 ) we may wish to test whether µ = 0 is plausible in light of the
data x.
Formally, we define a null hypothesis H0 as our hypothesised model of interest, and also
specify an alternative hypothesis H1 of rival models against which we wish to test H0 .
Most often we simply test H0 : θ = θ0 against H1 : θ 6= θ0 . This is known as a twosided test. In some situations it may be more appropriate to consider alternatives of the form
H1 : θ > θ0 or H1 : θ < θ0 , known as one-sided tests.
Rejection Region for a Test Statistic
To test the validity of H0 , we first choose a test statistic T ( X ) of the data for which we can
find the distribution, PT , under H0 .
Then, we identify a rejection region R ⊂ R of low probability values of T under the assumption that H0 is true, so
P( T ∈ R| H0 ) = α
for some small probability α (typically 5%).
A well chosen rejection region will have relatively high probability under H1 , whilst retaining low probability under H0 .
Finally, we calculate the observed test statistic t( x ) for our observed data x. If t ∈ R we
“reject the null hypothesis at the 100α% level”.
p-Values
For each possible significance level α ∈ (0, 1), a hypothesis test at the 100α% level will
result in either rejecting or not rejecting H0 .
As α → 0 it becomes less and less likely that the null hypothesis will be rejected, as the
rejection region is becoming smaller and smaller. Similarly, as α → 1 it becomes more and
more likely that the null hypothesis will be rejected.
For a given data set and resulting test statistic, we might, therefore, be interested in identifying the critical significance level which marks the threshold between us rejecting and not
rejecting the null hypothesis. This is known as the p-value of the data.
Smaller p-values suggest stronger evidence against H0 .
1.2
Error Rates and Power of a Test
Test Errors
There are two types of error in the outcome of a hypothesis test:
2
• Type I: Rejecting H0 when in fact H0 is true. By construction, this happens with probability α. For this reason, the significance level of a hypothesis test is also referred to as the
Type I error rate.
• Type II: Not rejecting H0 when in fact H1 is true. The probability with which this type
of error will occur depends on the unknown true value of θ, so to calculate values we
plug-in a plausible alternative value for θ 6= θ0 , θ1 say, and let β = P( T ∈
/ R|θ = θ1 ) be
the probability of a Type II error.
Power
We define the power of a hypothesis test by
1 − β = P( T ∈ R | θ = θ1 ).
For a fixed significance level α, a well chosen test statistic T and rejection region R will
have high power - that is, maximise the probability of rejecting the null hypothesis when the
alternative is true.
2
2.1
Testing for a population mean
Normal Distribution with Known Variance
N(µ, σ2 ) - σ2 Known
Suppose X1 , . . . , Xn are i.i.d. N(µ, σ2 ) with σ2 known and µ unknown.
We may wish to test if µ = µ0 for some specific value µ0 (e.g. µ0 = 0, µ0 = 9.8).
Then we can state our null and alternative hypotheses as
H0 :µ = µ0 ;
H1 :µ 6= µ0 .
Under H0 : µ = µ0 , we then know both µ and σ2 . So for the sample mean X̄ we have a
known distribution for the test statistic
Z=
X̄ − µ0
√ ∼ Φ.
σ/ n
So if we define our rejection region R to be the 100α% tails of the standard normal distribution distribution,
R = −∞, −z1− α2 ∪ z1− α2 , ∞
n o
≡ z |z| > z1− α2 ,
we have P( Z ∈ R) = α under H0 .
We thus reject H0 at the 100α% significance level ⇐⇒ our observed test statistic z =
x̄ − µ0
√ ∈ R.
σ/ n
The p-value is given by 2 × {1 − Φ(|z|)}.
3
0.2
R
0.0
0.1
φ(z)
0.3
0.4
Ex. 5% Rejection Region for N(0, 1) Statistic
−4
−2
0
2
4
z
2.1.1
Duality with Confidence Intervals
There is a strong connection in this context between hypothesis testing and confidence intervals.
Suppose we have constructed a 100(1 − α)% confidence interval for µ . Then this is precisely the set of values {µ0 } for which there would be insufficient evidence to reject a null
hypothesis H0 : µ = µ0 at the 100α%-level.
Example
A company makes packets of snack foods. The bags are labelled as weighing 454g; of
course they won’t all be exactly 454g, and let’s suppose the variance of bag weights is known
to be 70g². The following data show the mass in grams of 50 randomly sampled packets.
464, 450, 450, 456, 452, 433, 446, 446, 450, 447, 442, 438, 452, 447, 460, 450, 453, 456,
446, 433, 448, 450, 439, 452, 459, 454, 456, 454, 452, 449, 463, 449, 447, 466, 446, 447,
450, 449, 457, 464, 468, 447, 433, 464, 469, 457, 454, 451, 453, 443
Are these data consistent with the claim that the mean weight of packets is 454g?
1. We wish to test H0 : µ = 454 vs. H1 : µ 6= 454. So set µ0 = 454.
2. Although we have not been told that the packet weights are individually normally distributed, by the CLT we still have that the mean weight of the sample of packets is apX̄ − µ0
√ ∼
proximately normally distributed, and hence we still approximately have Z =
σ/ n
Φ.
x̄ − µ0
√ = −2.350.
3. x̄ = 451.22 and n = 50 ⇒ z =
σ/ n
4. For a 5%-level significance test, we compare the statistic z = −2.350 with the rejection
region R = (−∞, −z0.975 ) ∪ (z0.975 , ∞) = (−∞, −1.96) ∪ (1.96, ∞). Clearly we have z ∈ R,
and so at the 5%-level we reject the null hypothesis that the mean packet weight is 454g.
4
5. At which significance levels would we have not rejected the null hypothesis?
• For a 1%-level significance test, the rejection region would have been
R = (−∞, −z0.995 ) ∪ (z0.995 , ∞) = (−∞, −2.576) ∪ (2.576, ∞).
In which case z ∈
/ R, and so at the 1%-level we would not have rejected the null
hypothesis.
• The p-value is
2 × {1 − Φ(|z|)} = 2 × {1 − Φ(| − 2.350|)} ≈ 2(1 − 0.9906)
= 0.019,
and so we would only reject the null hypothesis for α > 1.9%.
2.2
Normal Distribution with Unknown Variance
N(µ, σ2 ) - σ2 Unknown
Similarly, if σ2 in the above example were unknown, we still have that
T=
X̄ − µ0
√ ∼ t n −1 .
s n −1 / n
So for a test of H0 : µ = µ0 vs. H1 : µ 6= µ0 at the α level, the rejection region of our
x̄ − µ0
√ is
observed test statistic t =
s n −1 / n
R = −∞, −tn−1,1− α2 ∪ tn−1,1− α2 , ∞
o
n ≡ t |t| > tn−1,1− α2 .
0.2
0.1
R
0.0
f(t)
0.3
0.4
Ex. 5% Rejection Region for t5 Statistic
−4
−2
0
t
5
2
4
Example 1
Consider again the snack food weights example. There, we assumed the variance of bag
weights was known to be 70. Without this, we could have estimated the variance by
s2n−1 =
n
1
( xi − x̄ )2 = 70.502.
n − 1 i∑
=1
Then the corresponding t-statistic becomes
t=
x̄ − µ0
√ = −2.341,
s n −1 / n
very similar to the z-statistic of before.
And since n = 50, we compare with the t49 distribution which is approximately N(0, 1). So
the hypothesis test results and p-value would be practically identical.
Example 2
A particular piece of code takes a random time to run on a computer, but the average time
is known to be 6 seconds. The programmer tries an alternative optimisation in compilation
and wishes to know whether the mean run time has changed. To explore this, he runs the reoptimised code 16 times, obtaining a sample mean run time of 5.8 seconds and bias-corrected
sample standard deviation of 1.2 seconds. Is the code any faster?
1. We wish to test H0 : µ = 6 vs. H1 : µ 6= 6. So set µ0 = 6.
2. Assuming the run times are approximately normal, T =
X̄ − µ0
√ ∼ tn−1 . That is,
s n −1 / n
X̄ − 6
√ ∼ t15 . So we reject H0 at the 100α% level if | T | > t15,1−α/2 .
sn−1 / 16
3. x̄ = 5.8, sn−1 = 1.2 and n = 16 ⇒ t =
−2
x̄ − µ0
√ =
.
3
s n −1 / n
4. We have |t| = 2/3 << 2.13 = t15,.975 , so we have insufficient evidence to reject H0 at the
5% level.
5. In fact, the p-value for these data is 51.51%, so there is very little evidence to suggest the
code is now any faster.
6
3
3.1
Testing for differences in population means
Two Sample Problems
Samples from 2 Populations
Suppose, as before, we have a random sample X = ( X1 , . . . , Xn1 ) from an unknown population distribution PX .
But now, suppose we have a further random sample Y = (Y1 , . . . , Yn2 ) from a second,
different population PY .
Then we may wish to test hypotheses concerning the similarity of the two distributions PX
and PY .
In particular, we are often interested in testing whether PX and PY have equal means. That
is, to test
H0 : µ X = µY vs H1 : µ X 6= µY .
Paired Data
A special case is when the two samples X and Y are paired. That is, if n1 = n2 = n and
the data are collected as pairs ( X1 , Y1 ), . . . , ( Xn , Yn ) so that, for each i, Xi and Yi are possibly
dependent.
For example, we might have a random sample of n individuals and Xi represents the heart
rate of the ith person before light exercise and Yi the heart rate of the same person afterwards.
In this special case, for a test of equal means we can consider the sample of differences
Z1 = X1 − Y1 , . . . , Zn = Xn − Yn and test H0 : µ Z = 0 using the single sample methods we
have seen. In the above example, this would test whether light exercise causes a change in
heart rate.
3.2
Normal Distributions with Known Variances
N(µ X , σX2 ), N(µY , σY2 )
Suppose
• X = ( X1 , . . . , Xn1 ) are i.i.d. N(µ X , σX2 ) with µ X unknown;
• Y = (Y1 , . . . , Yn2 ) are i.i.d. N(µY , σY2 ) with µY unknown;
• the two samples X and Y are independent.
Then we still have that, independently,
σ2
X̄ ∼ N µ X , X
n1
σ2
Ȳ ∼ N µY , Y
n2
From this it follows that the difference in sample means,
σ2
σ2
X̄ − Ȳ ∼ N µ X − µY , X + Y
n1
n2
7
,
and hence
( X̄ − Ȳ ) − (µ X − µY )
q
∼ Φ.
σX2 /n1 + σY2 /n2
So under the null hypothesis H0 : µ X = µY , we have
Z= q
X̄ − Ȳ
σX2 /n1 + σY2 /n2
∼ Φ.
N(µ X , σX2 ), N(µY , σY2 ) - σX2 , σY2 Known
So if σX2 and σY2 are known, we immediately have a test statistic
z= q
x̄ − ȳ
σX2 /n1 + σY2 /n2
which we can compare against the quantiles of a standard normal.
That is,
o
n R = z |z| > z1− α2 ,
gives a rejection region for a hypothesis test of H0 : µ X = µY vs. H1 : µ X 6= µY at the 100α%
level.
3.3
Normal Distributions with Unknown Variances
N(µ X , σ2 ), N(µY , σ2 ) - σ2 Unknown
On the other hand, suppose σX2 and σY2 are unknown.
Then if we know σX2 = σY2 = σ2 but σ2 is unknown, we can still proceed.
We have
( X̄ − Ȳ ) − (µ X − µY )
√
∼ Φ,
σ 1/n1 + 1/n2
and so, under H0 : µ X = µY ,
X̄ − Ȳ
∼ Φ.
σ 1/n1 + 1/n2
√
but with σ unknown.
Pooled Estimate of Population Variance
We need an estimator for the variance using samples from two populations with different
means. Just combining the samples together into one big sample would over-estimate the
variance, since some of the variability in the samples would be due to the difference in µ X and
µY .
So we define the bias-corrected pooled sample variance
n
Sn2 1 +n2 −2 =
n
∑i=1 1 ( Xi − X̄ )2 + ∑i=2 1 (Yi − Ȳ )2
,
n1 + n2 − 2
8
which is an unbiased estimator for σ2 .
We can immediately see that s2n1 +n2 −2 is indeed an unbiased estimate of σ2 by noting
Sn2 1 +n2 −2 =
n2 − 1
n1 − 1
S2
+
S2 ;
n 1 + n 2 − 2 n1 −1 n 1 + n 2 − 2 n2 −1
That is, s2n1 +n2 −2 is a weighted average of the bias-corrected sample variances for the individual
samples x and y, which are both unbiased estimates for σ2 .
Then substituting Sn1 +n2 −2 in for σ we get
( X̄ − Ȳ ) − (µ X − µY )
√
∼ tn1 + n2 −2 ,
Sn1 +n2 −2 1/n1 + 1/n2
and so, under H0 : µ X = µY ,
T=
X̄ − Ȳ
√
∼ tn1 + n2 −2 .
Sn1 +n2 −2 1/n1 + 1/n2
So we have a rejection region for a hypothesis test of H0 : µ X = µY vs. H1 : µ X 6= µY at the
100α% level given by
n o
R = t |t| > tn1 +n2 −2,1− α2 ,
for the statistic
t=
x̄ − ȳ
√
.
sn1 +n2 −2 1/n1 + 1/n2
Example
The same piece of C code was repeatedly run after compilation under two different C compilers, and the run times under each compiler were recorded. The sample mean and biascorrected sample standard deviation for Compiler 1 were 114s and 310s respectively, and the
corresponding figures for Compiler 2 were 94s and 290s. Both sets of data were each based on
15 runs.
Suppose that Compiler 2 is a refined version of Compiler 1, and so if µ1 , µ2 are the expected
run times of the code under the two compilations, we might fairly assume µ2 ≤ µ1 .
Conduct a hypothesis test of H0 : µ1 = µ2 vs H1 : µ1 > µ2 at the 5% level.
Until now we have mostly considered two-sided tests. That is tests of the form H0 : θ = θ0
vs H1 : θ 6= θ0 .
Here we need to consider one-sided tests, which differ by the alternative hypothesis being
of the form H1 : θ < θ0 or H1 : θ > θ0 .
This presents no extra methodological challenge and requires only a slight adjustment in
the construction of the rejection region.
We still use the t-statistic
t=
x̄ − ȳ
√
,
sn1 +n2 −2 1/n1 + 1/n2
9
where x̄, ȳ are the sample mean run times under Compilers 1 and 2 respectively. But now the
one-sided rejection region becomes
R = {t |t > tn1 +n2 −2,1−α } .
First calculating the bias-corrected pooled sample variance, we get
s2n1 +n2 −2 =
14 × 310 + 14 × 290
= 300.
28
(Note that since the sample sizes n1 and n2 are equal, the pooled estimate of the variance is the
average of the individual estimates.)
x̄ − ȳ
114 − 94
√
=√ √
sn +n −2 1/n1 + 1/n2
300 1/15 + 1/15
√1 2
= 10 = 3.162.
So t =
For a 1-sided test we compare t = 3.162 with t28,0.95 = 1.701 and conclude that we reject the
null hypothesis at the 5% level; the second compilation is significantly faster.
10
4
4.1
Goodness of Fit
Count Data and Chi-Square Tests
Count Data
The results in the previous sections relied upon the data being either normally distributed,
or at least through the CLT having the sample mean being approximately normally distributed.
Tests were then developed for making inference on population means under those assumptions. These tests were very much model-based.
Another important but very different problem concerns model checking, which can be addressed through a more general consideration of count data for simple (discrete and finite)
distributions.
The following ideas can then be trivially extended to infinite range discrete and continuous
r.v.s by binning observed samples into a finite collection of predefined intervals.
Samples from a Simple Random Variable
Let X be a simple random variable taking values in the range { x1 , . . . , xk }, with probability
mass function p j = P( X = x j ), j = 1, . . . , k.
A random sample of size n from the distribution of X can be summarised by the observed
frequency counts O = (O1 , . . . , Ok ) at the points x1 , . . . , xk (so ∑kj=1 O j = n).
Suppose it is hypothesised that the true pmf { p j } is from a particular parametric model
p j = P( X = x j |θ ), j = 1, . . . , k for some unknown parameter p-vector θ.
To test this hypothesis about the model, we first need to estimate the unknown parameters
θ so that we are able to calculate the distribution of any statistic under the null hypothesis
H0 : p j = P( X = x j |θ ), j = 1, . . . , k. Let θ̂ be such an estimator, obtained using the sample O.
Then under H0 we have estimated probabilities for the pmf p̂ j = P( X = x j |θ̂ ), j = 1, . . . , k
and so we are able to calculate estimated expected frequency counts E = ( E1 , . . . , Ek ) by Ej =
n p̂ j . (Note again we have ∑kj=1 Ej = n.)
We then seek to compare the observed frequencies with the expected frequencies to test for
goodness of fit.
Chi-Square Test
To test H0 : p j = P( X = x j |θ ) vs. H1 : 0 ≤ p j ≤ 1, ∑ p j = 1 we use the chi-square statistic
X2 =
k
(Oi − Ei )2
.
Ei
i =1
∑
If H0 were true, then the statistic X 2 would approximately follow a chi-square distribution
with ν = k − p − 1 degrees of freedom.
• k is the number of values (categories) the simple r.v. X can take.
• p is the number of parameters being estimated (dim(θ )).
• For the approximation to be valid, we should have ∀ j, Ej ≥ 5. This may require some
merging of categories.
11
0.6
χ2ν pdf for ν = 1, 2, 3, 5, 10
0.3
0.0
0.1
0.2
χ2ν(x)
0.4
0.5
χ21
χ22
χ23
χ25
χ210
0
2
4
6
8
10
x
Rejection Region
Clearly larger values of X 2 correspond to larger deviations from the null hypothesis model.
That is, if X 2 = 0 the observed counts exactly match those expected under H0 .
For this reason, we always perform a one-sided goodness of fit test using the χ2 statistic,
looking only at the upper tail of the distribution.
Hence the rejection region for a goodness of fit hypothesis test at the at the 100α% level is
given by
R=
4.2
n
o
x2 x2 > χ2k− p−1,1−α .
Proportions
Example
Each year, around 1.3 million people in the USA suffer adverse drug effects (ADEs). A
study in the Journal of the American Medical Association (July 5, 1995) gave the causes of 95
ADEs below.
Cause
Lack of knowledge of drug
Rule violation
Faulty dose checking
Slips
Other
Number of ADEs
29
17
13
9
27
Test whether the true percentages of ADEs differ across the 5 causes.
Under the null hypothesis that the 5 causes are equally likely, we would have expected
counts of 95
5 = 19 for each cause.
12
So our χ2 statistic becomes
(29 − 19)2 (17 − 19)2 (13 − 19)2 (9 − 19)2 (27 − 19)2
+
+
+
+
19
19
19
19
19
100
4
36 100 64
304
=
+
+
+
+
=
= 16.
19
19 19
19
19
19
x2 =
We have not estimated any parameters from the data, so we compare x2 with the quantiles
of the χ25−1 = χ24 distribution.
Well 16 > 9.49 = χ24,0.95 , so we reject the null hypothesis at the 5% level; we have reason to
suppose that there is a difference in the true percentages across the different causes.
4.3
Model Checking
Example - Fitting a Poisson Distribution to Data
Recall the example from the Discrete Random Variables chapter, where the number of particles emitted by a radioactive substance which reached a Geiger counter was measured for
2608 time intervals, each of length 7.5 seconds.
We fitted a Poisson(λ) distribution to the data by plugging in the sample mean number of
counts (3.870) for the rate parameter λ. (Which we now know to be the MLE!)
x
O(n x )
E(n x )
0
57
54.4
1
203
210.5
2
383
407.4
3
525
525.5
4
532
508.4
5
408
393.5
6
273
253.8
7
139
140.3
8
45
67.9
9
27
29.2
≥10
16
17.1
(O=Observed, E=Expected).
Whilst the fitted Poisson(3.87) expected frequencies looked quite convincing to the eye, at
that time we had no formal method of quantitatively assessing the fit. However, we now know
how to proceed.
x
O
E
O−E
0
57
54.4
2.6
1
203
210.5
-7.5
2
383
407.4
-24.4
3
525
525.5
-0.5
4
532
508.4
23.6
5
408
393.5
14.5
6
273
253.8
19.2
7
139
140.3
-1.3
8
45
67.9
22.9
9
27
29.2
2.2
(O− E)2
E
0.124
0.267
1.461
0.000
1.096
0.534
1.452
0.012
7.723
0.166
The statistic x2 = ∑
(O− E)2
E
≥10
16
17.1
1.1
0.071
= 12.906 should be compared with a χ211−1−1 = χ29 distribution.
Well χ29,0.95 = 16.91, so at the 5% level we do not reject the null hypothesis of a Poisson(3.87)
model for the data.
4.4
Independence
Contingency Tables
Suppose we have two simple random variables X and Y which are jointly distributed with
unknown probability mass function p XY .
We are often interested in trying to ascertain whether X and Y are independent. That is,
determine whether p XY ( x, y) = p X ( x ) pY (y).
13
Let the ranges of the r.v.s X and Y be { x1 , . . . , xk } and {y1 , . . . , y` } respectively.
Then an i.i.d. sample of size n from the joint distribution of ( X, Y ) can be represented by a
list of counts nij (1 ≤ i ≤ k; 1 ≤ j ≤ `) of the number of times we observe the pair ( xi , y j ).
Tabulating these data in the following way gives what is known as a k × ` contingency
table.
x1
x2
..
.
xk
y1
n11
n21
y2
n12
n22
nk1
n ·1
nk2
n ·2
...
...
y`
n 1`
n 2`
n 1·
n 2·
nk`
n·`
nk·
n
Note the row sums (n1· , n2· , . . . , nk· ) represent the frequencies of x1 , x2 , . . . , xk in the sample
(that is, ignoring the value of Y). Similarly for the column sums (n·1 , n·2 , . . . , n·` ) and y1 , . . . , y` .
Under the null hypothesis
H0 : X and Y are independent,
the expected values of the entries of the contingency table, conditional on the row and column
sums, can be estimated by
n̂ij =
ni · × n · j
,
n
1 ≤ i ≤ k, 1 ≤ j ≤ `.
ni ·
To see this, consider the marginal distribution of X; we could estimate p X ( xi ) by p̂i· =
.
n
n· j
.
Similarly for pY (y j ) we get p̂· j =
n
Then under the null hypothesis of independence p XY ( xi , y j ) = p X ( xi ) pY (y j ), and so we can
estimate p XY ( xi , y j ) by
ni · × n · j
p̂ij = p̂i· × p̂· j =
.
n2
Now that we have a set of expected frequencies to compare against our k × ` observed
frequencies, a χ2 test can be performed.
We are using both the row and column sums to estimate our probabilities, and there are k
and ` of these respectively. So we compare our calculated x2 statistic against a χ2 distribution
with k ` − {(k − 1) + (` − 1)} − 1 = (k − 1)(` − 1) degrees of freedom.
Hence the rejection region for a hypothesis test of independence in a k × ` contingency
table at the at the 100α% level is given by
R=
n
o
x2 x2 > χ2(k−1)(`−1),1−α .
14
Example
An article in International Journal of Sports Psychology (July-Sept 1990) evaluated the relationship between physical fitness and stress. 549 people were classified as good, average, or
poor fitness, and were also tested for signs of stress (yes or no). The data are shown in the table
below.
Stress
No stress
Poor Fitness
206
36
242
Average Fitness
184
28
212
Good Fitness
85
10
95
475
74
549
Is there any relationship between stress and fitness?
Under independence we would estimate the expected values to be
Stress
No stress
Poor Fitness
209.4
32.6
242
Average Fitness
183.4
28.6
212
Good Fitness
82.2
12.8
95
475
74
549
Hence the χ2 statistic is calculated to be
X2 =
(Oi − Ei )2
(206 − 209.4)2
(10 − 12.88)2
=
+
.
.
.
+
= 1.1323.
∑ Ei
209.4
12.8
i
This should be compared with a χ2 distribution with (2 − 1) × (3 − 1) = 2 degrees of
freedom.
χ22,0.95 = 5.99, so we have no significant evidence to suggest there is any relationship between fitness and stress.
15