Download Political Science 236 Hypothesis Testing: Review and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Political Science 236
Hypothesis Testing: Review and Bootstrapping
Rocı́o Titiunik
Fall 2007
1
Hypothesis Testing
Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter
The goal of hypothesis testing is to decide, using a sample from the population, which of two
complimentary hypotheses is true. In general, the two complimentary hypotheses are called the
null hypothesis and the alternative hypothesis. If we let θ be a population parameter and Θ be the
parameter space, we can define these complementary hypotheses as follows:
Definition 1.2 Let Θ0 and Θ1 ≡ Θc0 be a partition of the parameter space Θ. Then the null and
alternative hypothesis are defined as follows
1. Null Hypothesis: H0 : θ ∈ Θ0
2. Alternative Hypothesis: H1 : θ ∈ Θ1
Definition 1.3 Testing Procedure. A testing procedure is a rule, based on the outcome of a
random sample from the population under study, used to decide whether to reject H0 .
1
The subset of the sample space for which H0 will be rejected is called the critical region ( or
the rejection region), and its complement is called the acceptance region. In general, a hypothesis
test will be specified in terms of a test statistic T (X1 , X2 , ..., XN ) ≡ T (X), which is a function of
the sample. We can define the critical region formally as follows.
Definition 1.4 Critical Region. The subset C ⊂ RN of the sample space for which H0 is rejected
is called the critical region and is defined by
Cc = x ∈ RN : T (x) > c
for some c ∈ R. The value c is called the critical value. The complement of Cc , Ca ≡ Ccc , is called
the acceptance region.
If we let CTc be the critical region of the test statistic T (X) (i.e. CTc is defined by Cc =
x ∈ RN : T (x) ∈ CTc ), a statistical test of H0 against H1 will generally be defined as:
1.
T (x) ∈ CTc =⇒ Reject H0
T (x) ∈
/ CTc =⇒ Accept H0
A hypothesis test of H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 can make one of two types of errors.
Definition 1.5 Type I and Type II Errors. Let H0 be a null hypothesis being tested for acceptance or rejection. The two types of errors that can be made are
1. Type I Error: rejecting H0 when θ ∈ Θ0 (i.e, when H0 is true)
2. Type II Error: accepting H0 when θ ∈ Θ1 (i.e, when H0 is false)
So a type I error is committed when the statistical test mistakenly rejects the null hypothesis,
and a type II error is committed when the test mistakenly accepts the null hypothesis. The ideal
2
test is one where the hypothesis would always be correctly identified as being either true or false.
For such an ideal test to exist, we must partition the range of potential sample outcomes in such a
way that outcomes in the critical region Cc would occur if and only if H0 were true and outcomes
in the acceptance region Ca would occur if and only if H0 were false. In general, ideal tests cannot
be constructed.
For θ ∈ Θ0 , the test will make a mistake if x ∈ Cc and therefore the probability of a type
I error is Pθ (X ∈ Cc ) and for θ ∈ Θ1 , the test will make a mistake if x ∈ Ca and therefore the
probability of a type II error is Pθ (X ∈ Ca ). Note that Pθ (X ∈ Cc ) = 1 − Pθ (X ∈ Ca ). We will
now define the power function of a test. The power function completely summarizes all of the
operating characteristics of a statistical test with respect to probabilities of making correct and
incorrect decisions about H0 . The power function is defined below.
Definition 1.6 Let H0 be defined as H0 : θ ∈ Θ0 and H1 be defined as H1 : θ ∈ Θ1 . Let the critical
region Cc define a test of H0 . Then the power function of the statistical test is the function of θ
defined by
β (θ) ≡ Pθ (X ∈ Cc ) =
probability of Type I error
if θ ∈ Θ0
one minus probability of Type II error
if θ ∈ Θ1
In words, the power function indicates the probability of rejecting H0 for every value of θ ∈ Θ.
The value of the power function at a particular value of the parameter space θp ∈ Θ is called the
power of the test at θp and represents the probability of rejecting H0 if θp were the true value of
the parameter vector. The ideal power function is 0 for all θ ∈ Θ0 and 1 for all θ ∈ Θ1 . In general,
this ideal cannot be attained and we say that a good test has power function near 0 for all θ ∈ Θ0
and near 1 for all θ ∈ Θ1 . When comparing two tests for a given H0 , a test is better if it has lower
power for θ ∈ Θ0 and higher power for θ ∈ Θ1 which implies that the test has lower probabilities
of both type I and type II error.
We now define the size and level of a test:
3
Definition 1.7 Size. For 0 ≤ α ≤ 1, a test with power function β (θ) is a size-α test if supθ∈Θ0 β (θ) =
α
Definition 1.8 Level. For 0 ≤ α ≤ 1, a test with power function β (θ) is a level-α test if
supθ∈Θ0 β (θ) ≤ α
In words, the size of the test is the maximum probability of Type I error associated with a given
test rule. The lower the size of the test, the lower the maximum probability of mistakenly rejecting
H0 . The level of a test is an upper bound to the type I error probability of a statistical test. The
key difference between these two concepts is that the size represents the maximum value of β (θ)
for θ ∈ Θ0 (i.e. the maximum type I error) while the level is only a bound that might not equal
β (θ) for any θ ∈ Θ0 nor equal the supremum of β (θ) for θ ∈ Θ0 . Thus, the set of level-α tests
contains the set of size-α tests. In other words, a test of H0 having size γ is a α-level test for any
α ≥ γ.
In applications, when we say that H0 is (not) rejected at the α-significance level, we often mean
that α was the bound on the level of protection against type I error that was used when constructing
the test. A more accurate statement is regarding the level of protection against type I error is that
H0 is (not) rejected using a size-α test.
2
Bootstrapping Hypothesis Tests
The simplest situation involves a simple null hypothesis H0 that completely specifies the probability
distribution of the data. Thus, if we have a sample x1 , x2 , ..., xn from a population with CDF F ,
then H0 specifies that F = F0 where F0 contains no unknown parameters. A statistical test is based
on a test statistic T which measures the discrepancy between the data and the null hypothesis. We
will follow the convention that large values of T are evidence against H0 . If the null hypothesis
is simple and the observed value of the test statistics is denoted by t, then the level of evidence
4
against H0 is measured by the significance probability
p = P (T ≥ t | H0 )
which is referred to as the p-value. The p-value is effectively the marginal size test at which a given
hypothesis would be rejected based on the observed outcome of X. A corresponding notion is that
of a critical value tp for t, which is associated with testing at level p: if t ≥ tp then H0 is rejected
at level p or 100p%. It follows that tp is defined as
P (T ≥ tp | H0 ) = p
Note that p is what we defined earlier as the size of the test and the set {(x1 , x2 , ..., xn ) : t ≥ tp | H0 }
is the level p critical region of the test. The distribution of the T under H0 is called the null
distribution of T .
2.1
How to choose the test-statistic
In a parametric setting, there is an explicit form of the sampling distribution of the data with a
finite number of unknown parameters. In these cases the alternative hypothesis guides the choice
of the test statistic (usually through use of the likelihood function of the data). In non-parametric
settings, no particular forms are specified for the distributions and hence the appropriate choice
of T is less clear. However, the choice of T should be always based on some notion of what is of
concern in the case that H0 turns out to be false.
In all non-parametric problems, the null hypothesis H0 leaves some parameters unknown and
therefore does not completely specify F . In this case, the p-value is not well defined because
P (T ≥ t | F ) may depend upon which F satisfying H0 is taken.
2.1.1
Pivot Tests
When H0 concerns a particular parameter value, we can use the equivalence between hypothesis
tests and confidence intervals. This equivalence implies that if the value of θ0 is outside a 1 − α
5
confidence interval for θ, then θ differs from θ0 with p-value less than α. A specific form of test
based on this equivalence is a pivot test. Suppose that T is an estimator for a scalar θ, with
T −θ
estimated variance V . Suppose also that the studentized version of T , Z = 1/2 , is a pivot (i.e.
V
its distribution is the same for all relevant F , and in particular for all θ). For a one-sided test
of H0 : θ = θ0 versus H1 : θ > θ0 , the p-value that corresponds to the observed studentized test
t − θ0
statistic z0 = 1/2 is
v
T − θ0
t − θ0
p=P
≥ 1/2 | H0
V 1/2
v
However, since Z is a pivot we have
t − θ0
T − θ0
t − θ0
= P Z ≥ 1/2 | H0
≥ 1/2 | H0
P
V 1/2
v
v
t − θ0
= P Z ≥ 1/2 | F
v
and therefore the p-value can be written as
p = P {Z ≥ z0 | F }
Note that this has a big advantage in the context of bootstrapping, because we do not have to
construct a special null-hypothesis sampling distribution.
2.2
Non-Parametric Bootstrap Tests
Testing hypothesis requires that probability calculations be done under the null hypothesis model.
This means that the usual bootstrap setting must be modified, since resampling from the empirical
CDF Fb and applying the plug-in principle to obtain θb = t Fb won’t give us an estimator of θ
under the null hypothesis H0 . In the hypothesis testing context, instead of resampling from the
empirical CDF Fb, we must resample from an empirical CDF Fb0 which satisfies the relevant null
hypothesis H0 . (Unless, as we mentioned above, we can construct a pivot test-statistic).
6
Once we have decided on the null resampling distribution Fb0 , the basic bootstrap test will
compute the p-value as
n
o
pboot = P ∗ T ∗ ≥ t | Fb0
or will approximate it by
pboot =
# {t∗b ≥ t}
B
using the results t∗1 , t∗2 , ..., t∗B from B bootstrap samples.
Example 2.1 Difference in means. Suppose we want to compare two population means µ1 and µ2
using the test statistic t = x1 − x2 . We will use the following sample data:
sample1
82
79
81
79
77
79
79
78
79
82
76
73
64
sample2
84
86
85
82
77
76
77
80
83
81
78
78
78
If the shapes of the underlying distributions are identical, then under H0 : µ1 = µ2 the two distributions are the same. In this case, it is sensible to choose for Fb0 the pooled empirical CDF of the
two samples. Applying this procedure with 1, 000 bootstrap samples yielded 52 values of t∗ greater
than the observed value t = 80.38 − 77.53 = 2.84, which implies a p-value of
52
1000
= 0.052. So we
cannot reject the null at 5% (but we can at 5.2%!!)
2.2.1
Studentized Bootstrap Method
For some problems, it is possible to obtain more stable significance tests by studentizing comparisons. Remember that because of the relationship between confidence sets and hypothesis tests,
such a test can be obtained calculating a 1 − p confidence set by the studentized bootstrap method
and concluding that the p-value is less than p is the null hypothesis parameter falls outside the
confidence set.
We can also implement this idea by bootstrapping the test statistic directly rather than constructing confidence intervals. In this case, the p-value can be obtained directly. Suppose that θ is
7
a scalar with estimator T and that we want to test H0 : θ = θ0 against H1 : θ > θ0 . The method
we mentioned in the section Pivot Tests applies when
Z=
T −θ
V 1/2
is approximately a pivot (i.e. its distribution is approximately independent of unknown paramt − θ0
eters). Then, with z0 = 1/2 being the observed studentized test statistic the bootstrap analog
v
of
p = P {Z ≥ z0 | F }
is
n
o
p = P Z ∗ ≥ z0 | Fb
which we can approximate by bootstrapping without having to decide on a null empirical distribution Fb0 .
Example 2.2 Let’s continue the example of the difference in means. We were comparing compare
two population means µ1 and µ2 using the test statistic t = x1 − x2 . Now, it would reasonable to
suppose that the usual two-sample t-statistic
Z=
X 2 − X 1 − (µ2 − µ1 )
1/2
S22 /n2 + S12 /n1
is approximately pivotal. We take Fb to be the empirical CDF of the two samples taken together,
provided that no assumptions are made connecting the two distributions. The observed value of the
test statistic under the null is
z0 =
x2 − x1
s22 /n2 + s21 /n1
1/2
We also calculate B values of
z∗ =
x∗2 − x∗1 − (x2 − x1 )
1/2
∗2
s∗2
2 /n2 + s1 /n1
8
3
Testing Linear Restrictions in OLS
Consider the problem of testing the following null hypothesis
H0 : Rβ = r
where the d × K matrix R is matrix of restrictions (where d is the number of restrictions) and r
is a p × 1 vector of constants. The alternative hypothesis is H1 : Rβ 6= r. Using standard results
from multivariate normal distributions, we now that
T1 ≡
b
Rβ−r
T T2 ≡
R XT X
−1
RT
−1 σ2
T y − Xβb
y − Xβb
σ2
Rβb − r
∼ χ2d
∼ χ2N −K
T1 ⊥ T2
and hence we have pivotal statistic given by
T
b )
(Rβ−r
F
“
−1
R(XT X)
b )
(Rβ−r
·
1
d
T
(y−Xβb) (y−Xβb)
σ2
·
1
N −r
−1 T −1 R XT X
R
Rβb − r · d1
T 1
y − Xβb
y − Xβb · N −K
T −1 T −1 b
Rβ−r
R XT X
R
Rβb − r
∼ Fd,N −K
ds2
=
”−1
σ2
≡
=
RT
b
Rβ−r
T References
• Davidson, A. C. and D.V. Hinkley, 2006. “Bootstrap Methods and their Application”. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
9