Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocı́o Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The goal of hypothesis testing is to decide, using a sample from the population, which of two complimentary hypotheses is true. In general, the two complimentary hypotheses are called the null hypothesis and the alternative hypothesis. If we let θ be a population parameter and Θ be the parameter space, we can define these complementary hypotheses as follows: Definition 1.2 Let Θ0 and Θ1 ≡ Θc0 be a partition of the parameter space Θ. Then the null and alternative hypothesis are defined as follows 1. Null Hypothesis: H0 : θ ∈ Θ0 2. Alternative Hypothesis: H1 : θ ∈ Θ1 Definition 1.3 Testing Procedure. A testing procedure is a rule, based on the outcome of a random sample from the population under study, used to decide whether to reject H0 . 1 The subset of the sample space for which H0 will be rejected is called the critical region ( or the rejection region), and its complement is called the acceptance region. In general, a hypothesis test will be specified in terms of a test statistic T (X1 , X2 , ..., XN ) ≡ T (X), which is a function of the sample. We can define the critical region formally as follows. Definition 1.4 Critical Region. The subset C ⊂ RN of the sample space for which H0 is rejected is called the critical region and is defined by Cc = x ∈ RN : T (x) > c for some c ∈ R. The value c is called the critical value. The complement of Cc , Ca ≡ Ccc , is called the acceptance region. If we let CTc be the critical region of the test statistic T (X) (i.e. CTc is defined by Cc = x ∈ RN : T (x) ∈ CTc ), a statistical test of H0 against H1 will generally be defined as: 1. T (x) ∈ CTc =⇒ Reject H0 T (x) ∈ / CTc =⇒ Accept H0 A hypothesis test of H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 can make one of two types of errors. Definition 1.5 Type I and Type II Errors. Let H0 be a null hypothesis being tested for acceptance or rejection. The two types of errors that can be made are 1. Type I Error: rejecting H0 when θ ∈ Θ0 (i.e, when H0 is true) 2. Type II Error: accepting H0 when θ ∈ Θ1 (i.e, when H0 is false) So a type I error is committed when the statistical test mistakenly rejects the null hypothesis, and a type II error is committed when the test mistakenly accepts the null hypothesis. The ideal 2 test is one where the hypothesis would always be correctly identified as being either true or false. For such an ideal test to exist, we must partition the range of potential sample outcomes in such a way that outcomes in the critical region Cc would occur if and only if H0 were true and outcomes in the acceptance region Ca would occur if and only if H0 were false. In general, ideal tests cannot be constructed. For θ ∈ Θ0 , the test will make a mistake if x ∈ Cc and therefore the probability of a type I error is Pθ (X ∈ Cc ) and for θ ∈ Θ1 , the test will make a mistake if x ∈ Ca and therefore the probability of a type II error is Pθ (X ∈ Ca ). Note that Pθ (X ∈ Cc ) = 1 − Pθ (X ∈ Ca ). We will now define the power function of a test. The power function completely summarizes all of the operating characteristics of a statistical test with respect to probabilities of making correct and incorrect decisions about H0 . The power function is defined below. Definition 1.6 Let H0 be defined as H0 : θ ∈ Θ0 and H1 be defined as H1 : θ ∈ Θ1 . Let the critical region Cc define a test of H0 . Then the power function of the statistical test is the function of θ defined by β (θ) ≡ Pθ (X ∈ Cc ) = probability of Type I error if θ ∈ Θ0 one minus probability of Type II error if θ ∈ Θ1 In words, the power function indicates the probability of rejecting H0 for every value of θ ∈ Θ. The value of the power function at a particular value of the parameter space θp ∈ Θ is called the power of the test at θp and represents the probability of rejecting H0 if θp were the true value of the parameter vector. The ideal power function is 0 for all θ ∈ Θ0 and 1 for all θ ∈ Θ1 . In general, this ideal cannot be attained and we say that a good test has power function near 0 for all θ ∈ Θ0 and near 1 for all θ ∈ Θ1 . When comparing two tests for a given H0 , a test is better if it has lower power for θ ∈ Θ0 and higher power for θ ∈ Θ1 which implies that the test has lower probabilities of both type I and type II error. We now define the size and level of a test: 3 Definition 1.7 Size. For 0 ≤ α ≤ 1, a test with power function β (θ) is a size-α test if supθ∈Θ0 β (θ) = α Definition 1.8 Level. For 0 ≤ α ≤ 1, a test with power function β (θ) is a level-α test if supθ∈Θ0 β (θ) ≤ α In words, the size of the test is the maximum probability of Type I error associated with a given test rule. The lower the size of the test, the lower the maximum probability of mistakenly rejecting H0 . The level of a test is an upper bound to the type I error probability of a statistical test. The key difference between these two concepts is that the size represents the maximum value of β (θ) for θ ∈ Θ0 (i.e. the maximum type I error) while the level is only a bound that might not equal β (θ) for any θ ∈ Θ0 nor equal the supremum of β (θ) for θ ∈ Θ0 . Thus, the set of level-α tests contains the set of size-α tests. In other words, a test of H0 having size γ is a α-level test for any α ≥ γ. In applications, when we say that H0 is (not) rejected at the α-significance level, we often mean that α was the bound on the level of protection against type I error that was used when constructing the test. A more accurate statement is regarding the level of protection against type I error is that H0 is (not) rejected using a size-α test. 2 Bootstrapping Hypothesis Tests The simplest situation involves a simple null hypothesis H0 that completely specifies the probability distribution of the data. Thus, if we have a sample x1 , x2 , ..., xn from a population with CDF F , then H0 specifies that F = F0 where F0 contains no unknown parameters. A statistical test is based on a test statistic T which measures the discrepancy between the data and the null hypothesis. We will follow the convention that large values of T are evidence against H0 . If the null hypothesis is simple and the observed value of the test statistics is denoted by t, then the level of evidence 4 against H0 is measured by the significance probability p = P (T ≥ t | H0 ) which is referred to as the p-value. The p-value is effectively the marginal size test at which a given hypothesis would be rejected based on the observed outcome of X. A corresponding notion is that of a critical value tp for t, which is associated with testing at level p: if t ≥ tp then H0 is rejected at level p or 100p%. It follows that tp is defined as P (T ≥ tp | H0 ) = p Note that p is what we defined earlier as the size of the test and the set {(x1 , x2 , ..., xn ) : t ≥ tp | H0 } is the level p critical region of the test. The distribution of the T under H0 is called the null distribution of T . 2.1 How to choose the test-statistic In a parametric setting, there is an explicit form of the sampling distribution of the data with a finite number of unknown parameters. In these cases the alternative hypothesis guides the choice of the test statistic (usually through use of the likelihood function of the data). In non-parametric settings, no particular forms are specified for the distributions and hence the appropriate choice of T is less clear. However, the choice of T should be always based on some notion of what is of concern in the case that H0 turns out to be false. In all non-parametric problems, the null hypothesis H0 leaves some parameters unknown and therefore does not completely specify F . In this case, the p-value is not well defined because P (T ≥ t | F ) may depend upon which F satisfying H0 is taken. 2.1.1 Pivot Tests When H0 concerns a particular parameter value, we can use the equivalence between hypothesis tests and confidence intervals. This equivalence implies that if the value of θ0 is outside a 1 − α 5 confidence interval for θ, then θ differs from θ0 with p-value less than α. A specific form of test based on this equivalence is a pivot test. Suppose that T is an estimator for a scalar θ, with T −θ estimated variance V . Suppose also that the studentized version of T , Z = 1/2 , is a pivot (i.e. V its distribution is the same for all relevant F , and in particular for all θ). For a one-sided test of H0 : θ = θ0 versus H1 : θ > θ0 , the p-value that corresponds to the observed studentized test t − θ0 statistic z0 = 1/2 is v T − θ0 t − θ0 p=P ≥ 1/2 | H0 V 1/2 v However, since Z is a pivot we have t − θ0 T − θ0 t − θ0 = P Z ≥ 1/2 | H0 ≥ 1/2 | H0 P V 1/2 v v t − θ0 = P Z ≥ 1/2 | F v and therefore the p-value can be written as p = P {Z ≥ z0 | F } Note that this has a big advantage in the context of bootstrapping, because we do not have to construct a special null-hypothesis sampling distribution. 2.2 Non-Parametric Bootstrap Tests Testing hypothesis requires that probability calculations be done under the null hypothesis model. This means that the usual bootstrap setting must be modified, since resampling from the empirical CDF Fb and applying the plug-in principle to obtain θb = t Fb won’t give us an estimator of θ under the null hypothesis H0 . In the hypothesis testing context, instead of resampling from the empirical CDF Fb, we must resample from an empirical CDF Fb0 which satisfies the relevant null hypothesis H0 . (Unless, as we mentioned above, we can construct a pivot test-statistic). 6 Once we have decided on the null resampling distribution Fb0 , the basic bootstrap test will compute the p-value as n o pboot = P ∗ T ∗ ≥ t | Fb0 or will approximate it by pboot = # {t∗b ≥ t} B using the results t∗1 , t∗2 , ..., t∗B from B bootstrap samples. Example 2.1 Difference in means. Suppose we want to compare two population means µ1 and µ2 using the test statistic t = x1 − x2 . We will use the following sample data: sample1 82 79 81 79 77 79 79 78 79 82 76 73 64 sample2 84 86 85 82 77 76 77 80 83 81 78 78 78 If the shapes of the underlying distributions are identical, then under H0 : µ1 = µ2 the two distributions are the same. In this case, it is sensible to choose for Fb0 the pooled empirical CDF of the two samples. Applying this procedure with 1, 000 bootstrap samples yielded 52 values of t∗ greater than the observed value t = 80.38 − 77.53 = 2.84, which implies a p-value of 52 1000 = 0.052. So we cannot reject the null at 5% (but we can at 5.2%!!) 2.2.1 Studentized Bootstrap Method For some problems, it is possible to obtain more stable significance tests by studentizing comparisons. Remember that because of the relationship between confidence sets and hypothesis tests, such a test can be obtained calculating a 1 − p confidence set by the studentized bootstrap method and concluding that the p-value is less than p is the null hypothesis parameter falls outside the confidence set. We can also implement this idea by bootstrapping the test statistic directly rather than constructing confidence intervals. In this case, the p-value can be obtained directly. Suppose that θ is 7 a scalar with estimator T and that we want to test H0 : θ = θ0 against H1 : θ > θ0 . The method we mentioned in the section Pivot Tests applies when Z= T −θ V 1/2 is approximately a pivot (i.e. its distribution is approximately independent of unknown paramt − θ0 eters). Then, with z0 = 1/2 being the observed studentized test statistic the bootstrap analog v of p = P {Z ≥ z0 | F } is n o p = P Z ∗ ≥ z0 | Fb which we can approximate by bootstrapping without having to decide on a null empirical distribution Fb0 . Example 2.2 Let’s continue the example of the difference in means. We were comparing compare two population means µ1 and µ2 using the test statistic t = x1 − x2 . Now, it would reasonable to suppose that the usual two-sample t-statistic Z= X 2 − X 1 − (µ2 − µ1 ) 1/2 S22 /n2 + S12 /n1 is approximately pivotal. We take Fb to be the empirical CDF of the two samples taken together, provided that no assumptions are made connecting the two distributions. The observed value of the test statistic under the null is z0 = x2 − x1 s22 /n2 + s21 /n1 1/2 We also calculate B values of z∗ = x∗2 − x∗1 − (x2 − x1 ) 1/2 ∗2 s∗2 2 /n2 + s1 /n1 8 3 Testing Linear Restrictions in OLS Consider the problem of testing the following null hypothesis H0 : Rβ = r where the d × K matrix R is matrix of restrictions (where d is the number of restrictions) and r is a p × 1 vector of constants. The alternative hypothesis is H1 : Rβ 6= r. Using standard results from multivariate normal distributions, we now that T1 ≡ b Rβ−r T T2 ≡ R XT X −1 RT −1 σ2 T y − Xβb y − Xβb σ2 Rβb − r ∼ χ2d ∼ χ2N −K T1 ⊥ T2 and hence we have pivotal statistic given by T b ) (Rβ−r F “ −1 R(XT X) b ) (Rβ−r · 1 d T (y−Xβb) (y−Xβb) σ2 · 1 N −r −1 T −1 R XT X R Rβb − r · d1 T 1 y − Xβb y − Xβb · N −K T −1 T −1 b Rβ−r R XT X R Rβb − r ∼ Fd,N −K ds2 = ”−1 σ2 ≡ = RT b Rβ−r T References • Davidson, A. C. and D.V. Hinkley, 2006. “Bootstrap Methods and their Application”. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. 9