Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript

IE241: Introduction to Hypothesis Testing Topic Slide Hypothesis testing………………………………………..3 Light bulb example………………………………………..4 Null and alternative hypotheses………………..……….5 Two types of error…………………………………………8 Decision rule……………………………………..……….11 test statistic……………………………………………11 critical region………………………………………….12 Power of the test……………………………….…….17 Simple hypothesis testing……………………………...18 Neyman-Pearson lemma……………….…….…….19 example………………………………………………..21 Composite hypothesis testing ……………...………..26 example…………………………………..……………29 Likelihood ratio test………………………………….34 relationship to mean…………………………….38 Examples of 1-sided composite hypotheses drug to help sleep……………………………………42 civil service exam………....................…………..44 difference between two proportions ……….46 effect of size of n…………..………………….51 railroad ties………………………………………..…. 55 fertilizer to improve yield of corn…………………..58 test of two variances…………………………..62 F distribution ……………………………………63 Tests of correlated means…………………………..…69 Bayes’ likelihood ratio test…………………………..…77 example…………………...……………………..…..78 Topic Slide Chi-square tests……………………………………….81 goodness of fit…………………………………….82 independence in contingency tables…………..91 testing sample vs hypothesized variance……..108 Significance testing…………………………………….111 We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to the second major area of statistics, hypothesis testing. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the hypothesis. What is a statistical hypothesis? A statistical hypothesis is an assumption about f(X) if X is continuous or p(X) if X is discrete. Let’s look at an example. A buyer of light bulbs bought 50 bulbs of each of two brands. When he tested them, Brand A had an average life of 1208 hours with a standard deviation of 94 hours. Brand B had a mean life of 1282 hours with a standard deviation of 80 hours. Are brands A and B really different in quality? We set up two hypotheses. The first, called the null hypothesis Ho, is the hypothesis of no difference. Ho: μA = μB The second, called the alternative hypothesis Ha, is the hypothesis that there is a difference. Ha: μA ≠ μB On the basis of the sample of 50 from each of the two populations of light bulbs, we shall either reject or not reject the hypothesis of no difference. In statistics, we always test the null hypothesis. The alternative hypothesis is the default winner if the null hypothesis is rejected. We never really accept the null hypothesis; we simply fail to reject it on the basis of the evidence in hand. Now we need a procedure to test the null hypothesis. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the null hypothesis. There are two possible decisions, reject or not reject. This means there are also two kinds of error we could make. The two types of error are shown in the table below. True state Ho true Ho false Decision Reject Ho Type 1 error Do not reject Ho Correct α decision Correct Type 2 decision error β If we reject Ho when Ho is in fact true, then we make a type 1 error. The probability of type 1 error is α. If we do not reject Ho when Ho is really false, then we make a type 2 error. The probability of a type 2 error is β. Now we need a decision rule that will make the probability of the two types of error very small. The problem is that the rule cannot make both of them small simultaneously. The one type of error the experimenter has under his control is α error. He can choose the size of α. Because in science we have to take the conservative route and never claim that we have found a new result unless we are really convinced that it is true, we choose a very small α, the probability of type 1 error. Then among all possible decision rules given α, we choose the one that makes β as small as possible. The decision rule consists of a test statistic and a critical region where the test statistic may fall. For means from a normal population, the test statistic is XA XB XA XB t sdiff s A2 s B2 n A nB where the denominator is the standard deviation of the difference between two independent means. The critical region is a tail of the distribution of the test statistic. If the test statistic falls in the critical region, Ho is rejected. Now, how much of the tail should be in the critical region? That depends on just how small you want α to be. The usual choice is α = .05, but in some very critical cases, α is set at .01. Here we have just a non-critical choice of light bulbs, so we’ll choose α = .05. This means that the critical region has probability = .025 in each tail of the t distribution. For a t distribution with .025 in each tail, the critical value of t = 1.96, the same as z because the sample size is greater than 30. The critical region then is |t |> 1.96. In our light bulb example, the test statistic is t 1282 1208 74 4.23 2 2 17.5 80 94 50 50 Now 4.23 is much greater than 1.96 so we reject the null hypothesis of no difference and declare that the average life of the B bulbs is longer than that of the A bulbs. Because α = .05, we have 95% confidence in the decision we made. We cannot say that there is a 95% probability that we are right because we are either right or wrong and we don’t know which. But there is such a small probability that t will land in the critical region if Ho is true that if it does get there, we choose to believe that Ho is not true. If we had chosen α = .01, the critical value of t would be 2.58 and because 4.23 is greater than 2.58, we would still reject Ho. This time it would be with 99% confidence. How do we know that the test we used is the best test possible? We have controlled the probability of Type 1 error. But what is the probability of Type 2 error in this test? Does this test minimize it subject of the value of α? To answer this question, we need to consider the concept of test power. The power of a statistical test is the probability of rejecting Ho when Ho is really false. Thus power = 1-β. Clearly if the test maximizes power, it minimizes the probability of Type 2 error β. If a test maximizes power for given α, it is called an admissible testing strategy. Before going further, we need to distinguish between two types of hypotheses. A simple hypothesis is one where the value of the parameter under Ho is a specified constant and the value of the parameter under Ha is a different specified constant. For example, if you test Ho: μ = 0 vs Ha: μ = 10 then you have a simple hypothesis test. Here you have a particular value for Ho and a different particular value for Ha. For testing one simple hypothesis Ha against the simple hypothesis Ho, a ground-breaking result called the Neyman-Pearson lemma provides the most powerful test. L(ˆa ) L(ˆ0 ) λ is a likelihood ratio with the Ha parameter MLE in the numerator and the Ho parameter MLE in the denominator. Clearly, any value of λ > 1 would favor the alternative hypothesis, while values less than 1 would favor the null hypothesis. Basically, this likelihood ratio says that if there exists a critical region A of size α and a constant k such that n La Lo f ( x ; ) i a i 1 n f ( x ; ) i k inside A k outside A o i 1 and n La Lo f ( x ; ) i a i 1 n f ( x ; ) i o i 1 then A is a best (most powerful) critical region of size α. Consider the following example of a test of two simple hypotheses. A coin is either fair or has p(H) = 2/3. Under Ho, P(H) = ½ and under Ha, P(H) = 2/3. The coin will be tossed 3 times and a decision will be made between the two hypotheses. Thus X = number of heads = 0, 1, 2, or 3. Now let’s look at how the decision will be made. First, let’s look at the probability of Type 1 error α. In the table below, Ho⇒ P(H) =1/2 and Ha⇒ P(H) = 2/3. X P(X|Ho) P(X|Ha) 0 1/8 1/27 1 2 3/8 3/8 6/27 12/27 3 1/8 8/27 Now what should the critical region be? Under Ho, if X = 0, α = 1/8. Under Ho, if X = 3, α = 1/8. So if either of these two values is chosen as the critical region, the probability of Type 1 error would be the same. Now what if Ha is true? If X = 0 is chosen as the critical region, the value of β = 26/27 because that is the probability that X ≠ 0. On the other hand, if X = 3 is chosen as the critical region, the value of β = 19/27 because that is the probability that X ≠ 3. Clearly, the better choice for the critical region is X=3 because that is the region that minimizes β for fixed α. So this critical region provides the more powerful test. In discrete variable problems like this, it may not be possible to choose a critical region of the desired α. In this illustration, you simply cannot find a critical region where α = .05 or .01. This is seldom a problem in real-life experimentation because n is usually sufficiently large so that there is a wide variety of choices for critical regions. This problem to illustrate the general method for selecting the best test was easy to discuss because there was only a single alternative to Ho. Most problems involve more than a single alternative. Such hypotheses are called composite hypotheses. Examples of composite hypotheses: Ho: μ = 0 vs Ha: μ ≠ 0 which is a two-sided Ha. A one-sided Ha can be written as Ho: μ = 0 vs Ha: μ > 0 Ho: μ = 0 vs Ha: μ < 0 or All of these hypotheses are composite because they include more than one value for Ha. And unfortunately, the size of β here depends on the particular alternative value of μ being considered. In the composite case, it is necessary to compare Type 2 errors for all possible alternative values under Ha. So now the size of Type 2 error is a function of the alternative parameter value θ. So β(θ) is the probability that the sample point will fall in the noncritical region when θ is the true value of the parameter. Because it is more convenient to work with the critical region, the power function 1-β(θ) is usually used. The power function is the probability that the sample point will fall in the critical region when θ is the true value of the parameter. As an illustration of these points, consider the following continuous example. Let X = the time that elapses between two successive trippings of a Geiger counter in studying cosmic radiation. The density function is f(x;θ) = θe-θx where θ is a parameter which depends on experimental conditions. Under Ho, θ = 2. Now a physicist believes that θ < 2. So under Ha, θ < 2. Now one choice for the critical region is the right tail of the distribution, X ≥ 1 2e 2 x dx .135 1 Another choice is the left tail, X ≤ .07 for which α = .135. That is, .07 2e 2 x dx .135 0 Now let’s examine the power for the two competing critical regions. For the right-tail critical region X > 1, 1 (1 ) e x dx e 1 and for the left-tail critical region X <.07, .07 1 (2 ) e x dx 1 e .07 0 The graphs of these two functions are called the power curves for the two critical regions. These two power functions are P o we r fu nc tio ns fo r two c ritic al re gio ns 1 .2 critical region X>1 1 critical region X<.07 Power 0 .8 0 .6 0 .4 0 .2 0 0 0 .5 1 1 .5 2 2 .5 3 3 .5 4 Th e ta Note that the power function for X>1 region is always higher than the power function for X<.07 region before they cross at θ = 2. Since the alternative θ values in the problem are all θ<2, clearly the right-tail critical region X>1 is more powerful than the left-tail region. What we just saw was a 1-sided composite alternative hypothesis test. Unfortunately, with two-sided composite alternative hypotheses, there is no best test that covers all alternative values. Clearly, if the alternative were θa < θo , the left tail would be best, and if the alternative were θa > θo , the right tail would be best. This shows that best critical regions exist only if the alternative hypothesis is suitably restricted. So for composite hypotheses, a new principle needs to be introduced to find a good test. This principle is called a likelihood ratio test. L(ˆ0 ) L(ˆ) where the denominator is the maximum of the likelihood function with respect to all the parameters, and the numerator is the maximum of the likelihood function after some or all of the parameters have been restricted by Ho. Consequently, the numerator can never exceed the denominator, so λ can assume values only between 0 and 1. A value of λ close to 1 lends support to Ho because then it is clear that allowing the parameters to assume values other than those possible under Ho would not increase the likelihood of the sample values very much, if at all. If, however, λ is close to 0, then the probability of the sample values of X is very low under Ho, and Ho is therefore not supported by the data. Because increasing values of λ correspond to increasing degrees of belief in Ho, λ may serve as a statistic for testing Ho, with small values leading to rejection of Ho. Now the MLEs are functions of the values of the random variable X, so λ is also a function of these values of X and is therefore an observable random variable. λ is often related to X whose distribution is known so it is not necessary to find the distribution of λ. Suppose we have a normal population with σ = 1 and we are interested in testing whether the mean = μo. That is, 1 e 2 1 ( x )2 2 Let’s see how we would construct a likelihood ratio test. In this case, n 2 n L( ) (2 ) e 1 ( xi ) 2 2 i 1 Since maximizing L(μ) is equivalent to maximizing log L(μ), log L( ) n ( xi ) i 1 so ̂ X and therefore n n 2 L( ) (2 ) e 1 ( xi X ) 2 2 i 1 Under Ho, there are no parameters to be estimated, so n 2 L(o ) (2 ) e n 1 ( xi o )2 2 i 1 and λ then is e e n n 1 2 2 ( xi o ) ( xi X ) 2 i 1 i 1 n ( X o ) 2 2 This expression shows a relationship between λ and X , such that for each value of λ, there are two critical values of X , which are symmetrical with respect to X = μo. So the 5% critical region for λ corresponds to the two 2.5% tails of the normal X distribution given by X o | 1.96 | n Thus the likelihood ratio test is identical to the t test and serves as a compromise test when no best test is available. It is because of the concept of power that we simply fail to reject the null hypothesis and do not accept it when the test value does not fall into the rejection region. The reason is that if we had a more powerful test, we might have been able to reject Ho. Now let’s look at some examples. As an example of a one-sided composite hypothesis test, suppose a new drug is available which claims to produce additional sleep. The drug is tested on 10 patients with the results shown. Patient Hours gained 1 2 3 4 5 6 7 8 9 10 0.7 -1.1 -0.2 1.2 0.1 3.4 3.7 0.8 1.8 2.0 We are testing the hypothesis Ho: μ = 0 vs Ha: μ > 0 The mean hours gained = 1.24 and s = 1.45. So the t statistic is 1.24 0 t 2.7 1.45 10 which has 9 df. For df = 9 and α = .05, the required t = 2.262. Since our obtained t is greater then the required t, we can, with 95% confidence, reject Ho. So in this case, even with only 10 patients, we can endorse the drug for obtaining longer sleep. Now let’s take a second example. A civil service exam is given to a group of 200 candidates. Based on their total scores, the 200 candidates are divided into two groups, the top 30% and the bottom 70%. Now consider the first question in the examination. In the upper 30% group, 40 had the right answer. In the lower 70% group, 80 had the right answer. Is the question a good discriminator between the top scorers and the lower scorers? To answer this question, we first set up the two hypotheses. In this case, the null hypothesis is Ho: pu = pl and the alternative is Ha: pu > pl because we would expect the upper group to do better than the lower group on all questions. In binomial situations, we must deal with proportions instead of counts unless the two sample sizes are the same. The proportion of successes p = x/n may be assumed to be normally distributed with mean p and variance pq/n if n is large. Then the difference between two sample proportions may also be approximately normally distributed if n is large. In this situation, μp1-p2 = p1-p2 and 2 p1 p 2 p1q1 p2q2 n1 n2 Just as for the binomial distribution, the normal approximation will be satisfactory if each nipi exceeds 5 when p ≤ ½ and niqi exceeds 5 when p > ½. The test statistic is t pu pl pq pq nu nl We need the common estimate of p under Ho to use in the denominator, so we use the estimate for the entire group. So p = 120/200 = 3/5 =.6 and q = .4. The p for the upper group = 40/60 = .67. The p for the lower group = 80/140≈.57. So inserting our values into the test statistic, we get t .67 .57 .10 1.32 .6(.4) .6(.4) .076 60 140 Our critical region is t > 1.65 because we have set α = .05 as the critical value in this 1-tailed test. Because of the large sample size, t.95 = z.95 . Because the obtained t = 1.32 is lower than the required t = 1.65, we cannot reject the null hypothesis because the data didn’t allow us to do so. So, given the data, we conclude that the first question is not a good one for distinguishing between the upper scorers and the lower scorers on the entire test. Now let’s look at our test problem again. Suppose instead of 200 candidates we tested 500, but kept everything else in the problem the same. t .67 .57 .10 2.092 .6(.4) .6(.4) .0478 150 350 Now we will reject Ho because now t = 2.092, which is greater than 1.65, the critical value of t. This is why we never accept Ho, but only fail to reject it with the evidence in hand. It is always possible that a more powerful test will provide evidence to reject Ho. But this leads to another question. If, theoretically, we can always keep increasing sample size, then eventually we will always be able to reject Ho. So why do the test to begin with? The reality is that you can’t keep increasing n in the real world because there are constraints on time, money, and manpower that prevent having n so large that rejection of Ho is a foregone conclusion. We usually have to get by with the n we have available. Furthermore, even if we could get a larger sample size, there is no guarantee that everything else will remain the same. The mean difference in the numerator could change. So could the variance estimates in the denominator. So we do the test because there is no other choice. Let’s look at another example of testing the difference between two proportions. A railroad company installed two sets of 50 ties. The two sets were treated by creosote using two different processes. After a number of years in service, 22 ties of the first set and 18 ties of the second set were still in good condition. The question is whether one method of treating with creosote is better than the other. So we set up two hypotheses: Ho: p1 = p2 Ha: p1 ≠ p2 Now we can use the t test statistic because the samples are large enough to assume normality of p1 – p2. For a 2-tailed test with α = .05, the critical value of t = 1.96. t p1 p2 pq pq n1 n2 First, we need to get the values of p and q for the denominator. Since Ho treats both p1 and p2 as coming from populations with the same p, the common estimate of p is (22+18)/100 = .4. So q = .6. Now the t test is t .44 .36 .08 .816 (.4)(.6) (.4)(.6) .09798 50 50 Clearly, we cannot reject Ho. As another example, consider the application of a fertilizer to plots of farm ground and the effect it has on the yield of corn in bushels. The data are Treated 6.2 5.7 6.5 Untreated 5.6 5.9 5.6 6 6.3 5.8 5.7 5.7 5.8 5.7 6 6 6 5.8 5.5 5.7 5.5 The average yield for the treated plots = 6.0, with s2 = 0.0711. The average yield for the untreated plots = 5.7 with s2 =0.0267. Ho: μtreated = μuntreated Ha: μtreated ≠ μuntreated The test statistic is t 6 5.7 .3 3.0339 0.0711 0.0267 .098883 10 10 So Ho can be rejected because α = .05 and t.025 = 2.101 with 18 df. When you test the difference between two means, df = (nA-1) + (nB-1). So we can conclude that the fertilizer will help produce more bushels of corn. Now can ask how many extra bushels of corn we will get with the fertilizer. The point estimate is .3, but we can find a 95% confidence interval around this estimate. In the case of a confidence interval for the difference between two means, ( X A X B ) t.95 s X2 A s X2 B .3 2.101(.0989) .3 .208 .092 A B .508 Note that the 95% confidence interval does not include 0, and the t test rejected Ho that the difference = 0. The confidence interval thus confirms the t test outcome. Because the sample size was only 10 for each group, we can’t say with any degree of confidence that the increase in yield is more than .092, but it may be as much as .508. One caution about using small samples to test the difference between means is that t assumes equality of the two variances. If the samples are large, this assumption is unnecessary. Now how can we know if the two variances are equal? We can test them. We already know that each variance is distributed as chi-square. Now how can we test to see if two variances are equal? The answer is the F test. The F distribution is a ratio of two chi-square variables. So if s21 and s22 possess independent chi-square distributions with v1 and v2 df, respectively, then s12 F s22 v1 v2 has the F distribution with v1 and v2 df. The F distribution is f (F ) cF 1 ( v1 2 ) 2 (v2 v1F ) 1 ( v1 v2 ) 2 where c is given by v v v1 v2 2 v1 v v 1 2 ! 2 2 v1 2 v1 v2 v1 2 v2 2 v2 ! ! 2 2 2 2 v1 v2 2 2 1 2 and the symbol Γ(x) denotes the gamma or factorial function of x, which has the property that Γ(x+1) = xΓ(x). Now let’s do the test to see if our two variances are equal. In the problem, the two variances are .0711 and .0267. So s12 .0711 F 2 2.66 s2 .0267 and there are 10-1 and 10-1 df. Is 2.66 greater than would be expected if the two variances were equal? To answer this, we must consult the F distribution. Now it turns out that the critical region in the two tails have critical values that are reciprocals of each other. That is, if then s12 F 2 s2 s22 1/ F 2 s1 Because of this reciprocal property, the procedure is always to place the larger variance over the smaller. Then we can refer to the F distribution for the .025 critical region to see if the hypothesis of a common variance is to be rejected. For this case, with 9 and 9 df, the critical value of F = 4.025. There is an FINV function in EXCEL to find critical values for the F distribution. Since the observed value of 2.66 is less than the critical value of 4.025, we cannot reject the null hypothesis of common variance. To see the reciprocality, if we had placed the smaller variance over the larger, the observed ratio would be, 1/2.66 = .376 The critical value would be 1/F = 1/4.025 = .2484 In this case, for the left tail, the observed value should be less than the critical value. But here .376 > .2484, so we would not reject Ho. So far, we have always been talking about the difference between two independent means. What if the means are not independent? In the situation of correlated means, we must find the standard error of the difference between two dependent means. So now the situation is complicated. Suppose we have 10 heart patients who took a treadmill test, then went on a strict exercise program, then retook the treadmill test. Patient Test 1 Test 2 Gain 1 15 21 6 2 18 23 5 3 18 21 3 4 20 25 5 5 20 24 4 6 22 29 7 7 23 28 5 8 24 28 4 9 22 25 3 10 25 33 8 Mean 20.7 25.7 5 St dev 3.093 3.802 1.633 Our question is whether the strict exercise program really helped the patients’ treadmill performance. Ho: μafter = μbefore Ha: μafter > μbefore Clearly, the before and after test values are correlated because they are scores for the same patients. In fact, the correlation coefficient = .908. We can use the t test statistic, but what do we put in the denominator? We have to incorporate this dependence in the standard error. The simplest way to do this is to reformulate Ho and Ha. Ho: μdiff = 0 Ha: μdiff > 0 Now all we have to do is test the observed difference against the null difference of 0. Clearly the difference between means is normally distributed, so the t test can be used. Now we have 50 5 t 9.68 1.633 .5164 10 For α =.05 in the right tail (1-tailed test), the critical value of t with df = 9 is 1.833. Our observed value of 9.68 is much greater than 1.833 so we can reject Ho with 95% confidence. The point to see here is that with correlated data, the test of the differences gives the same result as a test of before and after values if the covariance is incorporated in the standard error. There is no clear way to incorporate the covariance in the standard error. Therefore, the way to handle correlated values is to use the average of the differences in the t test statistic numerator and the standard deviation of the differences, divided by the square root of n, in the denominator. Now what about a confidence interval around this difference? The 95% confidence interval is 5 2.262(.5164) = 5 1.168 3.832 < μdiff < 6.168 Again the 95% confidence interval does not include 0, confirming the t test result at α = .05. In this case, the t test is one-sided and the confidence interval is 2-sided. This does not always happen with 1-sided t tests. Tests of correlated data are much less common than tests of independent groups. Nonetheless, they do happen and testing the differences is the way to handle this. Most correlated data come from before-after situations, so you must be careful in your inferences about the effect of the intervening activity if you reject Ho. For example, it is possible in our example, that some of the patients were taking some drug that helped their treadmill performance. So we can’t conclude that it was just the exercise program unless we make sure that there are no other factors to consider. Another approach to an admissible test strategy is that developed by Bayes. which turns out to be a likelihood ratio test. Bayes’ formula is used to determine the likelihood of a hypothesis, given an outcome. P ( H i | D) P(Hi )P(D | Hi ) k P(H )P(D | H ) i 1 i i This formula gives the likelihood of Hi given the data you actually got versus the total likelihood of every hypothesis given the data you got. So Bayes’ strategy is a likelihood ratio test. Consider an example where there are two identical boxes. Box 1 contains 2 red balls and Box 2 contains 1 red ball and 1 white ball. Now a box is selected by chance and 1 ball is drawn from it. What is the probability that it was Box 1 that was selected if the ball that was drawn was red? Let’s test this with Bayes’ formula. There are only two hypotheses here, so H1= Box1 and H2 = Box2. The data, of course, = R. So we can find P ( H1 | R) P ( H1 ) P ( R | H1 ) P ( H1 ) P ( R | H1 ) P ( H 2 ) P ( R | H 2 ) (1 / 2)(1) 2 (1 / 2)(1) (1 / 2)(1 / 2) 3 And we can find P ( H 2 | R) P(H2 )P(R | H2 ) P ( H1 ) P ( R | H1 ) P ( H 2 ) P ( R | H 2 ) (1 / 2)(1 / 2) 1 (1 / 2)(1) (1 / 2)(1 / 2) 3 So we can see that the odds of the data favoring Box1 to Box2 are 2:1. We are twice as likely to be right if we choose Box 1, but there is still some probability that it could be Box 2. The reason we choose Box 1 is because it is more likely, given the data we have. This is the whole idea behind likelihood ratio tests. We choose the hypothesis which has the greater likelihood, given the data we have. With other data, we might choose another hypothesis. Now we’re going to look at tests where the test statistic is χ2. The first case is a test of goodness-of-fit. Here we are comparing an observed distribution with some distribution expected under the null hypothesis to see if the data fit Ho or not. Since we’re dealing with distributions here and not means, we will use frequencies instead of measurements. Suppose you want to know if a die is fair. Under the null hypothesis, the die is fair. Under the alternative hypothesis, the die is not fair. Ho: p = 1/6 for all sides of the die Ha: p ≠ 1/6 for all sides of the die Now how do we test this? We do an experiment to test the die. The observed data are shown in the table below: X Observed Expected 1 n1 n/6 2 n2 n/6 3 n3 n/6 4 n4 n/6 5 n5 n/6 6 n6 n/6 The expected data are what is expected under Ho. This question is whether or not the observed data agree with the expected data. If they do not agree, then we reject Ho that the die is fair. This goodness-of-fit test is due to Karl Pearson. The test statistic is k ni npio 2 i 1 npio 2 where k = number of categories. In fact, this χ2 test, like the t test, turns out to be equivalent to a likelihood ratio test. Now suppose the experiment consists of 60 rolls of the die with the following results: X Observed Expected 1 15 60/6= 10 2 7 60/6= 10 3 4 60/6= 10 4 11 60/6= 10 5 6 60/6= 10 6 17 60/6= 10 Now we can apply our test statistic to these data to see if Ho is to be rejected or not. The test statistic is k ni npio i 1 npio 2 2 (15 10) 2 (7 10) 2 (4 10) 2 (11 10) 2 (6 10) 2 (17 10) 2 10 10 10 10 10 10 136 13.6 10 2 Now for α = .05, the critical value of χ2 with 6-1 = 5 df is 11.1. Since our observed χ2 > 11.1, we reject Ho that the die is fair. The test statistic we used approaches the χ2 distribution when n is large because the proportions p are distributed normally when n is large. A limitation of the use of χ2 is that all of the expected frequencies must be ≥ 5. This is similar to the limitation for the use of the normal approximation to the binomial in which np and nq were required to be > 5. The expected frequencies are not always equal. Consider the following example. In experiments on breeding flowers, the colors were purple, red, green, and yellow. The flower colors are expected to occur in a 9:3:3:1 ratio. So Ho: pp = 9/16 pr = 3/16 pg = 3/16 py = 1/16 in a multinomial distribution involving four categories for which n = 217. The question is whether the colors are in accord with the theoretically expected frequencies. The observed and expected data are purple red green yellow total ni 120 48 36 13 217 ei 122 41 41 14 218 The expected data are obtained by multiplying each expected p by n. So the expected frequencies for purple and red flowers are purple: (217) 9/16 = 122.06 red: (217) 3/16 = 40.69 Because this gives rounded decimal values the total for observed and expected frequencies may not be identical. Here the test statistic is k ni npio 2 i 1 npio 2 (120 122)2 (48 41)2 (36 41)2 (13 14)2 1.909 122 41 41 14 2 The critical value of χ2 with 3 df for a critical region of size α=.05 is 7.8. Since the observed χ2 of 1.909 < 7.8, the null hypothesis cannot be rejected. Another use of the χ2 distribution for hypothesis testing is for tests of independence in contingency tables. A contingency table is a crossclassification of two variables Variable A Variable B B1 B2 B3 B4 A1 p1 . A2 (p2.)(p.3) p2 . A3 Totals Totals p3 . p.1 p.2 p.3 p.4 where pi. is the probability of being in the ith row and p.j is the probability of being in the jth column. Since, under Ho, these variables are independent, the probability of the ijth cell is the product of the row and column probabilities. The dot notation is common in crossclassifications, where the dot is in place of what has been summed over to get the marginal. c ni nij j 1 r n j nij and pi. = ni. / n p.j = n.j / n i 1 In this case, Ho: Variables A and B are independent Ha: Variables A and B are not independent and the test statistic is r c 2 i 1 j 1 n ij npi p j 2 npi p j with (r-1)(c-1) df. This test is also due to Karl Pearson. Let’s look at an example. Suppose an experimenter is interested in whether or not educational level is related to marital adjustment, and has collected the following data, where the values in parentheses are expected frequencies. Educational Level Marital Adjustment Score Very low Low High Very high Totals College 18 (27) 29 (39) 70 (64) 115 (102) 232 High school 17 (13) 28 (19) 30 (32) 41 (51) 116 Grade school 11 (6) 10 (9) 11 (14) 20 (23) 52 111 176 400 Totals 46 67 How do you get the expected frequencies? Just replace the pi. and p.j with their maximum likelihood estimates as frequencies. The expected frequency is ni n j ni n j npi p j n n n n and we can find the expected frequency for cell11 by 232 * 46 26.68 27 400 Now the test statistic is nin j n r c ij n 2 ni n j i 1 j 1 n 2 which has a χ2 distribution with (r-1)(c-1) df if n is sufficiently large and Ho is true. Note that nij is the observed cell frequency and (ni.n.j)/n is the expected cell frequency under Ho. Now we can find all the cell expected frequencies and find the observed χ2 just as we did for the goodness-of-fit test. The χ2 for (3-1)(4-1)= 6 df with α = .05 is 12.592. The observed χ2 is 20.7 for our contingency table, so the null hypothesis can be rejected with 95% confidence. Again, we must be sure that all expected frequencies are ≥ 5 to assure the validity of the χ2 test. Now let’s look at some more examples. The following data are for school children in a city in Scotland. Hair Eyes Fair Red Medium Brown Black Total blue 1368 170 1041 398 1 2978 light 2577 474 2703 932 11 6697 medium 1390 420 3826 1842 33 7511 255 1848 2506 112 5175 Total 5789 1319 9418 5678 157 22361 dark 454 Test to see whether hair color and eye color are independently distributed. In this study, Ho: hair and eye color are independent Ha: hair and eye color are not independent The appropriate test statistic here is χ2 with (4-1)(5-1) = 12 df. For α = .05, the critical value of χ2 = 21.026. Now this is a big table and involves so much computation that we will look for a shortcut. Let’s look for some obvious incompatibility. The smallest eye color total overall is for blue eyes. Blue eyes has the smallest frequency for all hair colors except fair. Since cell11 is incompatible with the other cells in row 1, let’s look at the χ2 value for this cell. For cell11, the expected frequency under Ho is nin j n 2978 * 5789 770.97 771 22361 So the χ2 value for this cell is (1368 771) 2 356409 462.27 771 771 Just with this one cell, we can reject Ho because 462.27 >> 21.026. We don’t need to compute the χ2 values for all the other cells because we already can reject Ho. If we were to add the χ2 values for all other cells, the rejection would be very much stronger. This example illustrates the value of looking at the data to see where there is an inconsistency and then checking that cell. We also could have checked cell35 because the highest total row is row 3 but this cell is not highest in its column. Looking at the data also helps to find the cell(s) with the most unexpected result. It may be that most of the cells are close to expectation, but that one or two of them are very much not in accord with expectation. This could be a very useful finding. Take another example. Five brands of canned salmon are being tested for quality. The tester examines 24 cans of each brand and finds the following results. Brand Quality A B C D E Total High 21 14 17 22 16 90 Very low 3 10 7 2 8 30 24 24 24 24 24 120 Total Can you say which brand of tuna you would most like to buy and which you would not accept? In this case, Ho: brand and quality are independent Ha: brand and quality are not independent The critical value for the χ2 statistic with (2-1)(5-1) = 4 df where α = .05 is 9.488. Let’s see if we can reject Ho. The expected frequencies are easy to compute here because the column totals are all equal. So for the first row, all the expected frequencies = 18. For the second row, all the expected frequencies are 6. The χ2 is (21 18) 2 (14 18) 2 (17 18) 2 (22 18) 2 (16 18) 2 18 18 18 18 18 (3 6) 2 (10 6) 2 (7 6) 2 (2 6) 2 (8 6) 2 6 6 6 6 6 .5 .89 .06 .89 .22 2 1.5 2.67 .17 2.67 .67 10.22 which is greater than the critical value of 9.488, so we can reject Ho with 95% confidence. But we can do more than that. We can choose the tuna brand we will buy (brand D) and the brand we will avoid (brand B) because they have the largest combined (high quality + low quality) χ2 values. Finally, there is a third way to use χ2 as a test statistic. We have already seen χ2 used to form a confidence interval for the variance σ2. Now we see how to use χ2 for testing an observed vs a hypothetical value of σ. Consider the following problem. Past experience for a manufactured product has shown that σ = 7.5. However, the latest sample of size 25 gave s =10. Has the variability in this product increased? We set up the two hypotheses: Ho: σ = 7.5 Ha: σ > 7.5 The test statistic is n 2 2 x i i 1 2 (n 1) s 2 2 (25 1)102 42.67 2 7.5 and the critical value of χ2 with 24 df for α = .05 in this 1-tailed test is 36.415. Since our observed value > the critical value, we reject Ho with 95% confidence and claim that variability has increased. The right tail of the χ2 distribution is a restricted best critical region because the best critical region exists only when μ = 0. If the alternative were σ < σo, then the left tail of the χ2 distribution is a restricted best test. But if the alternative were σ ≠ σo, there is no best critical region and we use the two equal tails of the χ2 distribution as the compromise critical region. We have looked at hypothesis testing with three different statistics, t, F, χ2. In all cases, we have chosen the desired α level and used the test statistic to see if the result falls in the critical region. When it does and we reject Ho at the chosen α level, we call the result significant. There is another way to deal with testing that does not involve setting the chosen α level beforehand. This way is called significance testing. We still do hypothesis testing as before except that now we do not choose the α level. That is, we do not have a critical region or a critical value of the test statistic. Instead we compute the observed value of the test statistic and then find the probability that the test statistic will exceed this value. If the probability is small enough, we say the result is significant at p < whatever probability value we get. For example, in the light bulb test, we observed a t of 4.23. If we had been doing significance testing instead of hypothesis testing, we would find the probability of a t this great or greater. It turns out that this probability is 0.0000994, so we would say that this result is significant at p < .0000994. Significance testing, in effect, gives us the smallest α level at which we would reject Ho for the data we have. For the treadmill test of correlated means, we observed a t = 9.68, which was significant at the .05 level. If we were doing significance testing here, we would declare the result significant at p < 0.00000469. Significance testing vs hypothesis testing was at one time a very controversial issue. But today significance testing is more commonly used than hypothesis testing. This is a case where technological advances changed statistical practice. When hypothesis testing was first developed, there were no computers so people selected α levels because they were tabled for the major test statistics. With computers, significance testing took over because there was no longer any need to use tables. The p-value was computable in a split second, so why not use it?