Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
22 September 2006 MA 224.1 Paul E. Hand [email protected] [email protected] 1 Basic Hypothesis Tests 1.1 Objectives In this section pay attention to: • the logic by which we say a coin is unfair • the concept of a significance level 1.2 Problem Statements Suppose someone claimed to have extra sensory perception (ESP). Naturally, others might doubt this claim, but what could a person do to convince others s/he has ESP? One problematic issue in an inquiry like this is that it is conceivable that a test subject might appear to have ESP by doing nothing but guessing. The tenability of such an explanation depends on the nature of the task at hand and the subject’s performance, among other things. The field of statistics is concerned with determining whether chance alone is a sufficient explanation for certain observations. Consider testing for ESP by asking a series of questions with only two possible answers. The correct answer is determined by some randomizing device known to be fair. Essentially, the test subject is trying to guess the outcome of a fair coin toss. For example, a sequence of cards might lay face down and a subject calls out whether the next card is red or black. In an investigation into the kind of ESP called clairvoyance, subjects had to call out which among 5 shapes is on a face down card. In a pooling of many studies, there were 907, 030 calls. One would expect around 907030/5 = 181406 correct answers. In the pooling of studies, there were 194, 605 correct calls, meaning subjects were right roughly 5.36 25 of the time (Rhine 1940). Could this fluctuation be due to chance alone? How do we tell if someone is more likely than expected to call the outcome of some randomizing device? Alternatively, one could wonder whether a coin toss is fair. After all, because there are different engravings on the two sides, there is no a priori reason to believe the coin is exactly 50/50. A more useful question is then whether or not a coin toss is fair enough. Is there enough evidence to doubt its fairness? We will pursue this question first. How do we tell if a coin is fair? Consider extreme cases: • Suppose we flip a coin 100 times and get 50 heads. This isn’t proof that the coin is fair, but it certainly isn’t reason to doubt its fairness. 1 • Suppose we flip a coin 100 times and get 100 heads. This isn’t proof that the coin isn’t fair, but it does cast substantial doubt. • Suppose we flip a coin 2 times and we get 2 heads. Somehow, this isn’t terribly convincing of anything. • If we flip a coin 1 time and get 1 head, it would be foolish to be convinced that the coin is unfair. Is there a difference between the following in the conclusions we can draw? • What if we flip a coin 10 times and get 7 heads? • What if we flip a coin 100 times and get 70 heads? • What if we flip a coin 1000 times and get 700 heads? We need a framework in which to test the hypothesis that the coin is or isn’t fair. 1.3 A naive hypothesis test for fairness of a coin Consider flipping a coin n times. Suppose we observe X heads. We wish to decide between two hypotheses • Ho : The coin is fair, i.e. the probability of heads and tails are both 0.5 • HA : The coin is unfair, i.e. either heads are tails is more likely than the other. Ho is called the null hypothesis. HA is called the alternative hypothesis. We need a method to decide whether to accept Ho or to reject Ho in favor of HA . It is naively reasonable to decide via the following test: • If X = n/2, we believe the coin is fair. We accept Ho . If it were unfair, which of heads or tails should we believe is preferred? • If X > n/2, we believe the coin is unfair, biased toward heads. We reject Ho and accept HA . After all, it did show at least a slight inclination toward heads. • If X < n/2, we believe the coin is unfair, biased tails. We reject Ho and accept HA . Is this a reasonable method? Let’s conduct an experiment. Flip a coin 10 times. Use this method to determine if it is fair. Repeat several times. The flaw in the method should be apparent. Exercise 1. Find other flaws in this method. To determine if this is a reasonable method of assessing the fairness of a coin, we would wonder if fair coins are determined to be fair and if unfair coins are determined to be unfair. Suppose a coin were fair. Upon flipping n times, would it appear to be fair? 2 A test of fairness will be based on the number of observed heads in n flips, but it is possible to get any number of heads from 0 to n. We are hence resigned to the fact that there will always some chance the coin incorrectly appears biased. Hopefully, such incorrect outcomes are improbable. Assume a fair coin is flipped n times. What is the probability that the method above determines the coin is fair? The coin is determined flipsgive heads. Assuming n is even, the probability of n/2 fair if exactly n/2 n n 1 ways of getting n/2 heads in n flips and there are 2n heads in n flips is n , as there are 2 n/2 n/2 outcomes of n flips. The following table shows the probability that a fair coin is determined to be fair after n flips. n 2 P(fair coin is determined to be fair) 10 100 1000 10000 1 2 10 1 ≈ 0.25 210 5 100 1 ≈ 0.08 2100 50 1000 1 ≈ 0.03 21000 500 10000 1 ≈ 0.008 5000 210000 None of these results are acceptable. If we flip twice, we will wrongly conclude the coin is biased half of the time. If we flip more, we wrongly conclude bias even more often. Precisely, the fault with the proposed test is that P(determine coin is biased | coin is fair) is too big. This is to say P(reject Ho | Ho ) is too big. We read the above notation “probability of rejecting Ho given Ho is true.” Denote α = P (reject Ho | Ho ) and call α the significance level of the test. Alternatively, we could call it the probability of a false positive. We would like α to be small. Typically the value α = 0.05 is used. While small values of α are preferred, it will not be possible, in non-trivial scenarios, to have α = 0. 1.4 A better test We need a better test for the fairness of a coin. The fact that the significance level of the prior test was too high means that the circumstance underwhich we decided in favor of Ho was too rare, assuming Ho is true. In accordance with our experiment with fair coins, there is some range of values the number of heads is expected to be in. Only if the number of heads is outside this range can we start doubting the presumption that the coin is fair. Thus we propose the following variation of the hypothesis test above, when n = 10. Flip a coin 10 times. Let X be the number of times that the coin comes up heads. 3 • If |X − 5| > then reject Ho . • Otherwise, accept Ho . What number should be in place of the underscore above? The test from the last section had 0. This was too restrictive. Lets try to find the range which would give a test with significance level α = 0.05. Consider the test above with rejection of Ho if |X − 5| > 2. That is to say, we reject Ho if X = 0, 1, 2, 8, 9, or 10. What is the significant level of the test? n 1 . We calculate The probability of getting k heads in n flips of a coin is k 2n α = P (reject Ho | Ho ) = P (X ≤ 2 or X ≥ 8 | Ho ) 10 10 10 10 10 10 + + + + + 0 1 2 8 9 10 = 10 2 ≈ .11 The above calculation shows that with this method, the probability of declaring a fair coin to be biased is greater than one tenth. We want this value to be at most one in twenty. Consider the test above with rejection if |X − 5| > 3. That is to say, we reject Ho if X = 0, 1, 9, or 10. What is the significance level of the test? α = P (reject Ho | Ho ) = P (X ≤ 1 or X ≥ 9 | Ho ) 10 10 10 10 + + + 0 1 9 10 = 10 2 ≈ .02 This significance level meets our requirement that α ≤ 0.05 We can say that the following test has significant level α ≈ 0.02: Flip a coin 10 times. Let X be the number of times that the coin comes up heads. • If |X − 5| > 3 then reject Ho . • Otherwise, accept Ho . 1.5 Terminology In the above section, Ho is called the null hypothesis. It represents the hypothesis we are trying to disprove. Often, Ho corresponds to where presumption would lie. In the example with a coin flip, we presume the coin is fair. If there is strong enough evidence, we will doubt this presumption. 4 HA is called the alternative hypothesis. X was the number of times we got heads in a certain number of flips. This is a random variable. It is the outcome of a random experiment. We will usually denote random variables with capital letters. The above test rejected Ho if |X − 5| > 3. We call the set of all such X the rejection region for the test. Here the rejection region is {0, 1, 9, 10}. The acceptance region is everything else {2, 3, 4, 5, 6, 7, 8} As mentioned before, the significance level of the test is α = P (reject Ho | Ho ). If X is in the rejection region of a test with significance level α, we can say that we reject the hypothesis that the coin is fair with significance level α. We could also say that if X is in the rejection region, the difference in fraction of heads from 21 was statistically significant. This is to say, the probability of a difference at least as large as that observed is less than 0.05. 1.6 Simulation Sometimes it might be easier to simulate random simulations than to analyze them analytically. We will consider a hypothesis test where we flip a coin 1000 times: Flip a coin 1000 times. Let X be the number of times that the coin comes up heads. • If |X − 500| > then reject Ho . • Otherwise, accept Ho . What number should be in place of the underscore above in order to get a test with a 5% significance level? We will have a computer simulate 1000 fair coin tosses, record the value of X, and repeat 10000 times. Below is the histogram of the resulting X’s Count 250 200 150 100 50 440 460 480 500 5 520 540 X There were 527 outcomes with |X − 500| > 30 but only 449 outcomes with |X − 500| > 31. Hence, just over 95% of the outcomes were in the region [469, 531]. We can now give a reasonable but approximate rejection region. The two different colors in the above histogram show the regions where 95% and 5% of the outcomes lie. Flip a coin 1000 times. Let X be the number of times that the coin comes up heads. • If |X − 500| > 31 then reject Ho . • Otherwise, accept Ho . As per the simulation above, there is a probability of approximately Ho is true, giving α ≈ 0.045 449 10000 that the test rejects Ho given Exercise 2. Compare the rejection region from the experiment with 10 flips to that from the experiment with 1000 flips. References: Rhine, J. (1940) Extra-Sensory Perception: A Review. Scientific Monthly, 51, 450 - 459 6