Download 1 Basic Hypothesis Tests

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
22 September 2006
MA 224.1
Paul E. Hand
[email protected]
[email protected]
1
Basic Hypothesis Tests
1.1
Objectives
In this section pay attention to:
• the logic by which we say a coin is unfair
• the concept of a significance level
1.2
Problem Statements
Suppose someone claimed to have extra sensory perception (ESP). Naturally, others might doubt this claim,
but what could a person do to convince others s/he has ESP?
One problematic issue in an inquiry like this is that it is conceivable that a test subject might appear to
have ESP by doing nothing but guessing. The tenability of such an explanation depends on the nature of
the task at hand and the subject’s performance, among other things. The field of statistics is concerned with
determining whether chance alone is a sufficient explanation for certain observations.
Consider testing for ESP by asking a series of questions with only two possible answers. The correct
answer is determined by some randomizing device known to be fair. Essentially, the test subject is trying to
guess the outcome of a fair coin toss. For example, a sequence of cards might lay face down and a subject
calls out whether the next card is red or black.
In an investigation into the kind of ESP called clairvoyance, subjects had to call out which among 5
shapes is on a face down card. In a pooling of many studies, there were 907, 030 calls. One would expect
around 907030/5 = 181406 correct answers. In the pooling of studies, there were 194, 605 correct calls,
meaning subjects were right roughly 5.36
25 of the time (Rhine 1940). Could this fluctuation be due to chance
alone?
How do we tell if someone is more likely than expected to call the outcome of some randomizing device?
Alternatively, one could wonder whether a coin toss is fair. After all, because there are different engravings on the two sides, there is no a priori reason to believe the coin is exactly 50/50. A more useful question
is then whether or not a coin toss is fair enough. Is there enough evidence to doubt its fairness? We will
pursue this question first.
How do we tell if a coin is fair?
Consider extreme cases:
• Suppose we flip a coin 100 times and get 50 heads. This isn’t proof that the coin is fair, but it certainly
isn’t reason to doubt its fairness.
1
• Suppose we flip a coin 100 times and get 100 heads. This isn’t proof that the coin isn’t fair, but it does
cast substantial doubt.
• Suppose we flip a coin 2 times and we get 2 heads. Somehow, this isn’t terribly convincing of anything.
• If we flip a coin 1 time and get 1 head, it would be foolish to be convinced that the coin is unfair.
Is there a difference between the following in the conclusions we can draw?
• What if we flip a coin 10 times and get 7 heads?
• What if we flip a coin 100 times and get 70 heads?
• What if we flip a coin 1000 times and get 700 heads?
We need a framework in which to test the hypothesis that the coin is or isn’t fair.
1.3
A naive hypothesis test for fairness of a coin
Consider flipping a coin n times. Suppose we observe X heads. We wish to decide between two hypotheses
• Ho : The coin is fair, i.e. the probability of heads and tails are both 0.5
• HA : The coin is unfair, i.e. either heads are tails is more likely than the other.
Ho is called the null hypothesis. HA is called the alternative hypothesis.
We need a method to decide whether to accept Ho or to reject Ho in favor of HA .
It is naively reasonable to decide via the following test:
• If X = n/2, we believe the coin is fair. We accept Ho . If it were unfair, which of heads or tails should
we believe is preferred?
• If X > n/2, we believe the coin is unfair, biased toward heads. We reject Ho and accept HA . After all,
it did show at least a slight inclination toward heads.
• If X < n/2, we believe the coin is unfair, biased tails. We reject Ho and accept HA .
Is this a reasonable method?
Let’s conduct an experiment. Flip a coin 10 times. Use this method to determine if it is fair. Repeat
several times. The flaw in the method should be apparent.
Exercise 1. Find other flaws in this method.
To determine if this is a reasonable method of assessing the fairness of a coin, we would wonder if fair
coins are determined to be fair and if unfair coins are determined to be unfair.
Suppose a coin were fair. Upon flipping n times, would it appear to be fair?
2
A test of fairness will be based on the number of observed heads in n flips, but it is possible to get any
number of heads from 0 to n. We are hence resigned to the fact that there will always some chance the
coin incorrectly appears biased. Hopefully, such incorrect outcomes are improbable. Assume a fair coin is
flipped n times. What is the probability that the method above determines the coin is fair?
The coin is determined
flipsgive heads. Assuming n is even, the probability of n/2
fair if exactly n/2
n
n
1
ways of getting n/2 heads in n flips and there are 2n
heads in n flips is
n , as there are
2
n/2
n/2
outcomes of n flips.
The following table shows the probability that a fair coin is determined to be fair after n flips.
n
2
P(fair coin is determined to be fair)
10
100
1000
10000
1
2
10 1
≈ 0.25
210
5
100
1
≈ 0.08
2100
50
1000
1
≈ 0.03
21000
500
10000
1
≈ 0.008
5000 210000
None of these results are acceptable. If we flip twice, we will wrongly conclude the coin is biased half
of the time. If we flip more, we wrongly conclude bias even more often.
Precisely, the fault with the proposed test is that P(determine coin is biased | coin is fair) is too big. This
is to say
P(reject Ho | Ho ) is too big.
We read the above notation “probability of rejecting Ho given Ho is true.” Denote
α = P (reject Ho | Ho )
and call α the significance level of the test. Alternatively, we could call it the probability of a false positive.
We would like α to be small. Typically the value α = 0.05 is used. While small values of α are preferred,
it will not be possible, in non-trivial scenarios, to have α = 0.
1.4
A better test
We need a better test for the fairness of a coin. The fact that the significance level of the prior test was too
high means that the circumstance underwhich we decided in favor of Ho was too rare, assuming Ho is true.
In accordance with our experiment with fair coins, there is some range of values the number of heads
is expected to be in. Only if the number of heads is outside this range can we start doubting the presumption that the coin is fair. Thus we propose the following variation of the hypothesis test above, when n = 10.
Flip a coin 10 times. Let X be the number of times that the coin comes up heads.
3
• If |X − 5| >
then reject Ho .
• Otherwise, accept Ho .
What number should be in place of the underscore above? The test from the last section had 0. This was too
restrictive. Lets try to find the range which would give a test with significance level α = 0.05.
Consider the test above with rejection of Ho if |X − 5| > 2. That is to say, we reject Ho if X = 0, 1, 2,
8, 9, or 10. What is the significant level of the test?
n 1
. We calculate
The probability of getting k heads in n flips of a coin is
k 2n
α = P (reject Ho | Ho )
= P (X ≤ 2 or X ≥ 8 | Ho )
10
10
10
10
10
10
+
+
+
+
+
0
1
2
8
9
10
=
10
2
≈ .11
The above calculation shows that with this method, the probability of declaring a fair coin to be biased is
greater than one tenth. We want this value to be at most one in twenty.
Consider the test above with rejection if |X − 5| > 3. That is to say, we reject Ho if X = 0, 1, 9, or 10.
What is the significance level of the test?
α = P (reject Ho | Ho )
= P (X ≤ 1 or X ≥ 9 | Ho )
10
10
10
10
+
+
+
0
1
9
10
=
10
2
≈ .02
This significance level meets our requirement that α ≤ 0.05
We can say that the following test has significant level α ≈ 0.02:
Flip a coin 10 times. Let X be the number of times that the coin comes up heads.
• If |X − 5| > 3 then reject Ho .
• Otherwise, accept Ho .
1.5
Terminology
In the above section, Ho is called the null hypothesis. It represents the hypothesis we are trying to disprove.
Often, Ho corresponds to where presumption would lie. In the example with a coin flip, we presume the
coin is fair. If there is strong enough evidence, we will doubt this presumption.
4
HA is called the alternative hypothesis.
X was the number of times we got heads in a certain number of flips. This is a random variable. It is
the outcome of a random experiment. We will usually denote random variables with capital letters.
The above test rejected Ho if |X − 5| > 3. We call the set of all such X the rejection region for the test.
Here the rejection region is {0, 1, 9, 10}. The acceptance region is everything else {2, 3, 4, 5, 6, 7, 8}
As mentioned before, the significance level of the test is α = P (reject Ho | Ho ).
If X is in the rejection region of a test with significance level α, we can say that we reject the hypothesis
that the coin is fair with significance level α.
We could also say that if X is in the rejection region, the difference in fraction of heads from 21 was
statistically significant. This is to say, the probability of a difference at least as large as that observed is less
than 0.05.
1.6
Simulation
Sometimes it might be easier to simulate random simulations than to analyze them analytically. We will
consider a hypothesis test where we flip a coin 1000 times:
Flip a coin 1000 times. Let X be the number of times that the coin comes up heads.
• If |X − 500| >
then reject Ho .
• Otherwise, accept Ho .
What number should be in place of the underscore above in order to get a test with a 5% significance level?
We will have a computer simulate 1000 fair coin tosses, record the value of X, and repeat 10000 times.
Below is the histogram of the resulting X’s
Count
250
200
150
100
50
440
460
480
500
5
520
540
X
There were 527 outcomes with |X − 500| > 30 but only 449 outcomes with |X − 500| > 31. Hence,
just over 95% of the outcomes were in the region [469, 531].
We can now give a reasonable but approximate rejection region. The two different colors in the above
histogram show the regions where 95% and 5% of the outcomes lie.
Flip a coin 1000 times. Let X be the number of times that the coin comes up heads.
• If |X − 500| > 31 then reject Ho .
• Otherwise, accept Ho .
As per the simulation above, there is a probability of approximately
Ho is true, giving α ≈ 0.045
449
10000
that the test rejects Ho given
Exercise 2. Compare the rejection region from the experiment with 10 flips to that from the experiment with
1000 flips.
References:
Rhine, J. (1940) Extra-Sensory Perception: A Review. Scientific Monthly, 51, 450 - 459
6