Download hypothesis testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Omnibus test wikipedia , lookup

Statistical hypothesis testing wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
AMS 5
HYPOTHESIS TESTING
Hypothesis Testing
Was it due to chance, or something else?
Decide between two hypotheses that are
mutually exclusive on the basis of
evidence from observations.
Test of Significance
• A certain brand of tobacco has an average nicotine content of
2.5 milligrams, while another one has an average of only 1.5
milligrams. A cigarette manufacturer receives an unlabeled
shipment of tobacco and needs to determine the nicotine
content using a sample of the tobacco.
• A certain type of vaccine is known to be only 25% effective
over a period of 2 years. A new type of vaccine is being tested
on 2000 people chosen at random. We want to test if this new
vaccine is more effective than the original one.
Test of Significance
• A machine for filling bottles of soda has to put 333 ml of
liquid in each bottle. If the average amount is too low or too
high with respect to the expected content then the machine is
considered to be out of control. The machine is regularly
inspected to check whether it is out of control by taking a
sample of bottles.
• A balance designed for precision weighting in a lab has to be
kept very well calibrated. To test the calibration of the balance
several measurements of the weight of an object are done. If
they differ too much then the balance has to be re-calibrated.
In all these examples a decision has to be taken based on
the numbers in a sample. These numbers are subject to
uncertainty and we have to decide if the differences that
we observe are only due to chance or not.
Null and Alternative Hypotheses
A bill is proposed to simplify the tax code. The proposer claims
that the bill is `revenue-neutral', that is, it will not lower tax
revenues. A simulation is run using 100 tax returns chosen at
random and the differences between the tax paid using the old
rules and those that would be paid using the new rules are
recorded. The average difference comes up to be -$219 with a
standard deviation of $725. Can we claim that the new rules are
revenue-neutral?
We can put the problem in these terms: there are two hypotheses:
NULL HYPOTHESIS, H0
and
ALTERNATIVE HYPOTHESIS, H1
Under the null hypothesis there is no difference in revenue
and the fact that the observed value is not 0 is totally due
to chance. Under the alternative hypothesis the difference
is real.
Null and Alternative Hypotheses
For the examples that we considered at the beginning we have
that:
• Cigarette: H0: the mean nicotine content is 1.5. H1: the
mean nicotine content is 2.5
• Vaccine: H0: the proportion is 25%. H1: the proportion is
higher than 25%.
• Soda Bottles: H0: the average amount of liquid is 333 ml.
H1: the average amount of liquid is not equal to 333 ml.
• Balance: H0: the device is calibrated. H1: the device is not
properly calibrated.
Test Statistics
How do we test the null hypothesis against the alternative?
Back to the tax example. Suppose the null is true. Then the
difference should be $0. How `large' is -$219 with respect to $0?
To answer this question we convert to standard units. Given that
the sample was of size 100 the SE is approximately $72:
−$219 − $0
≈ −3
$72
so the difference between the value under the null and the
observed value is -3 standard units.
The probability of the interval to the left of -3 is about 0.001, that
is, one chance in 1,000. So, under the null hypothesis, $219 is a
very unlikely value.
Test Statistics
In general we are calculating a test statistics given by
observed − exp ected
z=
SE
which is referred to as the z-test.
Once the z-test is calculated we have to decide whether its value
is `large' or is `small'. We observe the probability of the left tail of
the normal curve, below the z-test. If this probability is small then
the value of the z-test is far from the center of the distribution.
This probability is called an observed significance level
Test Statistics
The smaller the P-value, the stronger the evidence against
the null, but
Making a test of significance
To make a test of significance you need to:
• set up the null hypothesis
• pick a test statistics to measure the difference between the data
and what is expected under the null hypothesis
• compute the test statistics and the corresponding observed
significance level.
A small observed significance level implies that the data are far
from the values expected from the model under the null
hypothesis.
What is a small observed significance level?
This is somewhat arbitrary, but it is usually considered that if P is
less than 5% the results are significant. If P is less than 1% the
results are highly significant.
Examples
1. A random sample of 85 8th graders has a mean score of 265
with an SD of 55 on a national math test. A State Administrator
claims that the mean score of 8th graders on the examinations
is above 260. Is there enough evidence to support the
administrator's claim? The hypotheses are:
H0: mean ≤ 260
vs H1: mean > 260
The test statistics is obtained by changing to standard units:
265 − 260
= 0.838
55 / 85
The probability that a standard normal is above 0.838 is about
21%. This is a rather large P-value, so there does not seem to
be enough evidence to reject H0.
Examples
2. A light bulb manufacturer guarantees that the mean life of the
bulbs is at least 750 hours. A random sample of 36 light bulbs
has a mean of 725 hours and a standard deviation of 60 hours.
Is there enough evidence to reject the manufacturer's claim?
The hypotheses are
H0: mean ≥ 750
vs H1: mean < 750
The test statistics is obtained by changing to standard units:
725 − 750
= −2.05
60 / 36
The probability that a standard normal is below -2.05 is about
2%. There is some evidence to reject the manufacturer's claim.
Binary boxes
Consider again the problem of testing the new vaccine. This is a
binary model since we can classify the population in two groups:
the group of people for which the vaccine was effective and that
for which the vaccine was not.
Under the null hypothesis the box model that generates the
sample consists of the box
since there is 25% chance that the vaccine is effective.
Suppose that the number of people in the sample (of 2000 people)
for which the vaccine was effective is 534. According to the null
hypothesis the expected number would be 500. Is the 34 people
difference large enough to reject the null hypothesis and claim
that the new vaccine is more effective?
Binary boxes
We need to calculate the standard units of the difference between
550 and 534. Under the null hypothesis the SD of the box is
SD = 0.25 × 0.75 ≈ 0.43
so the standard error is SE = 2000 × 0.43 = 19.23 . Then:
534 − 500
z=
= 1.1768
19.23
The observed significance level is given by the area under normal
curve corresponding to interval above 1.768. This is around 4%,
which is small enough to conclude that the difference is
statistically significant. So there is evidence to support the claim
that the new vaccine is more effective than the standard.
The t-test
The examples that we have seen so far rely on the fact that the
sample size is large. So, even when the SD of the box is
unknown, we can still use the normal curve to obtain the
observed significance level of the test.
This is not the case when the sample size is small. In this case we
need a modification of the z-test due to `Student', a pen name for
a statistician called Gosset.
Consider the following example. The following five measures of
the concentration of Carbon monoxide (CO) are taken from a gas
sample where the concentration is precisely controlled to be 70
parts per million (ppm). Five measurements are taken to check
the calibration of an instrument
78 83 68 72 88
The t-test
The null hypothesis is that the device is calibrated and so the
average of the measurements is 70 ppm. The average of the
sample is 77.8 ppm , the SD is 7.22 ppm and thus the SE of the
average is 7.22 / 5 ≈ 3.23. The z-test can be obtained as
77.8 − 70
z=
≈ 2.4
3.23
To determine the observed significance level we calculate the area
to the right of 2.4 under the normal curve. This is less than 1%,
which looks like strong evidence against the null hypothesis.
Unfortunately we have to remember that the SD that we have
calculated is NOT the SD of the box. It is the SD of the sample,
whose size is fairly small, and thus the approximation is not very
precise. We correct the procedure with the following steps.
The t-test
Step 1: Consider a different estimate of the SD
number of measurements
SD =
× SD
number of measurements − 1
Notice that SD + > SD.
+
5
In our previous example we get SD =
× 7.22 ≈ 8.07 so the SE
of the average becomes 8.07 / 5 = 3.61, 4 as opposed to 3.23,
reflecting a higher level of uncertainty.
+
Then the test statistics becomes
77.8 − 70
t=
≈ 2.2
3.61
The t-test
Step 2: To find the observed significance level we can not use the
normal curve any more. We need to use a Student's t curve. This
curve depends on the degrees of freedom (DF). These are
calculated as
degrees of freedom = number of measurements - 1
A table for the Student's t curves is found at the end of the book.
There is one curve for each value of the DF. Each row
corresponds to one curve. The probabilities that are reported
correspond to the right hand tail, as opposed to what was
reported for the normal curve. These curves are symmetric
around 0 and for DF above 25 they resemble the normal curve
very closely. Thus, in our example, we need a Student's t curve
with 4 DF. The value 2.2 is not present in the table for the row
corresponding to 4 DF. The closest value is 2.13, which
corresponds to 5%. So the P-value for this test is about 5%.
Which is much weaker an evidence against the null than before.
The t-test
Suppose now that 6 measures are taken with the device
72 79 65 84 67 77
The average is equal to 74 ppm and the SD is 6.68 ppm. The
corrected SD is SD + = 6 / 5 × 6.88 ≈ 7.32. The SE of the average
is 7.23 / 6 ≈ 2.99. So the t-test is
74 − 70
t=
≈ 1.34
2.99
This time the DF are 5 and if we look at the table we find that the
probability corresponding to 1.48 is 10%. Since 1.34 is smaller
than 1.48 we have that the P-value is larger than 10%. This is not
enough evidence against the null. So the machine can be
considered to be well calibrated.
Examples
1. An environmentalist estimates that the mean waste recycled by
adults in the US is more than 1 pound per person per day. You
take a sample of 12 adults and find that the waste generated
per person per day is 1.2 pounds with a standard deviation of
0.3 pounds. Can you support the environmentalist's claim?
The hypotheses are:
H0: mean ≤ 1 vs H1: mean > 1
The corrected value of the SD is 12 /11 × 0.3 = 0.32 and the test
statistics is obtained by changing to standard units:
1.2 − 1
= 2.17
0.32 / 12
The probability that a Student with 11 degrees of freedom will
be above 2.17 is about 2.5%. This is a rather small P-value, so
there seems to be enough evidence to reject H0.
Examples
2. A microwave oven repairer says that the mean repair cost for
damaged microwave ovens is less than $100. You find a random
sample of 5 ovens has a mean repair cost of $75 with an SD of
$12.5. Do you have enough evidence to support the repairer's
claim?
H0: mean ≥ 100 vs H1: mean < 100
The corrected value of the SD is 5 / 4 ×12.5 = 13.95 and the test
statistics is obtained by changing to standard units:
75 − 100
= 4.01
13.95 / 5
The probability that a student with 4 degrees of freedom will
be above 4.01 is less than 1%. This is a rather small P-value,
so there seems to be enough evidence to reject H0.
Was the result significant?
How small does P-value have to get before you
reject the null hypothesis?
If P-value < 5%
statistically significant.
If P-value < 1%
highly significant.
Therefore a P-value of 5.1% is totally different
than a P-value of 4.9%!!!!!
Investigators should summarize the data, say
what test was used, and report the P-value
instead of just comparing it to 5% or 1%
Data Snooping
A result which is statistically significant cannot
be explain as chance variation. This is false.
Even if the null hypothesis is right, there is a 5%
chance of getting a difference which the test will
call “statistically significant”!
Therefore an investigator who makes 100 tests
expect to get 5 results which are “statistically
significant” due to chance!!!
To make bad enough worse, investigators often
decide which hypotheses to test only after have
seen the data
data snooping.
Practical significance
Statistical significance and practical significance
are two different ideas!!!
The P-value of a test depends on the sample
size. With a large sample, even a small
difference can be “statistically significant”, that
is, hard to explain by the luck of the draw. This
doesn’t
necessarily
make
it
important.
Conversely, an important difference may not be
statistically significant if the sample is too small.