Download Lecture 13

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Applied Data Analysis
Spring 2017
Karen, age 7
Karen Albert
[email protected]
Thursdays, 4-5 PM (Hark 302)
Lecture outline
1. One and two-tailed tests
2. Types of errors
The alternative hypothesis
When the null hypothesis is
H0 : µ = 10
we have three choices for the alternative hypothesis:
H1 :
µ > 10
H1 :
µ < 10
H1 :
µ 6= 10
One v. two-tailed
H1 :
µ > 10
H1 :
µ < 10
A “greater than” or “less than” test is a one-tailed test.
H1 : µ 6= 10
A “not equal to” test is a two-tailed test.
One v. two-tailed
H1 :
µ > 10
H1 :
µ < 10
A “greater than” or “less than” test is a one-tailed test.
H1 : µ 6= 10
A “not equal to” test is a two-tailed test.
The difference affects the p-value.
The p-value
Remember that the p-value is the probability of seeing a result
as extreme or more extreme than the observed result given that
the null hypothesis is true.
The p-value
Remember that the p-value is the probability of seeing a result
as extreme or more extreme than the observed result given that
the null hypothesis is true.
Let’s say that we have calculated the test statistic and the
z-score is -2. The p-value depends on whether the test is
one-tailed or two-tailed.
The “less than” test
When the alternative hypothesis is
H1 : µ < 10
it means that if the null hypothesis of chance is wrong, the true
mean is smaller than the value given by the null.
In this case, the p-value is on the left-hand side of the curve.
The “greater than” test
When the alternative hypothesis is
H1 : µ > 10
it means that if the null hypothesis of chance is wrong, the true
mean is larger than the value given by the null.
This time, assume that the test statistic is 2. The p-value is on
the right-hand side of the curve.
The “not equal to” test
When the alternative hypothesis is
H1 : µ 6= 10
it means that if the null hypothesis of chance is wrong, the true
mean is either larger or smaller than the value given by the null.
In cases such as these, we have to remember to multiply the
p-value by 2.
How can you tell which alternative?
The problem tells us...
Suppose a test has been given to all high school students in a
certain state. The mean test score of the entire state is 70, with
a S.D. of 10. Members of the school board suspect that female
students have a higher mean score on the test than male
students. A random sample of 64 female students is equal to
73. Does this provide strong evidence that the overall mean for
female students is higher?
How can you tell which alternative?
The problem tells us...
Suppose a test has been given to all high school students in a
certain state. The mean test score of the entire state is 70, with
a S.D. of 10. Members of the school board suspect that female
students have a higher mean score on the test than male
students. A random sample of 64 female students is equal to
73. Does this provide strong evidence that the overall mean for
female students is higher?
Answer
H0 : µ = 70
H1 : µ > 70
Answer
H0 : µ = 70
H1 : µ > 70
pnorm(73,70,10/sqrt(64),lower.tail=FALSE)
## [1] 0.008197536
The p-value is 0.008.
Answer
H0 : µ = 70
H1 : µ > 70
pnorm(73,70,10/sqrt(64),lower.tail=FALSE)
## [1] 0.008197536
The p-value is 0.008.
The p-value is small so we reject the null hypothesis.
Answer if the alt. had been “not equal to.”
Members of the school board suspect that female students
have a different mean score on the test than male students.
Does this provide strong evidence that the overall mean for
female students is different from the male students?
Answer if the alt. had been “not equal to.”
Members of the school board suspect that female students
have a different mean score on the test than male students.
Does this provide strong evidence that the overall mean for
female students is different from the male students?
The p-value would have been 0.008 × 2 or 0.016.
Answer if the alt. had been “not equal to.”
Members of the school board suspect that female students
have a different mean score on the test than male students.
Does this provide strong evidence that the overall mean for
female students is different from the male students?
The p-value would have been 0.008 × 2 or 0.016.
Note that this p-value is still small so we would still reject the
null hypothesis.
Two-tailed tests and CIs
For z’s and t’s, two-tailed tests and confidence intervals are
equivalent.
Two-tailed tests and CIs
For z’s and t’s, two-tailed tests and confidence intervals are
equivalent.
If the value of the null hypothesis falls outside the confidence
interval, we decide to reject the null hypothesis.
Two-tailed tests and CIs
For z’s and t’s, two-tailed tests and confidence intervals are
equivalent.
If the value of the null hypothesis falls outside the confidence
interval, we decide to reject the null hypothesis.
If the null falls into the confidence interval, we fail to reject the
null.
Two-tailed tests and CIs
For z’s and t’s, two-tailed tests and confidence intervals are
equivalent.
If the value of the null hypothesis falls outside the confidence
interval, we decide to reject the null hypothesis.
If the null falls into the confidence interval, we fail to reject the
null.
Why don’t we accept the null? Think of all the values contained
in the confidence interval and remember that the null
hypothesis is only about 1 value.
Types of errors
If we reject the null hypothesis when it is true, we make an error.
If we fail to reject the null hypothesis when it is false, we make
an error.
Types of errors
If we reject the null hypothesis when it is true, we make an error.
If we fail to reject the null hypothesis when it is false, we make
an error.
World
H0
H1
Decision
H0
H1
no error
type I
type II no error
Error probabilities
World
H0
H1
Decision
H0 H1
α
β
-
Error probabilities
World
H0
H1
Decision
H0 H1
α
β
-
• α is the probability of a type I error—the probability of
rejecting the null hypothesis when it is true.
Error probabilities
World
H0
H1
Decision
H0 H1
α
β
-
• α is the probability of a type I error—the probability of
rejecting the null hypothesis when it is true.
• It’s a number we agree upon as a community.
Error probabilities
World
H0
H1
Decision
H0 H1
α
β
-
• α is the probability of a type I error—the probability of
rejecting the null hypothesis when it is true.
• It’s a number we agree upon as a community.
• If the p-value is less than α, we decide to reject the null
hypothesis.
The probability of a type I error
α is the long-run probability of rejecting the null hypothesis
when it is true.
The probability of a type I error
α is the long-run probability of rejecting the null hypothesis
when it is true.
If we want to protect against a type I error, why not set α to a
really low number?
The probability of a type I error
α is the long-run probability of rejecting the null hypothesis
when it is true.
If we want to protect against a type I error, why not set α to a
really low number?
Look at the picture that I am drawing on the board.
Screw that
The probability of a type I error
α is the long-run probability of rejecting the null hypothesis
when it is true.
If we want to protect against a type I error, why not set α to a
really low number?
Look at the picture that I am drawing on the board.
As α decreases, the β (the probability of a type II error)
increases.
Limitations of testing 1
There is nothing special about 5% or 1%.
Limitations of testing 1
There is nothing special about 5% or 1%.
If our α level is 5%, what is the difference between a p-value of
4.9% and p-value of 5.1%?
Limitations of testing 1
There is nothing special about 5% or 1%.
If our α level is 5%, what is the difference between a p-value of
4.9% and p-value of 5.1%?
One is statistically significant, and one is not.
Limitations of testing 1
There is nothing special about 5% or 1%.
If our α level is 5%, what is the difference between a p-value of
4.9% and p-value of 5.1%?
One is statistically significant, and one is not.
But does that make sense?
Limitations of testing 1
There is nothing special about 5% or 1%.
If our α level is 5%, what is the difference between a p-value of
4.9% and p-value of 5.1%?
One is statistically significant, and one is not.
But does that make sense?
Always report the p-value, not just the conclusion.
Limitations of testing 2
Data snooping
What does a significance level of 5% mean?
Limitations of testing 2
Data snooping
What does a significance level of 5% mean?
There is a 5% chance of rejecting the null hypothesis when it is
true.
Limitations of testing 2
Data snooping
What does a significance level of 5% mean?
There is a 5% chance of rejecting the null hypothesis when it is
true.
If our significance level is 5%, how many results would be
“statistically significant” just by chance if we ran 100 tests?
Limitations of testing 2
Data snooping
What does a significance level of 5% mean?
There is a 5% chance of rejecting the null hypothesis when it is
true.
If our significance level is 5%, how many results would be
“statistically significant” just by chance if we ran 100 tests?
We would expect 5 to be “statistically significant,” and 1 to be
“highly significant.”
Limitations of testing 2.1
So what can we do?
Limitations of testing 2.1
So what can we do?
1. Always state how many tests were run before statistically
significant results turned up.
2. Always test your conclusions on an independent set of
data, if possible.
Limitations of testing 3
Was the result important?
Limitations of testing 3
Was the result important?
If we increase the sample size, what happens to:
• the standard error?
• the test statistic?
• the p-value?
Limitations of testing 3
Was the result important?
If we increase the sample size, what happens to:
• the standard error?
• the test statistic?
• the p-value?
A statistically significant difference may not be important, and
an important difference many not statistically significant.
Limitations of testing 4
The role of the model.
Limitations of testing 4
The role of the model.
Significance tests only make sense when we can talk about
them in the context of a box model.
Limitations of testing 4
The role of the model.
Significance tests only make sense when we can talk about
them in the context of a box model.
Give me two examples of when we would not have a box model.
Limitations of testing 4.1
Two possibilities:
• We have the entire population.
There is no such thing as sampling variability in this case.
All data is subject to many small errors, but these are not
like draws from a box.
Limitations of testing 4.1
Two possibilities:
• We have the entire population.
There is no such thing as sampling variability in this case.
All data is subject to many small errors, but these are not
like draws from a box.
• We do not have a probability sample.
With a sample of convenience, the concept of chance is
hard to define, the phrase “the difference is due to chance”
is hard to interpret, and p-values are nearly meaningless.
Limitations of testing 5
Does the difference prove the point?
Limitations of testing 5
Does the difference prove the point?
Consider an ESP experiment in which a die is rolled, and the
subject tries to make it land showing 6 pips. This is repeated
720 times, and the die lands 6 in 143 of these trials.
Limitations of testing 5
Does the difference prove the point?
Consider an ESP experiment in which a die is rolled, and the
subject tries to make it land showing 6 pips. This is repeated
720 times, and the die lands 6 in 143 of these trials.
If the die is fair and the subject does not have ESP, we would
expect 720*1/6=120 sixes.
The expected difference is 143-120=23.
Limitations of testing 5.1
The standard error is
se <- sqrt(720)*sqrt((1/6)*(5/6))
se
## [1] 10
Limitations of testing 5.1
The standard error is
se <- sqrt(720)*sqrt((1/6)*(5/6))
se
## [1] 10
and the p-value is
Limitations of testing 5.1
The standard error is
se <- sqrt(720)*sqrt((1/6)*(5/6))
se
## [1] 10
and the p-value is
pnorm(143,120,se,lower.tail=FALSE)
## [1] 0.01072411
So is ESP real, or are there alternative explanations for these
findings?
Limitations of testing 5.2
The die isn’t fair!
Limitations of testing 5.2
The die isn’t fair!
The take-home message:
Significance tests do not check the design of the study.
What did we learn?
• Difference between one and two-tailed tests.
• Types of errors and their probabilities.
• The limitations of hypothesis testing.