Download Comparing Classical and Bayesian Approaches to Hypothesis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Latitudinal gradients in species diversity wikipedia , lookup

Transcript
Comparing Classical and Bayesian
Approaches to Hypothesis Testing
James O. Berger
Institute of Statistics and Decision Sciences
Duke University
www.stat.duke.edu
Outline
•
•
•
•
The apparent overuse of hypothesis testing
When is point null testing needed?
The misleading nature of P-values
Bayesian and conditional frequentist testing of
plausible hypotheses
• Advantages of Bayesian testing
• Conclusions
I. The apparent overuse of
hypothesis testing
• Tests are often performed when they are
irrelevant.
• Rejection by an irrelevant test is sometimes
viewed as “license” to forget statistics in further
analysis
Prototypical example
Habitat
Type
A
B
C
D
E
F
Rank
1
2
3
4
5
6
Observed
Hypothesis
Usage
3.8
3.6
H0 : "mean usage is
2.8
equal for all habitats"
1.8
Rejected (P<.025)
1.5
0.7
Statistical mistakes in the example
• The hypothesis is not plausible; testing serves no
purpose.
• The observed usage levels are given without
confidence sets.
• The rankings are based only on observed means,
and are given without uncertainties. (For instance,
perhaps Pr (A>B)=0.6 only.)
Prototypical example
Habitat
Type
A
B
C
D
E
F
Rank
1
2
3
4
5
6
Observed
Hypothesis
Usage
3.8
3.6
H0 : "mean usage is
2.8
equal for all habitats"
1.8
Rejected (P<.025)
1.5
0.7
Statistical mistakes in the example
• The hypothesis is not plausible; testing serves no
purpose.
• The observed usage levels are given without
confidence sets.
• The rankings are based only on observed means,
and are given without uncertainties. (For instance,
perhaps Pr (A>B)=0.6 only.)
Prototypical example
Habitat
Type
A
B
C
D
E
F
Rank
1
2
3
4
5
6
Observed
Hypothesis
Usage
3.8
3.6
H0 : "mean usage is
2.8
equal for all habitats"
1.8
Rejected (P<.025)
1.5
0.7
II. When is testing of a point null
hypothesis needed?
Answer: When the hypothesis is plausible, to
some degree.
Note that, while H 0 :  0 is typically not
plausible, it is a good approximation to
H 0 :| |   , as long as  <  (4 n )
(assuming n Gaussian observations with
standard deviation  ).
Examples of hypotheses that are not
realistically plausible
• H0: small mammals are as abundant on livestock grazing
land as on non-grazing land
• H0: survival rates of brood mates are independent
• H0: bird abundance does not depend on the type of forest
habitat they occupy
• H0: cottontail choice of habitat does not depend on the
season
Examples of hypotheses that may be
plausible, to at least some degree:
• H0: Males and females of a species are the same
in terms of characteristic A.
• H0: Proximity to logging roads does not affect
ground nest predation.
• H0: Pollutant A does not affect Species B.
III. For plausible hypotheses, P-values
are misleading as measures of evidence
Example: Experimental drugs D1, D2, D3, . . .
are to be tested.
Each Test: H0: Di has negligible effect
H1: Di is effective
Typical Bayesian Answer: The probability
that H0 is true is 0.06.
Classical Answer (P-value): If H0 were true, the
probability of observing hypothetical data as or
more "extreme" than the actual data is 0.06.
DRUG
D1
D2
D3
D4
D5
D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28
DRUG
D7
D8
D9
D10
D11 D12
P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66
Question: How strongly do we believe that
Drug i has a nonnegligible effect when
(i) the P-value is approximately 0.05?
(ii) the P-value is approximately 0.01?
A Surprising Fact: Suppose it is known
that, apriori, about 50% of the Drugs will
have negligible effect. Then,
(i) of the Drugs for which the P-value  0.05,
at least 25% (and typically over 50%)
will have negligible effect;
(ii) of the Drugs for which the P-value  0.01,
at least 7% (and typically over 15%)
will have negligible effect.
DRUG
D1
D2
D3
D4
D5
D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28
DRUG
D7
D8
D9
D10
D11 D12
P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66
Question: How strongly do we believe that
Drug i has a nonnegligible effect when
(i) the P-value is approximately 0.05?
(ii) the P-value is approximately 0.01?
A Surprising Fact: Suppose it is known
that, apriori, about 50% of the Drugs will
have negligible effect. Then,
(i) of the Drugs for which the P-value  0.05,
at least 25% (and typically over 50%)
will have negligible effect;
(ii) of the Drugs for which the P-value  0.01,
at least 7% (and typically over 15%)
will have negligible effect.
DRUG
D1
D2
D3
D4
D5
D6
P-VALUE 0.41 0.04 0.32 0.94 0.01 0.28
DRUG
D7
D8
D9
D10
D11 D12
P-VALUE 0.11 0.05 0.65 0.009 0.09 0.66
Question: How strongly do we believe that
Drug i has a nonnegligible effect when
(i) the P-value is approximately 0.05?
(ii) the P-value is approximately 0.01?
IV. Bayesian testing of point hypotheses
Data and Model: X has density f ( x| )
Example: X  # of eggs hatched out of n eggs
in a recently polluted area (so f is binomial,
and  is the true proportion that would hatch).
To Test: H 0 :  0 versus H1 :  0
Example: 0 is the historically known proportion
of eggs that hatch in the area
The prior distribution
Let P1 and P2 be the prior probabilities of H1 and H 2 .
(The usual default choice is P1  P2  0.5.)
Under H1 , let  ( ) be the density representing
information concerning the location of  . (The usual
default choice for the binomial problem is  ( )  1.)
Note: There are two schools of Bayesian statistics,
the subjective school, where the prior distribution
reflects real extraneous information, and the objective
school, where the prior is chosen in a default fashion.
Posterior probability that H0 is true,
given the data (from Bayes theorem):
Pr( H 0 | data x ) 

P0 f ( x|0 )
P0 f ( x|0 )  P1
f ( x| ) ( )d

 
{ 
 1
x
0
1   
0
x n
0}
Beta ( x  1, n  x  1)
(for the binomial testing problem)

( 1)
Note: Some prefer to use the Bayes Factor (or
weighted likelihood ratio) of H 0 to H1 ,
B
f ( x|0 )
f ( x| ) ( ) d

 
{ 
0}
likelihood of data under H 0
=
,
" average" likelihood of data under H1
since this does not involve prior probabilties of the H i .
Example: Suppose x=40 eggs hatched out of n=100.
Then Pr( H 0 | data x )  0.52 and B  0.92. (Here a
classical test would yield P  value  0.05.)
Conditional frequentist interpretation of
the posterior probability of H0
Pr( H 0 | data x ) is also the frequentist type I error
probability , conditional on observing data of the
same "strength of evidence" as the actual data x.
(The classical type I error probability makes the
mistake of reporting the error averaged over data
of very different strengths.)
V. Advantages of Bayesian testing
• Pr (H0 | data x) reflects real expected error rates:
P-values do not.
• A default formula exists for all situations:
( 1)
*
*
*


f
(
x
,

)
f
(
x
,

)
f
(
x
,

)
dx
d

0


 ,
Pr( H 0 | data x )  1 
*


f
(
x
,

)
f
(
x
,

)
d

0



where x * is independent (unobserved) data of the smallest
size such that the above integrals exist.
• Posterior probabilities allow for incorporation of
personal opinion, if desired. Indeed, if the
published default posterior probability of H0 is P*,
and your prior probability of H0 is P0, then your
posterior probability of H0 is
( 1)
  1
 1

Pr( H 0 | data x )  1    1  *  1 


P
P


0


*
Example: In the binomial example, recall P  0.52.
A "skeptic" has P0  01
. ; hence Pr( H 0 | data x )  011
. .
A " believer" has P0  0.9; hence Pr( H 0 | data x )  0.91.
• Posterior probabilities are not affected by the
reason for stopping experimentation, and hence
do not require rigid experimental designs (as do
classical testing measures).
• Posterior probabilities can be used for multiple
models or hypotheses.
Example: H 0 : pollutant A has no effect on species B
H1 : pollutant A decreases abundance of species B
H 2 : pollutant A increases abundance of species B
Pr( H 0 | data ) .30, Pr( H1 | data ) .68, Pr( H 2 | data ) .02
An aside: integrating science and
statistics via the Bayesian paradigm
• Any scientific question can be asked (e.g., What is
the probability that switching to management plan
A will increase species abundance by 20% more
than will plan B?)
• Models can be built that simultaneously
incorporate known science and statistics.
• If desired, expert opinion can be built into the
analysis.
Conclusions
• Hypothesis testing is overutilized while (Bayesian)
statistics is underutilized.
• Hypothesis testing is needed only when testing a
“plausible” hypothesis (and this may be a rare
occurrence in wildlife studies).
• The Bayesian approach to hypothesis testing has
considerable advantages in terms of interpretability
(actual error rates), general applicability, and
flexible experimentation.