Download Introduction to Frequentist and Bayesian Approaches

Document related concepts

Computational phylogenetics wikipedia , lookup

Transcript
Statistical Modeling and Data Analysis
Given a data set, first question a statistician ask is,
โ€œWhat is the statistical model to this data?โ€
We then characterize and analyze the parameters of the model
with an objective in mind.
โ€ข Example : SBP of Cancer Patients vs. Normal patients
Cancer: 145, 165, 134, 120, 112, 156, 145, 133, 135, 120
Normal: 138, 120, 112, 110, 128, 134, 128, 109, 138, 140
Objective: Do cancer patients have higher SBP than the normal
patients?
1
Population of cancer patients with a probability distribution
Population of normal patients with a probability distribution
normal
cancer
๐œ‡1
๐œ‡2
Systolic blood pressure
Objective is to test the Hypothesis: ๐œ‡2 > ๐œ‡1
Does the data support this hypothesis?
2
Assumption: The data is random and is generated from the
normal distributions?
โ€ข Random Variable ๐‘‹
๐‘‹โˆถ๐‘†โ†’๐‘…
๐‘† is the collection of all subjects. What we observe is one
realization
๐‘‹(๐‘ )
โ€ข Random Sample: {๐‘‹1 , ๐‘‹2 , โ€ฆ , ๐‘‹๐‘› }
We collect a sample of subjects {๐‘ 1 , ๐‘ 2 , โ€ฆ , ๐‘ ๐‘› }
3
Observed Sample: {๐‘‹ ๐‘ 1 , ๐‘‹ ๐‘ 2 , โ€ฆ , ๐‘‹ ๐‘ ๐‘› }
Assumption: {๐‘ 1 , ๐‘ 2 , โ€ฆ , ๐‘ ๐‘› } โ€“ Simple Random Sample
(equally likely than any other sample)
โ€ข Multivariate Observations
๐‘ฟ=
๐‘‹1
๐‘‹2
: ๐‘† โ†’ ๐‘…๐‘˜
โ‹ฎ
๐‘‹๐‘˜
An observed vector is one realization of this, i.e., ๐‘ฟ(๐‘ )
4
Random Sample: {๐‘ฟ1 , ๐‘ฟ2 , โ€ฆ , ๐‘ฟ๐‘› }
Observed sample is a realization of
{๐‘ฟ ๐‘ 1 , ๐‘ฟ ๐‘ 2 , โ€ฆ , ๐‘ฟ ๐‘ ๐‘› }
Note: If the simultaneous inference is to made on its
components, the probability statement should be
viewed in terms of probability of observing
{๐‘ 1 , ๐‘ 2 , โ€ฆ , ๐‘ ๐‘› }
5
Stochastic Process
{๐‘‹ ๐‘ก , 0 โ‰ค ๐‘ก < โˆž}
Observed value of this is one realization
{๐‘‹ ๐‘ก, ๐‘  , 0 โ‰ค ๐‘ก < โˆž}
Can we describe a probability distribution of
{๐‘‹ ๐‘ก , 0 โ‰ค ๐‘ก < โˆž}?
Kolmogorov Consistency Theorem says that probability
distribution can be described.
6
These are three realizations with ๐‘‹ 0 = 0
7
Discrete time points
{๐‘‹ ๐‘ก , ๐‘ก = โ‹ฏ , โˆ’2, โˆ’1, 0, 1, 2, โ‹ฏ }
If this process is stationary, then a probability model for
๐‘‹(๐‘ก) can be described in a concise way. For example,
๐‘‹ ๐‘ก = ๐œŒ๐‘‹ ๐‘ก โˆ’ 1 + ๐œ– ๐‘ก
=
โˆž
๐‘˜ ๐œ–(๐‘ก
๐œŒ
๐‘˜=0
โˆ’ ๐‘˜),
where {๐œ– ๐‘ก } is white noise.
8
Image Process:
9
{๐‘‹ ๐‘ , ๐‘ ๐œ– ๐‘„}, where ๐‘„ is the set of all pixels.
Note that what we observe is a realization of this
{๐‘‹ ๐‘, ๐‘  , ๐‘ ๐œ– ๐‘„}
10
The same can be said about weather map.
11
Data Analysis
Generally speaking, we perform one or more of the
following tasks in data analysis (statistical inference)
โ€ข Estimate the model
โ€ข Hypothesis testing
โ€ข Predictive analysis
Given the sample data, objective is to make inference about
the population described by the probability model.
All inferences are based on probability model assumed.
12
Estimation
๐œƒ โˆ’ ๐ธ๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘’ ๐‘œ๐‘“ ๐œƒ
Think of estimating any parameters of a probability model.
For example, estimating ๐›ฝ0 and ๐›ฝ1 of a regression model
๐‘ฆ = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ + ๐œ–
How good is the estimate ๐œƒ?
Well, you might say that if ๐œƒ โ‰… ๐œƒ, it is a good estimate.
Not so simple! Note that ๐œƒ is unknown.
13
Frequentistโ€™s Interpretation
Note that ๐œƒ depends on the sample we observe.
Sample
๐œฝ
|๐œฝ โˆ’ ๐œฝ|
๐œฝ
|๐œฝ โˆ’ ๐œฝ|
โ€ฆโ€ฆ
--
--
--
--
โ€ฆโ€ฆ
--
--
--
--
observed
observed
observed
observed
observed
โ€ฆโ€ฆ
--
--
--
--
โ‹ฎ
โ‹ฎ
โ‹ฎ
โ‹ฎ
โ‹ฎ
๐œƒ is better than ๐œƒ if the average of ๐œฝ โˆ’ ๐œฝ
๐Ÿ
is smaller than
๐Ÿ
the average of ๐œฝ โˆ’ ๐œฝ , i.e.,
๐ธ ๐œƒโˆ’๐œƒ
2
<๐ธ ๐œƒโˆ’๐œƒ
2
for all ๐œƒ.
14
๐œƒ is better than ๐œƒ if
๐ธ ๐œƒโˆ’๐œƒ
2
<๐ธ ๐œƒโˆ’๐œƒ
2
for all ๐œƒ.
A best estimate, in this sense, is of course not possible. If ๐œƒ0 โ‰ก
๐œƒ0 irrespective of the observed sample, then
๐ธ ๐œƒ0 โˆ’ ๐œƒ
2
= 0 for ๐œƒ = ๐œƒ0
We restrict to a class of estimators, and then try to find best
Estimate within this class.
For example, we may consider a class of all unbiased
estimators.
15
Theories are well developed for achieving best estimates
among the class of unbiased estimates for simple
probability models.
For complicated model, we can always fall back to
maximum likelihood estimates.
Obtain the estimate by maximizing the likelihood function
๐ฟ ๐œƒ ๐‘ฅ1 , ๐‘ฅ2 , โ€ฆ , ๐‘ฅ๐‘› = Pr(๐‘ฅ1 , ๐‘ฅ2 , โ€ฆ , ๐‘ฅ๐‘› |๐œƒ)
For small sample size ๐‘›, this may not always yield good
estimate, but for large sample size ๐‘›, this generally yield
optimal estimates.
16
Asymptotic Optimality of Maximum Likelihood Estimate
{๐œƒ๐‘› } โ€“ sequence of asymptotically normal estimates
1
๐œƒ๐‘› โˆ’ ๐œƒ
โˆ’2
๐‘‰๐‘›
๐œƒ โ†’๐‘‘ ๐‘(0, ๐ผ) as ๐‘› โ†’ โˆž
๐‘‰๐‘› ๐œƒ can be interpreted as asymptotic variance of {๐œƒ๐‘› }.
๐‘‰๐‘› ๐œƒ โ‰ฅ ๐ผ๐‘› ๐œƒ
โˆ’1 ,
๐ผ๐‘› (๐œƒ) - Fisher Information Matrix
Under regular probability models, maximum likelihood
estimates {๐œƒ๐‘€๐ฟ } achieves the lower bound.
17
Bayesian Interpretation
Prior Distribution - ๐œ‹(๐œƒ)
Through this we might say that some values of ๐œƒ are more likely
than other values.
๐œƒ is better than ๐œƒ if
2
๐ธ ๐œƒ โˆ’ ๐œƒ ๐œ‹ ๐œƒ ๐‘‘๐œƒ <
2
๐ธ ๐œƒ โˆ’ ๐œƒ ๐œ‹(๐œƒ)๐‘‘๐œƒ.
A best estimate is now possible; for example,
๐œƒ๐ต = ๐ธ(๐œƒ|๐‘‘๐‘Ž๐‘ก๐‘Ž)
The RHS is the expectation with respect to the posterior distribution
of ๐œƒ.
18
Prior Distribution - ๐œ‹ ๐œƒ
Really? Where did it come from?
You may not believe this, but we are really talking in terms of
a statistical philosophy.
Can you really believe that the true state of nature ๐œƒ is
random?
normal
cancer
๐œ‡1
๐œ‡2
Systolic blood pressure
19
๐œ‡1 and ๐œ‡2 are supposed to be fixed mean SBPs of the
normal and cancer populations. Now, we are saying that
they are random.
Bayesian Paradigm
๐œƒ is never a fixed value; under most circumstances some
values of ๐œƒ are more likely than other values.
Before a data is analyzed, we should explore this prior. Then
update it based on the information provided by the data.
Prior: ๐œ‹(๐œƒ)
Data: ๐‘“(๐‘‘๐‘Ž๐‘ก๐‘Ž|๐œƒ)
Posterior: ๐œ‹(๐œƒ|๐‘‘๐‘Ž๐‘ก๐‘Ž)
All information about ๐œƒ is contained in the posterior.
20
Example:
1 in 1,000 in the population carry a particular genetic disorder.
Certain tests on a person are performed, and data is collected
Data: {๐‘ฅ1 , ๐‘ฅ2 , โ€ฆ , ๐‘ฅ๐‘› }
๐‘“ ๐‘‘๐‘Ž๐‘ก๐‘Ž + ,
Prior: ๐œ‹ + =
๐‘“(๐‘‘๐‘Ž๐‘ก๐‘Ž|โˆ’)
1
1000
Posterior: ๐œ‹ + ๐‘‘๐‘Ž๐‘ก๐‘Ž =
๐‘“
๐‘“
๐‘‘๐‘Ž๐‘ก๐‘Ž +
๐‘‘๐‘Ž๐‘ก๐‘Ž +
๐œ‹ + +๐‘“
๐œ‹ +
๐‘‘๐‘Ž๐‘ก๐‘Ž โˆ’
๐œ‹ โˆ’
21
The main issues with Bayesian inference are
(1) Appropriateness of the prior
(2) Computation of the posterior distribution
{๐‘‹1 , ๐‘‹2 , โ€ฆ , ๐‘‹๐‘› } random sample from ๐‘(๐œ‡, ๐œŽ 2 )
Prior:
๐œ‡ ~ ๐‘(๐œˆ0 , ๐œŽ 2 ๐œ”0 )
2
๐œŽ 2 โˆ’1 ~ ๐œ’๐‘š
This is a conjugate prior because the posterior distribution is of
same form as the prior distribution.
Is this prior appropriate?
22
Prior:
๐œ‡ ~ ๐‘(๐œˆ0 , ๐œŽ 2 ๐œ”0 )
2
๐œŽ 2 โˆ’1 ~ ๐œ’๐‘š
If nothing is known about (๐œ‡, ๐œŽ 2 ), ๐œ”0 โ‰ˆ โˆž, ๐‘š = 1, ๐œˆ0 = 0.
This gives almost flat prior for ๐œ‡ and ๐œŽ 2 .
There are other ways to assign non-informative priors.
Note that if
Prior:
๐œ‡ ~ ๐‘(๐œˆ0 , ๐œ02 )
2
๐œŽ 2 โˆ’1 ~ ๐œ’๐‘š
then we will have computational problem of computing
posterior distribution.
23
Computation of the posterior
There are two popular techniques of computing posterior
distribution:
1. Metropolis-Hasting Algorithm
2. Gibbs Sampler
These techniques can be used effectively for complex
probability model and reasonable priors.
24
Frequentist vs. Bayesian
Frequentist
Bayesian
All data information is
contained in the likelihood
function.
All data information is
contained in the likelihood
function and the prior
The estimates are viewed
in terms of how they behave
on the average
Estimates are viewed in
terms of where they are
located in the posterior
Estimates are generally obtained
by maximizing the likelihood
function. Techniques include
Newton-Raphson, EM-algorithm
Estimates are obtained from
the posterior. Techniques
include Gibbas Sampler,
Metropolis-Hasting etc.
25
Mixture Models
Suppose the population is a mixture of two or more populations.
๐‘ฆ๐‘– = ๐›ฝ0๐‘– + ๐›ฝ1๐‘– ๐‘ฅ๐‘– + ๐œ–๐‘–
๐›ฝ0๐‘–
~ ๐‘€๐‘–๐‘ฅ๐‘ก๐‘ข๐‘Ÿ๐‘’ ๐‘œ๐‘“ ๐‘๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘ 
๐›ฝ1๐‘–
Bayesians would have a good answer to estimate this model than
frequentists would.
26
Hypothesis Testing
Think about how it started in statistical literature.
Data: {๐‘‹1 , ๐‘‹2 , โ€ฆ , ๐‘‹๐‘› } drawn from a probability model.
๐ป: ๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘  associated with the probability model
Does the data support this hypothesis?
Bayesians had an answer to this, but they were not popular at
the time.
Ans. ๐‘ƒ(๐ป|๐‘‘๐‘Ž๐‘ก๐‘Ž)
27
๐’‘ โˆ’ ๐’—๐’‚๐’๐’–๐’† (Fisher)
{๐‘‹1 , ๐‘‹2 , โ€ฆ , ๐‘‹๐‘› } drawn from ๐‘(๐œ‡, ๐œŽ 2 )
Hypothesis ๐ป: ๐œ‡ = ๐œ‡0
Compute ๐‘ฅ โˆ’ ๐œ‡0 = ๐‘ก
๐‘ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’
= Pr ๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘–๐‘›๐‘” ๐‘Ž ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ ๐‘ก ๐‘œ๐‘Ÿ ๐‘š๐‘œ๐‘Ÿ๐‘’ ๐‘’๐‘ฅ๐‘ก๐‘Ÿ๐‘’๐‘š๐‘’ ๐ป)
If this ๐‘ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ is vey small (< 0.05), then the data
provide very little evidence in support of the hypothesis.
Conclusion: Reject the Hypothesis
28
Analysis of Variance (ANOVA)
ANOVA is one of the most popular statistical tools of analyzing
data.
Factor 1
Y
Factor 2
A Response
Variable
Factor 3
Does Y (the response) depends on any of the factors?
29
Example 1: You are doing a research on mpg (miles per
gallon) for a brand of automobiles.
Question: What effects mpg?
Wind speed
mpg
Air temperature
Air moisture
Do wind speed, air temperature, and air moisture effect
mpg?
30
Example 2:
Research Question: Does blood pressure (BP) depend on
weight and gender?
Weight
BP
Gender
31
There is a variation in BP.
Some is due to weight,
and some is due to
gender.
BP
* Female
* Male
*
**
*
*
*
*
*
*
*
*
*
Weight
32
Concept:
Variation(BP) = Variation(Weight) + Variation(Gender)
+ Variation(Error)
These variation can be described by Sums of Squares
โ€ฆ
2
SS(BP) = SS(Weight) + SS(Gender) + SS(Error)
๐‘‘๐‘“๐ต๐‘ƒ = ๐‘‘๐‘“๐‘ค
+
๐‘‘๐‘“๐‘”
+ ๐‘‘๐‘“๐‘’
๐‘‘๐‘“ is the degrees of freedom that represent the effective
number of terms in the sums of squares
33
F-Statistics
Weight:
Test Statistic ๐น1 =
๐‘†๐‘† ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก
๐‘‘๐‘“๐‘ค
๐‘†๐‘† ๐ธ๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ
๐‘‘๐‘“๐‘’
=
๐‘€๐‘†๐‘ค
.
๐‘€๐‘†๐ธ
Hypothesis ๐ป: Weight is not a factor in BP
๐‘ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ =
๐‘ƒ(๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘–๐‘›๐‘” ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ ๐‘š๐‘œ๐‘Ÿ๐‘’ ๐‘’๐‘ฅ๐‘ก๐‘Ÿ๐‘’๐‘š๐‘’ ๐‘กโ„Ž๐‘Ž๐‘› ๐น1 |๐ป)
If p-value (<0.05), then there is little evidence that weight
is not a factor
Gender: Test Statistics ๐น2 =
๐‘†๐‘† ๐‘”๐‘’๐‘›๐‘‘๐‘’๐‘Ÿ
๐‘‘๐‘“๐‘ค
๐‘†๐‘† ๐ธ๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ
๐‘‘๐‘“๐‘’
=
๐‘€๐‘†๐บ
๐‘€๐‘†๐ธ
Same can be done to see if gender is a factor.
34
Neyman โ€“ Pearson Lemma
Basis for Classical Hypothesis Testing
๐ป0 : Null hypothesis
๐ป๐‘Ž : Alternative Hypothesis (Research Hypothesis)
TS: Test Statistics
Decision Rule
Conclusion
Type-I Error: False Discovery
Type-II Error: False Non-Discovery
Devise a decision rule so that
๐›ผ = Pr(False Discovery)
is very small (=0.05). Through Neyman-Pearson Lemma, a
most powerful decision rule can be obtained.
35
๐ป0 : ๐œ‡ = ๐œ‡0
Uniformly Most Powerful Unbiased Decision Rule is
๐‘‹ โˆ’ ๐œ‡0 > ๐‘˜,
where ๐‘˜ is such that
Pr X โˆ’ ๐œ‡0 > ๐‘˜ = 0.05.
Note that this is a frequentist method since the probability
statement should be interpreted in a frequentist manner.
36
Likelihood Approach
Neyman-Perason Lemma works only on simple probability
models.
Test Statistics
โˆ’2 log ๐œ† = 2(max log ๐ฟ โˆ’ max log ๐ฟ)
๐ป
If the hypothesis ๐ป is correct, the โˆ’2 log ๐œ† should be closed to
0. Thus, we reject the hypothesis ๐ป if
โˆ’2 log ๐œ† > ๐‘
The cut-off point ๐‘ can be obtained through asymptotic
distribution of โˆ’2 log ๐œ†, which is usually ๐œ’ 2 .
37
Model Selection
Suppose you want choose one model out of several. This is a
type of multiple hypotheses problem.
Regression: ๐‘ฆ = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1 + ๐›ฝ2 ๐‘ฅ2 + โ‹ฏ ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜ + ๐œ–
Not all predictors ๐‘ฅ1 , ๐‘ฅ2 , โ€ฆ ๐‘ฅ๐‘˜ are significant, and you want to
select the set of significant predictors. This can be viewed as
selecting one of the several models ๐‘€๐‘— , ๐‘— = 1,2, โ€ฆ , ๐‘š
โˆ’2 log ๐œ†๐‘€๐‘— = 2(max log ๐ฟ โˆ’ max log ๐ฟ)
๐‘€๐‘—
Choose the model that yields the smallest โˆ’2 log ๐œ†๐‘€๐‘—
38
This yields a biased selection, meaning that a model with
higher number of parameters has a better chance of being
selected.
AIC or BIC Information criteria
๐ด๐ผ๐ถ = 2 max log ๐ฟ โˆ’ 2 โˆ— # ๐‘œ๐‘“ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ๐‘  in Mj
๐‘€๐‘—
๐ต๐ผ๐ถ = 2 max log ๐ฟ โˆ’ log ๐‘› โˆ— # ๐‘œ๐‘“ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘Ÿ๐‘  ๐‘–๐‘› ๐‘€๐‘—
๐‘€๐‘—
Select the model with the highest value of AIC (or BIC)
39
Bayesian Hypothesis Testing
Data: {๐‘‹1 , ๐‘‹2 , โ€ฆ , ๐‘‹๐‘› } drawn from ๐‘(๐œ‡, ๐œŽ 2 )
Hypothesis ๐ป: ๐œ‡ = ๐œ‡0
Prior: ๐‘0 = ๐‘ƒ๐‘Ÿ ๐œ‡ = ๐œ‡0 , 1 โˆ’ ๐‘0 = Pr(๐œ‡ โ‰  ๐œ‡0 )
Posterior: ๐‘ = Pr(๐œ‡ = ๐œ‡0 |๐ท๐‘Ž๐‘ก๐‘Ž), 1 โˆ’ ๐‘ = Pr(๐œ‡ โ‰  ๐œ‡0 |๐ท๐‘Ž๐‘ก๐‘Ž)
Bayes Factor:
๐ต๐น =
๐‘
1โˆ’๐‘
๐‘0
1โˆ’๐‘0
If this Bayes factor (๐ต๐น โ‰ฅ 20), data has sufficient evidence to
support the hypothesis ๐ป: ๐œ‡ = ๐œ‡0 .
40
Frequentist Vs. Bayesian
Note that both ๐‘ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ and classical hypothesis tests are
frequentists since the statements are made in terms of
probability.
๐‘ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ = Pr ๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘–๐‘›๐‘” ๐‘Ž ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ ๐‘ก ๐‘œ๐‘Ÿ ๐‘š๐‘œ๐‘Ÿ๐‘’ ๐‘’๐‘ฅ๐‘ก๐‘Ÿ๐‘’๐‘š๐‘’ ๐ป)
๐›ผ = ๐‘ƒ๐‘Ÿ ๐น๐‘Ž๐‘™๐‘ ๐‘’ ๐ท๐‘–๐‘ ๐‘๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘ฆ
= Pr((๐‘‹ โˆ’ ๐œ‡0 | > ๐‘)
The Bayes Factor is used in Bayesian tests which is based on the
posterior probability Pr(๐ป|๐ท๐‘Ž๐‘ก๐‘Ž)
41
Multiple Hypotheses:
Consider 1000 independent tests each at Type-error of ฮฑ = 0.05.
Then 5% of the null hypotheses would be falsely rejected. In other
words, if 50 of the hypotheses were rejected, there is no guarantee
that they were not all falsely rejected.
FWER: m = # of hypotheses
ฯ€ = P(One or more falsely rejected hypotheses)
= 1 โ€“ (1๏€ญ๏ก )m
๏ก ๏€ฝ1๏€ญ (1๏€ญ๏ฐ )1/ m ๏‚ป ๏ฐ / m (Bonferroni Correction)
If m is large, ฮฑ would be very small. Thus the power of detecting any true
positive would be very small.
Sequential Bonferroni Corrections:
Let p[1] ๏‚ฃ p[2] ๏‚ฃ ... ๏‚ฃ p[m] be the p-values of independent tests with
corresponding null hypotheses H(1), H(2),...., H(m).
Holmโ€™s Method (Holm, 1979; Scand. Statist.)
โ€ข If p ๏€พ ๏ฐ / m , accept all nulls.
[1]
โ€ข If p[1] ๏‚ฃ ๏ฐ / m, reject H ; if p ๏€พ ๏ฐ /(m ๏€ญ1) , accept the rest of nulls.
[2]
(1)
โ€ข Continue until first j such that p ๏€พ ๏ฐ /(m ๏€ญ j ๏€ซ.1) In that case reject
[ j]
all H ,i ๏‚ฃ j ๏€ญ1,and accept the rest of nulls.
(i)
Simes Method (Biometrika, 1986):
โ€ข If p ๏‚ฃ ๏ฐ , reject all nulls.
[m]
โ€ข If not, but if p[m๏€ญ1] ๏‚ฃ ๏ฐ / 2 , reject all H(i),i ๏€ฝ1,2,..., m ๏€ญ1
๏ฐ
โ€ข Continue until first p[i] ๏‚ฃ
. In that case reject all H( j), j ๏€ฝ1,2,...,i
m ๏€ญ i ๏€ซ1
Note: Both Holmโ€™s and Simes methods are designed to refine the
FWER.
False Discovery Rate (FDR): Benjamini and Hochberg (1995), JRSS
When the number of hypotheses m is very large (say in
thousands), and if each individual hypothesis is not important,
then FWER criterion is not very useful since it yields few
discoveries. For example, in a microarray data analysis, the
objective is to detect potential genes for future exploration. Here,
each individual gene is not important. In such cases, tests with a
controlled FWER would yield few discoveries.
FDR = Expected proportion of false rejections.
Accept Null
Reject Null
Total
True Null
U
V
True
Alternatives
T
S
m
0
m๏€ญm
0
m- R
R
m
FDR = E [ V ], where, V ๏€ฝ 0 if R ๏€ฝ 0
R
R
= E [ V | R ๏€พ 0] P( R ๏€พ 0)
R
Note that FWER = P(R>0)
Benjamini and Hochberg proved that the following procedure produces
FDR ๏‚ฃ q :
i
p
๏‚ฃ
q, then reject all
Let k be the largest integer i such that [i]
m
H
( j)
, j ๏€ฝ1,2,..., k.
The result was proved under the assumption of independent test statistics.
It was later extended to a positively correlated test statistics by Benjamini
and Yekutieli, 2001; Ann. Stat.
Bayesian Interpretation (Storey, 2003, Ann. Stat.)
V
pFDR ๏€ฝ E[ | R ๏€พ 0]
R
H i :๏ฑi ๏€ฝ 0 vs. H ai :๏ฑi ๏‚น 0,
0
i ๏€ฝ1,2,..., m
Let Ti be test statistics that reject H i if Ti ๏€พ c.
0
Ti , i ๏€ฝ1,2,..., m are independently distributed.
๏ฑ ,๏ฑ ,....,๏ฑm are i.i.d. with p ๏€ฝ P(๏ฑi ๏ƒŽ H i ) ๏€พ 0, then
1 2
0
pFDR ๏€ฝ P(H i | Ti ๏€พ c)
0
Note: pFDR is a posterior version of the Type-I error
Directional Hypothesis Problem (Three decision problem):
Suppose H i :๏ฑ ๏€ฝ 0 is rejected, but it is also important to find the direction
0 i
of ๏ฑi , i.e., ๏ฑi ๏€ผ 0 or ๏ฑi ๏€พ 0.
So the problem is to find subsets
S๏€ญ and S๏€ซ of {1,2,..., m} such that
S๏€ญ ๏€ฝ {i : ๏ฑi ๏€ผ 0} and S๏€ซ ๏€ฝ {i : ๏ฑi ๏€พ 0}
Example: Gene selection
When the genes are altered under adverse condition, such as
cancer, the affected genes show under or over expression in a
microarray.
X i ๏€ญ Expression Level
X i ~ P(๏ฑi ,๏ก )
H i :๏ฑi ๏€ฝ 0 vs H ๏€ญi :๏ฑi ๏€ผ 0 or H ๏€ซi :๏ฑi ๏€พ 0
0
The objective is to find the genes with under expressions
and genes with over expressions.
Directional Error (Type III error):
Type III error is defined as P( Selection of false direction if the null is rejected).
The traditional method does not control the directional error. For example,
| t |๏€พ t๏ก / 2 , and t ๏€พ t๏ก / 2 , an error occurs if ๏ฑ ๏€ผ 0.
Sarkar and Zhou (2008, JSPI)
Finner ( 1999, AS)
Shaffer (2002, Psychological Methods)
Lehmann (1952, AMS; 1957, AMS)
Main points of these work is that if the objective is to find the true
direction of the alternative after rejecting the null, then a Type III error must be
controlled instead of Type I error.
Bayesian Decision Theoretic Framework
i :๏ฑ ๏€ผ ๏ฑ , H i :๏ฑ ๏€พ ๏ฑ
H i :๏ฑi ๏€ฝ ๏ฑ (say, 0) , H ๏€ญ
i 0 ๏€ซ i 0
0
0
Suppose ๏ฑ ,๏ฑ ,...,๏ฑm are generated from
1 2
๏ฐ (๏ฑ ) ๏€ฝ p๏€ญ๏ฐ ๏€ญ (๏ฑ ) ๏€ซ p ๏ฐ (๏ฑ ) ๏€ซ p๏€ซ๏ฐ ๏€ซ (๏ฑ )
0 0
where
๏ฐ - (๏ฑ ) ๏€ฝ g๏€ญ (๏ฑ ) I (๏ฑ ๏€ผ 0), ๏ฐ (๏ฑ ) ๏€ฝ I (๏ฑ ๏€ฝ 0), g (๏ฑ ) I (๏ฑ ๏€พ 0)
0
๏€ซ
g๏€ญ (๏ƒ—) ๏€ญ density with support contained in (-๏‚ฅ,0)
g๏€ซ (๏ƒ—) ๏€ญ density with support contained in (0, ๏‚ฅ)
g๏€ญ and g ๏€ซ could be trucated densities of a density on ๏ฑ .
The skewness in the prior is introduced by (p , p , p ).
-1 0 1
p๏€ญ ๏€ผ p๏€ซ reflects that the right tail is more likely than the
left tail.
p- ๏€ฝ 0 (or p๏€ซ ๏€ฝ 0) would yield a one - tail test.
p- ๏€ฝ p๏€ซ with g- and g ๏€ซ as truncated of a symmetric
density would yield a two - tail test.
p- and p๏€ซ can be assigned based on what tail is
more important.
Loss Function
L(ฮธ, d) ๏€ฝ
m
๏ƒฅ
i ๏€ฝ1
L (๏ฑ , d )
i i i
where di ๏€ฝ (d๏€ญi , d i , d๏€ซi ) taking values
0
(1,0, 0) for selecting H ๏€ญi
(0, 1, 0) for selecting H i
0
(0, 0, 1) for selecting H ๏€ซi
Let ๏ค ๏€ฝ (๏ค ๏€ญi ,๏ค i ,๏ค ๏€ซi ) be a randomized rule.
0
i
The average risk for a decision rule ๏ค ๏€ฝ (๏ค ,...,๏ค m ) is given by
1
r๏ค (๏ฐ ) ๏€ฝ p๏€ญr๏€ญ๏ค (๏ฐ ๏€ญ ) ๏€ซ p r๏ค (0) ๏€ซ p๏€ซr๏€ซ๏ค (๏ฐ ๏€ซ )
00
where
r๏€ญ๏ค (๏ฐ ๏€ญ ) ๏€ฝ ๏ƒฅ ๏ƒฒ๏ฑ ๏€ผ0 R(๏ฑi ,๏คi )๏ฐ ๏€ญ (๏ฑi )d๏ฑi
i i
r๏€ซ๏ค (๏ฐ ๏€ซ ) ๏€ฝ ๏ƒฅ ๏ƒฒ๏ฑ ๏€พ0 R(๏ฑi ,๏คi )๏ฐ (๏ฑi )d๏ฑi
1
i i
r๏ค (0) ๏€ฝ ๏ƒฅ R(0,๏คi )
0
i
For a fixed prior ๏ฐ , decision rules can be compared by comparing the space
S(๏ฐ ) ๏€ฝ {(r๏€ญ๏ค (๏ฐ ๏€ญ ), r๏ค (0), r๏€ซ๏ค (๏ฐ ๏€ซ )) :๏ค ๏ƒŽ D*}
0
consider the class of all rules ๏ค for which R(0,๏ค ) are the same
slope depends
on p๏€ญ and p๏€ซ
p๏€ญ ๏€พ p๏€ซ
r๏€ซ๏ค (๏ฐ )
Bayes
Rule
r๏€ญ๏ค (๏ฐ )
Remark : This theorem implies that if apriori it is known that H ๏€ซi is
more likely than H ๏€ญi ( p๏€ซ ๏€พ p๏€ญ ), then the average risk of the Bayes rule
in the positive direction will be smaller than average risk in the negative
direction.
For the "0 -1", this would mean that the expected number of falsely
delected genes in the positive direction would be less than the expected
number of falsely detected genes in the negative direction.