Download Bayes made simple

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bayes made simple
Significance is…
P (obtaining a test statistic more extreme than
the one we observed|H0 is true)
This is not a test of the strength of evidence supporting a hypothesis (or
model) . It is simply a statement about the probability of obtaining
extreme values that we have not observed.
A frequentist confidence interval
In frequentist statistics, a 95% CI represents an interval
such that if the experiments were repeated 100 times,
95% of the resulting CIs (e.g., average ± 1.96 SE) would
contain he true parameter value….
A new approach to insight
Pose question and think of the answer needed to answer it.
Ask:
• How do the data arise?
• What is the hypothesized process that produces them?
• What are the sources of randomness/uncertainty in the
process and the way we observe it?
• How can we model the process and its associated uncertainty
in a way that allows the data to speak informatively?
This approach is based on a firm
intuitive understanding of the
relationship between process
models and probability models
Why Bayes?
Light Limitation of Trees
1.0
 ( Li  c)
i 
(  )  ( Li  c)
ϒ= max. growth rate at high light
c=minimum light requirement
α=slope of curve at low light
0.6
0.4
0.0
0.2
Growth Rate
yi ~ Normal( i ,  )
0.8
p( yi |  )  Normal( i ,  )
0
1
2
3
4
light Availability
5
6
7
Where do uncertainties arise?
• Variation due to processes we failed to model.
• Error in our observations?
• of the process
• of covariates or predictor variables
• What about genetic variation among individuals? Geographic
variation among sites?
• What does the current science tell us about the process we
are modeling?
• How can we exploit what is already known about the
processes we are modeling?
g ( Li , )
i
 proc  , , c
Process
model
Parameter
model
 ( Li  c)
i 
(  )  ( Li  c)
p( yi |  )  Normal( i ,  )
yi ~ Normal( i ,  )
Data on response
Predictor
 obs. x
xi
yi
g ( Li , )
i
 proc  , , c
Data
model
Process
model
 obs. y
Parameter
model
Data on response
Predictor
yi
xi
g ( Li , )
 obs. x
Data
model
Process
model
i
 proc  , , c

 obs. y
Parameter
model
Hyperparameters
Today
 Derivation of Bayes Law
 Understanding each piece
 P(y|θ)
 P(θ)
 P(y|θ) P(θ)
 P(y)
 Putting the pieces together
 The relationship between likelihood and Bayes
 Priors and conjugacy (…probably into Thursday)
Concept of Probability
P(A)  probability that event A occurs 
area of A
area of S
Event A
S= sample space
Concept of Probability
P(A)  probability that event A occurs 
area of A
area of S
Event A
Event B
S= sample space
Conditional Probabilities
Probability of B given that we know A occurred:
P(B | A) 
area of B and A P( B  A) P( B, A)


area of A
P( A)
P( A)
Event A
Event B
S= sample space
Conditional Probabilities
What is P(A occurred given that B occurred?
Event A
Event B
S= sample space
Bayes Law: Get this now and forever
1) P(B | A)  P( B, A)
P ( A)
P ( A, B )
2) P(A | B)  P( B)
Solving 1 for P(B,A):
P(B,A)  P( B | A) P( A)
Substituting into 2 gives Bayes Law:
What is P(B|A)?
P( B | A) P( A)
P(A | B) 
P( B)
We are interested in P(θ|y)
• We have some new data (y) in hand—the data represent the
“event” that has occurred.
• What is the probability of the parameters given the data? By
symmetry, Bayes law is:
Joint
Product rule
P( , y ) P( y |  ) P( )
P(  | y) 

P( y )
P( y )
Marginal
The Holy Grail
The posterior distribution specifies
P(θ|y) as a function of θ. It returns a
probability of the parameter value in
light of the data.
P(θ|y)
θ
Bayes Law
What is this? Haven’t we
seen this before?
The probability that the
parameter takes on a
particular value in light
of prior data on θ,
=the prior distribution.
P( y |  ) P( )
P(  | y) 
P( y )
What we seek: the probability
that a parameter takes on a
particular value in light of the new
data== the posterior distribution
The probability of the data, aka,
the marginal distribution. More
on this coming up.
Components
 Understanding P(θ) = the prior
 Understanding P(y|θ)P(θ) = the joint distribution
 Understanding P(y) = the marginal distribution
0.00 0.10 0.20
dnorm(x, 40, 2)
What is P(θ) (aka the prior)?
30
35
40
45
55
0.015
Informative prior
0.000
dnorm(x, 0, 100)
θx
50
30
35
40
45
θx
50
55
Uninformative prior
Where do priors come from?
• If we have a mean and a standard deviation from
earlier studies of θ, then we have a prior on θ.
P(θ|y) in our current study becomes
P(θ) in future studies
• If we don’t have prior information, the prior will be
uninformative
The joint
So what is
P(y|θ)P(θ)?
(aka the joint distribution)
Exercise
• You have 8 observations of the standing crop of carbon
in a grassland from 0.25 sq. m. plots. Assume the data are
normally distributed.
y=(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6)
• A previous estimate of carbon standing crop was
mean=20, sd=2.2.
• Calculate and plot the prior, the likelihood, and the joint
distribution.
Point estimates vs. distribution
θ (mean)
P( y |  ) P( )
P( | y) 
P( y )
Area under curve =1
0.10
0.00
P(y|θ)
P(θ|y)
0.20
P(y =4 infected|θ=0.12)|
θ (mean)
L(θ|y)= dnorm(y, sigma).
The data are constant , the parameter
varies.
Area under curve ≠1
# Data
y=c(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6)
y.sd<-sd(y)
#prior mean and sd on theta
p.mean=20
p.sd=2.2
D=NULL
theta=seq(0,30,.1) # set up a vector of potential values for theta
#Likelihood x prior=joint
for (i in 1:length(theta)){ # note we do this for all values of theta
#prior
P=dnorm(theta[i],p.mean,p.sd)
#likelihood
L=prod(dnorm(y,theta[i],y.sd)) # note the product (not log-likelihood)
#likelihood x prior
LP=L*P
D=rbind(D,c(theta[i],LP,L,P))
}
D=as.data.frame(D)
names(D)=c("theta", "LP","L","P")
# Plot everything
par(mfrow=c(3,1))
#prior
plot(D$theta,D$P,type="l",lwd=2,xlab=expression(theta),ylab=expression
(paste("P(",theta,")")), main="Prior", col="blue")
# likelihood
plot(D$theta,D$L,type="l",lwd=2,xlab=expression(theta), ylab=expression
(paste("P(y|",theta,")")), main="Likelihood",col="blue")
# prior * likelihood=joint
plot(D$theta,D$LP,type="l",lwd=2, xlab=expression(theta), ylab=expression
(paste("P(y|",theta,")P(",theta,")")), main="Joint",col="blue")
1.0
0.0
P( )
2.0
Prior
0
5
10
15
20
25
30
20
25
30
20
25
30
3e-06
0e+00
P(y| )
Likelihood
0
5
10
15
0e+00 6e-33
P(y| )P( )
Joint
0
5
10
15
What is P(y)?
P( y |  ) P( )
P(  | y) 
P( y )
Because P(y) is a constant
P( | y)  P( y |  ) P( )
So, without knowing the denominator, we can evaluate the relative
support for each value of θ, but not the probability. This is what
maximum likelihood does. To get at the probability, we must
“normalize” the relative support by dividing by p(y).
The θ are mutually exhaustive,
mutually exclusive hypotheses.
θ1
So what is P(y)?
θ3
θ2
Sample space:
All possible outcomes of
observation, experiment,
etc.
So what is P(y)?
The θ are mutually exhaustive,
mutually exclusive hypotheses.
θ3
θ1
Data: the observed
Outcome (the blue blob).
θ2
Area of blue blob
P(y)  P(data) 
Area of green blob
Sample space:
All possible outcomes
of observation,
experiment, etc
(the green blob)
So what is P(y)?
P(3  y)  P(3 , y) 
P( y | 3 ) P(3 )
θ3
θ1
θ2
Because the probability of Y is:
P(3  y)  P( 2  y)  P( 1  y)
3
P(y)   P( y | i ) P(i )
i 1
Bayes law for discrete parameters
P( y |  i ) P( i )
P( i | y) 
P( y )
P( y |  i ) P( i )
P( i | y)  J
 P( y | i ) P(i )
i 1
P(θi|y) reads: in light of the data, the probability that the parameter has
the value θi If we find this value for all possible values of the parameter θ,
then we have the posterior distribution.
An example from medical testing:
False positives in medical testing
Prob (ill )  10 6
Prob (test  | ill )  1
Prob (test  | healthy)  10 3
What is prob(ill | test  ) ?
Prob ( | ill ) Prob(ill )
Prob (ill |  ) 
Prob( )
An example from medical testing
Prob (ill )  10 6
Prob (test  | ill )  1
Prob (test  | healthy)  10 3
What is prob(ill | test  ) ?
Prob ( | ill ) Prob(ill )
Prob (ill |  ) 
Prob( )
ill
Test +
Not ill
Pr ob(  )  prob( ill   )  prob( healthy   ) 
prob( ill ) prob(  | ill )  prob( healthy ) Pr ob(  | healthy ) 
1x106  ( 1  106 )x103  103
The Definite Integral
The integral between a and b:
n
n
lim  f ( x )x  

Δx->0
i
i 1
b
a
f ( x)dx
y=f(x)
y
a
x
b
Bayes law for continuous parameters
P( y |  ) P( )
P(  | y) 
P( y )
P( y |  ) P( )
P(  | y) 
 P( y |  ) P( )
P(θ|y)
θ

P(θ|y) reads: in light of the data, the probability that the parameter has
the value θ If we find this value for all possible values of the parameter θ,
then we have the posterior distribution.
Bayes Law
The likelihood
The probability that the
parameter takes on a
particular value in light
of prior data on θ,
=the prior distribution.
P( y |  ) P( )
P(  | y) 
P( y )
The probability that a parameter takes
on a particular value in light of the new
data==the posterior distribution
The probability of the data, aka,
the marginal distribution.
Marginal density of θ
Joint density
of y and θ
Conditional density
of θ given y=y0
Marginal density of y
Pr( | y ) 
Pr( y, ) Pr( y |  ) Pr( ) P( y |  ) Pr( )


Pr( y )
Pr( y )
p( y |  ) p( )

How do we derive a posterior distribution?
P( y |  ) P( )
P(  | y) 
P( y )
The prior distribution, P(θ), can be subjective or objective,
informative or non-informative.
P( y |  ) P( )
P(  | y) 
P( y )
The likelihood function, aka, data distribution, P(y|θ),
P( y |  ) P( )
P(  | y) 
P( y )
The product of the prior and the likelihood function,
P(θ )P(y|θ), the joint P(y, θ)
P( y |  ) P( )
P(  | y) 
P( y )

P( y )  P( ) P( y |  )d
The denominator, the marginal distribution or normalization constant
What we are seeking:
the posterior distribution P(θ|y)
P( y |  ) P( )
P(  | y) 
P( y )
Note that we are dividing each point the dashed line by the
area in the dashed line to obtain a probability reflecting our prior
and current knowledge.
Summary: Bayes vs likelihood
• The difference is not the use of prior information.
• In likelihood, we find parameter estimates by maximizing the likelihood.
We have a likelihood profile, but it is somewhat cumbersome for developing
confidence or support envelopes.
• In Bayes, we integrate or sum over the entire range of parameter values to
get a PDF by dividing each point on the “likelihood profile” by the area
beneath the profile. The estimate of our parameter is the mean or the median
of the resulting PDF. We also obtain estimates of the mode, the variance,
kurtosis, etc..., which allows us to make statements about the probability of
our parameter(s). Likelihood cannot make these statements.
• This posterior PDF in our current study forms the prior in subsequent
studies.
• The real value of Bayes over likelihood emerges as our process and
probability models become complex. In this case, we can exploit the product
rule to simplify our problem, to break it up into manageable chunks that can
be reassembled in a coherent way.
Related documents