Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayes made simple Significance is… P (obtaining a test statistic more extreme than the one we observed|H0 is true) This is not a test of the strength of evidence supporting a hypothesis (or model) . It is simply a statement about the probability of obtaining extreme values that we have not observed. A frequentist confidence interval In frequentist statistics, a 95% CI represents an interval such that if the experiments were repeated 100 times, 95% of the resulting CIs (e.g., average ± 1.96 SE) would contain he true parameter value…. A new approach to insight Pose question and think of the answer needed to answer it. Ask: • How do the data arise? • What is the hypothesized process that produces them? • What are the sources of randomness/uncertainty in the process and the way we observe it? • How can we model the process and its associated uncertainty in a way that allows the data to speak informatively? This approach is based on a firm intuitive understanding of the relationship between process models and probability models Why Bayes? Light Limitation of Trees 1.0 ( Li c) i ( ) ( Li c) ϒ= max. growth rate at high light c=minimum light requirement α=slope of curve at low light 0.6 0.4 0.0 0.2 Growth Rate yi ~ Normal( i , ) 0.8 p( yi | ) Normal( i , ) 0 1 2 3 4 light Availability 5 6 7 Where do uncertainties arise? • Variation due to processes we failed to model. • Error in our observations? • of the process • of covariates or predictor variables • What about genetic variation among individuals? Geographic variation among sites? • What does the current science tell us about the process we are modeling? • How can we exploit what is already known about the processes we are modeling? g ( Li , ) i proc , , c Process model Parameter model ( Li c) i ( ) ( Li c) p( yi | ) Normal( i , ) yi ~ Normal( i , ) Data on response Predictor obs. x xi yi g ( Li , ) i proc , , c Data model Process model obs. y Parameter model Data on response Predictor yi xi g ( Li , ) obs. x Data model Process model i proc , , c obs. y Parameter model Hyperparameters Today Derivation of Bayes Law Understanding each piece P(y|θ) P(θ) P(y|θ) P(θ) P(y) Putting the pieces together The relationship between likelihood and Bayes Priors and conjugacy (…probably into Thursday) Concept of Probability P(A) probability that event A occurs area of A area of S Event A S= sample space Concept of Probability P(A) probability that event A occurs area of A area of S Event A Event B S= sample space Conditional Probabilities Probability of B given that we know A occurred: P(B | A) area of B and A P( B A) P( B, A) area of A P( A) P( A) Event A Event B S= sample space Conditional Probabilities What is P(A occurred given that B occurred? Event A Event B S= sample space Bayes Law: Get this now and forever 1) P(B | A) P( B, A) P ( A) P ( A, B ) 2) P(A | B) P( B) Solving 1 for P(B,A): P(B,A) P( B | A) P( A) Substituting into 2 gives Bayes Law: What is P(B|A)? P( B | A) P( A) P(A | B) P( B) We are interested in P(θ|y) • We have some new data (y) in hand—the data represent the “event” that has occurred. • What is the probability of the parameters given the data? By symmetry, Bayes law is: Joint Product rule P( , y ) P( y | ) P( ) P( | y) P( y ) P( y ) Marginal The Holy Grail The posterior distribution specifies P(θ|y) as a function of θ. It returns a probability of the parameter value in light of the data. P(θ|y) θ Bayes Law What is this? Haven’t we seen this before? The probability that the parameter takes on a particular value in light of prior data on θ, =the prior distribution. P( y | ) P( ) P( | y) P( y ) What we seek: the probability that a parameter takes on a particular value in light of the new data== the posterior distribution The probability of the data, aka, the marginal distribution. More on this coming up. Components Understanding P(θ) = the prior Understanding P(y|θ)P(θ) = the joint distribution Understanding P(y) = the marginal distribution 0.00 0.10 0.20 dnorm(x, 40, 2) What is P(θ) (aka the prior)? 30 35 40 45 55 0.015 Informative prior 0.000 dnorm(x, 0, 100) θx 50 30 35 40 45 θx 50 55 Uninformative prior Where do priors come from? • If we have a mean and a standard deviation from earlier studies of θ, then we have a prior on θ. P(θ|y) in our current study becomes P(θ) in future studies • If we don’t have prior information, the prior will be uninformative The joint So what is P(y|θ)P(θ)? (aka the joint distribution) Exercise • You have 8 observations of the standing crop of carbon in a grassland from 0.25 sq. m. plots. Assume the data are normally distributed. y=(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6) • A previous estimate of carbon standing crop was mean=20, sd=2.2. • Calculate and plot the prior, the likelihood, and the joint distribution. Point estimates vs. distribution θ (mean) P( y | ) P( ) P( | y) P( y ) Area under curve =1 0.10 0.00 P(y|θ) P(θ|y) 0.20 P(y =4 infected|θ=0.12)| θ (mean) L(θ|y)= dnorm(y, sigma). The data are constant , the parameter varies. Area under curve ≠1 # Data y=c(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6) y.sd<-sd(y) #prior mean and sd on theta p.mean=20 p.sd=2.2 D=NULL theta=seq(0,30,.1) # set up a vector of potential values for theta #Likelihood x prior=joint for (i in 1:length(theta)){ # note we do this for all values of theta #prior P=dnorm(theta[i],p.mean,p.sd) #likelihood L=prod(dnorm(y,theta[i],y.sd)) # note the product (not log-likelihood) #likelihood x prior LP=L*P D=rbind(D,c(theta[i],LP,L,P)) } D=as.data.frame(D) names(D)=c("theta", "LP","L","P") # Plot everything par(mfrow=c(3,1)) #prior plot(D$theta,D$P,type="l",lwd=2,xlab=expression(theta),ylab=expression (paste("P(",theta,")")), main="Prior", col="blue") # likelihood plot(D$theta,D$L,type="l",lwd=2,xlab=expression(theta), ylab=expression (paste("P(y|",theta,")")), main="Likelihood",col="blue") # prior * likelihood=joint plot(D$theta,D$LP,type="l",lwd=2, xlab=expression(theta), ylab=expression (paste("P(y|",theta,")P(",theta,")")), main="Joint",col="blue") 1.0 0.0 P( ) 2.0 Prior 0 5 10 15 20 25 30 20 25 30 20 25 30 3e-06 0e+00 P(y| ) Likelihood 0 5 10 15 0e+00 6e-33 P(y| )P( ) Joint 0 5 10 15 What is P(y)? P( y | ) P( ) P( | y) P( y ) Because P(y) is a constant P( | y) P( y | ) P( ) So, without knowing the denominator, we can evaluate the relative support for each value of θ, but not the probability. This is what maximum likelihood does. To get at the probability, we must “normalize” the relative support by dividing by p(y). The θ are mutually exhaustive, mutually exclusive hypotheses. θ1 So what is P(y)? θ3 θ2 Sample space: All possible outcomes of observation, experiment, etc. So what is P(y)? The θ are mutually exhaustive, mutually exclusive hypotheses. θ3 θ1 Data: the observed Outcome (the blue blob). θ2 Area of blue blob P(y) P(data) Area of green blob Sample space: All possible outcomes of observation, experiment, etc (the green blob) So what is P(y)? P(3 y) P(3 , y) P( y | 3 ) P(3 ) θ3 θ1 θ2 Because the probability of Y is: P(3 y) P( 2 y) P( 1 y) 3 P(y) P( y | i ) P(i ) i 1 Bayes law for discrete parameters P( y | i ) P( i ) P( i | y) P( y ) P( y | i ) P( i ) P( i | y) J P( y | i ) P(i ) i 1 P(θi|y) reads: in light of the data, the probability that the parameter has the value θi If we find this value for all possible values of the parameter θ, then we have the posterior distribution. An example from medical testing: False positives in medical testing Prob (ill ) 10 6 Prob (test | ill ) 1 Prob (test | healthy) 10 3 What is prob(ill | test ) ? Prob ( | ill ) Prob(ill ) Prob (ill | ) Prob( ) An example from medical testing Prob (ill ) 10 6 Prob (test | ill ) 1 Prob (test | healthy) 10 3 What is prob(ill | test ) ? Prob ( | ill ) Prob(ill ) Prob (ill | ) Prob( ) ill Test + Not ill Pr ob( ) prob( ill ) prob( healthy ) prob( ill ) prob( | ill ) prob( healthy ) Pr ob( | healthy ) 1x106 ( 1 106 )x103 103 The Definite Integral The integral between a and b: n n lim f ( x )x Δx->0 i i 1 b a f ( x)dx y=f(x) y a x b Bayes law for continuous parameters P( y | ) P( ) P( | y) P( y ) P( y | ) P( ) P( | y) P( y | ) P( ) P(θ|y) θ P(θ|y) reads: in light of the data, the probability that the parameter has the value θ If we find this value for all possible values of the parameter θ, then we have the posterior distribution. Bayes Law The likelihood The probability that the parameter takes on a particular value in light of prior data on θ, =the prior distribution. P( y | ) P( ) P( | y) P( y ) The probability that a parameter takes on a particular value in light of the new data==the posterior distribution The probability of the data, aka, the marginal distribution. Marginal density of θ Joint density of y and θ Conditional density of θ given y=y0 Marginal density of y Pr( | y ) Pr( y, ) Pr( y | ) Pr( ) P( y | ) Pr( ) Pr( y ) Pr( y ) p( y | ) p( ) How do we derive a posterior distribution? P( y | ) P( ) P( | y) P( y ) The prior distribution, P(θ), can be subjective or objective, informative or non-informative. P( y | ) P( ) P( | y) P( y ) The likelihood function, aka, data distribution, P(y|θ), P( y | ) P( ) P( | y) P( y ) The product of the prior and the likelihood function, P(θ )P(y|θ), the joint P(y, θ) P( y | ) P( ) P( | y) P( y ) P( y ) P( ) P( y | )d The denominator, the marginal distribution or normalization constant What we are seeking: the posterior distribution P(θ|y) P( y | ) P( ) P( | y) P( y ) Note that we are dividing each point the dashed line by the area in the dashed line to obtain a probability reflecting our prior and current knowledge. Summary: Bayes vs likelihood • The difference is not the use of prior information. • In likelihood, we find parameter estimates by maximizing the likelihood. We have a likelihood profile, but it is somewhat cumbersome for developing confidence or support envelopes. • In Bayes, we integrate or sum over the entire range of parameter values to get a PDF by dividing each point on the “likelihood profile” by the area beneath the profile. The estimate of our parameter is the mean or the median of the resulting PDF. We also obtain estimates of the mode, the variance, kurtosis, etc..., which allows us to make statements about the probability of our parameter(s). Likelihood cannot make these statements. • This posterior PDF in our current study forms the prior in subsequent studies. • The real value of Bayes over likelihood emerges as our process and probability models become complex. In this case, we can exploit the product rule to simplify our problem, to break it up into manageable chunks that can be reassembled in a coherent way.