Download BayesianSlides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Additional Slides on Bayesian
Statistics for STA 101
Prof. Jerry Reiter
Fall 2008
Can we use this method to learn
about means and percentages?
• To learn about population averages and
percentages, we’ve used data (like the
DNA test results), but not prior information
(like the list of suspects).
• We show how to combine data and prior
information in class.
Combining the prior beliefs and the
data using Bayes Rule
• In Bayes rule problem before break, we combine
the prior beliefs and the data using Bayes rule.
Pr( X  1 | p) Pr( p)
Pr( p | X  1) 
Pr( X  1)
• Pr(p|X=1) represents our posterior beliefs about µ .
Estimation of unknown parameters
in statistical models (Bayesian and
non-Bayesian)
• Suppose we posit a probability distribution to
model data. How do we estimate its unknown
parameters?
• Example: assume data follow regression model.
Where do the estimates of the regression
coefficients come from?
• Classical statistics: maximum likelihood
estimation.
• Bayesian statistics: Bayes rule.
Estimating percentage of Dukies
who plan to get advanced degree
• Suppose we want to estimate the
percentage of Duke students who plan to
get an advanced degree (MBA, JD, MD,
PhD, etc.). Call this percentage p.
• We sample 20 people at random, and
8 of them say they plan to get an
advanced degree.
• What should be our estimate of p?
Estimating the average IQ of Duke
professors
Distributions
Prof IQs (hypothetical data)
3
.99
2
.95
.90
1
.75
.50
0
.25
.10
-1
.05
-2
.01
100 110 120 130 140 150 160 170
Normal Quantile Plot
• Let µ be the
population average
IQ of Duke profs.
• Suppose we
randomly sample
25 Duke profs and
record their IQs.
• What should be
our estimate of µ?
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
132.16
11.710679
2.3421358
136.99393
127.32607
25
Maximum likelihood estimation: A
principled approach to estimation
• Usually we can use subject-matter
knowledge to specify a distribution for the
data. But, we don’t know the parameters
of that distribution.
1) Number out of 20 who want advanced
degree: binomial distribution.
2) Profs’ IQs: normal distribution.
Maximum likelihood estimation
• We need to estimate the parameters of the
distribution.
Why do we care?
A) So we can make probability statements
about future events.
B) The parameters themselves may be
important.
Maximum likelihood estimation
• The maximum likelihood estimate of the
unknown parameter is the value for which
the data were most likely to have
occurred.
• Let’s see how this works in the examples.
Advanced degree example
• Let Y be the random variable for the number of people out of 20
that plan to get an advanced degree.
• Y has a binomial distribution with n = 20, and unknown
probability p.
• In the data, Y= 8 . If we knew p, the value of the probability
distribution function at Y= 8 would be:
20!
Pr(Y  8 ) 
p 8 (1  p) 208
( 8 !)(12 !)
MLE for degree example
• Let’s graph Pr(Y = 8) as a function of the unknown p.
• Label the function L(p). L(p) is called the likelihood
function for p.
Maximum likelihood
• The maximum likelihood estimate of p is
the value of p that maximizes L(p).
• This is a reasonable estimate because it is
the value of p for which the observed data
(y= 8 ) had the greatest chance of
occurring.
Finding the MLE for degree
example
• To maximize the likelihood function, we
need to take the derivative of
20!
8
208
L( p ) 
p (1  p)
( 8 !)( 12 !)
with respect to p, set it equal to zero, and
finally solve for p.
You get the sample percentage!
Estimating the average IQ of Duke
professors
Distributions
Prof IQs (hypothetical data)
3
.99
2
.95
.90
1
.75
.50
0
.25
.10
-1
.05
-2
.01
100 110 120 130 140 150 160 170
Normal Quantile Plot
• Let µ be the
population average
IQ of Duke profs.
• Suppose we
randomly sample
25 Duke profs and
record their IQs.
• What should be
our estimate of µ?
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
132.16
11.710679
2.3421358
136.99393
127.32607
25
Model for Professors’ IQs
• The mathematical function for a normal curve for
any prof’s IQ, which we label Y, is:
1
 ( y   ) 2 / 2 2
f ( y) 
e
 2
• All normal curves have this form, with different
means and SDs. Here, we’ll assume the σ = 15.
We don’t know µ, which is what we’re after.
Model for all 25 IQs
• We need the function for all 25 IQs.
• Assuming each prof’s IQ is independent
of other profs’ IQs, we have
f ( y1 , y2 ,..., y25 )  f ( y1 )  f ( y2 )  ...  f ( y25 )
25

i 1
 1
 ( yi   ) 2 / 2 (152 ) 
15 2 e



Model for all 25 IQs
• With some algebra and simplifications, the
likelihood function is:
1
L(  ) 
15
25
2

e
25
25

i 1
( yi   ) 2 / 2 (152 )
Likelihood function and maximum
likelihood estimates
• A graph of the likelihood function looks something like this:
• The function is maximized when µ is the sample average. So,
we use 132.16 as our estimate of the average Duke prof’s IQ.
• This sample average is the MLE for µ in any normal curve.
The Bayesian approach to
estimation of means
• Let’s show how to combine data and prior
information to address the following
motivating question:
What is a likely range for the average IQ of
Duke professors?
Combining the prior beliefs and the
data using Bayes Rule
• We combine our prior beliefs and the data using
Bayes rule.
f (data |  ) f (  )
f (  | data) 
f (data)
• f(µ|data) represents our posterior beliefs about µ .
Formalizing a model for prior
information
• Let’s assign a distribution for µ that reflects
our a priori beliefs about its likely range.
Label this f(µ).
• Using the data you supplied in class, the
curve describing our beliefs about µ is the
normal curve with
mean = 128
SD = 15
Mathematical equation for normal
curve
• We can write down the equation for this
normal curve.
1
(   128 ) 2 / 2 (15) 2
f ( ) 
e
15 2
Model for the data (25 IQs)
• If we knew µ, the model for the data (the
professors’ IQs) is
25
f ( y1 , y2 ,..., y25 |  )  
i 1
 1
 ( yi   ) 2 / 2 (152 ) 
15 2 e



Estimating the average IQ of Duke
professors
Distributions
Prof IQs (hypothetical data)
3
.99
2
.95
.90
1
.75
.50
0
.25
.10
-1
.05
-2
.01
100 110 120 130 140 150 160 170
Normal Quantile Plot
• Let µ be the
population average
IQ of Duke profs.
• Suppose we
randomly sample
25 Duke profs and
record their IQs.
• What should be
our estimate of µ?
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
132.16
11.710679
2.3421358
136.99393
127.32607
25
Combining the prior beliefs and the
data using Bayes Rule
• We combine the model for the prior beliefs and the
model for the data using Bayes rule.
f (data |  ) f (  )
f (  | data) 
f (data)
• f(µ|data) represents our posterior beliefs about µ .
Posterior distribution
• Using calculus, one can show that f(µ|data) is a
normal curve with
mean =
SD =
(1 / SE 2 )  y  (1/Prior SD 2 )  Prior mean
2
2
1/SE  1/Prior SD
1



2
2 
 1/SE  1/Prior SD 
Posterior distribution
• For our data and prior beliefs, the posterior beliefs,
f(µ|data), is a normal curve with
(1 / 2.34 ) 132.16  (1/15 )128
 132.06
mean =
2
2
1/2.34  1/15
2
SD =
2
1


 2.313

2
2 
 1/2.342  1/15 
Using the posterior distribution to
summarize beliefs about µ
• Because f(µ|data) describes beliefs about µ, we
can make probability statements about µ.
• For example, using a normal curve with mean
equal to 132.06 and SD equal to 2.314,
Pr(µ > 130 | data) = .813
• A 95% posterior interval for µ stretches from
127.52 to 136.59.
Bayesian statistics in general
• Bayesian methods exist for any population parameter, including
percentiles, maxima and minima, ratios, etc.
• The method is general:
1) specify a mathematical curve that reflects prior beliefs about the
population parameter.
2) specify a mathematical curve that describes the distribution of the
data, given a value of the population parameter.
3) combine the curves from 1 and 2 mathematically to get posterior
beliefs for the parameter, updated for the data.
Differences between frequentist
and Bayesian
FREQUENTIST
BAYESIAN
 Parameters are not
random.
 Confidence intervals.
o Parameters are
random.
o Posterior
distributions.