Download Lecture 2 - eis.bris.ac.uk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

History of statistics wikipedia , lookup

Approximate Bayesian computation wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistical inference wikipedia , lookup

Bayesian inference wikipedia , lookup

Transcript
Lecture 2
Bayesian Statistics and Inference
Lecture Contents
•
•
•
•
•
•
What is Bayesian inference
Prior distributions
Examples of conjugate Bayesian analysis
Credible intervals
Bayes factors
Bayesian linear regression
Bayes Theorem
• Bayesian statistics named after Rev. Thomas
Bayes (1702-1761)
• Bayes Theorem for probability events A and B
p( A | B) 
p( B | A) p( A)
p( B)
• Or for a set of mutually exclusive and exhaustive
events (i.e. p(i Ai )  i p( Ai )  1 ), then
p ( B | Ai ) p ( Ai )
p ( Ai | B) 
 j p( B | A j ) P( A j )
Example – coin tossing
• Let A be the event of 2 Heads in three tosses of
a fair coin. B be the event of 1st coin is a Head.
• Three coins have 8 equally probable patterns
{HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
• A = {HHT,HTH,THH} →p(A)=3/8
• B = {HHH,HTH,HTH,HTT} →p(B)=1/2
• A|B = {HHT,HTH}|{HHH,HTH,HTH,HTT}
→p(A|B)=1/2
• B|A = {HHT,HTH}|{HHT,HTH,THH} →p(B|A)=2/3
• P(A|B) = P(B|A)P(A)/P(B) = (2/3*3/8)/(1/2) = 1/2
Example 2 – Diagnostic testing
• A new HIV test is claimed to have “95%
sensitivity and 98% specificity”
• In a population with an HIV prevalence of
1/1000, what is the chance that a patient
testing positive actually has HIV?
Let A be the event patient is truly positive, A’
be the event that they are truly negative
Let B be the event that they test positive
Diagnostic Testing ctd.
• We want p(A|B)
• “95% sensitivity” means that p(B|A) = 0.95
• “98% specificity” means that p(B|A’) = 0.02
So from Bayes Theorem
p( B | A) p( A)
p( B | A) p( A)  p( B | A' ) p( A' )
0.95  0.001

 0.045
0.95  0.001  0.02  0.999
p( A | B) 
Thus over 95% of those testing positive will, in fact, not have HIV.
Being Bayesian!
• So the vital issue in this example is how should
this test result change our prior belief that the
patient is HIV positive?
• The disease prevalence (p=0.001) can be
thought of as a ‘prior’ probability.
• Observing a positive result causes us to modify
this probability to p=0.045 which is our ‘posterior’
probability that the patient is HIV positive.
• This use of Bayes theorem applied to
observables is uncontroversial however its use
in general statistical analyses where parameters
are unknown quantities is more controversial.
Bayesian Inference
In Bayesian inference there is a fundamental distinction
between
• Observable quantities x, i.e. the data
• Unknown quantities θ
θ can be statistical parameters, missing data, latent
variables…
• Parameters are treated as random variables
In the Bayesian framework we make probability statements
about model parameters
In the frequentist framework, parameters are fixed nonrandom quantities and the probability statements
concern the data.
Prior distributions
As with all statistical analyses we start by positing
a model which specifies p(x| θ)
This is the likelihood which relates all variables
into a ‘full probability model’
However from a Bayesian point of view :
•  is unknown so should have a probability
distribution reflecting our uncertainty about it
before seeing the data
• Therefore we specify a prior distribution p(θ)
Note this is like the prevalence in the example
Posterior Distributions
Also x is known so should be conditioned on and here we
use Bayes theorem to obtain the conditional distribution
for unobserved quantities given the data which is
known as the posterior distribution.
p( | x) 
p( ) p( x |  )
 p( ) p( x |  )d
 p( ) p( x |  )
The prior distribution expresses our uncertainty about
 before seeing the data.
The posterior distribution expresses our uncertainty
about  after seeing the data.
Examples of Bayesian Inference
using the Normal distribution
Known variance, unknown mean
It is easier to consider first a model with 1
unknown parameter. Suppose we have a
sample of Normal data: xi ~ N (  ,  2 ), i  1,..., n.
Let us assume we know the variance, 2
and we assume a prior distribution for the
mean,  based on our prior beliefs:
 ~ N ( 0 ,  02 )
Now we wish to construct the
posterior distribution p(|x).
Posterior for Normal distribution
mean
So we have
2  12
0
p (  )  (2 ) exp(  12 (    0 ) 2 /  02 )
2  12
p ( xi |  )  (2 ) exp(  12 ( xi   ) 2 /  2 )
and hence
p (  | x)  p(  ) p( x |  )
2  12
0
 (2 ) exp(  12 (    0 ) 2 /  02 ) 
N
2
2
1
(
2

)
exp(

(
x


)
/

)

i
2
2  12
i 1
 exp(  12  2 (1 /  02  n /  2 )   (  0 /  02   xi / 2 )  cons)
i
Posterior for Normal distribution
mean (continued)
For a Normal distribution with response y
with mean  and variance  we have
 12
f ( y )  (2) exp{ 12 ( y   ) 2 / }
 exp{ 12 y 2 1  y /   cons}
We can equate this to our posterior as follows:
 exp(  12  2 (1 /  02  n /  2 )   ( 0 /  02   xi / 2 )  cons)
i
   (1 /  02  n /  2 ) 1 and    ( 0 /  02   xi / 2 )
i
Precisions and means
• In Bayesian statistics the precision = 1/variance
is often more important than the variance.
• For the Normal model we have
1 /   (1 /  02  n /  2 ) and    ( 0 /  02  x /( 2 / n))
In other words the posterior precision = sum
of prior precision and data precision, and the
posterior mean is a (precision weighted)
average of the prior mean and data mean.
Large sample properties
As n  
Posterior precision
1 /   (1 /  02  n /  2 )  n /  2
So posterior variance   2 / n
Posterior mean    ( 0 /  02  x /( 2 / n))  x
And so posterior distribution
p( | x)  N ( x ,  2 / n)
Compared to p( x |  )  N ( , 2 / n) in the
frequentist setting
Girls Heights Example
• 10 girls aged 18 had both their heights and
weights measured.
• Their heights (in cm) where as follows:
169.6,166.8,157.1,181.1,158.4,165.6,166.7,156.5,168.1,165.3
We will assume the variance is known to be 50.
Two individuals gave the following prior
distributions for the mean height
Individual 1 p1 ( ) ~ N (165,22 )
2
p
(

)
~
N
(
170
,
3
)
Individual 2 2
Constructing posterior 1
• To construct the posterior we use the
formulae we have just calculated
2


165
,

• From the prior, 0
0  4
• From the data, x  165.52, 2  50, n  10
• The posterior is therefore
p(  | x) ~ N (1 , 1 )
1
where 1  ( 14  10
)
 2.222,
50
1655.2
1  1 ( 165

4
50 )  165.23.
Prior and posterior comparison
Constructing posterior 2
• Again to construct the posterior we use the
earlier formulae we have just calculaed
2


170
,

• From the prior, 0
0 9
• From the data, x  165.52, 2  50, n  10
• The posterior is therefore
p(  | x) ~ N ( 2 , 2 )
1
where 2  ( 19  10
)
 3.214,
50
1655.2
 2  2 ( 170

9
50 )  167.12.
Prior 2 comparison
Note this prior is not as close to the data as prior 1 and hence posterior is
somewhere between prior and likelihood.
Other conjugate examples
• When the posterior is in the same family as the
prior we have conjugacy. Examples include:
Likelihood
Parameter
Prior
Posterior
Normal
Mean
Normal
Normal
Normal
Precision
Gamma
Gamma
Binomial
Probability
Beta
Beta
Poisson
Mean
Gamma
Gamma
In all cases
• The posterior mean is a compromise between
the prior mean and the MLE
• The posterior s.d. is less than both the prior s.d.
and the s.e. (MLE)
‘A Bayesian is one who, vaguely expecting a horse
and catching a glimpse of a donkey, strongly
concludes he has seen a mule’ (Senn)
As n  
• The posterior mean  the MLE
• The posterior s.d.  the s.e. (MLE)
• The posterior does not depend on the prior.
Non-informative priors
• We often do not have any prior information, although
true Bayesian’s would argue we always have some prior
information!
• We would hope to have good agreement between the
frequentist approach and the Bayesian approach with a
non-informative prior.
• Diffuse or flat priors are often better terms to use as no
prior is strictly non-informative!
• For our example of an unknown mean, candidate priors
are a Uniform distribution over a large range or a Normal
distribution with a huge variance.
Improper priors
• The limiting prior of both the Uniform and Normal is a
Uniform prior on the whole real line.
• Such a prior is defined as improper as it is not strictly a
probability distribution and doesn’t integrate to 1.
• Some care has to be taken with improper priors however
in many cases they are acceptable provided they result
in a proper posterior distribution.
• Uniform priors are often used as non-informative priors
however it is worth noting that a uniform prior on one
scale can be very informative on another.
• For example: If we have an unknown variance we may
put a uniform prior on the variance, standard deviation or
log(variance) which will all have different effects.
Point and Interval Estimation
• In Bayesian inference the outcome of interest for a
parameter is its full posterior distribution however we
may be interested in summaries of this distribution.
• A simple point estimate would be the mean of the
posterior. (although the median and mode are
alternatives.)
• Interval estimates are also easy to obtain from the
posterior distribution and are given several names, for
example credible intervals, Bayesian confidence
intervals and Highest density regions (HDR). All of these
refer to the same quantity.
Credible Intervals
• If we consider the heights example with our first prior
then our posterior is
P(μ|x)~ N(165.23,2.222),
and a 95% credible interval for μ is
165.23±1.96×sqrt(2.222) =
(162.31,168.15).
Similarly prior 2 results in a 95% credible interval for μ is
(163.61,170.63).
Note that credible intervals can be interpreted in the more
natural way that there is a probability of 0.95 that the
interval contains μ rather than the frequentist conclusion
that 95% of such intervals contain μ.
Hypothesis Testing
Another big issue in statistical modelling is the ability to
test hypotheses and model comparisons in general.
The Bayesian approach is in some ways more
straightforward. For an unknown parameter θ
we simply calculate the posterior probabilities
p0  P(  0 | x), p1  P(  1 | x)
and decide between H0 and H1 accordingly.
We also require the prior probabilities to achieve this
 0  P(  0 ), 1  P(  1 )
Bayes factors
• Prior odds on H0 against H1 is π0 /π1
• Posterior odds on H0 against H1 is p0 /p1
• The Bayes factor B in favour of H0 against H1 is
( p0 / p1 ) p0 1
B

( 0 /  1 ) p1 0
Note that when hypotheses are simple B is the
likelihood ratio of H0 against H1 i.e. the odds in
favour of H0 against H1 that are given by the
data however in complex hypotheses B also
involves the prior distributions.
Bayes factors – Girls height
example prior 1
Let us assume that H0 is μ >165 and hence
H1 is μ ≤165. Now we have π0= π1=0.5
under the N(165,4) prior
The posterior is N(165.23,2.222) which
results in p0 =0.561 p1=0.439 and results
in a Bayes factor of 0.561/0.439=1.278
here the Bayes factor is close to 1 and so
the data has not much altered our beliefs
about the hypothesis under discussion.
Bayes factors – Girls height
example prior 2
Now under the N(170,9) prior we have π0=0.952
and π1=0.048 so strong a priori evidence for H0
against H1
The posterior is N(167.12,3.214) which results in
p0 =0.881, p1=0.119 and results in a Bayes
factor of (0.881×0.048)/(0.952×0.119) = 0.373 so
in the case the Bayes factor is smaller than 1 as
the data gives less evidence for H0 against H1
than the prior distribution.
It should be noted that care needs to be taken
when using Bayes factors and non-informative
priors.
Bayesian inference with more
unknown parameters
We have so far restricted ourselves to an example with
only 1 unknown parameter which is generally
unrealistic.
For example it would be more common to consider a
Normal distribution with both mean and variance
unknown.
In such a situation interest may focus on the marginal
posterior distribution of the mean treating the variance
as a nuisance parameter.
The marginal distribution is created by integrating the joint
posterior distribution over the nuisance parameters
p(  | x)  p(  ,  | x)d
Bayesian inference with more
unknown parameters
This integration is one of the reasons why Bayesian
statistics has been of less practical use in the past. This
means that for even reasonably simple models Bayesian
inference becomes involved.
However the revolution in computer speed and memory
size has meant that integrations can be easily
approximated by simulation methods as we will describe
in the next session.
We will now briefly describe a Bayesian linear regression
model before going on to a lab that allows you to try
simulation approaches to solve the simple models in
these lectures.
Linear Regression example
In our whistle-stop tour of Bayesian statistics we
have here skipped over many standard multiple
parameter models. We will focus on linear
regression here for comparison with the
frequentist methods.
I will give brief details as it is less important to
know how to calculate posterior distributions
analytically when we will generally use
simulation-based methods later.
Although the intention is not to scare you, the
derivations are rather complex.
Linear Regression
2
2
p
(
y
|

,

,

,
x
)
~
N
(



x
,

)
• Our model is
i
0
1
i
0
1 i
Now we need priors for the 3 unknown
2

,

,

parameters, 0 1
which we will consider in
more detail in the practical.
For now we will use a convenient non-informative
prior based on a uniform distribution on (0 , 1, log  )
This results in p( 0 , 1 ,  2 | x)   2
The posterior can be expressed as follows:
p( ,  2 | y)  p( |  2 , y) p( 2 | y)
Linear Regression
We then get
p(  |  2 , y ) ~ N ( ˆ , V  2 )
where ˆ  ( X T X ) 1 X T y, V  ( X T X ) 1
p( 2 | y ) ~ Inv 2 (n  k , s 2 )
1
2
where s 
( y  Xˆ )T ( y  Xˆ )
nk
Note that the classical estimates of  and  2
are ˆ and s 2 respective ly with th e classical 
standard error obtained by setting  2  s 2
Estimating the linear regression
To sample from the posterior distribution given, we
firstly calculate the values of
ˆ , V and s 2 before sampling from the distributi ons
for  2 and  in turn.
Note that in the practical we will return to the
heights example and regress the girls heights on
their weights while trying various informative
priors.
Information for the Practical
In this first practical you will use an MCMC
estimation package called WinBUGS to fit
the models discussed in the lecture.
This practical is meant to confirm the
answers from the lecture notes and also to
familiarize you a little with WinBUGS.
We will give more details on WinBUGS in
later lectures.