Download X n - IDA.LiU.se

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Statistical Decision Theory
Bayes’ theorem:
For discrete events
M
B1, , BM mutually exclusive events with
 Pr Bi A 
Pr A Bi  Pr Bi 

M
Pr
i 1
A B j  PrB j 
Bj  S
(sample space)
j 1
 Pr A Bi  Pr Bi 
For probability density functions
fY X  y x  
f X Y x y  fY  y 
 f X Y x z  fY z dz

f X Y x y  fY  y 
f X x 
 f X Y x y  fY  y 
The Bayesian “philosophy”
The classical approach (frequentist’s view):
The random sample X = (X1, … , Xn ) is assumed to come from a distribution
with a probability density function f (x;  ) where  is an unknown but fixed
parameter.
The sample is investigated from its random variable properties relating to
f (x;  ) . The uncertainty about  is solely assessed on basis of the sample
properties.
The Bayesian approach:
The random sample X = (X1, … , Xn ) is assumed to come from a distribution
with a probability density function f (x;  ) where the uncertainty about  is
modelled with a probability distribution (i.e. a p.d.f), called the prior
distribution
The obtained values of the sample, i.e. x = (x1, … , xn ) are used to update the
information from the prior distribution to a posterior distribution for 
Main differences:
In the classical approach,  is fix, while in the Bayesian approach  is a
random variable.
In the classical approach focus is on the sampling distribution of X, while in
the Bayesian the sample focus is on the variation of .
 Bayesian: “What we observe is fixed, what we do not observe is random.”
Frequentist: “What we observe is random, what we do not observe is
fixed.”
Concepts of the Bayesian framework
Prior density:
p( )
Likelihood:
L( ; x )
Posterior density:
q( | x ) = q( ; x ) The book uses the second notation
“as before”
Relation through Bayes’ theorem:
qθ x  
f X x θ  pθ 
 f X x   p d
Lθ; x   pθ 
 Book' s notation 
Lθ; x   pθ 


 Lθ; x   pθ 
f X x
 Lλ; x  pλdλ
f X  x; θ   pθ 
 f X  x; λ pλdλ

The textbook writes
qθ x   qθ; x 
f X  x   h x 

qθ; x  
Lθ; x  pθ 
h x 
Still the posterior is referred to as the distribution of  conditional on x
Decision-theoretic elements
1. One of a number of actions should be decided on.
2. State of nature: A number of states possible. Usually represented by 
3. For each state of nature the relative desirabilities of the different actions
possible can be quantified
4. Prior information for the different states of nature may be available: Prior
distribution of 
5. Data may be available. Usually represented by x. Can be used to update the
knowledge about the relative desirabilities of the different actions.
In mathematical notation for this course:
True state of nature:

Uncertainty described by the prior p ( )
Data:
x
observation of X, whose p.d.f. depends on 
(data is thus assumed to be available)
Decision procedure:

Action:
 (x)
Loss function:
LS ( ,  (x) ) measures the loss from taking action  (x)
when  holds
Risk function
The decision procedure becomes an action
when applied to given data x
Rθ ,     LS θ ,   x Lθ; x dx  E X LS θ ,   X 
Rθ ,     LS θ ,   x Lθ; x dx  E X LS θ ,   X 
Note that the risk function is the expected loss with respect to the simultaneous
distribution of X1, … , Xn
Note also that the risk function is for the decision procedure, and not for the
particular action
Admissible procedures:
A procedure  1 is inadmissible if there exists another procedure such that
R( , 1 )  R( ,  2 ) for all values of .
A procedure which is not inadmissible (i.e. no other procedure with lower
risk function for any  can be found) is said to be admissible
Minimax procedure:
A procedure  * is a minimax procedure if
 



R θ,  *  min 
max
R
θ
,



  θ

i.e.  is chosen to be the “worst” possible value, and under that value the
procedure that gives the lowest possible risk is chosen
The minimax procedure uses no prior information about  , thus it is not a Bayesian
procedure.
Example
Suppose you are about to make a decision on whether you should buy or rent a new
TV.
  1 = “Buy the TV”
 2 = “Rent the TV”
Now, assume  is the mean time until the TV breaks down for the first time
Let  assume three possible values 6, 12 and 24 months
The cost of the TV is $500 if you buy it and $30 per month if you rent it
If the TV breaks down after 12 months you’ll have to replace it for the same cost
as you bought it if you bought it. If you rented it you will get a new TV for no cost
provided you proceed with your contract.
Let X be the time in months until the TV breaks down and assume this variable is
exponentially distributed with mean 
 A loss function for an ownership of maximum 24 months may be defined as
LS ( ,  1(X ) ) = 500 + 500  H (X – 12) and LS ( ,  2(X ) ) = 30  24 = 720
Then

R , 1   E X 500  500  H  X  12  500  500   e

 500 1  e 12 
R ,  2   720
1  1 x
dx 
12

Now compare the risks for the three possible values of 

R( ,  1 )
R( ,  2 )
6
568
720
12
684
720
24
803
720
Clearly the risk for the first procedure increases with  while the risk for the second
in constant. In searching for the minimax procedure we therefore focus on the
largest possible value of  where  2 has the smallest risk
  2 is the minimax procedure
Bayes procedure
Bayes risk:
B    Rθ ,  pθ dθ
Uses the prior distribution of the unknown parameter
A Bayes procedure is a procedure that minimizes the Bayes risk
 B  arg min  Rθ,  pθ dθ

Example cont.
Assume the three possible values of  (6, 12 and 24) has the prior probabilities 0.2,
0.3 and 0.5.
Then





 
B1   500  1  e 12 6  0.2  1  e 12 12  0.3  1  e 12 24  0.5 
 280
B 2   720 (does not depend on  )
Thus the Bayes risk is minmized by  1 and therefore  1 is the Bayes procedure
Decision theory applied on point estimation
The action is a particular point estimator
State of nature is the true value of 
The loss function is a measure of how good (desirable) the estimator is of  :
 
LS  LS  ,ˆ
Prior information is quantified by the prior distribution (p.d.f.) p( )
Data is the random sample x from a distribution with p.d.f. f (x ;  )
Three simple loss functions
Zero-one loss:
 
ˆ  |  b

0
|

LS  , ˆ  
a, b  0
ˆ
a |    |  b
 
Absolute error loss:
LS  ,ˆ  a | ˆ   | a  0
Quadratic (error) loss:
LS  ,ˆ  a  ˆ  
 


2
a0
Minimax estimators:
Find the value of  that maximizes the expected loss with respect to the sample
values, i.e. that maximizes
 

 
E X LS  ,ˆ X  over the set of estimators ˆ X 
Then, the particular estimator that minimizes the risk for that value of  is the
minimax estimator
Not so easy to find!
Bayes estimators
A Bayes estimator is the estimator that minimizes
 
 LS  ,ˆ x L ; x dx p d 
   LS  , ˆ x L ; x  p d dx 
   LS  , ˆ x q ; x h x d dx 
  h x   LS  , ˆ x q ; x d dx
ˆ p d 
R

,



 For any given value of x what has to be minimized is


ˆ x  q ; x d
L

,

 S
The Bayes philosophy is that data (x ) should be considered to be given and
therefore the minimization cannot depend on x.
Now minimization with respect to different loss functions will result in measures of
location in the posterior distribution of .
Zero-one loss:
Absolute error loss:
Quadratic loss:
ˆ x  is the posterior mode for  given x
ˆ x  is the posterior median for  given x
ˆ x  is the posterior mean for  given x
About prior distributions
Conjugate prior distributions
Example: Assume the parameter of interest is , the proportion of some property of
interest in the population (i.e. the probability for this property to occur)
A reasonable prior density for  is the Beta density:
  1 1    1
p ;  ,   
; 0   1
B ,  
where   0 and   0 are two (constant) parameters
1
and B ,     x
 1
1  x 
 1
0
the so - called Beta function
  
dx 
,
   
Beta(1,1)
Beta(5,5)
Beta(1,5)
Beta(5,1)
Beta(2,5)
Beta(5,2)
Beta(0.5,0.5)
Beta(0.3,0.7)
Beta(0.7,0.3)
0
0.5
1
Now, assume a sample of size n from the population in which y of the values
possess the property of interest.
The likelihood becomes
n y
L ; y      1   n y
 y

 1
n y

1    1
n y 
   1    
y
B ,  
L ; y  p 

q ; y   1




1


1
0 Lx; y  px dx 01 n  x y 1  x n y  x 1  x  dx
B ,  
 y


 y 1   n y    1 1    1
 



1 y
x 1  x n y  x 1 1  x  1 dx
0
n  y   1
y  1


 y  1 1   n y   1
1 y  1
x
0

1  x 
n  y   1

dx
1   
B y   , n  y   
Thus, the posterior density is also a Beta density with parameters y +  and n – y + 
Prior distributions that combined with the likelihood gives a posterior in the same
distributional family are named conjugate priors.
(Note that by a distributional family we mean distributions that go under a
common name: Normal distribution, Binomial distribution, Poisson distribution
etc. )
A conjugate prior always go together with a particular likelihood to produce the
posterior.
We sometimes refer to a conjugate pair of distributions meaning
(prior distribution, sample distribution = likelihood)
In particular, if the sample distribution, i.e. f (x; ) belongs to the k-parameter
exponential family of distributions:
 Aj θ B j  x C  x  Dθ 
k
f x; θ   e j1
we may put
 Aj θ  j  k 1Dθ  K 1 ,, k , k 1 
k
pθ   e j1
 Aj θ  j  k 1Dθ 
k
 e j1
where 1 , … , k + 1 are parameters of this prior distribution and K( ) is a function of
1 , … , k + 1 only .
Then
qθ; x   Lθ; x   pθ  
 Aj θ  B j  xi  C  xi  nD θ   Aj θ  j  k 1D θ  K 1 ,, k , k 1 
i 1
i 1
 e j1
e j1

k
n
n
k
 n





A
θ
B
x





  n  k 1 D θ 
C
x


j
j
i
j
 i K  ,, ,  j1
 i1

k
k 1
 e i1
e 1
e
k
n

 n

 Aj θ   B j  xi  j   n k 1 D θ 
 i1

 e j1
k
i.e. the posterior distribution is of the same form as the prior distribution but with
parameters
n
n
i 1
i 1
1   B1 xi ,,  k   Bk xi ,  k 1  n
instead of
1,,  k ,  k 1
Some common cases:
Conjugate prior
Sample distribution
Posterior
Beta
Binomial
Beta
X ~ Bin n,  
 | x ~ Beta  x,   n  x
Normal, known  2
Normal
 ~ Beta ,  
Normal
 

 ~ N  , 2
X i ~ N , 2
Gamma
Poisson
 ~ Gamma ,  
X i ~ Po 
Pareto
Uniform
p   

;  
X i ~ U 0, 

2
2 2 
 2
n




 | x ~ N  2


x
,
2
2
2
2
2
  n
  n 
   n
Gamma
 |  xi ~ Gamma   xi ,   n
Pareto

q ; x      n  ;   max  , xn 

Example
Assume we have a sample x = (x1, … , xn ) from U (0, ) and that a prior density
for  is the Pareto density
p    1  1  ,  2 ;   1,   0
What is the Bayes estimator of  under quadratic loss?
The Bayes estimator is the posterior mean.
The posterior distribution is also Pareto with


q ; x     n  1  max  , xn 
 n1  n  ,   max  , xn 

E  | x  


  n 1   n 










n

1

max

,
x

d 
n 

max  , x n  

 max  , xn 

  n 1

  n 




n

1




d 

max  , x n  


   n  1  max  , xn 

 n1    n1d 
max  , x n  
  n  2 

 

   n  1  max  , xn 



    n  2   max  , x n  
  n  2  

max

,
x
n 

   n  1  max  , xn    n1   0 


 n2


  n 1

 max  , xn   ˆB Compare with ˆML  xn 
 n2




  n 1







Non-informative priors (uninformative)
A prior distribution that gives no more information about  than possibly the
parameter space is called a non-informative or uninformative prior.
Example: Beta(1,1) for an unknown proportion  simply says that the parameter
can be any value between 0 and 1 (which coincides with its definition)
A non-informative prior is characterized by the property that all values in the
parameter space are equally likely.
0.25
0.2
Proper non-informative priors:
0.15
1.2
1
0.1
0.8
0.05
0.6
0
1
2
3
4
5
The prior is a true density or mass function
Improper non-informative priors:
The prior is a constant value over Rk
Example: N ( , ) for the mean of a normal population
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Decision theory applied on hypothesis testing
Test of H0:  =  0 vs. H1:  =  1
Decision procedure: C = Use a test with critical region C
Action: C (x) = “Reject H0 if x C , otherwise accept H0 ”
Loss function:
H0 true
H1 true
Accept H0
0
b
Reject H0
a
0
Risk function
R C ; θ   E X LS θ ,  C  X  
 Loss when rejecting H 0 for true value θ   Pr  X  C | θ  

 Loss when accepting H 0 for true value θ   Pr X  C | θ


R C ; θ0   a    0  1     a
R C ; θ1   0  1     b    b
Assume a prior setting p0 = Pr (H0 is true) = Pr ( =  0) and p1 = Pr (H1 is true) =
Pr ( =  1)
 The prior expected risk becomes
Eθ R C ; θ   a  p0  b  p1
Bayes test:
 B  arg min Eθ R C ; θ   arg min ap0  bp1
C
C
Minimax test:
 *  arg min max R C ; θ   arg min max a , b 
C

θ

C

θ

Lemma 6.1: Bayes tests and most powerful tests (Neyman-Pearson lemma) are
equivalent in that
every most powerful test is a Bayes test for some values of p0 and p1 and
every Bayes test is a most powerful test with
Lθ1; x  p0 a

Lθ0 ; x  p1b
Example:
Assume x = (x1, x2 ) is a random sample from Exp( ), i.e.
f x;    e
1  1 x
, x  0 ;  0
We would like to test H0:  = 1 vs. H0:  = 2 with a Bayes test with losses a = 2
and b = 1 and with prior probabilities p0 and p1
L1; x 
L 0 ; x 
1 11 x1
1 e

1  01 x1
0 e
1 11 x2
 1 e
1  01 x2
 0 e

4e  x1  x2  
p0 a
p
 2 0
p1b
p1
 p 
 x1  x2  ln  0 
 2 p1 
4e 2 x1  x2 
  x  x   4e  x1  x2 
e 1 2
Now,
P X 1  X 2  t  
t


t  x1
t
 
1  1 x1 1  1 x2
 e

 
t

dx2 dx1 
x1 0 x2 0

 x2 t  x1
1  x1
 e
e
x2 0 dx1
x1 0
1
 e
1
1  1 x1
e
 e
 1t
1  1t
1  1t
t


e
1  e
 1 x1
1  1t
 x1 e
x1 0
 1 e
 t e
 1 t  x1 
x1 0
dx   e
1
1  1 x1
 P X 1  X 2  t   e
 1t

t
x1 0
dx 
1

1  1t
 t e


 p 
1ln  0 
2p
e  1
 p0 



1

ln
p0  
2
p
 1  e  1 
 
  ln 

2
p
  1 
2 p1    p0   
 

1   ln 

p0    2 p1   
A fixed size  gives conditions on p0 and p1, and a certain choice will give a
minimized 
Sequential probability ratio test (SPRT)
Suppose that we consider the sampling to be the observation of values in a “stream”
x1, x2, … , i.e. we do not consider a sample with fixed size.
We would like to test H0:  = 0 vs. H1:  = 1
After n observations have been taken we have xn = (x1, … , xn ) , and we put
LRn  
as the current test statistic.
L1; xn 
L 0 ; xn 
The frequentist approach:
Specify two numbers K1 and K2 not depending on n such that 0 < K1 < K2 <  .
Then
If LR(n)  K1  Stop sampling, accept H0
If LR(n)  K2  Stop sampling, reject H0
If K1 < LR(n) < K2  Take another observation
Usual choice of K1 and K2 (Property 6.3):
If the size  and the power 1 –  are pre-specified, put
K1 

1 
and K 2 
1

This gives approximate true size  and approximate true power 1 – 
The Bayesian approach:
The structure is the same, but the choices of K1 and K2 is different.
Let c be the cost of taking one observation, and let as before a and b be the loss
values for taking the wring decisions, and p0 and p1 be the prior probabilities of H0
and H1 respectively.
Then the Bayesian choices of K1 and K2 are


 1  k1   a

k 2  1 ln k1  1  k1  ln k 2 
c
 p0  




k

k
k

k


2
1
0


K1, K 2   arg min   2 1

k1 ,k2


 p  k 2  k 2  1  b  c  k1  k 2  1 ln k1  k 2  1  k1  ln k 2 

 1  k k


k

k


2
1
2
1
1

 
where
  f  X ;1   
  f  X ;1   
 H 0  and 1  E ln 
 H1 
0  E ln 
  f  X ; 0   
  f  X ; 0   
Bayesian inference
“Very much lies in the posterior distribution”
Bayesian definition of sufficiency:
A statistic T (x1, … , xn ) is sufficient for  if the posterior distribution of 
given the sample observations x1, … , xn is the same as the posterior
distribution of  given T, i.e.
q | x   q | T  x 
The Bayesian definition is equivalent with the original definition (Theorem 7.1)
Credible regions
Let  be the parameter space for  and let Sx be a subset of .
If
 qθ | x dθ  1  
Sx
then Sx is a 1 –  credible region for  .
For  a scalar we refer to it as a credible interval
The important difference compared to confidence regions:
A credible region is a fixed region (conditional on the sample) to which the
random variable  belongs with probability 1 – .
A confidence region is a random region that covers the fixed  with probability
1–
Highest posterior density (HPD) regions
If for all θ1  S x , θ2  S x
qθ1 | x   qθ2 | x 
then S x is called a 1   highest posterior density (HPD) credible region
Equal-tailed credible intervals
An equal-tailed 1 –  credible interval is a 1 –  credible interval (a , b ) such that
Pr  a | x   Pr  b | x    2
Example:
In a consignment of 5000 pills suspected to contain the drug Ecstasy a sample of 10
pills are sampled for chemical analysis. Let  be the unknown proportion of
Ecstasy pills in the consignment and assume we have a high prior belief that this
proportion is 100%.
Such a high prior belief can be modelled with a Beta density
  1 1    1
p  
B ,  
where  is set to 1 and  is set to a fairly high value, say 20, i.e. Beta (20,1)

Beta(20,1)
0
0.2
0.4
0.6
0.8
1
Now, suppose after chemical analysis of the 10 pills, all of them showed to
contain Ecstasy.
The posterior density for  is Beta( + x,  + n – x) (conjugate prior with
binomial sample distribution as the population is large)

Beta (20 + 10, 1 + 10 – 10) = Beta (30, 1)
Then a lower-tail 99% credible interval for  satisfies
1

x 29
 B30,1dx  0.99
a


1
 1  a 30  0.99
30 B30,1
 a  1  0.99  30 B30,11 30  0.858
Thus with 99% certainty we can state that at least 85.8% of the pills in the
consignment consist of Ecstasy
Comments:
We have used the binomial distribution for the sample. More correct would be to
use the hypergeometric distribution, but the binomial is a good approximation.
For a smaller consignment (i.e. population) we can benefit on using the result that
the posterior for the number of pills containing Ecstasy in the rest of the
consignment after removing the sample is beta-binomial.
This would however give similar results
If a sample consists of 100% of one kind, how would a confidence interval for  be
obtained?
Bayesian hypothesis testing
The issue is to test H0 vs. H1
Without specifying the hypotheses further, we seek to judge upon which of the two
hypothesis that, conditional on the sample, is the most probable.
Note the difference compared to the classical approach: There we seek to reject H0
in favour of H1 and never the opposite.
The aim of Bayesian hypothesis testing is to determine the posterior “odds”
Q* 
Pr H 0 | x 
Pr H1 | x 
With Q* > 1 we then say that conditional on the sample H0 is Q* times more
probable than H1 and with Q* < 1 the expression is reversed.
Some review of probability theory:
For two events A and B from a random experiment we have
Pr  A | B  
Pr B | A Pr  A
Pr B 

Pr B | A Pr  A
Pr  A | B 
Pr B | A Pr  A
Pr B 



Pr A | B
Pr B | A Pr A
Pr B | A Pr A
Pr B 



 

 
Actually A can be replaced by any other event C , but when
A is used the relation can be written
Odds  A | B  
Pr B | A
 Odds  A
Pr B | A


This result is usually referred to as Bayes theorem on odds form
Now, it is possible to replace A with H0 , A with H1 and B with x (the sample)

Pr H 0 | x  f  x | H 0  Pr H 0 


Pr H1 | x  f  x | H1  Pr H1 

Q* 
f x | H0 
Q
f  x | H1 
where Q is the prior “odds”
f x | H0 
LH 0 ; x 
can
be
written
Note that
f  x | H1 
LH1; x 
where L(H ; x) is the likelihood of H . The concept of likelihood is not restricted
to parameters.
Note also that f (x | H0) need not have the same functional form as f (x | H1)
B
f x | H0 
will be referred to as the Bayes factor
f  x | H1 
but is sometimes called the likelihood ratio
To make the comparison with classical hypothesis testing more transparent the
posterior odds may be transformed to posterior probabilities for each of the two
hypotheses:
Pr H 0 | x  
Pr H1 | x  
Q*
Q*  1
1
Q*  1
Now, if the posterior probability of H1 is 0.95 this would be a result that could be
compared with “H0 is rejected at 5% level of significance”
However, the two approaches cannot be made equal
Another example from forensic science
Assume a crime has been conducted where a blood stain was left at the crime scene.
A suspect is identified and a saliva sample is taken from this person. The DNA
profiles are compared between the saliva sample and the blood stain and they appear
to match (i.e. they are equal).
Put
H0: “The blood stain comes from the suspect”
H1: “The blood stain comes from another person than the suspect
The Bayes factor becomes
B
Pr Matching DNA profiles | H 0 
Pr Matching DNA profiles | H1 
Now, if laboratory mistakes can be discarded
Pr (Matching DNA profiles | H0 ) = 1
How about the probability in the denominator of B ?
This probability relates to the commonness of the current profile among people in
general.
Today’s DNA analysis is such that if a full profile is obtained (i.e. if there are no
missing DNA markers), the probability is very low, about 1 in 10 million.
 B becomes very large
If the suspect was caught on reasons not related to the DNA-analysis (as it should
be), the prior odds, Q = Pr (H0 ) / Pr (H1 ) is probably greater than 1
If we calculate with Q =1 then Q* = B and thus very large
 The DNA analysis very strongly supports H0 to be true.
Hypotheses expressed in terms of a parameter
For sake of simplicity we express the parameter as a scalar, but the results also apply
to multidimensional parameters 
Case 1: H0:  = 0
H1:  = 1
Pr H 0   p0 ; Pr H1   p1
B
f  x; 0 
f  x;1 
Example: Let x be an observation from Bin(n,  ) and H0:  = 0 vs. H1:  = 1

Q
p0
p1
n x
   0 1   0 n x
x
n x
x





1


0
; B 
  0  
1   1  1 
n x
n x

  1 1  1 
 x
 
 Q*   0 
 1 
x
1 0 


 1  1 
n x
p0
p1
Case 2: H0:   
Pr H 0  
H1:    –  ( \  )
 p d   p d ; PrH1    p d   p d
H0
B

H1
 f  x;  p d

 f  x;  p d

Note that the textbook uses the somewhat redundant notation
p0 ( | H0 ) and p1 ( | H1 ) for the prior(s)

Case 3: H0:  = 0
H1:   0
Pr H 0   p 0  ; Pr H1  
 p d   p d
H1
B
  0
f  x; 0 
 f  x;  p d
  0
It can be shown (Theorem 7.3) that in this case
q | x 
 0 p 
B  lim
(As   0 defines the region for H1 the conditioning on H1 in the textbook seems
redundant. )
Nuisance parameters and predictive distributions
If the parameters involved are two  and  and the parameter of interest is , then 
is referred to as a nuisance parameter.
The marginal posterior density for  is then obtained by integrating out  from the
joint posterior density:
qθ | x    qθ , λ | x  dλ

where  is the parameter space for .
Predictive distributions
Let xn = ( x1, … , xn ) be a random sample from a distribution with p.d.f. f (x ;  )
Suppose we will take a new observation xn + 1 and would like to make so-called
predictive inference about it. In practice this means that we would like to express
the uncertainty about it in terms of a prediction interval.
The marginal p.d.f. of Xn + 1 is the same as that of each variable Xi in the sample,
i.e. f (x ;  )
However, if we want to make use of the sample we should rather study the
simultaneous density of Xn + 1 and  | xn .
Xn + 1 and Xn are independent by definition
 Xn + 1 and  | xn are also independent as the latter is conditional on xn
The simultaneous density of Xn + 1 and  | x is
f xn1; θ   qθ | xn 
Now, treating  (temporarily) as a nuisance parameter.
The posterior predictive distribution (density) for Xn + 1 given  and xn is then
g xn1 | xn  
 f xn1; θ  qθ | xn dθ

g can be used to
• find a point prediction of Xn + 1 as
– the mean of g with quadratic loss, i.e.  t  g t | xn dt
R1
– the median of g with absolute error loss
– the mode of g with zero-one loss
• compute a 1 –  prediction interval for Xn + 1 by solving for c and d
c g t | xn dt  1  
d
usually w ith equal tails
Example
The number of calls to a telephone central during an hour can usually be shown to
follow a Poisson distribution with mean . Assume that a prior for  is a
Gamma (a,b)-distribution, i.e. the prior density is
b a 1a e b
p  
a  1
Now assume that we have observed x1, x2 and x3 calls during each of the three
previous hours and we wish to make predictive inference about the number of calls
x4 during the current hour.
The posterior distribution is also Gamma with

b  3a  x  x  x 1 a  x  x  x e b3 b  3a t 1 at e b3
q | x1, x2 , x3  

a  x1  x2  x3  1
a  t  1
1
with t  x1  x2  x3
2
3
1
2
3
Now,
 x4
g x4 | x1, x2 , x3   
0

x4 !
e


b  3a t 1 a t e b3
d 

a  t  1

b  3a t 1 a t  x4 e b 4 
1
d 


a  t  1
x4 ! 0


b  4a t  x4 1 a t  x4 e b 4 
1 a  t  x4  1 b  3a t 1
d 




1

x

t

a
4
a  t  x4  1
x4! a  t  1 b  4
0



1 (Gamma density)
 b3


b4
a t 1
1
1


a  t  1 b  4x4 x4!
and e.g. a point prediction under quadratic loss of the number of calls is obtained by

b3
x

  b  4 
x 0
a t 1

1
1

a  t  1 b  4x x!
Empirical Bayes
“Something between the Bayesian and the frequentist approach”
General idea:
Use some of the available data (sample) to estimate the prior for 
Use the rest of the available data to make inference about 
Parametric set-up:
Let data-point i be represented by the bivariate random variable (Xi , i ),
i =1, 2, …, k
Let f (x; i ) be the p.d.f. for Xi and p( ) be the marginal density of i
f is assumed to be known part from  and p is assumed to be unknown.
1, … , k are unobservable quantities.
With Empirical Bayes we try to make inference about k
Use x1, … , xk – 1 to estimate p( )
Then make inference about k through its posterior distribution conditional on xk
f xk ; k  pˆ  k 
q k | xk  
 f xk ;  pˆ  d
the usual way:
• point estimation under certain choices of loss functions
• credible regions
• hypothesis testing
How to estimate p( ) ?
The marginal density for Xi is
f x    f x;  p d
Let p = p( ; ) where  is a (multidimensional) parameter defining the prior
assuming the functional form of p is known.
Method of moments:
Put up the equations
 xf x dx E  X   Eψ E  X |      xf x; dxp ; ψ d
Var  X   Eψ Var  X |   Varψ E  X |  
from which an estimate of  can be deduced.
Maximum-Likelihood:
 is estimated as
k 1
ψˆ  arg max Lψ; x1,, xk 1   
ψ
i 1
 f xi ;  p ; ψ d 
Apparently, the textbook is not correct here as the estimation of p should be based
on the first k – 1 observations and the kth should be excluded from that stage.
See further the textbook for simplifications of the MLE procedure.