Download Likelihood, Bayesian, and Decision Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 15:
Likelihood, Bayesian, and
Decision Theory
AMS 572
Group Members
Yen-hsiu Chen, Valencia Joseph, Lola Ojo,
Andrea Roberson, Dave Roelfs,
Saskya Sauer, Olivia Shy, Ping Tung
Introduction
"To call in the statistician after the experiment is done may be no more than
asking him to perform a post-mortem examination: he may be able to say what
the experiment died of."
- R.A. Fisher

Maximum Likelihood, Bayesian, and Decision Theory
are applied and have proven its selves useful and
necessary in sciences, such as physics, as well as research
in general.

They provide a practical way to begin and carry out an
analysis or experiment.
15.1
Maximum
Likelihood Estimation
15.1.1 Likelihood Function

Objective : Estimating the unknown parameters
θof a population distribution based on a random
sample χ1,…,χn from that distribution

Previous chapters : Intuitive Estimates
=> Sample Means for Population Mean

To improve estimation, R. A. Fisher (1890~1962)
proposed MLE in 1912~1922.
Ronald Aylmer Fisher (1890~1962)


The greatest of Darwin's
successors
Known for :




Notable Prizes :

Source: http://www-history.mcs.standrews.ac.uk/history/PictDisplay/Fisher.html
1912 : Maximum likelihood
1922 : F-test
1925 : Analysis of variance
(Statistical Method for
Research Workers )

Royal Medal (1938)
Copley Medal (1955)
Joint p.d.f. vs. Likelihood Function



Identical quantities
Different interpretation
Joint p.d.f. of X1 ,…, Xn :
 A function of χ1,…,χn for given θ
 Probability interpretation
n
f  x1,..., xn    f  x1   f  x 2   ... f  xn     f  xi  
i 1

Likelihood Function of θ :
 A function of θfor given χ1,…,χn
 No probability interpretation
n
L  x1,..., xn   f  x1,..., xn    f  x1   ... f  xn     f  xi  
i 1
Example : Normal Distribution

Suppose χ1,…,χn is a random sample from
a normal distribution with p.d.f.:
2
(
x


)
1
f ( x | , 2 ) 
exp{
}
2
2
 2
parameter (  ,  ), Likelihood Function:
2
n
L(  ,  2 )  
i 1
( xi   ) 2
1
[
exp{
}]
2
2
 2
1
1
n
(
) exp{ 2
2
 2
n

i 1
( xi   ) 2 }
15.1.2 Calculation of
Maximum Likelihood Estimators (MLE)

MLE of an unknown parameter θ:
The value
function

    x1,..., xn  which maximizes the likelihood
Example of MLE: L

x1,..., xn 

2 independent Bernoulli trials with success probability θ

θis known : 1/4 and 1/3
=>parameter space Θ= {1/4, 1/3}

Using Binomial distribution, the probabilities of observing
χ= 0, 1, 2 successes can be calculated
Example of MLE

Probability of ObservingχSuccesses
χ
the # of
successes
0
1
2
1/4
9/16
6/16
1/16
1/3
4/ 9
4/ 9
1/9
Parameter
space Θ
• When χ=0, the MLE of  :   1/ 4
• When χ=1 or 2, the MLE of  :   1/ 3
• The MLE  is chosen to maximize L  x 
for observed χ
15.1.3 Properties of MLE’s

Objective
optimality properties in large sample

Fisher information (continuous case)
2
2

  d ln f ( x |  )  

 d ln f ( x |  ) 
I ( )   
f
(
x
|

)
dx

E



 

d
d








Alternatives of Fisher information
2

  d ln f ( x |  )  

 d ln f ( x |  ) 
I ( )  E  

Var


 
d

d









  d 2 ln f ( x |  )  
 d ln f ( x |  ) 
I ( )   
f ( x |  )dx   E  

2


d

d



 
 

(1)
2
(2)
  d ln f ( x |  )  2 
 d ln f ( x |  ) 
I ( )  E  

Var




d
d
 


 



f ( x |  )dx  1
df ( x |  )
d
dx

1 0
 d
d
 df ( x |  )
 df ( x |  )
1
dx

 d
 d f ( x |  ) f ( x |  )dx

d ln f ( x |  )
f ( x |  )dx

d
 d ln f ( x |  ) 
 E
0
d




  d 2 ln f ( x |  )  
 d ln f ( x |  ) 
I ( )   
f ( x |  )dx   E  

2


d

d



 
 
2

d ln f ( x |  )
f ( x |  )dx
 d
2
  d ln f ( x |  )
d ln f ( x |  ) df ( x |  ) 
f (x |  ) 
dx
  d 2
d
d 
diffrentia ting

 d 2 ln f ( x |  ) d ln f ( x |  )
1 
 

 f ( x |  )dx
2

d
d
f (x |  ) 

2
2
  d ln f ( x |  )
 d ln f ( x |  )  
 

f ( x |  )dx  0


2

d
d

 


MLE (Continued)

Define the Fisher information for an i.i.d. sample
X 1 , X 2, , X n i.i.d. sample from p.d.f f ( x |  )
 d 2 ln f ( X 1 , X 2 , , X n |  ) 
I n ( )   E 

2
d



 d2
  E  2  ln f ( X 1 |  )  ln f ( X 2 |  )
 d

 ln f ( X n |  ) 

 d 2 ln f ( X 1 |  ) 
 d 2 ln f ( X 2 |  ) 
 E 
 E

2
2
d
d




 I ( )  I ( )  I ( )  nI ( )
 d 2 ln f ( X n |  ) 
E

2
d



MLE (Continued)
• Generalization of the Fisher information for
k-dimensional vector parameter
p.d.f. of an r.v. X is f ( x |  ), where   (1 ,  2 ,
, k )
information matrix of  , I ( ), is given by

   ln f ( x |  )    ln f ( x |  )  

I ij ( )  E  


i
 j
 
 



2

  ln f ( x |  ) 

 E 


 i  j 

MLE (Continued)
• Cramér-Rao Lower Bound
A random sample X1, X2, …, Xn from p.d.f f(x|θ).
Let ˆ be any estimator of θ with E (ˆ)    B( ), where B(θ) is the
bias of ˆ. If B(θ) is differentiable in θ and if certain regularity
conditions holds, then
2

1

B
(

)


Var (ˆ) 
nI ( )
(Cramér-Rao inequality)
The ratio of the lower bound to the variance of any estimator of θ
is called the efficiency of the estimator.
An estimator has efficiency = 1 is called the efficient estimator.
15.1.4 Large Sample Inference Based
on the MLE’s
Large sample inference on unknown parameter θ
Var(ˆ) 
estimate
1
nI ( )
n  d 2 ln f ( X | ) 
1
i
I (ˆ)    

d 2
n i 1
 ˆ
100(1-α)% CI for θ
ˆ  z

1
2
nI (ˆ)
   ˆ  z
1
2
nI (ˆ)
15.1.4 Delta Method for Approximating the
Variance of an Estimator

Delta method
estimate a nonlinear function h(θ)
suppose that E(ˆ)  and Var(ˆ) is a known function of θ.
expand h(ˆ) around  using first-order taylor series
h(ˆ)  h( )  (ˆ   )h( )
using
E (ˆ   )
 
0, Var h(ˆ)  h( )2Var (ˆ)
15.2
Likelihood Ratio Tests
15.2 Likelihood Ratio Tests
The last section presented an inference for pointwise
estimation based on likelihood theory. In this section, we
present a corresponding inference for testing hypotheses.
Let f (x; ) be a probability density function where  is a real
valued parameter taking values in an interval  that could be
the whole real line. We call the parameter space. An

alternative hypothesis H1will restrict the parameter  to some
subset 1 of the parameter space . The null hypothesis H 0 is
then the complement  of with respect to .







•
Consider the two-sided hypothesis
H 0 :    0 versus H1 :    ,0 where  0 is a specified value.
We will test H 0 versus H on the basis of the random sample
1
X 1 , X 2 ,...., X n from f ( x; ) . If the nulln hypothesis holds, we
would expect the likelihood L( )   f ( xi ; ) to be
i 1
relatively large, when evaluated at the prevailing value  0 .
L( 0 )
Consider the ratio of two likelihood functions, namely  
L(ˆ)
Note that   1 , but if H 0 is true  should be close to 1;
while H 1 if is true,  should be smaller. For a specified
significance level  , we have the decision rule, reject H 0
in favor of H 1 if   c , where c is such that   P [  c]
0
This test is called the likelihood ratio test.
Example 1
Let X 1 , X 2 ,...., X n be a random sample of size n from a normal
distribution with known variance. Obtain the likelihood ratio for
testing H 0 :    0 versus H1 :    0.
L / X 1 ,....., X n   i 1
n

1
2
2
( xi   ) 2
e
2
2
n
2 2
 (2 ) e

n
 ( xi   ) 2
i 1
2 2
( xi   )2
n

2
ln L(  ) 
ln( 2 ) 
2
2 2
( xi   )


ln L(  ) 
 0.
2


2
1
ln L( )  2
2


< 0 . Thus
So
̂  x
̂  x
is a maximum since
is the MLE of

.
Example 1 (continued)
L(  0 )


L( ˆ )

n
2 2
( 2 )

 ( xi   0 ) 2
2 2
e

n
2 2
(( 2 )
 ( xi  x ) 2
2 2
e
( xi 0 )2( xi x )2

2 2
e


[( xi x )( x0 )]2( xi x )2
2 2
e
( xi  x )2 2 ( xi  x )( x 0 )( x 0 )2 ( xi  x )2
2 2
e


( x 0 )2
2 2
e
 e
 n ( x  0 ) 2
2 2
 e
 z02
2
 z0 2
thus
So
c




P  z0 c**   


.
is equivalent to
thus
**
c
 z
/2
e
2
 c , or
2
*
z  c
0
Example 2
Let X 1 , X 2 ,...., X n be a random sample from a Poisson distribution
with mean  >0.
a. Show that the likelihood ratio test of H 0 :    0
versus H 1 :    0is based upon the statistic Y   xi .
Obtain the null distribution of Y.
L   i 1
n

xi
xi !
e  
xi

 e  n
x !
i
ln L( )   xi ln   n   ln xi !

ln L( ) 

So ˆ
x
 (x )  n  0
i

is a maximum since
  ( xi )
2
1 ˆ  n
ln
L
(

)

|

 n 
0
2
2
 ˆ
2
ˆ
ˆ




thus ˆ is the mle of

Example 2 (continued)
The likelihood ratio test statistic is:
xi

 0 e  n 0
L( 0 )

L(ˆ)
=
 n 0 

= 
  xi 
 xi
x!
i
i
ˆ

ˆ
 e  n
 xi !
x
e
 
=  0
 ˆ 
 xi
ˆ
e n n 0
xi  n 0
And it’s a function of Y =
,
 xi .
Under
H0
X 1 , X 2 ,...., X n ~ Poisson ( 0 )  Y ~ Poisson (n 0 )
Example 2 (continued)
b.
For  0 = 2 and n = 5, find the significance level of the test that
rejects H 0 if y  4 or y  17 .
The null distribution of Y is Poisson(10).
  PH (Y  4)  PH (Y  17)  PH (Y  4)  1  PH (Y  16)
0
0
  .029 1  .973  .056
0
0
Composite Null Hypothesis
The likelihood ratio approach has to be modified slightly when the null
hypothesis is composite. When testing the null hypothesis H 0 :    0
concerning a normal mean when
2
is unknown, the parameter space
  {(  ,  2 ) :     ,0   2  } is a subset of
R2
The null hypothesis is composite and 0  {(  ,  2 ) :   0 ,0   2  }
Since the null hypothesis is composite, it isn’t certain which value of
the parameter(s) prevails even under H 0. So we take the maximum of the
likelihood over  0
The generalized likelihood ratio test statistic is defined as
max  0 L( 0 )

max
L(ˆ)
 0
Example 3
Let X 1 , X 2 ,...., X n be a random sample of size n from a normal
distribution with unknown mean and variance. Obtain the likelihood
ratio test statistic for testing H 0 :    0 versus H :   
  ( ,  0 2 ) 0  {    , 2   0 2}
In Example 1, we found the unrestricted mle:
1
0
̂  x
Now


L  , / X 1 ,....., X n  i 1
2
n
n
n
i 1
i 1

1
2,
2
e
( xi   ) 2
2 2
= (2 )
2
n
2

n
 ( xi   ) 2
i 1
e
2 2
Since  ( xi  x ) 2   ( xi   ) 2 L( x ,  2 ; x)  L( ,  2 ; x)
we only need to find the value of  2 maximizing L( x ,  ; x).
2
Example 3 (continued)
( xi   )2
n

2
ln L(  ,  ) 
ln( 2 ) 
2
2
2

2
2
 n  ( xi  x )
ln L( x ,  ) 

0
2
2
4

2
2

2
So ̂ 2 
2
( 2 )
2
(
x

x
)
 i
is a maximum since
n
ln L( x ,  ) 
2
n
2
2
4

2
(
x

x
)
 i
6
n
| 2 ˆ 2 
0
4
2(ˆ )
Thus ̂  x is the MLE of .
Thus
̂ 2 
2
(
x

x
)
 i
n
We can also write ˆ 
2
.
is the MLE of 
2
(
x

x
)
 i
n
2
(n  1) s 2

n
Example 3 (continued)
n
2
n
 2 (n  1) s 
ˆ
L( )  
 e
n


2
n
 ( xi  x ) 2
i 1
2 ( n 1) s 2
n
2
 n ( n 1) s 2
 2 (n  1) s 
2 ( n 1) s 2

 e
n


2
n
2
 2 (n  1) s  2n

 e
n


2
n
2 2
 ( n 1) s 2
L(ˆ0 )  (2o ) e
2 o 2
n
2 2
 ( n 1) s 2
L( 0 )
(2o ) e


n
L(ˆ)
2
 2 (n  1) s  2


n
2 o 2
 e

n
2
n
2
 (n  1) s 

e
2 
 n o 
2
 ( n 1) s 2
2 o
2
e
n
2
Example 3 (continued)
Rejection region:
so
n
2
  c,
u
2
where u 
 u e k
define
such that   PH 0 [  c]
n
2
h(u)  u e
u
2
(n  1) s 2
 o2
and
n
u
1
2
2
n
2
u
2
n
1
h (u )  u e  u e
2
2
n
u

1
1 2 2
 u e (n  u )  u  n, u  0
2
'
So
c
where
implies
u  c1
or
u  c2
PH0 (c1  n21  c2 )  1  
~  2 (n  1)
15.3 : Bayesian Inference
Bayesian inference refers to a statistical
inference where new facts are presented
and used draw updated conclusions on a
prior belief. The term ‘Bayesian’ stems
from the well known Bayes Theorem
which was first derived by Reverend
Thomas Bayes.
Thomas Bayes (c. 1702 – April 17, 1761)
Source: www.wikipedia.com
Thomas Bayes (pictured above) was a Presbyterian minister and a
mathematician born in London who developed a special case of Bayes’
theorem which was published and studied after his death.
Bayes’ Theorem (review): f (A|B) = f (A ∩ B)
/f
(B) = f (B | A) f (A) / f(B) (15.1)
since, f (A ∩ B)= f (B ∩ A) = f (B | A) f (A)
Some Key Terms in Bayesian Inference…
…in plain English
•prior distribution – probability tendency of an uncertain quantity, θ,
that expresses previous knowledge of θ from, for example, a past
experience, with the absence of some proof
•posterior distribution – this distribution takes proof into account and
is then the conditional probability of θ. The posterior probability is
computed from the prior and the likelihood function using Bayes’
theorem.
•posterior mean – the mean of the posterior distribution
•posterior variance – the variance of the posterior distribution
•conjugate priors - a family of prior probability distributions in which
the key property is that the posterior probability distribution also
belongs to the family of the prior probability distribution
15.3.1 Bayesian Estimation
So far we’ve learned that the Bayesian approach treats θ as a random variable
and then data is used to update the prior distribution to obtain the posterior
distribution of θ. Now lets move on to how we can estimate parameters using
this approach.
(Using text notation)
Let θ be an unknown parameter based on a random sample, x1, x2, …, xn from
a distribution with pdf/pmf f (x | θ).
Let π (θ) be the prior distribution of θ.
Let π *(θ | x1, x2, …, xn) be the posterior distribution.
**Note that π *(θ | x1, x2, …, xn) is the condition distribution of θ given the
observed data, x1, x2, …, xn.
If we apply Bayes Theorem (Eq. 15.1), our posterior distribution becomes:
f (x1, x2, …, xn | θ) π(θ)
f (x1, x2, …, xn | θ)π(θ)
dθ
=
f (x1, x2, …, xn | θ) π(θ)
f *(θ | x1, x2, …, xn)
(15.2)
*Note that f *(θ | x1, x2, …, xn) is the marginal PDF of X1, X2, …,Xn
Bayesian Estimation (continued)
As seen in equation 15.2, the posterior distribution represents what is
known about θ after observing the data X = x1, x2, …, xn . From earlier
chapters, we know that the likelihood of a variable θ is f (X | θ) .
So, to get a better idea of the posterior distribution, we note that:
posterior distribution
i.e.
π *(θ | X)
likelihood x prior distribution
f (X | θ) x π (θ)
For a detailed practical example of deriving the posterior mean and
using Bayesian estimation, visit:
http://www.stat.berkeley.edu/users/rice/Stat135/Bayes.pdf
☺
Example 15.26
Let x be the number of successes from n i.i.d. Bernoulli trials with
unknown success probability p=θ. Show that the beta distribution is a
conjugate prior on θ.
★
★ f (x) 

Goal




★
f (x | ) ()  f (x,)
f (x, )d 


f (x |  ) ( )d

f (x, )
 ( )   ( | x) 

f (x)
*

f (x |  ) ( )
f (x |  ) ( )d
Example 15.26 (continued)
X has a binominal distribution of n and p= θ
f (x |  )  ( ) (1  )
n
x
x
nx
x=1,2…,n
Prior distribution of θ is the beta distribution
(a  b) a1
 ( ) 
 (1  ) b1
(a)(b)

0≤ θ ≥1
(a  b) a1
f (x, )  f (x |  ) ( )  ( )
 (1  ) nx b1
(a)(b)
n
x

f (x) 
1

0
(a  b) (a  x)(n  b  x)
f (x, )d  ( )
(a)(b)
(n  a  b)
n
x
Example 15.26 (continued)
f (x, )
 ( )   ( | x) 
f (x)
(n  a  b)
x a1
nx b1


(1  )
(x  a)(n  x  b)
*
It is a beta distribution with parameters (x+a) and (n-x+b)!!
Notes:
1. The parameters a and b of the prior distribution
may be interpreted as prior successes and prior
failures, with m=a+b being the total number of
prior observations.
After actually observing x successes and n-x
failures in n i.i.d Bernoulli trials, these parameters
are updated to a+x and b+n-x, respectively.
2. The prior and posterior means are, respectively,
a
m
and
a x
mn
15.3.2 Bayesian Testing
Assumption:
H 0 :   0
H a :   a
 *0   * (0 )  P(  0 | x)
  
 (a )  P(   a | x)
*
a
*
If

k

*
1
*
0
   1
*
0
*
a
H0 in favor of Ha .
, we reject
Where k >0 is a suitably chosen critical constant.
Abraham Wald
(1902-1950)
was the founder of
Statistical decision theory.
His goal was to
provide a unified
theoretical framework
for diverse problems.
i.e. point estimation,
confidence interval
estimation and hypothesis testing.
Source: http://www-history.mcs.st-andrews.ac.uk/history/PictDisplay/Wald.html
Statistical Decision Problem

The goal: is to choose a decision d from a set of possible
decisions D, based on a sample outcome (data) x

Decision space is D

Sample space: the set of all sample outcomes denoted by x

Decision Rule: δ is a function δ(x) which assigns to every
sample outcome x є X, a decision d є D.
Continued…

Denote by X the R.V. corresponding to x and the probability
distribution of X by f (x|θ).

The above distribution depends on an unknown parameter θ
belonging to a parameter space Θ

Suppose one chooses a decision d when the true parameter is θ, a
loss of L (d, θ) is incurred also known as the loss function.

The decision rule is assessed by evaluating its expected loss called
the risk function:
R(δ, θ) = E[L(δ(X),θ)] = ∫xL(δ(X),θ) f (x|θ)dx.
Example

Calculate and compare the risk
functions for the squared error
loss of two estimators of success
probability p from n i.i.d.
Bernoulli trials. The first is the
usual sample proportion of
successes and the second is the
bayes estimator from Example
15.26:
ṗ1 = X/n
and
ṗ2 = a + X/ m + n
Von Neumann (1928): Minimax
Source:http://jeff560.tripod.com/
How Minimax Works

Focuses on risk avoidance

Can be applied to both zerosum and non-zero-sum games

Can be applied to multi-stage
games

Can be applied to multi-person
games
Classic Example: The Prisoner’s Dilemma


Each player evaluates
his/her alternatives,
attempting to minimize
his/her own risk
From a common sense
standpoint, a sub-optimal
equilibrium results
Prisoner B
Stays
Silent
Prisoner
A Stays
Silent
Prisoner
A
Betrays
Both serve
six months
Prisoner B
Betrays
Prisoner A
serves ten
years
Prisoner B
goes free
Prisoner A
goes free
Prisoner B
serves ten
years
Both serve
two years
Classic example: With Probabilities


When disregarding the
probabilities when playing
the game, (D,B) is the
equilibrium point under
minimax
With probabilities
(p=q=r=1/4), player one
will choose B. This is…
Two player game with simultaneous moves,
where the probabilities with which player two
acts are known to both players.
2
Action A
[P(A)=p]
Action B
[P(B)=q]
Action C
[P(C)=r]
Action D
[P(D)=1p=q=r]
Action A
-1
1
-2
4
Action B
-2
7
1
1
Action C
0
-1
0
3
Action D
1
0
2
3
1
…how Bayes works

View {(pi,qi,ri)} as θi
where i=1 in the
previous example

Letting i=[1,n] we get a
much better idea of what
Bayes meant by “states
of nature” and how
probabilities of each
state enter into one’s
strategy
Conclusion
We covered three theoretical approaches in our presentation

Likelihood



provides statistical justification for many of the methods used in
statistics
MLE - method used to make inferences about parameters of the
underlying probability distribution of a given data set
Bayesian and Decision Theory

paradigms used in statistics

Bayesian Theory


probabilities are associated with individual event or statements
rather than with sequences of events
Decision Theory

Describe and rationalize the process of decision making, that is,
making a choice of among several possible alternatives
Source: http://www.answers.com/maximum%20likelihood, http://www.answers.com/bayesian%20theory, http://www.answers.com/decision%20theory
The End 
Any questions for the group?