Download Robust Maximum Likelihood Estimation

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Chapter 15
Maximum Likelihood Estimation,
Likelihood Ratio Test,
Bayes Estimation,
and Decision Theory
Chapter 15
Bei Ye, Yajing Zhao, Lin Qian, Lin Sun, Ralph
Hurtado, Gao Chen, Yuanchi Xue, Tim
Knapik, Yunan Min, Rui Li
Section 15.1
Maximum Likelihood Estimation
Maximum Likelihood Estimation
(MLE)
1.
2.
3.
4.
Likelihood function
Calculation of MLE
Properties of MLE
Large sample inference and delta method
• 2.Calculation of Maximum Likelihood
Estimation
• 2.Calculation of Maximum Like
Likelihood Function
1.1 Parameter space Θ
X1 ,…, Xn : i.i.d. observations
θ: an unknown parameter
Θ: The set of all possible values of θ
1.2 Joint p.d.f. or p.m.f. of X1 ,…, Xn
n
f ( x1 , , xn |  )  f ( x1 |  ) f ( x2 |  )  f ( xn |  )   f ( xi |  )
i 1
Likelihood Function
1.3 Likelihood Function of θ
For observed χ1,…,χn :
n
L( | x1 , , xn )   f ( xi |  )
i 1
The joint p.d.f. or p.m.f. is a function of χ1,…,χn for
given θ .
The likelihood function is a function of θ for given
χ1,…,χn .
Example : Normal Distribution
Suppose χ1,…,χn is a random sample from a normal
distribution with p.d.f.:
( x   )2
1
f ( x | , ) 
exp{
}
2
2
 2
2
a vector parameter: ( ,  2 )
Likelihood Function:
n
L(  ,  2 )  
i 1
( xi   )2
1
[
exp{
}]
2
2
 2
1
1
n
(
) exp{ 2
2
 2
n

i 1
( xi   ) 2 }
Calculation of MLE
2.1 Maximum Likelihood Estimation:
Need to find     x1,..., xn  which maximizes the
likelihood function L  x1,..., xn  .
Simple Example:
• Two independent Bernoulli trials with success
probabilityθ.
• θ is known : 1/4 or 1/3 (Θ).
• The probabilities of observing χ= 0, 1, 2 successes can
be calculated. Let’s look at the following table.
Calculation of MLE
Probability of ObservingχSuccesses
0
χ
1
2
1/4
9/16
6/16
1/16
1/3
4/ 9
4/ 9
1/9
the # of successes
Parameter
space Θ
The MLE  is chosen to maximize L  x  for
observed χ.
χ=0:  :   1/ 4 χ =1 or 2:  :   1/ 3
Calculation of MLE
2.2 Log-likelihood function:
n
ln L( )   ln f ( xi ;  )
i 1
Setting the derivative of L(θ) equal to zero and solving
for θ.
d [ln L( )]
0
d
Note: The likelihood function must be differentiable and
then this method can be used.
Properties of MLE
MLE: optimality properties in large samples
The concept of information due to Fisher
Properties of MLE
3.1 Fisher Information:
2
  d ln f ( x |  )  2 
 d ln f ( x |  ) 
I ( )   
f ( x |  )dx  E  




d

d



 
 

Alternative expression
  d 2 ln f ( x |  )  
 d ln f ( x |  ) 
I ( )   
f ( x |  )dx   E  

2


d

d



 
 

2
Properties of MLE
for an i.i.d. sample:
X 1 , X 2, , X n i.i.d. sample from p.d.f f ( x |  )
 d 2 ln f ( X 1 , X 2 , , X n |  ) 
I n ( )   E 

2
d



 d2
  E  2  ln f ( X 1 |  )  ln f ( X 2 |  )
 d

 ln f ( X n |  ) 

 d 2 ln f ( X 1 |  ) 
 d 2 ln f ( X 2 |  ) 
 E 
 E

2
2
d
d




 I ( )  I ( )  I ( )  nI ( )
 d 2 ln f ( X n |  ) 
E

2
d



Properties of MLE
for k-dimensional vector parameter
p.d.f. of an r.v. X is f ( x |  ), where   (1 ,  2 ,
, k )
information matrix of  , I ( ), is given by
   ln f ( x |  )    ln f ( x |  )  
I ij ( )  E  


i
 j
 
 
 
  2 ln f ( x |  ) 
 E 







i
j
Properties of MLE
3.2 Cramér-Rao Lower Bound
A random sample X1, X2, …, Xn from p.d.f f(x|θ).
Let ˆ. be any estimator of θ with E (ˆ)    B( ),where B(θ) is
the bias of ˆ. If B(θ) is differentiable in θ and if certain
regularity conditions holds, then
2

1

B
(

)


Var (ˆ) 
nI ( )
(Cramér-Rao inequality)
The ratio of the lower bound to the variance of any estimator of θ
is called the efficiency of the estimator.
An estimator has efficiency = 1 is called the efficient estimator.
Large Sample Inferences
and Delta Method
4.1 Large Sample Inferences
To make Large sample inference on unknown parameter θ
1
(Single Parameter), we need to estimate : Var(ˆ)  nI ( )
I(θ) is estimated by:
n  d 2 ln f ( X | ) 
1
i
I (ˆ)    

d 2
n i 1
 ˆ
This estimate does not require evaluation of the expected value.
An approximate large sample CI on θ is:
ˆ  z

1
2
nI (ˆ)
   ˆ  z
1
2
nI (ˆ)
4. Large Sample Inferences
and Delta Method
4.2 Delta Method for Approximating the
Variance of an Estimator
To estimate a nonlinear function h(θ). Suppose that


: E ( )   and Var( ) is a known function of θ
Delta Method:



h( )  h( )  (   )h' ( )

2
  
Var h( )  h' ( ) Var ( )


Section 15.2
Likelihood Ratio Test
Likelihood Ratio (LR) Test
1.
2.
3.
4.
Background of LR test
Neyman-Pearson Lemma and Test
Examples
Generalized Likelihood Ratio Test
Background of LR test
• Jerzy Splawa Neyman, Egon Sharpe Pearson
1894-1981 Polish1895-1980
American
English mathematician
mathematician.
Neyman-Pearson lemma
We want to find a rejection region R such that the error of
both type I and type II error are as small as possible.
Suppose x1, x2 , ......, xn have joint p.d.f f ( x1 , xn ; )
Consider the ratio
f ( x1 , , xn ;1 )
 ( x ; 0 ,1 ) 

f ( x1 , , xn ; 0 )
Then a best critical region of
L (1 )
L ( 0 )
 is
{( x1 , xn ) :  ( x;0 ,1 )  k}
Where k is a constant such that P(( x1 , xn )  C; H 0 )  
What is Likelihood Ratio (LR) test
•
A ratio is computed between the maximum probability of a
result under null and alternative hypothesis.
L1
 
L0
where the numerator corresponds to the maximum probability of an
observed result under the null hypothesis; denominator under the
alternative hypothesis.
• Test idea: if observe x, then condition L1 ( x)  L( x) is
evidence in favor of the alternative; the opposite inequality
is evidence against the alternative.
• Hence, the decision of rejecting null hypothesis was made
based on the value of this ratio.
The Test
• Let x1, x2 , ......, xn be a random sample with
p.d.f f ( x  )
• Hypothesis:
H0 :  
H1 :   1
0
• Test statistic:
L(1 x1 , x2 ,  , xn )
L( 0 x1 , x2 ,  , xn )
• Reject H 0 when the likelihood ratio exceeds
k
Characteristics of LR Test
• Most powerful test of significance level α; Maximize
1    P(reject 0  1 )
• Very useful and widely applicable, esp. in medicine to
assist in interpreting diagnostic tests .
• Exact distribution of the likelihood ratio corresponding to
specific hypotheses is very difficult to determine.
• The computations are difficult to perform by hand.
Example 1: Test on Normal Distribution Mean
H0 :    0
H1 :   1  0
The likelihoods under H 0 and H1 are:
n
L( 0 x1, x2 ,, xn )  (
1
) n exp( 
 2
2
(
x


)
 i 0
i 1
2
2
)
n
1
L( 1 x1, x2 ,, xn )  (
) n exp( 
 2
2
(
x


)
 i 1
i 1
2
2
)
Example 1 continued
Likelihood ratio is
L(  1 x1 , x2 , xn )
n
 exp( 2 (2 x  (0  1 ))( 1  0 ))
L( 0 x1 , x2 , xn )
2
Reject H 0when the ratio exceeds a constant k,
which is chosen for a specific significance level:
P( x  k    0 )  
The test is independent of 1 . It is the most
powerful  level test for all 1  0.
A Numerical Example
Suppose a random sample x with size n  1 and
f ( x, )  x 1I (0,1) ( x)
Test H 0 :   1 versus H1 :   2 with   0.05
f ( x1 ;2) 2 xI( 0,1) ( x)
 ( x;1,2) 

 2x
f ( x2 ;1) 1 I ( 0,1) ( x)
Where P(2 x  k ; H 0 )    0.05  k  0.95
So we reject H 0 if 2x  0.95.
Generalized Likelihood Ratio Test
• Neyman-Pearson Lemma shows that the
most powerful test for a Simple vs. Simple
hypothesis testing problem is a Likelihood
Ratio Test.
• We can generalize the likelihood ratio
method for the Composite vs. Composite
hypothesis testing problem.
Hypothesis
• Suppose H0 specifies that θ is in Θ0 and
H1 specifies that θ is in Θ0c. Symbolically,
the hypotheses are:
H 0 :   0
H1 :   
c
0
Test Statistics
   ( x1 , x2 ,, xn )

max  0 L | x1 , x2 ,, xn 
max   L | x1 , x2 ,, xn 




L ˆ0 | x1 , x2 ,, xn

L ˆ | x1 , x2 ,, xn
Test Statistics
• Note that λ ≤ 1.
• An intuitive way to understand λ is to view:
 the numerator of λ as the maximum
probability of the observed sample computed
over parameters in the null hypothesis
 the denominator of λ as the maximum
probability of the observed sample over all
possible parameters.
 λ is the ratio of these two maxima
Test Statistics
 If H0 is true, λ should be close to 1.
 If H1 is true, λ should be smaller.
• A small λ means that the observed sample is
much more likely for the parameter points in the
alternative hypothesis than for any parameter
point in the null hypothesis.
Reject Region & Critical Constant
• Rejects Ho if λ < k, where k is the critical
constant.
• k < 1.
• k is chosen to make the level of the test
equal to the specified α, that is,
α = PΘo ( λ ≤ k ).
A Simple Example to illustrate GLR test
• Ex 15.18 (GLR Test for Normal Mean: Known
Variance)
 For a random sample x1, x2……, xn from an
N (μ, σ2) distribution with known σ2, derive the
GLR test for the one-sided testing problem:
Ho: µ ≤ µ0 vs. H1: µ > µ0
where µ0 is specified.
Solutions
The likelihood function is
1 n
 1 n
2
L(  | x1 , x2 ,, xn )  (
) exp  2  ( xi   ) 
 2
 2 i 1

1 n
n
 1 n
2
2
(
) exp  2  ( xi  x )  2 ( x   ) 
2
 2
 2 i 1

Solutions
 If
, then the restricted MLE of µ
under H0 is simply
.
 If
, then the restricted MLE of µ
under H0 is
, because in this case,
the maximum of the likelihood function
under H0 is attained at
.
Solutions
Thus, the numerator & denominator of the
likelihood ratio are showing below, respectively
max   0 L(  | x1 , x2 , , xn )

 (
 

 (
 

1
 1
) n exp  2
2
 2
2
(
x

x
)


i
i 1

1
 1
n
) exp  2
2
 2

( xi  x )  2 ( x  0 ) 

2
i 1

n
n
2
if ( x  0 )
n
1
 1
n
max  L(  | x1 , x2 , , xn )  (
) exp  2
 2
 2
2
if ( x   0 )

( xi  x ) 

i 1

n
2
Solution
Taking the ratio of the two and canceling the
common terms, we get
1

  n
2
exp  2 2 ( x  0 ) 

 
if ( x  0 )
if ( x  0 )
Solution
• Clearly, we do not reject H0 when λ = 1,
i.e., when
.
• Therefore, the condition λ < k is equivalent
to
subject to
.
• In other words, we reject H0 if
is
large, which leads to the usual upper one
sided z-test.
Section 15.3
Bayesian Inference
Bayesian Inference
1.
2.
3.
4.
Background of Bayes
Bayesian Inference defined
Bayesian Estimation
Bayesian Testing
Background of Thomas Bayes
• Thomas Bayes
– 1702 – 1761
– British mathematician and
Presbyterian minister
– Fellow of the Royal Society
– Studied logic and theology at the
University of Edinburgh
– He was barred from studying at Oxford
and Cambridge because of his religion
Background of Bayes
• Baye’s Theorem
– Famous probability theorem for finding
“reverse probability”
P( B j | A) 
P( B j ) P( A | B j )
k
 P( B ) P( A | B )
i 1
i
i
– The theorem was published posthumously in a
paper entitled “Essay Towards Solving a
Problem in the Doctrine of Chances”
Bayesian Inference
• Application to Statistics – Qualitative
Overview
– Estimate an unknown parameter 
– Assumes the investigator has some prior
knowledge of the unknown parameter 
– Assumes the prior knowledge can be summarized
in the form of a probability distribution on ,
called the prior distribution
– Thus,  is a random variable
Bayesia
• Application to Statistics – Qualitative
Overview (cont.)
– The data are used to update the prior distribution
and obtain the posterior distribution
– Inferences on  are based on the posterior
distribution
Bayesian Inference
• Criticisms by Frequentists
– Prior knowledge is not accurate enough to form a
meaningful prior distribution
– Perceptions of prior knowledge differ from person
to person
– This may cause inferences on the same data to
differ from person to person.
Some Key Terms in Bayesian Inference…
In the classical approach the parameter, θ, is thought to be an unknown,
but fixed, quantity. In the Bayesian approach, θ is considered to be a
quantity whose variation can be described by a probability distribution
which is called prior distribution.
• prior distribution – a subjective distribution, based on experimenter’s
belief, and is formulated before the data are seen.
• posterior distribution – is computed from the prior and the likelihood
function using Bayes’ theorem.
• posterior mean – the mean of the posterior distribution
• posterior variance – the variance of the posterior distribution
• conjugate priors - a family of prior probability distributions in which
the key property is that the posterior probability distribution also
belongs to the family of the prior probability distribution
1.5.3.1 Bayesian Estimation
Now lets move on to how we can estimate parameters using
Bayesian approach. Now let’s move on to how we can estimate
parameters using Bayesian approach.
(Using text notation)
x1 , x2 ,..., xn
Let  be an unknown parameter based on a random sample,
from a distribution with pdf /pmf f ( x |  )
Let ( ) be the prior distribution 
of
Let  * ( | x1 , x2,..., xn )
be the posterior distribution
If we apply Bayes Theorem(Eq. 15.1), our posterior distribution
becomes
f x1 , x2 ,..., xn    
:

f x1 , x2 ,..., xn    d  
f *  x1 , x2 ,..., xn 
Note that
X1,X2,…Xn

f x1 , x2 ,..., xn    
f *  x1 , x2 ,..., xn 
is the marginal PDF of
Bayesian Estimation(continued)
As seen in equation 15.2, the posterior distribution represents what is known
about after observing the data   x1 , x2 ,..., xn
. From earlier chapters,
we know that the likelihood of a variable  is
f ( |  )
So, to get a better idea of the posterior distribution, we note that:
 prior distribution
posterior distribution  likelihood
i.e.
 * ( | )  f ( |  )   ( )
For a detailed practical example of deriving the posterior mean and using
Bayesian estimation, visit:
http://www.stat.berkeley.edu/users/rice/Stat135/Bayes.pdf
Example 15.25
Let x be an observation from an N (  ,  2 ) distribution where μ is unknown and
σ2 is known. Show that the normal distribution is a conjugate prior on μ.
2 

 1
  x   2   1
    0  


f ( x  ) (  )  
exp 
exp 
 

2
2
2


2 0
 2

  0 2




1

2
2 
2
2

exp 
 0 x         0  
2 2
20


 2  0 1

We can ignore the factor 2 because it will cancel from both the
0
numerator and denominator of the expression for  * (  x). Similarly, any terms
1


not involving μ can be canceled from the numerator and denominator.
Example 15.25 (continue)
Thus, we see that
 * (  x)
is proportional to


1

* 2
exp 




*2
 2

Where
*  p0 x  q0 0
 02
p0  2
   02
 1
 0
1 
  2
  2  2 
2
 0 
0 
2
q0  1  p0  2
   02
2
and
2
1
*2
It follows that  * (  x) has the form of the normal distribution. Specifically,
*
*
*2
1
/

2
is
distribution
(the
normalizing
constant
N

,

 * (  x)
comes from the denomination.


Example 15.25 (continue)
Note the following remarks
1. If a single observation x is replaced by the sample mean x of n i.i.d observation
*
*2
,

N (  ,  2 ) then the posterior distribution of μ is N 
with
1

*  p0 x  q0 0
p0 
 02
 2 / n   02
 n
1 
   2  2 
0 

2 /n
q0  1  p0  2
 / n   02

*2
2. The posterior mean (Bayes Estimate) μ* is the weighted average of x and μ0
with weights p0 and q0 being inversely proportional to their respective
variances. If the prior variance σ02<< σ2/n, then the prior dominates the data
and μ* is closer to μ0 . Conversely, if σ02>> σ2/n, then the data dominate the
prior and μ* is closer to x .
3. The posterior variance σ*2 is a harmonic sum of σ2/n and σ02 .
15.3.2 Bayesian Testing
Assumption:
H 0 :   0
H1 :   1
 *0   * (0 )  P(  0 | x)
   (1 )  P(  1 | x)
*
1
*
If

k

*
1
*
0
, we reject
 0*   1*  1
H0 in favor of Ha .
Where k >0 is a suitably chosen critical constant.
Example 15.27
Consider the problem of testing
H 0 :    0 Vs H a :    a
where 1   0 based on a random sample x1 , x2 ,..., xn from an
N ( , 2 )
Using the expression for the likelihood function from Example 15.16 and
canceling the common terms, the ratio of the posteriors becomes
n
2
(
x


)
}  
*
1
2
1
1
 n1  0 

2




   exp 
2 x  0  1  
*
2
 0  exp{ n ( x   ) 2 }   0   2

0
0
2 2
Therefore  1* /  0*  k holds if
 1 exp{


 n1  0 
  0 
exp 
2 x  0  1     k
2
 2
  1 


We see that the Bayes test has the same form as the Neyman-Pearson test,
but the critical constant is modified by the ratio of the prior probabilities.
Decision theory
1. Definition
– Loss
– Risk
2. Comparison of estimators
– Admissibility
– Minimax
– Bayes risk
3. Example of hypothesis testing
Section 15.4
Decision Theory
Definition
• Decision theory aims to unite the following under
a common framework
– point estimation
– confidence interval estimation
– hypothesis testing
• D : the set of all decisions
• X : the set of all outcomes, typically a sample of
identical and independent random variables
•  : X  D the function that chooses decisions
given the sample
Loss
• How do you evaluate the performance of a
decision function  ?
• Consider point estimation
– X  ( X 1,, X n ) where X i depends on an
unknown parameter  
– D   , which we expect seeing we are
attempting to predict 
– The decision function should return an
estimate  of the true parameter 
– A “good” decision rule selects values close to
the actual value of 
Loss and risk
• A loss function incurs a penalty when the
estimation deviates from the true value of 


L( ,  )  (   ) 2
Squared error loss
• The risk function is a measure of the
decision rule’s performance over the set of
outcomes

R( ,  )  E[ L( ,  )]
Squared error loss


L( ,  )  (   ) 2

R ( ,  )  E[(   ) 2 ]
2

 E[  2     2 ]
2

 E[ ]  2 E[  ]  E[ 2 ]
2



 ( E[ ]  E[ ] )  E[ ]  2E[ ]   2

2

 Var ( )  ( E[ ]   ) 2


 Var ( )  Bias ( ,  ) 2
2
Comparing estimators
• Let X  ( X 1,, X n ) where X i ~ Bernoulli ( p)
• Unbiased estimator for p
p(1  p)
R(1 , p)  Var (X) 
n
1  X
• Bayes estimator for p (assuming
P ~ Beta (a, b)
n
2 
a   Xi
i 1
abn
((a  b) 2 p 2  (n  2a(a  b)) p  a 2
R( 2 , p) 
(a  b  n) 2
)
Comparing estimators
If we let a = b = 1, and n = 4, then the risk functions become:
p(1  p)
R(1 , p) 
4
R( 2, p)  1 / 36
Comparison of risk
0.07
Point estimator
Bayes estimator
0.06
The estimators perform
differently depending on
the value of p
How do we select which
estimator to use?
Risk
0.05
0.04
0.03
0.02
0.01
0
0
0.1
0.2
0.3
0.4
0.5
p
0.6
0.7
0.8
0.9
1
Admissibility
• A decision 1 is inadmissible if there exists
another decision  2 such that
R(1, )  R( 2 , ) 
Comparison of risk
1
0.8
Risk
Decision rule 1 is inadmissible
because it’s risk is higher than
the other decision rule for all
values of p
Decision 1
Decision 2
0.6
0.4
0.2
0
0
0.2
0.4
0.6
p
0.8
1
Choosing estimators
• Restrict the set of estimators to those that are admissible
• Minimax
– Choose the decision that has the lowest maximum value of risk
min {max ( R( ,  ))}

 
• Bayes risk
– Assume  comes from a distribution with pdf
– Bayes risk is the average risk across values of
B( )   R( , ) ( )d

– Select the estimator with smallest bayes risk
 ( )

Comparing minimax and Bayes
Purchase price: $5
Sold price (not defective): $6
Penalty cost (defective): $2
Probability of defective: p
Rule d1: purchase
Rule d2: not purchase
_______________________________________________________
No. Defectives
Decision Rule 1
Decision Rule  2
x
Decision
Decision
0
d1
d1
1
d2
d1
2
d2
d2
________________________________________________________
Loss function: L(d1,p)=8p-1 L(d2,p)=2
Risk functions: R(1 ,p )= 8p3  19 p 2  14 p  1
R( 2 ,p )= -8p3  3 p 2  8 p  1
Comparing minimax and Bayes
A retailer must decide whether to purchase a large lot of items
containing an unknown fraction p of defectives. The purchase price is
$5 per item. If it is not defective, he can sell it at $6, which means the
profit would be $1. But if an item is defective, it must be junked.
Moreover, there is a penalty cost $2 for no trade. The retailer inspects
tow items at random, to decide whether to purchase (d1) or not
purchase (d 2).
_______________________________________________________
No. Defectives
Decision Rule 1
Decision Rule  2
x
Decision
Decision
0
d1
d1
1
d2
d1
2
d2
d2
________________________________________________________
Comparing minimax and bayes
Loss functions:
L(d1 , p)  (2  5) p  1(1  p)
 8 p 1
L( d 2 , p )  2
Risk functions:
R(1 , p)  L(d1 , p) P(d1 | p)  L(d 2 , p) P(d 2 | p)
 (8 p  1) P( X  0 | p)  2 P( X  1or 2 | p)
 (8 p  1)(1  p)2  2[1  (1  p)2 ]
 8p3  19 p 2  14 p  1
R( 2 , p)  L(d1 , p) P(d1 | p)  L(d 2 , p) P(d 2 | p )
 (8 p  1) P( X  0or1| p)  2 P( X  2 | p)
 (8 p  1)(1  p 2 )  2 p 2
 -8p3  3 p 2  8 p  1
Comparing minimax and bayes
Max R( 1 ,p )=2.289
Max R(  2 ,p )=3.329
1 is the Minimax rule.
3.5
3
Risk of delta 2
Risk of delta 1
2.5
For the prior  ( p) =2(1-p),
B( 2)=

1
0
R( 2 , p) ( p)dp 
1 is the Bayes rule.
41
30
Risk
39
R
(

,
p
)

(
p
)
dp

B(1)= 0
1
30
1
2
1.5
1
0.5
0
-0.5
-1
0
0.1
0.2
0.3
0.4
0.5
p
0.6
0.7
0.8
0.9
1
Bayes Decision Rule
• Is
1
the best rule among all the decision rules?
B( )   R( , ) ( )d   [  L( ( x),  ) f ( x |  )dx] ( )d

 X
  [  L( ( x),  ) f ( x |  ) ( )d ]dx   A( ( x), x)dx
X 
X
where
A(d , x)   L(d , ) f ( x |  ) ( )d

Bayes Decision Rule
Since X is discrete for this problem
B( )   R( , ) ( )d   A( ( x), x)

X
Then B ( ) is minimized by choosing for each x X that decision
d   (x) which minimizes A( d , x )
For the following loss function:
L(d 1, p )  8 p  1
and
L(d 2, p )  2
Bayes Decision Rule
1
A(d 1, x)   L(d 1, p ) f ( x | p ) ( p )dp
0
2 x
  (8 p  1)  p (1  p ) 2 x 2(1  p )dp
 x
0
1
1
1
2!2

[8 p x 1 (1  p ) 3 x dp   p x (1  p ) 3 x dp ]
x!(2  x)! 0
0
4
( x  1)!(3  x)! x!(3  x)!

[8

]
x!(2  x)!
5!
4!
(8 x  3)(3  x)

30
Note:
1

0
a!b!
p (1  p) dp 
(a  b  1)!
a
b
Bayes Decision Rule
1
A(d 1, x)   L(d 1, p) f ( x | p) ( p)dp 
0
(8 x  3)(3  x)
30
3 x
A(d 2, x)   L(d 2, p) f ( x | p) ( p)dp 
3
0
1
• Now we can check:
x  0 : A( d 1, x)  9 / 30  A( d 2, x)  1
x  1 : A(d 1, x)  22 / 30  A(d 2, x)  2 / 3
x  2 : A(d 1, x)  19 / 30  A( d 2, x)  1 / 3
• Therefore  min  ( d 1, d 2, d 2)   1 is the best rule among
all decision rules w.r.t to the given prior
Loss functions revisited (hypothesis
testing)
• Consider the hypothesis testing problem:
H 0 :   0
vs. H 1 :    1
where  is the unknown parameter and given the prior
probabilities  0   ( 0) of H 0 and  1   ( 1) of H 1
with :
 0 1 1
Loss functions revisited (hypothesis
testing)
• What would be a good loss function?
L( d 0 ,  0)  0, L( d 1,  0)  c 0
and
L( d 0,  1)  c1, L ( d 1,  1)  0
d0
: do not reject H 0
d1
: reject H 0
c0
c1
: the cost of type I error
: the cost of type II error
Loss functions revisited (hypothesis
testing)
• We have
A(d 0; x1, x 2,..., xn)  c1 1 f ( x1, x 2,..., xn |  1)
and
A(d 1; x1, x 2,..., xn)  c 0 0 f ( x1, x 2,..., xn |  0)
The Bayes test rejects H0 (makes decision d1) if
A(d 1; x1, x 2,..., xn)  A(d 0; x1, x 2,..., xn)
Which holds if
 1 f ( x1, x 2,..., xn |  1)
 1 * c0


 1 f ( x1, x 2,..., xn |  0)  0 * c1
Trendy ending slide