Download likelihood ratio test

Document related concepts

Distribution (mathematics) wikipedia , lookup

Transcript
L
k
i
i
e
o
h
o
d
l
1
Likelihood, Bayesian and
Decision Theory
Kenneth Yu
2
History
• The likelihood principle was first
introduced by R.A. Fisher in 1922.
The law of likelihood was identified
by Ian Hacking.
• "Modern statisticians are familiar
with the notion that any finite body
of data contains only a limited
amount of information on any point
under examination; that this limit is
set by the nature of the data
themselves…the statistician's task,
in fact, is limited to the extraction
of the whole of the available
information on any particular
issue." R. A. Fisher
3
Likelihood Principle
• All relevant data in is contained in the
likelihood function L(θ | x) = P(X=x | θ)
Law of Likelihood
• The extent to which the evidence supports
one parameter over another can be
measured by taking their ratio
• These two concepts allow us to utilize
likelihood for inferences on θ.
4
Motivation and Applications
• Likelihood (Especially MLE) is used in a range of
statistical models such as structural equation
modeling, confirmatory factor analysis, linear
models, etc. to make inferences on the parameter
in a function. Its importance came from a need to
find the “best” parameter value subject to error.
• This makes use of only the evidence and
disregards the prior probability of the hypothesis.
By making inferences on unknown parameters
from our past observations, we are able to
estimate the true Θ value for the population.
5
• The Likelihood is a function of the form:
L(Θ|X)Є{α P(X|Θ) : α > 0 }
• This represents how “likely” Θ is if we have
prior outcomes X. It is the same as the
probability of X happening given parameter Θ
• Likelihood functions are equivalent if they
differ by constant α (They are proportional).
The inferences on parameter Θ would be the
same if based on equivalent functions.
6
Maximum Likelihood Method
By Hanchao
7
Main topic include:
•
•
•
•
1. Why use Maximum Likelihood Method?
2. Likelihood Function
3. Maximum Likelihood Estimators
4. How to calculate MLE?
8
1. Why use Maximum
Likelihood Method?
Difference between:
Method of Moments
&
Method of Maximum likelihood
9
• Mostly, same!
• However, Method of Maximum
likelihood does yield “good” estimators:
1. an after-the-fact calculation
2. More versatile methods for fitting parametric
statistical models to data
3. Suit for large data samples
10
2. Likelihood Function
• Definition:
k
f
(
x
,...,
x
;

),




R
n
Let 1
, be the joint probability
(or density) function of n random
variables X 1 ,..., X n: x1 ,..., xn with sample
values
L( , x1 ,..., xn )  f ( x1 ,..., xn ; )
The likelihood function of the sample is given by:
11
• If X 1 ,..., X n are discrete iid random variable
with probability function p( x, ),
then, the likelihood function is given by
L( )  P( X 1  x1 ,..., X n  xn )
n
  P( X i  xi )
i 1
n
  P( xi ,  )
i 1
12
• In the continuous case, if the density is f ( x,  )
then, the likelihood function is given by
n
L( )   f ( xi ,  )
i 1
X 1 ,..., X nbe N (  ,  2 ) iid random variables.
i.e. Let
Find the Likelihood function?
n
n
L(  ,  2 )  
i 1
( xi   ) 2
1
1
exp( 
)
exp( 
2
n/2 n
2
(2 ) 
2 
2
(
x


)
 i
i 1
2
2
13
)
4. Procedure of one approach
to find MLE
• 1). Define the likelihood function, L(θ)
• 2). Take the natural logarithm (ln) of L(θ)
• 3). Differentiate ln L(θ) with respect to θ, and
then equate the derivative to 0.
• 4). Solve the parameter θ, and we will obtain
^
• 5). Check whether it is a max or global max 
• Still confuse?
14
Ex1: SupposeX 1 ,..., X n are random samples from ^a
Poisson distribution with parameter λ. Find MLE  ?
We have pmf:
p ( x) 
x e  
x!
; x  0,1,2,...;   0
Hence, the likelihood function is:
n
n
x e  
i 1
xi !
L ( )  
i
 xi


i 1
e  n
n
x!
i
i 1
15
Differentiating with respect to λ, results in:
n
x
d ln L( ) i 1 i

n
d

And let the result equals to zero:
n
x
That is,
d ln L( ) i 1 i

n 0
d

n

x
i 1
n
i
_
x
Hence, the MLE of λ is:
^
_
X
16
Ex2: Let X 1 ,..., X n be N (  ,  2 ) .
2
2



a) if μ is unknown and
0 is known, find the
MLE for μ.
b) if    0is known and  2 unknown, find the
MLE for  2.
c) if μ and  2 are both unknown, find the MLE
2
for  (  ,  ) .
• Ans:
Let    2 , so the likelihood function is:
n
L(  ,  )  (2 )  n / 2 exp(
2
(
x


)
 i
i 1
2
)
17
So after take the natural log we have:
n
n
n
ln L(  ,  )   ln( 2 )  ln(  ) 
2
2
2
(
x


)
 i
i 1
2
2





0 is known, we only need
a). When
0
to solve the unknown parameter μ:
n

(ln L(  ,  0 )) 

n
 (x
i 1
i
 )  0 
2 ( xi   )
i 1
n
x
i 1
i
2 0
 n 
0
x
18
• b) When    0 is known, so we only need to
solve one parameter    2 :
n

n
(ln L(  ,  ))  


2
2
(
x


)
 i
i 1
2
2
0
n
^
^
  2 
2
(
X


)
 i 0
i 1
n
• c) When both μ and θ unknown, we need to
differentiate both parameters, and mostly follow
the same steps by part a). and b).
19
Reality example:
Mic1
Sound localization
Mic2
MCU
20
Robust Sound Localization
IEEE Transactions on Signal Processing, Vol. 53, No. 6, June 2005
Noise
reverberations
Sound Source
21
The ideality and reality
Mic1
Mic2
The received signal in 1meter and angle 60 frequency 1kHz
22
Amplitude
Fourier Transform shows noise
Frequency (100Hz)
23
Algorithm:
1. Signal collection (Original signal samples in time domain)
m1 (t )  s(t )  n1 (t )
m2 (t )  s(t   )  n2 (t )
2. Cross Correlation (received signals after DFT, in freq
domain)
~


 arg max

 j
M
(

)
M
(

)
e
d
2
 1
  
24
• However, we have noise mixing within the
signal, so the Weighting Cross Correlation
algorithm become:
~


 arg max

 j
W
(

)
M
(

)
M
(

)
e
d
1
2

  
• Where by Using ML method as “Weighting
function ” to reduce the sensitive from noise
& reverberations
| M 1 ( ) || M 2 ( ) |
W ( ) 
| N1 ( ) |2 | M 2 ( ) |2  | N 2 ( ) |2 | M 1 ( ) |2
25
The disadvantage of MLE
• Complicated calculation (slow) -> it is almost
the last approach to solve the problem
• Approximated results (not exact)
Reference:
[1] Halupka, 2005,Robust sound localization in 0.18 um CMOS
[2] S.Zucker, 2003, Cross-correlation and maximum-likelihood analysis: a new
approach to combining cross-correlation fuctions
[3]Tamhane Dunlop, “Statistics and Data Analysis: from Elementary to
intermediate”, Chap 15.
[4]Kandethody M. Ramachandran, Chris P. Tsokos, “Mathematical Statistics
with Applications”, page 235-252.
26
Likelihood ratio test
Ji Wang
27
Brief Introduction
• The likelihood ratio test was firstly claimed by
Neyman and E.pearson in 1928. This test method is
widely used and always has some kind of optimality.
• In statistics, a likelihood ratio test is used to compare
the fit of two models, one of which is nested within
the other. This often occurs when testing whether a
simplifying assumption for a model is valid, as when
two or more model parameters are assumed to be
related.
28
Introduction about most powerful test
To the hypothesisH 0 :  0  H1 :  ,1 we have two test functions Y1
and Y2 , If E Y1  E Y2*,  1 ,then we called Y1 is more
powerful than Y2 .
If there is a test function Y satisfying the * inequality to the every test
function Y2 , then we called Y the uniformly most powerful test.
The advantage of likelihood ratio test comparing to
the significance test
•
The significance test can only deal
with the hypothesis in specific
interval just like: H :  
0
0
0
 H1 :  1
but can not handle the very
commonly hypothesis :
H 0 : 0 0  H1 :  1    0
because we can not use the method
of significance test to find the reject
region.
30
Definition of likelihood ratio test statistic
• X 1 ,....., X nare the random identical sampling from the family distribution
of F={ f ( x, ) :    }. For the test H0 : 0 0  H1 :  1    0
let
  x) 
MAX 0 l (  x1 ,..., xn )
MAX l (  x1 ,..., xn 
We call   X )is the likelihood ratio of the above mentioned hypothesis.
Sometimes we also called it general likelihood ratio.
# From the definition of the likelihood ratio test statistics, we can find if the
value of   X )is small, the null hypothesis H 0 : 0 0 is more probably to occur
than the alternative hypothesis H1 :  1, so it is reasonable for us to reject
null hypothesis.
Thus, this test reject H 0if
 X )  C
31
The definition of likelihood ratio test
•
We use   X ) as the test statistic of the test :
H 0 : 0 0  H1 :  1    0
and the rejection region is {  X )  C,}the C satisfy the inequality
P {  X )  C}  
  0
 level.
Then this test is the likelihood ratio test of
#If we do not know the distribution of   X )under null hypothesis, it is very difficult for us to find
the marginal value of LRT. However, if there is a statistic(T ( X ) )which is monotonous to
the
,and we know its distribution under null hypothesis. Thus, we can make a significance
test based on the T ( X ).
32
The steps to make a likelihood ratio test
• Step1 Find the likelihood ration function of the sample
.
X 1 ,....., X n
• Step2 Find the   X ) , the test statistic or some other statistics
which is monotonous to the   X ) .
• Step3 Construct the reject region by using the type 1 error at
the significance level of  .
33
• Example
X 1 ,....., X n are the random samples having the pdf:
f ( x,  )  e( x   , x       R
Please derive the rejection region of the hypothesis in the level
H 0 :   0  H1 :   0
34
Solution:
● Step1:
the sample distribution is :
f ( X , )  e

n

i 1
( x 
I ( x(1)   )
and it is also the likelihood function, the parameter
space is
    R, 0  {0}
then we derived
n
n
  xi
  xi  nx(1)
MAX 0 l (   x1 ,..., xn )  e i1 , MAX l (   x1,..., xn )  e i1
35
● Step2
the likelihood ratio test statistics
 X )  e
 nX (1)
e
1
 (2 nX (1) )
2
We can just used 2nX (1) ,because it is monotonous to the   x )
● Step3
Under the null hypothesis,2nX (1) ~  2 (2)
, so the marginal
valuec  2 (2) by calculating the P {2nX(1) C}  
  0
That is to say 2nX (1)is the likelihood ratio test statistics
and the reject region is {2nX (1)
 2 (2)}
36
Wald Sequential Probability Ratio Test
Xiao Yu
So far we assumed that the sample size is fixed
in advance. What if it is not fixed?
Abraham Wald(1902-1950) developed the
sequential probability ratio test(SPRT) by
applying the idea of likelihood ratio testing,
which sample sequentially by taking
observations one at a time.
Hypothesis:
H 0 :   0 ; H1 :   1
Ln (1 | x1 , x2 ,..., xn ) i 1 f ( xi | 1 )
n  n ( x1 , x2 ,..., xn ) 

Ln (0 | x1 , x2 ,..., xn )  n f ( xi | 0 )
n
i 1
• If  ( x, x ,..., x )  A stop sampling and decide to
1 
A
B

H
not1reject


• If A   ( x , x ,..., x )  B continue sampling
• If  ( x , x ,..., x )  B stop sampling and decide to
reject H
n
1
2
n
0
n
n
1
1
2
0
2
n
n
SPRT for Bernoulli Parameter
• A electrical parts manufacturer receives a
large lot of fuses from a vendor. The lot is
regarded as “satisfactory” if the fraction
defective p is no more than 0.1, otherwise it is
regarded as “unsatisfactory”.
H 0 : p  p0  0.1; H1 : p  p1  0.3
sn
 p1   1  p1 
n    

 p0   1  p0 
n  sn
  0.10,   0.20
1.504  0.186n  sn  1.540  0.186n
Fisher Information
2
  d ln f ( X |  )  2 


d
ln f ( X |  )  

I ( )  E  
  E 


2

d
d
 
 
 
 
score
d ln f ( X |  )
d
Cramer-Rao
Lower Bound
 d ln f ( X |  ) 
E
0

d


Var (ˆ) 
1
nI ( )
Single-Parameter Bernoulli
experiment
• The Fisher information contained n independent Bernoulli
trials may be calculated as follows. In the following, A
represents the number of successes, B the number of failure.
 2

 2

A
B ( A  B )!
I ( )   E  2 ln( f ( A; )) |     E  2 ln( (1   )
) | 
A! B !
 

 

 2

  A
B  
  E  2 ( A ln( )  B ln(1   )) |     E   
 | 
    1    
 

A
 n n(1   )
B
n
 E 2 
|   2 

2
2
(1   )
(1   )
 (1   )

 
We can see it’s reciprocal of the variance of the number of successes in N
Bernoulli trials. The more the variance, the less the Fisher information.
42
Large Sample Inferences Based on
the MLE’s
ˆ  
 d ln L( ) 
 d 
 d 2 ln L( ) 
 d 2 


Plug in the Fisher
information of Bernoulli
trials, we can see it’s
consistent as we have
learned.

1 
N  0,

 nI ( ) 
• An approximate large sample (1-alpha)-level
confidence interval(CI) is given by
ˆ  z
 /2
1
nI (ˆ)
   ˆ  z /2
1
nI (ˆ)
43
Jaeheun kim
Bayes' theorem
Thomas Bayes
(1702 –1761)
-English mathematician and
a Presbyterian minister born
in London
-a specific case of the
theorem (Bayes'theorem),
which was published after his
death (Richard price)
Bayesian inference
• Bayesian inference is a method of statistical
inference in which some kind of evidence or
observations are used to calculate the
probability that a hypothesis may be true, or
else to update its previously-calculated
probability.
• "Bayesian" comes from its use of the Bayes'
theorem in the calculation process.
BAYES’ THEOREM
Bayes' theorem shows the relation between two
conditional probabilities
P( A  B)  P( A | B) P( B)  P( B | A) P( A)
P( A | B) P( B)
P( B | A) 
P( A)
• we can make updated probability(posterior
probability) from the initial probability(prior
probability) using new information.
• we call this updating process Bayes' Theorem
Prior
prob.
New
info.
Using
Bayes
thm
Posterior
prob.
MONTE HALL
Should we switch the door or stay?????
http://en.wikipedia.org/wiki/Monty_Hall_problem
A contestant chose door 1
and then the host opened
one of the other doors(door
3).
Would switching from door
1 to door 2 increase
chances of winning the car?
Di
={Door i conceals a car}
Oj
={Host opens Door j after a contestant choose Door1}
p ( D1 )  p ( D2 )  p ( D3 ) 
1
3
1
2
p (O3 | D2 )  1
p (O3 | D1 ) 
p (O3 | D3 )  0
p ( D1 | O3 ) 
p (O3 | D1 ) p ( D1 )
p (O3 | D1 ) p ( D1 )  p (O3 | D2 ) p ( D2 )  p (O3 | D3 ) p ( D3 )
1 1

1
2
3


1 1
1
1
3 (when you stay)
(  )  (1 )  (0  )
2 3
3
3
p (O3 | D2 ) p ( D2 )
p ( D2 | O3 ) 
p (O3 | D1 ) p ( D1 )  p (O3 | D2 ) p ( D2 )  p (O3 | D3 ) p ( D3 )
1
2
3


(when you switch)
1 1
1
1
3
(  )  (1 )  (0  )
2 3
3
3
1
15.3.1 Bayesian Estimation
Premises of doing a bayesian estimation:
1.
Prior knowledge about the unknown parameter θ
Trust me.
I know
this θ!
2.
The possibility distribution of θ : π (θ) (prior distribution)
Zhenrui & friends
General equation:
π*(θ):posterior
distribution
f(x1,x2,…xn| θ) likelihood function of
θ based on observed values
x1,x2,…,xn .
  ( ) 
θ: unknown parameter
from a distribution with
pdf/pmf f (x | θ).
Considered as r.v. in
Bayesian estimation
f ( x1 , x2 ,  , xn |  ) ( )
 f ( x , x ,  , x
1

π (θ):prior
distribution
2
n
|  ) ( )d
Marginal p.d.f. of
X1,X2,…Xn, Just a
normalizing constant to

make

 ( )d  1
f ( x1 , x2 ,  , xn |  ) ( )
f  ( x1 , x2 ,  , xn )
 ( )  f ( X |  ) ( )

The µ* and σ*2 of π*(θ) are called posterior mean and
variance, repectively. µ* can be used as a point
estimate of θ (Bayes estimate)
51
Bayesian Estimation continued
A question: If I only know the possible value range of θ, but can’t summarized it in
the form of a possibility distribution. Can I still do the Bayesian estimation?
No! To apply the Bayes’ theorem, every term in the equation has to be a probability
term. π (θ) : √
vs
θ:
x
Criticisms of Bayesian approach:
1. Perceptions of prior knowledge differ from person to person. ‘subjective’.
2. Too fuzzy to quantify the prior knowledge of θ in the form of a distribution
Conjugate Priors:
  ( )  f ( X |  ) ( )
a family of prior distributions that the posterior distribution is of the same form of
the prior distribution
Examples of Conjugate Priors( from text book): Example
15.25,15.26
• Normal distribution is a conjugate prior on µ of N(µ, σ2 ) )(if σ2
is already known)
• Beta distribution is a conjugate prior on p of a Binominal
distribution Bin(n,p)
52
15.3.2 Bayesian Testing
simple vs simple
hypothesis test :
 0*   * ( 0 ) 
H 0 :   0
H1 :   1
a
ab
a   0 f ( x1 , x2 ,  , xn |  0 )
b
   (1 ) 
 1   0*
ab
*
1
 0   ( 0 ),
 1   (1 )  1   0
Prior
probability of
H0 and H1
*
b   1 f ( x1 , x2 ,  , xn | 1 )

b  1 f ( x1 , x2 ,  , xn | 1 )
 

a  0 f ( x1 , x2 ,  , xn |  0 )
*
1
*
0
A Bayesian test rejects
H0

if   k
*
1
*
0
k >0 is a suitably chosen
critical constant. A large
value of k corresponds to
a small value of α
53
Bayesian Testing continued
Bayesian test vs Neyman-Pearson likelihood ratio test (15.18)
Neyman-Pearson Lemma:
Bayesian test:
L(1 | x1 , x2 ,  , xn ) f ( x1 , x2 ,  , xn | 1 )

k
L( 0 | x1 , x2 ,  , xn ) f ( x1 , x2 ,  , xn |  0 )
0
 1* b  1 f ( x1 , x2 ,  , xn | 1 )



k
*

(
)k
*
 0 a  0 f ( x1 , x2 ,  , xn |  0 )
1
Bayesian test can be considered as a specialized Neyman-Peanson likelihood test
where the probabilities of each hypothesis (H0 & H1 )being true is known:π0 & π1
If
 0   1  1,/ 2
 1* f ( x1 , x2 ,  , xn | 1 )

k
 0* f ( x1 , x2 ,  , xn |  0 )
The Bayesian test becomes the Neyman-Pearson likelihood ratio test
54
Bayesian Inference for
one parameter
Bingqi Cheng
A biased coin
• Bernoulli random variable
• Prob(Head)= ϴ
• ϴ is unknown
Bayesian Statistics
Three Ingredients:
• Prior distribution
Initial guess or prior knowledge on parameter ϴ,
highly subjective
• Likelihood function
Fits or describes the distribution of real data ( e.g. a
sequence of heads or tails when toss the coin)
• Bayes Theorem
Update the prior distribution with real data
Prob(ϴ | data)
ϴ
Posterior distribution
Prior Distribution
Beta distributions are conjugate prior to Bernoulli
distributions
If prior is Beta and likelihood function is Bernoulli, then posterior is Beta
Prior Distribution
3
2.5
Density
2
1.5
1
0.5
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
Likelihood function
Posterior Distribution
Likelihood function
For this biased coin:
Prior distribution
Calculation Steps
Posterior Distribution
Posterior Distribution
7
6
Density
5
4
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
Predictive Probability
Bayesian VS M.L.E with calculus method
Back to the example of biased coin, still
we have 20 trials and get 13 heads.
Xiao Yu
 20 
f ( p)    p13 (1  p)7
13 
 20 
f '( p)    p12 (1  p)6 (13  20 p)  0
13 
 p  0.65
64
Jeffreys Prior
p(ˆ)  det I (ˆ)
65
• M.L.E with calculus method:
Find the parameter quick and directly, if
possible. -> A huge step
• Bayesian
Initial guess + approximation + converge
-> another startline + small step + maybe not
best value
Why bother to use Bayesian?
With large amount of data, the
computation of Bayesian is more easy to
handle.
66
log L( p,   C | x1 ,..., xn ) 
n
 log( p
j 1
1
1
1
 ( x  1 )2 /2 C12
 ( x  2 )2 /2 C22
e
 p2
e
)
2 C1
2 C2
• This is a Gaussian Mixture, observations are
vectors, C is the covariance matrix. Find the
maximum likelihood estimate for a mixture by
direct application of Calculus is tough.
67
Bayesian Learning
• The more evidence we have, the more we
learn.
The more flips we do, the more we know
about the probability to get a head, which is
the parameter of binomial distribution.
An application:
EM(Expectation Maximization) algorithm which
can beautifully handle with some regression
problems.
68
Two coins Game:
Suppose now that there are two
coins which can be flipped. The
probability of heads is p1 for the
first coin, and p2 for the second
coin. We decide on each flip which
of the two coins will be flipped, and
our objective is to maximize the
number of heads that occur.(p1
and p2 are unknown)
69
Matlab code for the strategy
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
function [] = twocoin2(p1,p2,n)
H1=0;T1=0;
H2=0;T2=0;
syms ps1;
syms ps2;
for k=1:n,
temp = int(ps2^H2*(1-ps2)^T2,0,ps1);
p(k) = double(int(ps1^H1*(1ps1)^T1*temp,0,1)/(beta(H1+1,T1+1)*betH2+1,T2+1)));
if rand < p(k),
guess(k) = 1;
y(k) = rand < p1;
H1 = H1 + y(k);
T1 = T1 + (1 - y(k));
else
guess(k) = 2;
y(k) = rand < p2;
H2 = H2 + y(k);
T2 = T2 + (1 - y(k));
end
end
disp('Guesses: ')
tabulate(guess)
disp('Outcomes: ')
tabulate(y)
figure(2)
plot(p)
end
P1=0.4, p2=0.6
Value of L(p1>p2|H1,T1,H2,T2)
70
Statistical Decision Theory
ABRAHAM WALD
Kicheon Park
(1902-1950)
• Hungarian mathematician
• Major contributions
- geometry, econometrics,
statistical sequential analysis,
and decision theory
• Died in an airplane accident
in 1950
Hans Schneeweiss “Abraham Wald” Department of Statistics, University of Munich Akademiestr. 1, 80799 MÄunchen,
Germany
71
Why decision theory is needed?
Limits of
classical statistics
I. Prior information
and Loss
II. Initial and final
Precision
III. Formulational
Inadequacy
72
Limit of Classical Statistics
•
•
•
Prior information and Loss
- relevant effects from past experience & losses
from each possible decision
Initial and final Precision
- Before and After observation of sample
information which is result of long series of
identical experiments
Formulational Inadequacy
- Limit to make meaningful decision to be
reached in the majority problem
73
Classical statistics vs. Decision Theory
• Classical statistics
- Direct use of sample information
• Decision theory
- combine the sample information with other relevant
aspects of the problem for the best decision
→ The goal of decision theory is to make decision based
on not only the presence of statistical knowledge but
also the uncertainties (θ) that are involved in the
decision problem
74
Two types of relevant information
I. Knowledge of the possible consequences of
the decisions → loss of result by each
possible decisions
II. Prior information →effects from past
experience about similar situation
75
Statistical Decision Theory - Elements
χ
Sample Space Χ
Unknown parameter θ
,
Decision Space
“Abraham Wald”, Wolfowitz, Annals of Mathematical Statistics
“Statistics & Data Analysis”, Tamhane & Dunlop, Prentice Hall
“Statistical Decision Theory”, Berger, Springer-Verlag
Mun Sang Yue
76
Statistical Decision Theory - Eqns
• Loss Function
Or
• Risk Function
77
Statistical Decision Theory – Decision
Rules
• Decision Rules - Minimax
min {max
}
• Decision Rules - Bayes
78
Statistical Decision Theory - Example
• A retailer must decide whether to purchase a
large lot of items containing an unknown
fraction p of defectives. Before making the
decision of whether to purchase the lot
(decision d1) or not to purchase the lot
(decision d2), 2 items are randomly selected
from the lot for inspection. The retailer wants
to evaluate two decisions rules formulated.
Prior π(p) = 2(1-p)
79
Example - Continue
No. of Defectives
x
Decision Rule δ1
Decision
Decision Rule δ2
Decision
0
d1
d1
1
d2
d1
2
d2
d2
• Loss Functions
L(d1,p) = 8p-1, and L(d2,p)=2
• Risk Functions
– R(δ1,p) = L(d1,p) P(δ1 chooses d1 | p) + L(d2,p) P(δ1 chooses d2 | p)
= (8p-1) P(X=0 | p) + 2 P(X=1 or 2 | p)
– R(δ2,p) = L(d1,p) P(δ2 chooses d1 | p) + L(d2,p) P(δ2 chooses d2 | p)
= (8p-1) P(X=0 or 1 | p) + 2P(X=2 | p)
80
R
Example - Continue
4
R(δ2,p)
3.5
3
max R(δ1,p) = 2.289
2.5
2
R(δ1,p)
1.5
max R(δ2,p) = 3.329
1
0.5
0
0
0.2
0.4
0.6
0.8
1
1.2
p
-0.5
-1
-1.5
81
Statistical Decision Theory - Example
• A shipment of transistors was received by a
radio company. A sampling plan was used to
check the shipment as a whole to ensure
contractual requirement of 0.05 defect rate
was not exceeded. A random sample of n
transistors was chosen from the shipment and
tested. Based upon X, the number of
defective transistors in the sample, the
shipment will be accepted or rejected.
82
Example (continue)
• Proportion of defective transistors in the shipment is θ.
• Decision Rule:
a1
accept lot if X/n ≤ 0.05
a2
reject lot if X/n ≥ 0.05
• Loss Function:
L(a1,θ) = 10*θ ; L(a2,θ) = 1
• π(θ) can be estimated based on prior experience
• R(δ,θ) can then be calculated
83
Summary
• Maximum Likelihood Estimation selects an estimate
of the unknown parameter that maximizes the
likelihood function.
• The Likelihood Ratio Test compares the likelihood of
the observed outcomes under the null hypothesis to
the likelihood under the alternate hypothesis.
• Bayesian methods treat unknown models or
variables as random variables with known
distributions instead of deterministic quantities that
happened to be unknown
84
Summary(Continue)
• Statistical Decision Theory moves statistics from
its traditional role of just drawing inferences
from incomplete information.
The theory
focuses on the problem of statistical actions
rather than inference.
“Here in the 21st Century … a combination of Bayesian and
frequentist ideas will be needed to deal with our increasingly
intense scientific environment.”
Bradley Efron, 164th ASA Presidential Address
85
THANK YOU!
Questions?
86