Download Chapter 1 Basic Concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Chapter 1 Basic Concepts
Thomas Bayes (1702-1761): two articles from his pen published
posthumously in 1764 by his friend
Richard Price.
Laplace (1774): stated the theorem on inverse probability in general
form.
Jeffreys (1939): rediscovered Laplace’s work.
Example 1:
yi , i  1, 2,, n : the lifetime of batteries


2
Assume y i ~ N  ,  . Then,
p y |  , 
2
 
n
 1
exp  2
 2
n
 y
i 1
i
2
t
   , y   y1 , , yn  .

To obtain the information about the values of

and  , two methods
2
are available:
(a) Sampling theory (frequentist):

and 
2
are the hypothetical true values. We can use
2
 point estimation: finding some statistics ˆ y  and ˆ  y  to
estimate

and  , for example,
2
n
ˆ y   y 
y
i 1
n
n
i
2
, ˆ  y  
 y
i 1
 y
2
i
n 1
.


 interval estimation: finding an interval estimate ˆ1  y , ˆ2  y 
1
and ˆ12  y , ˆ 22  y  for
estimate for

and  , for example, the interval
2
,

s
s  
   , Z ~ N (0,1).
y

z
,
y

z
,
P
Z

z






2
2
2
2
n
n 

(b) Bayesian approach:

2
Introduce a prior density   , 
 for

and  . Then, after some
2
manipulations, the posterior density (conditional density given y)


f  ,  2 | y can be obtained. Based on the posterior density, inferences
about

and 
2
can be obtained.
Example 2:
X ~ b10, p  : the number of wins for some gambler in 10 bets,
where p is the probability of winning.
Then,
10 
10 x
f x | p     p x 1  p  , x  1,2, ,10.
x
(a) Sampling theory (frequentist):
To estimate the parameter p, we can employ the maximum likelihood
principle. That is, we try to find the estimate
p̂
to maximize the
likelihood function
10  x
10 x
l  p | x   f x | p     p 1  p  .
x
2
x  10 ,
For example, as
Thus,
10 
1010
l  p | x   l  p | 10    p10 1  p 
 p10 .
10 
ˆ  1 . It is a sensible estimate. Since we can win all the
p
time, the sensible estimate of the probability of winning should be 1. On
the other hand, as
Thus,
x  0,
10 
100
10
l  p | x   l  p | 0    p 0 1  p   1  p  .
0
ˆ  0 . Since we lost all the time, the sensible estimate of
p
the probability of winning should be 0. In general, as
ˆ 
p
xn,
n
, n  0,1,,10,
10
maximize the likelihood function.
(b) Bayesian approach::
  p  : prior density for p, i.e., prior beliefs in terms of probabilities of
various possible values of p being true.
Let
  p 
r a  b  a 1
b 1
p 1  p 
 Beta a, b  .
r a r b 
Thus, if we know the gambler is a professional gambler, then we can use
the following beta density function,
  p  2 p  Beta2,1 ,
to describe the winning probability p of the gambler.
The plot of the density function is
3
1.0
0.0
0.5
prior
1.5
2.0
Beta(2,1)
0.0
0.2
0.4
0.6
0.8
1.0
p
Since a professional gambler is likely to win, higher probability is
assigned to the large value of p.
If we know the gambler is a gambler with bad luck, then we can use the
following beta density function,
  p  21  p  Beta1,2 ,
to describe the winning probability p of the gambler. The plot of the
density function is
1.0
0.0
0.5
prior
1.5
2.0
Beta(1,2)
0.0
0.2
0.4
0.6
0.8
1.0
p
Since a gambler with bad luck is likely to lose, higher probability is
4
assigned to the small value of p.
If we feel the winning probability is more likely to be around 0.5, then we
can use the following beta density function,
  p  6 p1  p  Beta2,2 ,
to describe the winning probability p of the gambler. The plot of the
density function is
0.0
0.5
prior
1.0
1.5
Beat(2,2)
0.0
0.2
0.4
0.6
0.8
1.0
p
If we don’t have any information about the gambler, then we can use the
following beta density function,
  p  1  Beta1,1 ,
to describe the winning probability p of the gambler. The plot of the
density function is
1.0
0.9
0.8
prior
1.1
1.2
Beta(1,1)
0.0
0.2
0.4
0.6
p
5
0.8
1.0
posterior density of p given x  conditiona l density of p given x
f  x, p  joint density of x and p

f x 
marginal density of x
f  x | p   p 

 f  x | p   p   l  p | x   p 
f x 
 f  p | x 
Thus, the posterior density of p given x is
f  p | x     p l  p | x  
r a  b  a 1
b 1  n 
n x
p 1  p    p x 1  p 
r a r b 
 x
 ca, b, x  p xa 1 1  p 
b 10 x 1
In fact,
r a  b  10 
b 10 x 1
p x a1 1  p 
r  x  a r b  10  x 
 Beta  x  a, b  10  x 
f  p | x 
Then, we can use some statistic based on the posterior density, for
example, the posterior mean
E  p | x    pf  p | x dp 
1
0
As
xa
b  10  x .
xn,
ˆ  E p | n 
p
an
b  10  n
is different from the maximum likelihood estimate
n
10 .
Note:
f  p | x     p l  p | x 
 the original informatio n about p the informatio n from the data 
 the new informatio n about p given the data
6
Properties of Bayesian Analysis:
1. Precise assumption will lead to consequent inference.
2. Bayesian analysis automatically makes use of all the information from
the data.
3. The inferences unacceptable must come from inappropriate assumption
and not from inadequacies of the inferential system.
4. Awkward problems encountered in sampling theory do not arise.
5. Bayesian inference provides a satisfactory way of explicitly
introducing and keeping track of assumptions about prior knowledge or
ignorance.
1.1 Introduction
Goal: statistical decision theory is concerned with the making of
decisions in the presence of statistical knowledge which sheds
lights on some of the uncertainties involved in the decision
problem.
3 Types of Information:
1. Sample information: the information from observations.
2.Decision
information: the information about the possible
consequences of the decisions, for
example, the loss due to a wrong
decision.
3. Prior information: the information about the parameter.
1.2 Basic Elements
:
parameter.
:
parameter space consisti1ng of all possible values of
7
.
a:
decisions or actions (or some statistic used to estimate
:
the set of all possible actions.
 ).
L , a  :     R
: loss function
L1 , a1  : the loss when the parameter value is 1
a1
and the action
is taken.
X   X 1 , X 2 ,, X n  : X 1 , X n
are independent
observations from a common distribution
:
sample space (all the possible values of X, usually
subset of
 will be a
R n ).
 f x |  dx 
A f x1 ,, xn |  dx1  dxn
A
X
P  X  A   dF x |    
A
  f x |      A f x1 , , xn |  
 A
where
F X x |  
is the cumulative distribution of X.
 hx  f x |  dx

E h X    hx dF X x |    
.

  h x  f  x |  
 
Example 2 (continue):
Let
A  1, 3, 5, 7, 9.
Then,
8
10 
10 
10 x
10 x
Pp  X  A     p x 1  p      p x 1  p 
xA  x 
x1, 3, 5, 7 , 9  x 
10 
10 
10 
10 
10 
9
7
5
3
   p1  p     p 3 1  p     p 5 1  p     p 7 1  p     p 9 1  p 
1
3
5
7
9
Let
X
10 .
a1  the estimate of p 
Also, let
 X X
h  X   L p ,  
 p  loss function
 10  10
Then,
  X 
X

E p h X   E p  L p,   E p   p 
 10

  10 
10 p
X
 Ep    p 
p0 .
10
 10 
Example 3:
Let X ~ beta ,1 and
hx   x 2 . Then,
E h X    hx  f x |  dx   x 2x 1dx
1
0



 2
Example 4:
9
x  2 |10 

 2 .
a1 :
sell the stock.
a2 :
keep the stock.
1 :
stock price will go down.
2 :
stock price will go up.
Let
L1 , a1   500, L1 , a2   300, L 2 , a1   1000, L 2 , a2   300
The above loss function can be summarized by
1
2
a1
a2
-500
1000
300
-300
Note that there is no sample information from an associated statistical
experiment in this example. We call such a problem no-data problem.
1.3 Expected Loss, Decision Rules, and Risk
Motivation:
In the previous section, we introduced the loss of making a decision
(taking an action). In this section, we consider the “expected” loss of
making a decision. Two types of expected loss are considered:
 Bayesian expected loss
 Frequentist risk
(a) Bayesian Expected Loss:
Definition:
The Bayesian expected loss of an action a is
10
  , a   E  L , a    L , a dF      L , a   d

where
  
distribution of
and
F   

are the prior density and cumulative
 , respectively.
Example 4 (continue):
Let
 1   0.99,   2   0.01.
Then,
  , a1   E  L , a1    1 L1 , a1     2 L 2 , a1 
 0.99   500  0.011000  485
and
  , a2   E  L , a2    1 L1 , a2     2 L 2 , a2 
 0.99  300  0.01   300  294
(b) Frequentist Risk:
Definition:
A (nonrandomized) decision rule
R
. If
 X 
is a function from  into
X  x0 is observed, then   x0  is the action that will be
taken. Two decision rules,
1
and
 2 , are considered equivalent if
P  1  X    2  X   1, for every 
Definition:
The risk function of a decision rule
 X 
11
is defined by
.
R ,   E L ,  X    L , x dF X x |     L , x  f x |  dx

Definition:
If

R , 1   R ,  2 , for all    ,
 , then the decision rule 1
with strict inequality for some
R-better than the decision rule
is
 2 . A decision rule is admissible if
there exists no R-better decision rule. On the other hand, a decision rule
is inadmissible if there does exist an R-better decision rule.
Note:
A rule
1
is R-equivalent to
2
if
R , 1   R ,  2 , for all    .
Example 4 (continue):
R1 , a1   L1 , a1   500  300  L1 , a2   R1 , a2 
and
R 2 , a1   L 2 , a1   1000  300  L 2 , a2   R 2 , a2  .
Therefore, both
a1
and
a2
are admissible.
Example 5:
Let
12
X ~ N  ,1, L , a     a  ,  1  X   X ,  2  X  
2
Note that
E  X   
and
X
.
2
Var X   1.
Then,
R , 1   E L , 1  X   E L , X   E   X 
2
 Var X   1
and
  X 
X

R ,  2   E L ,  2  X   E  L ,   E   
2

  2 
2
X

X  
 E      E    
2

 2 2 2
 X   2
  X   2
 E     2       
2  2 2  4 
 2 2 
2
2
2
X 
X  
 E       E    
 2 2
 2 2 4
 2 Var ( X )  2
X
 Var    0 


2
4
4
4
 
2
1 2
 
4 4
Definition:
The Bayes risk of a decision rule
on


with respect to a prior distribution
is defined as
r  ,    E  R ,     R , a   d

Example 5 (continue):
13
Let
 ~ N 0,  2 ,    
Then,
1
2  2
 2
e 2
2
r , 1   E  R , 1   E  1  1
and
1 2 

r  ,  2   E R ,  2   E  
4 4 
1 E  2
1 Var ( )
 
 
4
4
4
4
1 2
 
4
4


 
1.4 Decision Principles
The principles used to select a sensible decision are:
(a) Conditional Bayes Decision Principle
(b) Frequentist Decision Principle.
(a)
Conditional Bayes Decision Principle:
Choose an action
a A
minimizing
a Bayes action and will be denoted
Example 4 (continue):
14
a
  , a  . Such a will be called
.
Let A  a1 , a2 ,  1   0.99,   2   0.01. Thus,
  , a1   485,   , a2   294 .
Therefore,
a  a1 .
(b)
Frequentist Decision Principle:
3 most important frequentist decision principle:
 Bayes risk principle
 Minimax principle
 Invariance principle
(1) Bayes Risk Principle:
 1 ,  2  D , a decision
Let D be the class of the decision rules. Then, for
rule
1 is preferred to a rule  2 based on Bayes risk principle if
r  , 1   r  ,  2 
.
A decision rule minimizing r  ,   among all decision rules in class D

is called a Bayes rule and will be denoted as  . The quantity

r    r  ,  
is called Bayes risk for

.
Example 5 (continue):


X ~ N  ,1,  ~ N 0,  2 ,
15
D  cx : c is any constant .
Let
 c  X   cX . Then,
R ,  c   E   cX   E cX     E cX  c   c  1 
2

2
2
 E cX  c   2cX  c c  1  c  1  2
2
2

 c 2Var  X   c  1  2
2
 c 2  c  1  2
2
and

  , c   E  R , c   E  c 2  c  12 2
 
 c 2  c  1 E   2
2

 c 2  c  1  2
2
Note that
  ,  c  is a function of c.   ,  c  attains its
2
minimum as c 
,
1  2
2
( f c     ,  c   c  c  1  , f c   0  c 
)
1  2
2
2
2
'
.
Thus,

2
1 2
2
X  
X
2
1 
is the Bayes estimator.
In addition,
16

  2
r    r   ,   2
  
2
2
1 

1 
2
2

 2
 2
  
 

1
2

1 

4
2


2 2
1    1   2 2
 2 1   2 

1   2 2
2

1  2
Example 4 (continue):
Let D  a1 , a2 ,  1   0.99,   2   0.01 . Then,
r  , a1   E  R , a1   E  L , a1   485
and
r  , a2   E  R , a2   E  L , a2   294 .
a2
Thus,
is the Bayes estimator.
Note:
In a no-data problem, the Bayes risk (frequentist principle) is
equivalent to the Bayes expected loss (conditional Bayes principle).
Further, the Bayes risk principle will give the same answer as the
conditional Bayes decision principle.
Definition:
Let
X   X 1 , X 2 ,, X n 
have the probability distribution
function (or probability density function)
f x |    f x1 , x2 ,, xn |  
with prior density and prior cumulative distribution
17
  
and
F    , respectively. Then, the marginal density or distribution
X   X 1 , X 2 ,, X n 
of
mx   mx1 , x2 ,, xn   

is
    f x |  d


f x |  dF   
     f x |  
 
The posterior density or distribution of
 given x
f  | x   f  | x1 , x2 ,, xn  
The posterior expectation of
g  
given
is
   f x |  
mx 
x
.
is

 g     f x |  d

 
 g   f  | x d 
m x 
E f  |x  g    

  g   f  | x    g     f x |  
 
 

m x 

Very Important Result:
Let
X   X 1 , X 2 ,, X n 
have the probability distribution
function (or probability density function)
f x |    f x1 , x2 ,, xn |  
with prior density and prior cumulative distribution
  
and
F    , respectively. Suppose the following two assumptions hold:
(a) There exists an estimator
0
with finite Bayes risk.
18
(b) For almost all

x , there exists a value  x 
minimizing
 L ,  x  f  | x d

f  | x 
L , x   
  ,    E
  L ,  x  f  | x 
 
Then,
(a)
if
L , a   a  g  
2
,
then
 g   f  | x d


f  | x 
g    
 x   E
  g   f  | x 
 
and, more generally, if
L , a   w a  g  
2
,
then
 w g   f  | x d
 

f  | x 
 w  f  | x 

E
w g   

 x  

f  | x 
w    w g   f  | x 
E
 
  w  f  | x 


(b)
if
then
L , a   a  
  x 
,
is the median of the posterior density or distribution
19
f  | x 
 given x . Further, if
of
k 0   a ,   a  0
L , a   
,
 k1 a   , a    0
then
 x 

is the
k0
k0  k1 percentile of the posterior density or
f  | x 
distribution
of
 given x .
(c)
if
 0 when a    c
L , a   
1 when a    c
then,
  x 
is the midpoint of the interval
I
of length
2c
which maximizing
 f  | x d
I
P  I | x   
  f  | x 
 I
[Outline of proof]
(a)

  , a   E f  |x  L ,  x   E f  |x  w a  g  2



 E f  |x  w  a 2  2ag    g 2  







 E f  |x  w  a 2  2 E f  |x  g  w  a  E f  |x  g 2  w 
Thus
20
  , a 
 2 E f  |x  w a  2 E f  |x  g  w   0
a
E f  |x  g  w 
 a
E f  |x  w 
(b)
Without loss of generality, assume m is the median of
f  | x  . We
want to prove
  , m    , a   E f  |x  L , m  L , a   0 ,
a  m .
for
Since
L , m   L , a   m    a  
 m    a     m  a,

   m  a    2  m  a,
   m    a  a  m,

 m
m   a
 a
then
E f  |x  L , m   L , a 
 m  a P  m | x   2  m  a Pm    a | x 
 a  m P  a | x 
 m    a




2


m

a


 m  a P  m | x   a  m Pm    a | x  
   m     a 


 m  a



 a  m P  a | x 
 m  a P  m | x   a  m P  m | x 
ma am


0
2
2
21
[Intuition of the above proof:]
a1
a3
a2
c1
c3
c2
3
We want to find a point c such that
a
i 1
2
i 1
i
achieves its
c  a2 ,
minimum. As
3
ca
 ai  a2  a1  a2  a2  a2  a3  a3  a1 .
As c  c1 
3
c
i 1
 ai  c1  a3  a3  a1 .
1
As
3
c  c2   c2  ai  c2  a1  c2  a2  c2  a3  a3  a1  c2  a2
i 1
,
 a3  a1
As c  c3 
3
c
i 1
3
 ai  c3  a1  a3  a1 .
3
Therefore, As
c  a2 ,
ca
i 1
i
achieves its minimum.
(2) Minimax Principle:
A decision rule
principle if
1 is preferred to a rule  2 based on the minimax
sup R ,  1   sup R ,  2  .


22
A decision 
minimizing sup R ,   among all decision rules in
M

class D is called a minimax decision rule, i.e.,


sup R  ,  M  inf sup R ,   .
 
 D  
Example 5 (continue):
D  cx : c is any constant 
and
R , c   c 2  c 1  2 .
2
Thus,
1 if c  1
sup R ,  c   sup c 2  1  c 2  2  
.


 if c  1

Therefore,
 M  1  X   X

is the minimax decision rule.
Example 4 (continue):
D  a1 , a2 . Then,
sup R , a1   sup L , a1   1000

and
sup R , a 2   sup L , a 2   300

Thus,

.

 M  a2 .
(3) Invariance Principle:
If two problems have identical formal structures (i.e., the same sample
space, parameter space, density, and loss function), the same decision
23
rule should be obtained based on the invariance principle.
Example 6:
X: the decay time of a certain atomic particle (in seconds).
Let X be exponentially distributed with mean
f x |   
1

e
x
Suppose we want to estimate the mean

,
,0  x  .
 . Thus, a sensible loss function
is
L , a  
1
2
  a 
2
a

 1  
 
2
.
Suppose
Y: the decay time of a certain atomic particle (in minutes).
Then,
X
1  y

Y 
, f y |  e
, 0  y  ,  
60

60 .
Thus,
2
  a 
 2
 a    60   a 
L , a   1    1 
 1    L  , a 
         
  60 
2
where
a  a
60

..
Let
  X  : the decision rule used to estimate 
24
,
and
 Y  : the decision rule used to estimate  .
Based on the invariance principle,
 X 
X
  Y     X   60 Y   60  .
60
 60 
The above augments holds for any transformation of the form
Y  cX , c  R , based on the invariance principle. Then,
1
c
1
1
X
  X    Y    cX  

   X   X 1  kX , k   1
c
c
Thus,
  X   kX
is the decision rule based on the invariance
principle.
1.5 Foundations
There are several fundamental principles discussed in this section. They
are :
(a) Misuse of Classical Inference Procedure
(b) Frequentist Perspective
(c) Conditional Perspective
(d) Likelihood Principle
(e) Choosing Decision Principle
(a)
Misuse of Classical Inference Procedure:
Example 7:
Let
X 1 , X 2 ,, X n ~ N  ,1
In classical inference problem,
25
H 0 :   0 v.s. H1 :   0 ,
the rejection rule is
n
n x  1.96, x 
as   0.05.
x
i
i 1
n
,
10
24
Assume the true mean   10 , n  10 , and

X ~ N 1010 ,1024
Suppose x  10
11
.
, then
n x  10 24 10 11  10  1.96
and we reject H 0 .
Intuitively, x  10
11
seems to strongly indicate H 0 should be
true. However, for a large sample size, even as x is very close to 0, the
classical inference method still indicates the rejection of H 0 . The
above result seems to contradict the intuition.
Note: it might be more sensible to test, for example,
H0 :   103 v.s. H1 :   103 .
Example 8:
Let
X1 , X 2 ,, X100 ~ N  ,1
26
In classical inference problem,
H 0 :   0 v.s. H1 :   0 ,
the rejection rule is
100
n x  10 x  1.645, x 
x
i 1
i
100
as   0.05.
,
If x  0.164,. then 10x  1.64  1.645 and we do not reject
H 0 . However, as   0.51 ,
p  value  PX  x   PX  1.64  0.0505  0.51 ,
we then reject H 0 .
(b)
Frequentist Perspective:
Example 9:
Let
X1 , X 2 ,, X100 ~ N  ,1
In classical inference problem,
H 0 :   0 v.s. H1 :   1 ,
the rejection rule is
100
n x  10 x  1.645, x 
x
i 1
i
100
,
as   0.05. By employing the above rejection rule, about 5% of all
rejection of the null hypothesis will actually be in error as H 0 is true.
27
However, suppose the parameter values   0 and   1 occur
equally often in repetitive use of the test. Thus, the chance of H 0 being
true is 0.5. Therefore, correctly speaking, 5% error rate is only correct for
50% repetitive uses. That is, one can not make useful statement about the
actual error rate incurred in repetitive use without knowing R ,  
for all
(c)
.
Conditionalt Perspective:
Example 10:
Frequentist viewpoints:
X 1 , X 2 are independent with identical distribution,
P X i    1  P X i    1 
1
.
2
Then,
 X1  X 2

 X 1 , X 2    2
 X 1  1
can be used to estimate
if X 1  X 2
if X 1  X 2
.
In addition,
P  X 1 , X 2      P X 1    1, X 2    1 or X 1    1, X 2    1
 P X 1    1, X 2    1
1 1 1 1
 2   
2 2 2 2
 0.75
Thus, a frequentist claims 75% confidence procedure.
28
Conditional viewpoints:
Given X 1  X 2 ,   X 1 , X 2  is 100% certain to estimate

correctly,

correctly,
i.e., P  X1 , X 2    | X1  X 2   1 .
Given X 1  X 2 ,   X 1 , X 2  is 50% certain to estimate
i.e., P  X1, X 2    | X1  X 2   0.5 .
Example 11:
X
1
0.005
0.0051
1.02
f x |   0
f x |   1
f x |   1
f x |   0
2
0.005
0.9849
196.98
3
0.99
0.01
0.01
X  1 : some index (today) indicating the stock (tomorrow) will not go up
or go down
X  2 : some index (today) indicating the stock (tomorrow) will go up
X  3 : some index (today) indicating the stock (tomorrow) will go down
  0 : the stock (tomorrow) will go up
  1 : the stock (tomorrow) will go up
Frequentist viewpoints:
To test
H 0 :   0 v.s. H1 :   1 ,
by the most powerful test with   0.01, we reject H 0 as
X  1, 2 since
  P X  1, 2 |   0  0.005  0.005  0.01.
29
Thus, as X  1 , we reject H 0 and conclude the stock will go up.
This conclusion might not be very convincing since the index does not
indicate the rise of the stock.
Conditional viewpoints:
As X  1 ,
f 1 |   1
 1.02 .


f 1|  0
Thus, f 1 |   1 and f 1 |   0 are very close to each other.
Therefore, based on conditional viewpoints, about 50% chance, the stock
will go up tomorrow.
Example 12:
Suppose there are two laboratories, one in Kaohsiung and the other in
Taichung. Then, we flip a coin to decide the laboratory we will perform
an experiment at:
Head: Kaohsiung
;
Tail: Taichung
Assume the coin comes up tail. Then, the laboratory in Taichung should
be used.
Question: should we need to perform another experiment in
Kaohsiung in order to develop report?
Frequentist viewpoints: we have to call for averaging over all possible
data including data obtained in Kaohsiung.
Conditional viewpoints: we don’t need to perform another experiment
in Kaohsiung. We can make statistical
inference based on the data we have now.
30
The Weak Conditionality Principle:
Two experiments
E1
or
E2
can be performed to draw
information about  . Then, the actual information about  should
depend only on the experiment E j  j  1 or 2 that is actually
performed.
(d)
Likelihood Principle:
Definition:
For observed data x, the function l    f x |   , considered as a
function of  , is called the likelihood function.
Likelihood Principle:
All relevant experimental information is contained in the likelihood
function for the observed x. Two likelihood functions contain the
same information about  if they are proportional to each other.
Example 13:
 : the probability that a coin comes up head.
Suppose we want to know if the coin is fair, i.e.,
H0 : 
1
1
v.s. H1 :   ,
2
2
with   0.05. Then, we flip a coin in a series of trials, 9 heads and 3
tails. Let
X : the number of heads.
Two likelihood functions can be used. They are:
1. Binomial:
31
 n
n x
X ~ B12, p , l1    f1 x |      x 1   
 x
In this example,
12 
12  9
3
l1    f1 9 |      9 1   
 220 9 1   
9
2. Negative Binomial:
 n  x  1 x
 1   n
X ~ NBn, , l2    f 2 x |    
x


In this example, we throw a coin until 3 tails come up. Therefore,
n  3, x  9
and
 3  9  1 9
 1   3  55 9 1   3
l2    f 2 9 |    
 9 
By likelihood principle,
l1   and
l2  
contain the same
information. Thus, intuitively, the same conclusion should be achieved
based on the two likelihood functions. However, classical statistical
inference would result in bizarre conclusions from frequentist point of
view.
1. Binomial:
The reject rule is
X  c,
c
is some constant. Thus, in this example,
p  value  P X  9 |  
1
1
1



 f1  9 |     f1 10 |     f1 11 |    
2
2
2



 0.075  0.05
32
1

f 12 |   
2

Thus, we do not reject H 0 and conclude the coin is fair.
2. Negative Binomial:
The reject rule is
X  c,
c
is some constant. Thus, in this example,
p  value  P X  9 |  
1
1


 f 2  9 |     f 2 10 |     
2
2


 0.0325  0.05
Thus, we reject H 0 and conclude the coin is not fair.
1.6 Choosing Decision Principle
The “robust” Bayesian paradigm which takes into account uncertainity in
the prior is fundamentally correct paradigm.
33