Download Posterior Analysis - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gambler's fallacy wikipedia , lookup

Transcript
Statistics 512 Notes 27: Bayesian Statistics
Views of Probability
“The probability that this coin will land heads up is 1 .”
2
Frequentist (sometimes called objectivist) viewpoint: This
statement means that if the experiment were repeated
many, many times, the long-run average proportion of
heads would tend to 1 .
2
Bayesian (sometimes called subjectivist or personal)
viewpoint: This statement means that the person making
the statement has a prior opinion about the coin toss such
that he or she would as soon guess heads or tails if the
rewards are equal.
In the frequentist viewpoint, the probability of an event A ,
P( A) , represents the long run frequency of event A in
repeated experiments.
In the Bayesian viewpoint, the probability of an event A ,
P( A) , has the following meaning: For a game in which if
A occurs the Bayesian will be paid $1, P( A) is the amount
of money the Bayesian would be willing to pay to buy into
the game. Thus, if the Bayesian is willing to pay 50 cents
to buy in, P( A) =.5. Note that this concept of probability is
personal: P( A) may vary from person to person depending
on their opinions.
In the Bayesian viewpoint, we can make probability
statements about lots of things, not just data which are
subject to random variation. For example, I might say that
“the probability that Franklin D. Roosevelt had a cup of
coffee on February 21, 1935” is .68. This does not refer to
any limiting frequency. It reflects my strength of belief
that the proposition is true.
Rules for Manipulating Subjective Probabilities
All the usual rules for manipulating probabilities apply to
subjective probabilities. For example,
Theorem 11.1: If C1 and C2 are mutually exclusive, then
P(C1  C2 )  P(C1 )  P(C2 ) .
Proof: Suppose a person thinks a fair price for C1 is
p1  P(C1 ) and that for C2 is p2  P(C2 ) . However, that
person believes that the fair price for C1  C2 is p3 which
differs from p1  p2 . Say p3  p1  p2 and let the
difference be d  ( p1  p2 )  p3 . A gambler offers this
d
p

person the price 3 4 for C1  C2 . The person takes the
offer because it is better than p3 . The gambler sells C1 at a
d
p

discount price of 1 4 and sells C2 at a discount price of
d
4 to the person. Being a rational person with those
given prices of p1 , p2 , and p3 , all three of these deals seem
very satisfactory. At this point, the person has received
d
d
p3  and paid p1  p2  . Thus before any bets are
2
4
paid off, the person has
d
d
3d
d
p3   ( p1  p2  )  p3  p1  p2 
 .
4
2
4
4
That is, the person is down d before any bets are settled.
p2 
4
We now show that no matter what event happens, the
person will pay and receive the same amount in settling the
bets:
 Suppose C1 happens: the gambler has C1  C2 and the
person has C1 so they exchange $1’s and the person is
still down d . The same thing occurs if C2 happens.
4
 Suppose neither C1 nor C2 happens, then the gambler
and the person receive zero, and the person is still
down d .
4
 C1 and C2 cannot occur together since they are
mutually exclusive.
Thus, we see that it is bad for the person to assign
p3  P(C1  C2 )  p1  p2  P(C1 )  P(C2 )
Because the gambler can put the person in a position to lose
( p1  p2  p3 ) / 4 no matter what happens. This is
sometimes referred to as a Dutch book.
The argument when p3  p1  p2 is similar and
can also lead to a Dutch book. Thus p3 must equal
p1  p2 to avoid a Dutch book; that is,
P(C1  C2 )  P(C1 )  P(C2 ) . 
The Bayesian can consider subjective conditional
probabilities, such as P(C1 | C2 ) , which is the fair price of
C1 only if C2 is true. If C2 is not true, the bet is off. Of
course, P(C1 | C2 ) could differ from P(C1 ) . To illustrate,
say C2 is the event that “it will rain today” and C1 is the
event that “a certain person who will be outside on that day
will catch a cold.” Most of us would probably assign the
fair prices so that
P(C1 )  P(C1 | C2 ) .
Consequently, a person has a better chance of getting a cold
on a rainy day.
Frequentist vs. Bayesian statistics
The frequentist point of view towards statistics is based on
the following postulates:
 F1: Probability refers to limiting relative frequencies.
Probabilities are objective properties of the real world.
 F2: Parameters are fixed, unknown constants.
Because they are not fluctuating, no useful probability
statements can be made about parameters.
 F3: Statistical procedures should be designed to have
well-defined long run frequency properties. For
example, a 95 percent confidence interval should trap
the true value of the parameter with limiting frequency
at least 95 percent.
The Bayesian approach to statistics is based on the
following postulates:
 B1: Probability describes a person’s degree of belief,
not limiting frequency.
 B2: We can make probability statements about
parameters that are reflect our degree of belief about
the parameters, even though the parameters are fixed
constants.
 B3: We can make inferences about a parameter  by
producing a probability distribution for  . Inferences,
such as point estimates and interval estimates, may
then be extracted from this distribution.
Bayesian inference
Bayesian inference about a parameter  is usually carried
out in the following way:
1. We choose a probability density  ( ) -- called the prior
distribution – that expresses our beliefs about a parameter
 before we see any data.
2. We choose a probability model f ( x |  ) that reflects our
beliefs about x given  .
3. After observing data X 1 ,
, X n , we update our beliefs
and calculate the posterior distribution h( | X1 , , X n ) .
As in our discussion of the Bayesian approach in decision
theory, the posterior distribution is calculated using Bayes
rule:
f ( x, )
f ( x |  ) ( )
h( | x)  X ,

f X ( x)
 f ( x |  ) ( )d
Note that h( | x)  f ( x |  ) ( ) as  varies so that the
posterior distribution is proportional to the likelihood times
the prior.
Based on the posterior distribution, we can get a point
estimate, an interval estimate and carry out hypothesis tests
as we shall discuss below.
Bayesian inference for the normal distribution
Suppose that we observe a single observation x from a
normal distribution with unknown mean  and known
2
variance  . Suppose that our prior distribution for  is
N ( 0 ,  02 ) .
The posterior distribution of  is
f ( x |  ) (  )
h(  | x ) 
 f ( x |  ) (  )
 f ( x |  ) ( )d 
Now
f ( x |  ) (  ) 
 1

1
 1
 1
exp  2 ( x   ) 2 
exp  2 (   0 ) 2 
 2
 2
  0 2
 2 0

 1

1
 exp  2 ( x   ) 2  2 (   0 ) 2 
2 0
 2

 1  2  1
 x 0  x 2 02 
1 
 exp      2  2   2  2  2   2  2 
0 
 0    0 

 2   
Let a, b, and c be the coefficients in the quadratic
polynomial in  that is the last expression. The last
expression may then be written as
 a
2b
c 
exp     2     
a
a 
 2
To simplify this further, we use the technique of
completing the square and rewrite the expression as
2
2
2

 a
b 
 a  c  b   

exp        exp 


  
2
a
2
a


 a   






The second term does not depend on  and we thus have
that
2
 a
b 
h(  | x)  exp       
a  
 2 
This is the density of a normal random variable with mean
b
1
a and variance a .
Thus, the posterior distribution of  is normal with mean
x 0
 2
2
 0
1 
1
1

2
2

0
and variance
1
 12 
1
1 .

 2  02
Comments about role of prior in the posterior distribution:
The posterior mean is a weighted average of the prior mean
and the data, with weights proportional to the respective
precisions of the prior and the data, where the precision is
equal to 1/variance. If we assume that the experiment (the
observation of X ) is much more informative than the prior
2
2
distribution in the sense that    0 , then
 12   2
1  x
Thus, the posterior distribution of  is nearly normal with
2
mean x and variance  . This result illustrates that if the
prior distribution is quite flat relative to the likelihood, then
1. the prior distribution has little influence on the
posterior
2. the posterior distribution is approximately
proportional to the likelihood function.
On a heuristic level, the first point says that if one does not
have strong prior opinions, one’s posterior opinion is
mainly determined by the data one observes. Such a prior
distribution is often called a vague or noninformative prior.
Bayesian decisionmaking:
When faced with a decision, the Bayesian wants to
minimize the expected loss (i.e., maximize the expected
utility) of a decision rule under the prior distribution
 ( ) for  . In other words the Bayesian chooses the
decision rule d that minimizes the Bayes risk:
B(d )  E ( ) [ R ( , d )] ,
i.e., the Bayesian chooses to use the Bayes rule for the
Bayesian’s prior distribution  ( )
As we showed in last class, for point estimation with
squared error loss, the Bayes rule is to use the posterior
mean as the estimate.
Thus, for the above normal distribution setup, the
Bayesian’s estimate of  is
x 0
 2
2

0
1
1

 2  02