Download Document

Document related concepts

Statistics wikipedia , lookup

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Bayes and
robust Bayes
Scott Ferson, [email protected]
9 October 2007, Stony Brook University, MAR 550, Challenger 165
• Introduction
• Bayes’ rule
• Distributions
Outline
– Getting priors and specifying likelihoods
– Conjugate pairs
•
•
•
•
•
Model uncertainty
Updating
Multinomial sampling
Advantages and disadvantages
Robust Bayes
– Bounds on cumulatives, density-ratio bounds
• Imprecise Dirichlet model (IDM)
– Walley’s marbles, Laplace’s sunrises, (fault trees)
• Conclusions and synopsis
• Exercises
Bayesians are like snowflakes
• Huge diversity among practitioners
• Most, but not all, are subjectivists
• Most, but not all, regard all statistical
problems as part of decision theory
• Most, but not all, believe that precise
probabilities should and can be elicited
• Most but, amazingly, not all employ Bayes’
rule to obtain results
Bayesian domains and concerns
•
•
•
•
Statistics
Decision analysis
Risk analysis
Science itself
• Updating and aggregation, rarely convolutions
• Disparate lines of evidence, data, judgments
• Little or no data
Derivation of Bayes’ rule
P(A & B) = P(A|B) P(B) = P(B|A) P(A)
P(A|B) = P(A) P(B|A) / P(B)
A
A&B
B
Disease example
Prevalence in the general population is 0.01%
Tests of diseased people are positive 99.9% of the time
Test of healthy people are negative 99.99% of the time
Given that you’ve tested positive, what’s the
chance that you actually have the disease?
Almost all doctors say 99% or greater, but the true
answer is only about 50%.
Apply Bayes’ rule
P(disease)  P(positive | disease)
P(disease | positive) = ——————————————
P(positive)
where the normalization factor P(positive) is computed as the sum
P(disease)  P(positive | disease) + (1  P(disease))  (1  P(negative | healthy))
prevalence = 0.01% = P(disease)
sensitivity = 99.9% = P(positive | disease)
specificity = 99.99% = P(negative | healthy)
prevalencesensitivity/(prevalencesensitivity+(1prevalence)(1specificity))
= 0.01%  99.9% / (0.01%  99.9% + (1  0.01%)  (1  99.99%))
= 0.4998
To see why this is so…
• Imagine 10,000 people without any risk factors
• The prevalence of 0.01% suggests about one
person from this group has the disease
• The sensitivity of the test (99.9%) says that this
person will test positive almost surely
• The specificity of 99.99% suggests that, of the
9,999 people who do not have the disease, another
will also test positive
• Thus, we’d expect two people to test positive for
the disease, but only one of them actually has it
Paternity probability from blood tests
•
•
Mother has type O blood; girl has type B blood (she got the B allele from her father)
An alleged father has type AB so could have been the donor of the B allele
P(B | F) = probability the girl would inherit the B allele if the man were her father
P(B | not F) = probability she could inherit the B allele from another man
P(F) = prior probability the man is the father of the girl
P(F | B) = probability the man is her father given that they share the B allele
•
•
•
Genetics says P(B | F) = ½
Background frequency of the B allele is 0.09, so P(B | not F) = 0.09
Prior unknown, so set P(F) = ½
P(F)  P(B | F)
½½
P(F | B) = —————————————— = —————————— = 0.847
P(F)P(B|F)+(1P(F))P(B|not F) ½  ½ + (1  ½)  0.09
So is he the babydaddy?
Waiting for a bus
•
You got to the bus stop early, where the bus is now 10 minutes late. It might not
come at all, and the next one is in two hours. You might need to walk.
•
•
•
90% of buses do their rounds, and 80% of those that do are less than 10 min late
10% chance that bus won’t show up at all
Given that it’s already 10 minutes late, what’s the probability the bus will come?
Event B: bus comes Event W: you’re walking
Prior probabilities
P(B) = 90%
P(W) = 10%
Likelihoods given it’s late P(10 min | B) = 20% P(10 min | W) = 100%
P(B|10 min) = P(B)  P(10 min|B) / (P(B)  P(10 min|B) + P(W)  P(10 min|W))
= 90%  20% / (90%  20% + 10%  100%)
= 18 / 28
 0.643
A decision (which is where Bayesians really want to be) about whether to continue to wait or
start walking would consider the costs of missing your appointment and how tired you are.
P(X | Y) versus P(Y | X)
• Not interchangeable
– No matter how convenient that’d be
• Error makers
– Stupid people
– Smart people (Laplace, de Morgan)
invented Bayesian inference
wrote the first treatise on probability in English
• Bayes’ rule converts one to the other
Bayes’ rule on distributions
posterior  prior  likelihood
posterior (normalized)
likelihood
prior
-5
0
5
q
10
15
20
Density of normal(, ) is
P( y ) 
Example Example
  y  μ 2
exp  
2σ 2
2πσ 2

1
.




• One observation x = 13 drawn from normal(q, 2)
2


1
θ  6 

• Prior for q is normal(6,1):
P(θ) 
exp  

2
2π


 x  θ 2 
1

• Likelihood of normal(q,2):P( x | θ) 
exp  

4
4π


• Multiply:
  3θ 2  2(12  x)θ 

P(θ | x)  P(θ)P( x | θ)  exp 
4


• Posterior  exp((3q2+50q)/4) = normal(8⅓, ⅔ )
Answer
Density
0.5
0
Cumulative
1
0
0
5
10
q
15
20
Interpretation of the posterior
• Estimates value of θ (a distribution mean)
– Shrinkage (pull towards the prior’s mean)
• Uncertainty expressed as distributions
• Peaky density (steeper CDF) means surer
• Posterior is more certain than the prior
– Even though the datum is surprising to the prior
Credible intervals
Density
0.5
0
Cumulative
1
90% credible interval
(Bayesian confidence interval)
0
0
5
10
q
15
20
Bayesian confidence intervals
• Express probability the value is in an interval
– No complex Neyman-Pearson interpretation
• No guarantee about performance
– Performance “90% of the time…” doesn’t apply
– Don’t call them “confidence intervals”
• Asymmetric intervals
– Narrowest (mode)
– Symmetric (median)
– Mean-centered
Prosecutor’s fallacy
Where does the likelihood come from?
• Mechanism that generated the data
• ‘Inverse’ probability
“Libby is Billy spelled sideways, sorta”
– Density: probability of x as a function of x, given θ
– Likelihood: probability of x as a function of θ, given x
• Not a distribution (doesn’t integrate to one)
– Equivalence class of functions, can scale however we want
• Appropriateness of a probability model that justifies a
particular likelihood function is always at best a
matter of opinion (Edwards 1972, page 51)
Where do priors come from?
• Whatever You say
• Maximum entropy
• EDFs or empirical Bayes (Newman 2000)
– Data used to estimate prior and also likelihood
• Uniformative or reference priors
– No prior can represent ignorance (Bernardo 1979)
– Just a way to minimize the prior’s influence to discover
what the data themselves are saying about the posterior
• Conjugate pairs
Conjugate pairings
Likelihood
parameters
other than θ
are assumed
to be known
Likelihood
Prior
bernoulli(q)
uniform(0, 1)
bernoulli(q)
beta(, )
poisson(q)
gamma(, )
normal(q, s)
normal(, )
exponential(q) gamma(, )
binomial(k, q) beta(, )
uniform(0, q)
pareto(b, c)
negbinomial
beta(, )
normal(q, s)
gamma(, )
exponential(q) invgamma(, )
multinomial(k,qj) dirichlet(s, tj)
Posterior
beta(1+xi,1+nxi)
beta(+xi, +nxi)
gamma(+xi,+n)
normal((/2 + xi /s2)/v, 1/v)
v=1/2+n/s2
gamma(+n,+xi)
beta
pareto
beta
gamma
inversegamma
dirichlet(n+s,(xj+stj)/(n+s))
n=xj, j{1,...,k}
n
Model uncertainty
Bayesian model averaging
(Draper 1995)
• Similar to the probabilistic mixture
• Updates prior probabilities to get weights
• Takes account of available data
Bayesian model averaging
• Assume it’s actually the first model
• Compute probability distribution under that model
• Read off probability density of observed data
– Product if multiple data; it’s the likelihood for that model
• Repeat above steps for each model
• Compute posterior  prior  likelihood
– This gives the Bayes’ factors
• Use the posteriors as weights for the mixture
Example
Either
f(A,B) = fPlus(A,B) = A + B
or
f(A,B) = fTimes(A,B) = A  B
where
A ~ normal(0,1),
Single datum:
Priors:
B ~ normal(5, 1)
f(A,B) = 7.59
P(fPlus) = 0.6,
P(fTimes) = 0.4
Likelihoods
Probability density
0.3
fPlus
7.59
0.2
fTimes
0.1
0
-10
0 f (A , B )
10
20
fPlus(A,B) ~ A+B ~ normal(5, 2)
LPlus(7.59) = 0.05273
fTimes(A,B) ~ A  B ~ normal(0, 26) LTimes(7.59) = 0.02584
R: dnorm(7.59,5,sqrt(2))
Excel: =NORMDIST(7.59, 5, SQRT(2), FALSE)
Bayes’ factors
Posterior probabilities for the two models
posterior  prior  likelihood
Plus
0.60.05273/(0.60.05273+0.40.02584)=0.7538
normalization factor
Times
0.40.02584/(0.60.05273+0.40.02584)=0.2462
These are the weights for the mixture distribution
Probability density
0.3
fPlus
0.2
0.1
fTimes
0
Cumulative probability
-20
-15
-10
-5
0
5
10
15
20
25
10
15
20
25
f (A ,B )
1
0.75
fTimes
0.5
0.25
fPlus
0
-20
-15
-10
-5
0
5
f (A ,B )
Bayesian model averaging
• Produces single distribution as answer
• Can account for differential prior credibility
• Takes account of available data
•
•
•
•
•
Must enumerate all possible models
May be computationally challenging
Confounds variability and incertitude
Averages together incompatible theories
Underestimates tail risks
Updating from relational constraints
Updating with only constraints
Suppose W  H = A
W  [23, 33]
H  [112, 150]
A  [2000, 3200]
Any triples W  H  A get excluded with a likelihood of zero
Mass concentrated onto manifold of feasible combinations
Likelihood L( A  W  H | W , H , A)   ( A  W  H )
I (W  [23,33]) I ( H  [112,150]) I ( A  [2000,3200])
Prior Pr(W , H , A) 


33  23
150  112
3200  2000
Posterior f (W , H , A | A  W  H )   ( A  W  H )  Pr(W , H , A)
 () Dirac delta function
I() indicator function
1
Original interval ranges
W
0
20
1
Interval constraint analysis
30
40
H
0
120
140
160
CDF
1
A
0
1500
2500
3500
Multinomial sampling
Multinomial sampling
• Bag of marbles colored red, blue, or green
• Bag is shaken and marbles are randomly drawn
• Its color is recorded and it’s put back in the bag
• What’s the chance the next marble is red?
– If we knew how many marbles of each color are in
the bag, we could compute the probability
– If we haven’t seen inside the bag, what can we say
given the colors of marbles drawn so far?
Multinomial sampling
• Consider N random observations (sampled
with replacement)
• As each marble is drawn, we record its color
j   = {red, blue, green}
• This is just the standard multinomial with
P(j) = qj, where j=1,2,3, 0  qj, and qj = 1
Bayesian analysis
• Given nj = number of marbles colored j,
where nj = N, the likelihood function is
k
θ
nj
j
j 1
• A Dirichlet prior is conjugate
k
θ
j 1
st j 1
j
Dirichlet(s, tj)
where 0 < s, 0 < tj < 1, tj = 1 (tj is mean for qj)
Solution
k
θ
j 1
k
θ
(nj+stj)/(N+s))
Dirichlet(N+s,
n j  st j 1
j
n j  st j 1
j
j 1
• Bayes-Laplace prior is uniform Dirichlet(s=3, tj=1/3)
• Posterior for this prior is Dirichlet(N+3, (nj+1)/(N+3))
• The vector of means for a Dirichlet(s, tj) is tj, so the posterior
expected value of qj is (nj+1)/(N+3)
• Suppose we had 7 draws: blue, green, blue, blue, red, green, red
• Predictive probability the next marble is red is (2+1)/(7+3) = 0.3
• Compare to the observed frequency of red marbles 2/7  0.2857
• Before any data, N=0, the probability is (0+1)/(0+3) = 0.333
shrinkage
Beyond colored marbles
• Multinomial sampling occurs in many riskiid
analytic and statistical problems with random
trials with finitely many possible outcomes
• Analysis, and solutions, for all such problems
are just the same as for the marbles
Advantages of Bayesian statistics
• Ascendant
(Berry 1996; 1997)
– Fashionable and iconoclastic
– Answers easier since computer revolution
• Decision making
– Integrated into Bayesian framework
– Frequentists don’t even balance costs of Type II errors
• Rationality
– Maximizes expected utility, avoids sure loss (Dutch book)
– Rational agents seeing same info eventually agree (Dawid)
• Subjective information
– Legitimizes and takes account of prior beliefs
• Working without any empirical data
• Naturalness
• Data mining
Naturalness
Bayesian
Frequentist
•
•
•
•
•
•
•
•
•
•
•
•
Any question (OJ’s guilt, God)
Intuitive credibility intervals
Distributions represent uncertainty
Can use all information, even if
it’s from outside the experiment
Likelihood calculated only from
the data actually seen
Posterior depends on data only
through the likelihood
Free to stop whenever so can
enjoy serendipity, windfalls
Can use scientific stopping rule
•
•
•
•
Can only study repeatable events
Torturous confidence intervals
No parameter distributions (fixed)
Can only take account of what’s
observed during the experiment
Consider probabilities of
hypothetical data never observed
Probability measures depend on the
experimental design
Must follow experimental design to
compute p
Can’t use scientific stopping rule
Data mining
Bayesian
Frequentist
• Perfectly okay
• Doesn’t affect results
• Scientists free to peek at
data before and during
• Can have many more
parameters than samples
• Can estimate lots of
parameters at once
• Scientifically improper
(can’t compute p)
• Data contaminates the
scientist’s mind
• Bad to create a model
with many parameters
• Bad to estimate more that
a few parameters at a time
Important to neophytes
• Controversial
Really important
Problems and technical issues
• Subjectivity required: Beliefs needed for priors may
be inconsistent with public policy/decision making
• Needs priors, sampling model/likelihood
• Computational difficulty: It can be very hard to
derive answers analytically
• Zero preservation: If either the likelihood or the
prior is zero for a given q, then the posterior is zero
• Inadequate model of ignorance: Doesn’t distinguish
between incertitude and equiprobability
Fear of the prior
• Many Bayesians are reluctant to employ subjective
priors in scientific work (Yang and Berger 1998)
• Hamming (1991, page 298) wrote bluntly
“If the prior distribution, at which I am frankly guessing,
has little or no effect on the result, then why bother; and
if it has a large effect, then since I do not know what I
am doing how would I dare act on the conclusions
drawn?” [emphasis in the original]
Posterior probability
Everything’s in the prior
1
0.8
0.6
The strength of the
evidence is reflected in
the function’s curvature
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Prior probability of paternity
1
Groups
• Bayesian theory does not work for groups
– Rationality inconsistent with democratic process
• Some even say “groups can’t make decisions”
– Despite the fact they come to verdicts,
conclusions, findings, choices, …
Geo. Hazelrigg, NSF
• Must imagine a personalist “You”
– Teams, agencies, collaborators, companies, clients
– Reviewers, peers
Is coherence enough?
• Bayesians say their reasoning is rational and their conclusions
sound if their beliefs and inferences are coherent (consistent to
avoid sure loss)
• Standard Bayesian inferences are coherent, but coherence is
only a minimal requirement of rationality (Walley 1991, 396f)
• Beliefs ought to also conform to evidence, which Bayesian
inferences rarely do (ibid.)
• Abandoning objectivity forsakes the essential connection to
evidence and accountability in the real world
• Evidence comes from measurement and objectivity
Robust Bayes
Robustness
• An answer is robust if it does not depend
sensitively on the assumptions and calculation
inputs on which it’s based
• Related to important ideas in other areas of
statistics such as robust and resistance estimators
• Robust Bayes analysis, also called Bayesian
sensitivity analysis, investigates the robustness of
answers
Bayes’ rule made safe for uncertainty
prevalence  sensitivity
—————————————————————————
sensitivity  prevalence + (1  specificity)  (1  prevalence)
1


1

 11  specificit y 
prevalence


1
sensitivit y
1 / (1 + ((1/prevalence  1)  (1  specificity)) / sensitivity)
sensitivity = 99.9%
specificity = 99.99%
prevalence = [0.005, 0.015]%
[0.005, 0.015]%  99.9%
——————————————————————————
([0.005,0.015]%  99.9% + (1[0.005,0.015]%)  (199.99%))
= [19.99, 99.95]%
1 / (1 + ((1/[0.005, 0.015]%  1)  (1  99.99%)) / 99.9%)
= [33.31, 59.98]%
Uncertainty about the prior
class of prior distribution  class of posteriors
posteriors
priors
-5
0
likelihood
5
q
10
15
20
Uncertainty about the likelihood
class of likelihood functions  class of posteriors
posteriors
likelihoods
prior
-5
0
5
q
10
15
20
Uncertainty about both
posteriors
priors
-5
-5
likelihoods
0
0
55
qq
10
10
15
15
20
20
Moments of the normals
Prior: normal(μ,σ)
μ = [1, 6]
σ = [2.5, 6]
Likelihood: normal(x,v)
x = [5,13]
s = [4, 5]
Posterior: normal(θ1,v1)
mean = (μ/σ2 + x/s2)/(1/σ2 + 1/s2)
= [2.28, 10.35]
stdev = 1/(1/σ2 + 1/s2)
= [1.24, 1.66]
Conjugacy facilitates the calculation,
but note that σ2 and s2 are repeated
so be careful evaluating the mean
Robust Bayes can make a p-box
Posterior
Posteriorsp-box
Priors
-5
Likelihoods
0
5
10
15
20
class of priors, class of likelihoods  class of posteriors
Uncertainty about decisions
class of probability models  class of decisions
class of utility functions  class of decisions
If you end up with a single decision,
you’re in like flint.
If the class of decisions is large and diverse,
then you should be somewhat tentative about
making any decision.
Bayesian dogma of ideal precision
• Robust Bayes is plainly inconsistent with
the Bayesian idea that uncertainty should be
measured by a single additive probability
measure and values should always be
measured by a precise utility function.
• Some Bayesians justify it as a convenience
• Others suggest it accounts for uncertainty
beyond probability theory
Desirable properties of the class
•
•
•
•
Easy to understand and elicit
Easy to compute with
Sufficiently big to reflect one’s uncertainty
Generalization to multiple dimensions
(Berger 1994)
• Near ignorance (vacuous, but not in all ways)
Defining classes of priors
•
•
•
•
•
Parametric conjugate families
Parametric but non-conjugate families
Density-ratio (bounded density distributions)
-contamination, mixture, quantile classes, etc.
Bounds on cumulatives (trivializes results)
Parametric class
Parameters can be chosen to make the class wide,
but it will often still be too sparse in the sense
that the distributions it admits are too similar to
reflect what we think might be possible.
CDF
1
0
Bounded cumulative distributions
• Shouldering: Bounds on CDFs (p-boxes) admit
density distributions that may be zero anywhere.
• Renormalization: Posteriors have unit area.
• Zero preservation: Posterior’s range is inside the
intersection of the ranges of the prior and likelihood.
• Triviality: Together, these imply that all you can
ever conclude about the posterior is its range
Cumulative probability
Shouldering and zero preservation
A shoulder on a CDF corresponds to zero density.
Because the posterior can be zero almost everywhere,
all the mass could be pushed to any point.
Likelihood
Prior
Thus posterior is zero everywhere
Result is trivial
likelihood
CDF
1
posterior
prior
0
Bounded densities
Probability density
0.6
0.5
Likelihoods
0.4
0.3
0.2
Priors
0.1
0
0
5
10
15
q
20
25
30
35
Area below prior’s lower bound is smaller than 1; area below upper bound is larger than 1.
There are no restrictions on the likelihood bounds.
Un-normalized posterior bounds
Probability density
0.6
0.5
Likelihoods
0.4
How can we
normalize?
0.3
0.2
Priors
0.1
0
0
5
10
15
20
q
25
30
35
Normalized posterior bounds
Probability density
0.6
Normalization trick:
divide the lower bound
by the area of the upper
and vice versa
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
q
20
25
30
35
Bounds on posterior densities
• Imply bounds on cumulative tail risks
• Imply bounds on moments
• Imply bounds on the mode
bounds on mode
Take-home lessons
•
•
•
•
Many ways depending on uncertainty
Parametric uncertainty can be too sparse
Yet too much generality trivializes the result
Often not so easy to do the calculations
Probability of an event
• Imagine a gamble that pays one dollar if an event
occurs (but nothing otherwise)
– How much would you pay to buy this gamble?
– How much would you be willing to sell it for?
• Probability theory requires the same price for both
– By asserting the probability of the event, you agree to
buy any such gamble offered for this amount or less,
and to sell the same gamble for any amount less than or
equal to this ‘fair’ price…and to do so for every event!
• IP just says, sometimes, your highest buying price
might be smaller than your lowest selling price
Imprecise Dirichlet model
Walley’s (1996) bag of marbles
• What’s the probability of drawing a red
marble if you’ve never seen inside the bag?
• First, consider a model with N random (iid)
observations (sampled with replacement).
As each of the N marbles is drawn, we can
see its color j   = {1, …, k}
• This is just the standard multinomial with
P(j) = qj, where j=1,…,k, 0  qj, and qj = 1
Bayesian analysis
• Given nj = number of marbles that were colored j,
nj = N, the likelihood function is proportional to
k
θ
nj
j
j 1
• A Dirichlet prior is convenient (it’s a multivariate
generalization of a beta), with density proportional to
k
θ
j 1
st j 1
j
Dirichlet(s, tj)
where 0 < s, 0 < tj < 1, tj = 1 (tj is mean for qj)
The answer seems easy
• Posterior is also Dirichlet, in density proportional to
k
θ
j 1
n j  st j 1
j
k
θ
Dirichlet(N+s,
(nj+stj)/(N+s))

n j  st j 1
j
j 1
• But what if you don’t even know how many colors k
there are in the bag?
– Letting  = {red, not red} gives a different answer from
letting  = {blue, green, yellow, red}
– Lots of ways to form , lots of different answers
• Peeking (using the data to form ) is incoherent
Walley’s “Imprecise Dirichlet model”
• Likelihood is multinomial, as before
• Prior is the class of all Dirichlet distributions
M0s = {F : F = Dirichlet(s, tj)} for a fixed s > 0
that does not depend on 
• The corresponding posterior class is
MNs = {F : F = Dirichlet(N+s, (nj+stj)/(N+s))}
Predictive probabilities
• Under this posterior, the predictive probability that
the next observation will be color j is the posterior
mean of qj which is (nj + stj)/(N + s)
• By extremizing this quantity on tj, we get posterior
upper and lower probabilities
P = nj / (N + s)
_
P = (nj + s) / (N + s)
(tj  0)
(tj  1)
• Easily generalizes to other events
What if there’s no or little data?
• Before any observations, P = [P, P] = [0,1]
.
• This is appropriately vacuous (duh!)
• The width of the interval decreases as sample size
N increases, but doesn’t depend on k or the sample
space  at all
• In the limit N  , the Walley, Bayesian and
frequentist answers all converge to nj /N
What should s be?
• Determines the prior’s influence on posterior
If s  0, interval  point
(overconfident)
If s  , interval width  1
(ineducable)
Using s = 1 or 2 might be reasonable
• Letting it depend on N would be incoherent
• Letting it vary would be incoherent
• Different values of s give consistent inferences
that are nested (unlike Bayesian inconsistency)
Bayesian inconsistency
• Different priors
•
•
•
•
Bayes
Jeffreys (1946, 1961)
Perks (1947)
Haldane (1948)
(nj + m) / (N + k)
(nj + ½ m) / (N + ½ k)
(nj + m/k) / (N + 1)
nj / N
• Different models for 
•
•
•
•
Bayes
Jeffreys (1946, 1961)
Perks (1947)
Haldane (1948)
IDM
[0, 1]
[0, 1]
[nj / (N + 1), (nj + 1) / (N + 1)]
nj / N
s=
s=
s=1
s=0
s2
What if the event has never happened?
• If nj = 0, then P = 0 andP = s/(N+s)
– No reason to bet on such an event at any odds
– But should be willing to bet against it at shorter
and shorter odds as N increases
• Contrast this to Laplace’s famous calculation
of the probability the sun wouldn’t rise the
next day via his “rule of succession”
(nj+1)/(N+2) = 1/1826215
Conclusions and synopsis
Bayesian analysis
How?
decide on your “prior” belief, P(model)
compute likelihood function, P(data | model), as a function of the model
multiply to get posterior belief, P(model | data)
today’s posterior is tomorrow’s prior
Why?
“rational”
acknowledges inescapability of subjectivity
updating rule supports the coherent accumulation of knowledge
Why not?
tolerates (requires) subjectivity
Bayesian rationality does not extend to group decisions
inadequate model of ignorance (confounds with equiprobability)
Robust Bayes (sensitivity analysis)
How?
use a class of distributions to represent uncertainty about the prior
use a class of functions to represent uncertainty about the likelihood
get class of posteriors by applying Bayes’ rule to every possible combination
Why?
accounts for the analyst’s doubts about required inputs
can be cheaper and easier than insisting on precise inputs
consistent with robust statistics (doesn’t require analyst’s omniscience)
expresses the reasonableness of the posterior
Why not?
can be computationally difficult
does not obey Bayesian dogma of ideal precision
still has zero preservation problem
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions
of the Royal Society 53: 370-418. Reprinted in 1958 in Biometrika 45: 293-315.
Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York.
Bernardo, J.M. (1979). Reference posterior distributions for Bayesian inference. Journal of the Royal
Statistical Society, Series B 41: 113-147 (with discussion).
Berry, D.A. (1996). Statistics: A Bayesian Perspective. Duxbury Press, Belmont, California.
Berry, D.A. (1997). Bayesian Statistics. Institute of Statistics and Decisions Sciences, Duke University,
http://www.pnl.gov/Bayesian/Berry/.
Dawid, A.P. (1982). Intersubjective statistical models, In Exchangeability in Probability and Statistics,
edited by G. Koch and F. Spizzichino, North-Holland, Amsterdam.
Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society
Series B 57: 4597.
Edwards, A.W.F. (1972). Likelihood. Cambridge University Press.
Insua, D.R. and F. Ruggeri (eds.) (2000). Robust Bayesian Analysis. Lecture Notes in Statistics, Volume 152.
Springer-Verlag, New York.
Laplace, Marquis de, P.S. (1820). Théorie analytique de probabilités (edition troisième). Courcier, Paris. The
introduction (Essai philosophique sur les probabilités) is available in an English translation in A
Philosophical Essay on Probabilities (1951), Dover Publications, New York.
Lavine, M. (1991). Sensitivity in Bayesian statistics, the prior and the likelihood. Journal of the American
Statistical Association 86 (414): 396-399.
Pericchi, L.R. (2000). Sets of prior probabilities and Bayesian robustness. Imprecise Probability Project.
http://ippserv.rug.ac.be/documentation/robust/robust.html. http://www.sipta.org/documentation/robust/pericchi.pdf.
Walley, P. 1991. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London.
Yang, R. and J.O. Berger (1998). A catalog of noninformative priors. Parexel International and Duke
University, http://www.stat.missouri.edu/~bayes/catalog.ps
Exercises
1. Check the math in the example on slide 13.
2. Sketch [P,P] for {-,B,G,B,B,G,R,G,…,(35/100)R,…(341/1000)R}.
3. What can be said about the posterior if the prior is
uniform(a,b) and the likelihood function is an a nonzero constant for values of q between c and d and zero
elsewhere? Sketch the answer for a = 2, b = 14, c = 5,
d = 21. What is the posterior if a  [1,3], b  [11,17],
c  [4,6], d  [20,22]?
4. How could you do Bayesian backcalculation?
5. What is the statistical ensemble corresponding to a
prior probability distribution?
End
Missing failure data
Failure data is often incomplete
• We know how many times C
failed, but don’t whether it was
A or B or both that made it fail
C
or
• We know from component
testing how often A failed, but
not whether B would have too
A
B
• Traditional Bayesian approach (Vesely, et al.)
– Makes assumptions about missing data
– Requires independence
• IDM can relax these assumptions
– Computes chance of C, A or B failing at next test
– Uses only failure data nC, NC, nA, NA, nB, NB
– And hardly any assumptions