Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bayesian inference wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
The generalization of Bayes for continuous
densities is that we have some density f(y|)
where y and  are vectors of data and
parameters with  being sampled from a prior
(|) where the  are hyperparameters. If  is
known then Bayesian updating is
p( | y, ) 

f (y |  ) ( | )
f (y |  ) ( | )d
If  is not known then updating depends
upon a distribution h()the hyperprior

p( | y,  ) 
 f (y |  ) ( | )h()d
  f (y |  ) ( | )h()dd

For example, think back to the case of infested
tree nuts in which now there is prob pk of any nut
being infested in location k. Now suppose the
pk’s for k=1,2 are possibly jointly distributed with
a prior looking “like a Beta” with density:
(a  b  c) a1 b1
g( p1, p2 ) 
p1 p2 (1 p1  p2 ) c1
(a)(b)(c)
This is a Dirichlet distribution and the parameters a, b
and c are hyperparameters for the prior distribution of
the pk which are themselves parameters for a binomial
distribution of samples of nuts from the 2 locations. If
the hyperparameters are themselves determined
through some process depending upon distance
between the locations, then this is amenable to
hierarchical Bayesian modeling.
The hyperparameters  (or a,b, c in the
example) might specify how the parameters vary
in space or time between observations. One
possible approach is to estimate the  for
example by choosing it to maximize the marginal
distribution of the data as a function of  by
choosing it to maximize
p(y | ) 
 f (y | ) ( | )d
Giving an estimate
ˆ)
p( | y, 

ˆ and an estimated posterior

This is called an empirical Bayes approach. For the
example, 
we could estimate a,b,c from multiple
samples of nuts from the two locations, and plug in
these estimates, but this isn’t very “Bayes-like” since
we ignore information about distance!
Sufficient statistics
Suppose we take a sample X1,…Xn from a
distribution family {f(x,)} with a statistic Y1 =
u1(X1,…Xn). Then Y1 is a sufficient statistic for 
if and only if for any other statistics
Y2=u2(X1,…Xn), … , Yn=un(X1,…Xn) the
conditional pdf h(y2,…,yn | y1 ) of Y2,…,Yn given
Y1=y1 does not depend upon  no matter what
the value of y1 is. So given that we know Y1=y1 ,
it isn’t possible to use any other statistic Y2 to
make any inference about . Y1 exhausts all the
information about  that is contained in the
sample.
Normal Distribution example of Bayesian
updating
Sample Y from N(,2) with 2 known and
assume a Normal prior for  with mean  and
variance 2. Then posterior for  given Y=y is
also Normal (a conjugate prior) with mean
2
 
2
2

2
 
2
2
y
And variance (precision is 1/variance)

 2 2
1

2
2
1
1
 
 2
2


Now take a sample of size n Y=(y1,…,yn) from
N(,2) with 2 known and assume a Normal
prior for  with mean  and variance 2. Then
posterior for  given Y is the same as the
posterior for  given the sample mean y is
Normal with mean
2
n 2
 2
y
2
2
2
  n
  n
And variance (precision is 1/variance)
 2 2
1


2
2
n
1
  n

2
2
As  gets large, posterior mean goes to y and
variance goes to 2/n - the likelihood in the limit
of noninformative
prior.

Bayesian approaches and Markov processes
One of the most useful numerical techniques for
Bayesian analysis involves the use of Markov
Chains in an appropriate way to estimate the
complex integrals that arise. So we do a quick
summary of Markov chains, say something
about their ergodic properties, and note how one
can simulate them.
A stochastic process is a collection of random
variables indexed elements of a set (usually a
time interval or generations over some time)
A stochastic process is well defined if we know
the joint distributions of any collection of random
variables in it.
A Markov process is a stochastic process for
which the distribution of future process values,
given the present value, is independent of the
past values (Future is independent of the past,
given the present)
P[X tn1  x n 1 | X t n  x n ,..., X t1  x1 ]  P[X t n1  x n 1 | X tn  x n ]
For any t1< …< tn < tn+1 and x1, …, xn, xn+1
This property makes it easy to calculate the joint
distributions of the process at any list of times by
knowing the initial distribution and the transition
probability functions which specify the probability
of transition to any value given the present
value.
A Markov chain is a Markov process for which
the states take on values from some discrete set
(e.g. in classic pop gen, it would be allele
frequencies 0, 1/2N, …., 1). If the index set is
discrete time, we describe transitions by a
transition matrix
Pi, j (n)  P[X t  j | X t  i]
n1
n
Which gives the i,j entry of the matrix at time n. If
the matrix is the same at any time, the transition
 is called stationary (note this does not mean
matrix
the Markov process is stationary - it will generally
not be unless the initial condition is its stationary
distribution). For continuous-time Markov
processes, the matrix is replaced by a probability
density function, conditioned on the present.
Ergodicity is a property of some stochastic processes
for which there are consistent, long-term behaviors,
the simplest being a limiting distribution meaning that
the process eventually comes close to having a
particular distribution
Pi, j (n)  P[X tn1  j | X t n  i]   j as n  

Ergodicity implies that for long-enough samples, time
averages and sample averages are equivalent.
A Markov Chain with stationary transition matrix
which is aperiodic, irreducible and recurrent has a
stationary distribution which is its limiting distribution
(it is approached independent of initial distribution).
For discrete-time, finite state space case, limiting
distribution is left eigenvector of transition matrix for
the dominant eigenvalue (1).
Continuous time Markov chains have close
connections to the exponential distribution since the
distribution of time in any state is exponential - let the
rate of transition from state i to state j be qij then a
transition from state i takes place after an exponential
amount of time with parameter
q

ij
ji
At the time of transition the Chain moves from state I
qij
to state j with probability
 pij 
 qik
ki
The matrix Q is the infinitesimal generator of the
Chain where
so the rows sum to 0
qii   qij
ji
A discrete time sample of a continuous time MC can
be viewed in two ways - as following the MC at fixed
intervals, so in some intervals the chain doesn’t
change state, or by following the MC only at times of
observable events (discrete event simulation view)
when it changes state
The discrete event view provides a method to
simulate a MC - sometimes called the Gillespie
algorithm for the application to chemical reaction
transitions
To simulate:
(i) choose initial state i based upon some initial
distribution using a pseudo-random number
qij
pij 
generator.
qii   qij
q
 ij
q
ji
ji
ki
ik
(ii) choose transition event time by simulating an
exponential distribution with rate parameter -qii
(iii) Choose next state to be state j with probability
qij
pij 
 qik
ki
(iv) Go back to (ii) with new state if total time is less
than desired
 simulation stopping time
(v) Otherwise stop, or if multiple trajectories are
desired, go back to (i)