Download Importance Sampling - Northwestern University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Chapter 5. Importance sampling
For Monte-Carlo integration, one often has a choice for the probability distribution. Suppose
instead we write
Z 1
Z 1
p.x/ p.X/
I D EŒg.X/ D
g.x/p.x/ dx D
g.x/ ;
p .x/ dx D E g.X/ p .x/
p .X/
1
1
where p .x/ is some called the biasing probability distribution. From this we can then construct the estimate
N
X
p.Xj /
1
IO D
g.Xj / ;
N 1
p .Xj /
where now the random samples Xj are drawn according to the distribution p .x/. The quantity
p.x/=p .x/ is called the likelihood ratio. Note that E ŒIO  D I . The utility of this comes
home when one looks at the variance,
Z 1
2
p.x/
p.X/
D
g.x/ I
p .x/ dx :
Var g.X/ p .X/
p .x/
1
From this it’s easy to see that we get zero variance if we choose
p .x/ D
g.x/p.x/
:
I
The only problem with this, of course, is that we must know the answer I in order to construct
this distribution.
In general, note we can also write
Z
p.X/
Var g.X/ D
p .X/
Z
D
1
1
1
1
p.x/
g.x/ p .x/
g 2 .x/
2
I
p .x/ dx
p.x/
p.x/ dx
p .x/
I2
and a similar result for the original variance (just put p .x/ D p.x/. Subtracting the two, we
have
Z 1
p.X/
p.x/
2
VarŒg.X/ Var g.X/ D
g .x/ 1
p.x/ dx :
p .X/
p .x/
1
5.1 An exact multidimensional example
47
This shows that if we choose p .x/ > p.x/ when g 2 .x/p.x/ is large and p .x/ < p.x/ when
g 2 .x/p.x/ is small, then the variance will be reduced. In this case the probability mass is
redistributed in accordance with its relative importance as measured by the weight g.x/p.x/.
In the non-ideal case, we can still get estimates for the variances of I and IO,
O I2 D
1
N
X
N 1 j D1
g.Xj /
p.Xj /
p .Xj /
!2
IO
;
and
O I2
O
5.1
N
X
p.Xj /
1
D
g.Xj / N.N 1/ j D1
p .Xj /
!2
IO
:
An exact multidimensional example
Suppose we want to estimate P.ZN m/, where
ZN D
N
X
Xj ;
1
where the Xj are iid (independent, identically-distributed) random variables. Let I.z/ D 1 if
z > m, and 0 otherwise. Then we want
Z 1
Z 1
P.ZN m/ D
I.z/p.z/ dz D
I.x/
E p.x/
E d xE
1
Z
1
1
D
Z
1
I.x1 C : : : C xN / p.x1 /p.x2 / : : : p.xN / dx1 dx2 : : : dxN :
:::
1
1
If N is big, the difficulty is that it can be hard to find all of the regions in the N -dimensional
space that contribute to the integral.
With importance sampling, the above becomes
Z
1
P.ZN m/ D
I.x/
E
1
p.x/
E
p .x/
E d xE ;
p .x/
E
and when we do Monte Carlo sampling, this becomes
K
1 X E p.XEj /
PO D
I.Xj /
;
K j D1
p .XE /
j
48
Importance sampling
where K is the number of samples, with an estimated variance of
!2
K
X
p.XEj /
1
2
E
O PO D
I.Xj /
K.K 1/ j D1
p .XEj /
PO
:
What makes this work is that we know how to compute I.XEj / D I.X1 .j / C : : : C XN.j / /; in
this last, the subscripts denote component and the superscripts the particular trial. Also, since
the Xj are independent, we have
p.XEj /
p .XEj /
N
Y
p.Xk .j / /
D
kD1
p .Xk .j / /
;
i.e., we can compute the overall likelihood ratio as a product of individual likelihood ratios.
As an example, suppose the Xj are Gaussian random variables with zero mean and variance
1. Then, of course, because the sum of N Gaussians is also Gaussian with mean zero and
variance N , we know that
P D P.ZN m/ D p
1
Z
1
e
2N
x 2 =2N
dx D
m
p
1
erfc .m= 2N / :
2
Even though we know the exact answer, it’s still instructive to compute this probability with
importance sampling. The key step is determining the biasing distribution p .x/. One simple choice, for which many answers can be obtained analytically, is a Gaussian distribution
with mean and variance 2 . Note that with this particular choice we are doing the biasing
parametrically, i.e., we are choosing a distribution and adjusting its free parameters to do the
biasing.
Thus, we choose
p .xj / D p
1
2 2
e
.xj /=2 2
:
E
E
E
Of course, we get the same result for E ŒI.X/.p.
X/=p
.X// D EŒPO  D P as for the
unbiased case; that’s the result that we’re looking for. The interesting result is the variance,
"
Var
E
E p.X/
I.X/
E
p .X/
#
Z
1
D
1
Z
1
D
1
"
E
E p.X/
I.X/
E
p .X/
E
I.X/
#2
E d XE
p .X/
E
p.X/
E d XE
p.X/
E
p .X/
P2
P2
5.1 An exact multidimensional example
49
Suppose we first try D 0; the idea here is to increase the variance to spread the distribution
out, and thus get more samples at larger values. Then in the integral we need
!N N
p
N
2
2
Y
Y e xj2
E
p.X/
p
.x
/
2
j
E D
D
p.X/
.x /
xj2 =2 2
E
p
2
j
p .X/
j D1
j D1 e
N
Y
N
D p
e
. 2/N j D1
.1 1=2 2 /xj2
N
D O
N
N
Y
j D1
p
1
2 O
e
xj2 =2O
;
p
where O D = 2 2 1. We now have an integral that looks like we are computing the probability for a sum of Gaussians with variance O 2 to be bigger than m, with a correction factor of
N O N . Thus, the result must be
p
p
1
Var I D N O N erfc .m= 2N O 2 / P 2 :
p
2
p
In the above, let’s assume that N is large but that m= N is O.1/. The behavior as a function
of will therefore be dominated by the prefactor, . /
O N . We are interested in minimizing the
variance, which means that we want O to be minimal. Taking logarithmic derivatives of
O D p
2
2 2
1
)
2 ln 1
ln .2 2
2
1/
2
)
1 4
D0
2 2 2 1
which gives 2 D 2 2 1 or D 1. Note that in this case Var ŒI.p=p / P , which
means
p
that the coefficient of variation (standard deviation divided by the mean) is O.1= P /, which
means that many, many samples will be required to determine the value with Monte-Carlo
sampling when P is small. In this particular case, importance sampling is really of no benefit.
This particular issue is known as the “dimensionality problem” of variance scaling.
On the other hand, suppose we set D 1 and take nonzero. Then we get instead
N
N
Y
Y
E
p.X/
p 2 .xj /
1
e
E
p.X/ D
D p
E
p .xj /
. 2/N j D1 e .xj
p .X/
j D1
D
N
Y
1
e
p
. 2/N j D1
.xj C/2 =2C2
xj2
/2 =2
D e N
2
N
Y
1
p e
2
j D1
.xj C/2 =2
Now we have an integral that looks like we are computing the probability for a sum of Gaus2
sians each with mean and variance 1 to be bigger than m, with a correction factor of e N .
Thus, the result will be
p
21
e N erfc ..mCN/= 2N / P 2 :
2
50
Importance sampling
p
In this expression, if we assume m= N is large, we can use the asymptotic expansion
1
erfc .z/ p e
z
z2
:
The above expression is then approximately
e
N2
1
e
2
.mCN/2 =2N
p
1
2N
p
mCN
P2
Taking the logarithmic derivative of the first term, differentiating with respect to , and neglecting some small terms, we get
2N
1
2N.mCN/
2N
N
D0
mCN
)
N
m
N
D0
mCN
1p 2
m CN :
N
If, in addition, m2 N (i.e., P is small) this becomes m=N . In addition, in this case
it is easy to check that Var ŒI.p=p / D O.P 2 /, which means that it’s possible to get fairly
good results with a relatively small number of samples.
)
.N/2
m2 D N
)
D
A simpler way to see this result is to ask what is the most probable way for the sum X1 C X2 C
C XN to achieve a particular sum S . Maximizing the probability for a bunch of Gaussians
is equivalent to minimizing X12 C X22 C : : : C XN2 . Here we also want the sum to achieve a
particular value, so we want to
X
X
minimize
Xj2 subject to the constraint
Xj D S :
j
j
This is a simple Lagrange multiplier problem with the solution Xj D S=N , i.e., the mean of
each Gaussian should be shifted by the same amount.
This type of biasing, where one shifts the mean of the distribution, is known as mean translation. Generally speaking, this method tends to work well in practice. The difficulty, of course,
is figuring out a good shift of the mean.
As a specific example, take N D 10 and m D 15. The exact probability for the sum of
10 Gaussians to be larger than 15 is 1.05x10 6 . By using Gaussians with a mean shifted by
m=N D 1:5, producing 10000 trials of the sum generates the sample mean value 1.04x10 6
with a sample standard deviation of 2.4x10 6 , and a standard error of the mean of 2.4x10 8
(or, a coefficient of variation of 0.023).
Running this numerical experiment 10,000 times gives an overall mean of 1.05x10 6 with a
variance of 2.4x10 8 , consistent with the above. The histogram of these results is shown in
Fig. 5.1.
5.2 A coin flipping example
51
number of hits
1500
1000
500
0
0.95
1
1.05
1.1
computed mean
1.15
−6
x 10
Figure 5.1: Histogram of importance-sampled Monte-Carlo results. Each individual numerical
result is the estimated probability from 10,000 trials that the sum of 10 standard Gaussians is
larger than or equal to 15. The histogram shows the results of 10,000 such numerical experiments.
5.2
A coin flipping example
Consider the specific problem where we flip a coin N times and count the number of heads.
This problem is discrete, of course, and so the previous continuous theory really should be
modified. On the other hand, the notation for the continuous case is a lot simpler, and it
provides sufficient information to guide the importance sampling even in the discrete case.
In this case we let Xj be the result of flipping the coin the j th time, and we let
(
Xj D
1 for heads ;
0 for tails :
Also, for a fair coin P.Xj D 1/ D 1=2 and P.Xj D 0/ D 1=2. In addition, suppose we
are interested in the probability that the number of heads is greater than or equal to m, i.e.,
P.ZN m/. Since the number of heads follows a binomial distribution, we can compute this
probability exactly,
N
X
N
P.ZN m/ D
j
j Dm
! 1 j 1 N
2
2
j
N
1 X N
D N
2 j Dm j
As an example, for N D 100 and m D 85, the probability is 2:4 10
13
!
:
.
Because this probability is so low, there is no way that we can simulate this with standard
Monte Carlo. We can, however, simulate it with importance sampling. The idea is to use an
unfair coin, i.e., one with the probability of a head being p. The IS estimator, for M trials and
52
Importance sampling
a vector of N heads or tails XE , is
IO D
M
X
kD1
0
1
.j /
M
N
N
E
X
X
Y
p.
X
/
p.X
/
.j /
k
k
I.XEk /
D
I@
Xk A
.j / :
/
p .XE /
j D1
j D1 p .X
k
kD1
k
Keeping track of the total number of heads is simple, of course. The other piece of information
we need is the likelihood ratio. This is also easy to calculate from the likelihood ratios for each
single step, since
8
1
ˆ
ˆ
if Xk .j / D1 ;
.j /
ˆ
<
2p
p.Xk /
D
.j /
1
p .Xk / ˆ
ˆ
ˆ
if Xk .j / D0 :
:
2.1 p/
We just multiply the likelihood ratios for each individual step to get the overall likelihood ratio.
Note that if p > 12 , heads are more prevalent, which means that there are more events with a
single-step likelihood ratio of 1=.2p/ < 1. Thus, we expect the overall likelihood ratio to be
smaller than 1, as well. (It can actually get quite small, as we will see.)
The one thing we haven’t yet addressed for this example is the best choice for the biasing
probability p. In this case we can actually calculate the the variance as a function of p,
#
!
" "
N j #
N
j
X
E
N
1
1
p.
X/
1
E
I2 :
D
Var I.X/
N
E
2
2p
2.1
p/
j
p .X/
j Dm
Figure 5.2 shows the resulting standard deviation as a function of p. Note that the minimum
(roughly 5:6 10 13 ) occurs near p D 0:85, which is the value of p for which the expected
position is 85. This makes sense intuitively; if p is too small, there will be too few samples
that produce the required position. This will also happen if p is too large.
Assuming that the trial variance is the sample variance divided by the number of trials, from
this we can also calculate the expected number of trials needed to produce a standard deviation
that is 10% of the expected mean value. This number is
2
;
I2
and is shown in Figure 5.3. Note that the number of trials is extremely large unless we are
close to the optimal value of p. The figure shows that near the minimum only a relatively
small number of trials (less than 1,000) is needed in order to determine the probability quite
accurately.
100
These importance-sampled simulations are relatively straightforward to do. One merely draws
random numbers from a uniform distribution and declares a head if the number is less than
or equal to p, otherwise one gets a tail. The overall likelihood ratio is just the product of
the individual likelihood ratios, of course. Figures 5.4 and 5.5 show the computed mean and
standard deviation for this particular importance-sampled Monte-Carlo simulation.
5.2 A coin flipping example
53
−6
log10(σ)
−8
−10
−12
−14
0.5
0.6
0.7
0.8
0.9
1
p
Figure 5.2: Standard deviation of importance-sampled symmetric random walk as a function
of the biased probability p for N D100 and mD85.
log10(N)
15
10
5
0
0.5
0.6
0.7
0.8
0.9
1
p
Figure 5.3: Expected number of trials needed to produce a trial standard deviation that is 10%
of the mean as a function of p.
−13
3.5
x 10
P(x >= m)
3
2.5
2
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
p
Figure 5.4: Importance-sampled Monte-Carlo result for the probability that flipping a coin 100
times results in 85 or more heads as a function of the biasing probability p. Green is the exact
result, blue is the numerical result.
54
Importance sampling
−10.5
Var(P)
−11
−11.5
−12
−12.5
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
p
Figure 5.5: Importance-sampled Monte-Carlo result for the standard deviation obtained from
the experiment of flipping a coin 100 times to get 85 or more heads as a function of the biasing
probability p. Green is the exact result, blue is the numerical result.
5.3
Multiple importance sampling and balance heuristics
In many practical cases no single choice of biasing distribution can efficiently capture all the
regions of sample space that give rise to the events of interest. In these cases, it is necessary
to use importance sampling with more than one biasing distribution. The simultaneous use of
different biasing methods is called multiple importance sampling. When using several biasing
distributions pj.x/,
E a difficulty arises about how to correctly weight the results coming from
different distributions. One solution to this problem can be found by assigning a weight wj .x/
E
to each distribution and by rewriting the probability P as:
P D
J
X
Pj D
j D1
J Z
X
wj .x/I.
E x/L
E j .x/p
E j .x/d
E xE ;
(5.1)
j D1
where J is the number of different biasing distributions used and Lj .x/
E D p.x/=p
E j.x/
E is the
likelihood ratio for the j -th distribution. Note that the weights wj .x/
E depend on the value
of the random variables for each individual sample. From Eq. (5.1), a multiply-importancesampled Monte Carlo estimator for P can now be written as
PO D
J
X
j D1
POj D
Mj
J
X
1 X
wj .xEj;m /I.xEj;m /Lj .xEj;m / ;
Mj mD1
j D1
(5.2)
where Mj is the number of samples drawn from the j -th distribution pj.x/,
E and xEj;m is the
m-th such sample.
Several ways exist to choose the weights wj .x/,
E the particulars of which we will discuss momentarily. Generally, however, the quantity PO is an unbiased estimator for
P P (i.e., the expectation value of PO is equal to P ) for any choice of weights such that jJD1 wj .x/
E D 1
5.3 Multiple importance sampling and balance heuristics
55
for all x. Thus, each choice of weights corresponds to a different way of partitioning of the
total probability. The simplest possibility is just to set wj .x/
E D 1=J for all x, meaning that
each distribution is assigned an equal weight in all regions of sample space. This choice is not
advantageous, however, as we will see shortly.
If PO is a multiply-importance-sampled Monte Carlo estimator defined according to Eq. (5.2),
then, similarly to previous results, one can show that an unbiased estimator of its variance is
O P2O
Mj
X
2
wj .xEj;m /Lj2 .xEj;m /I.xEj;m /
J
X
1
D
Mj .Mj
j D1
1/ mD1
2
POj :
(5.3)
Recursion relations can also be written so that O 2 can be obtained without the need of storing
all the individual samples until the end of the simulation:
O P2O
with PO D
PJ
j D1
D
J
X
1
Mj .Mj
j D1
1/
SOj; Mj ;
(5.4)
POj; Mj and
m
POj;m D
m
SOj;m D SOj;m
1 O
Pj;m
1
C
1
C
m
1
m
1 2
w .xEj;m /Lj2 .xEj;m /I.xEj;m / ;
m j
wj2 .xEj;m /Lj2 .xEj;m /I.xj;m //
(5.5a)
POj;m
1
2
:
(5.5b)
When using multiple importance sampling, the choice of weights wj .x/
E is almost as important
as the choice of biasing distributions pj .x/.
E Different weighting functions result in different
values for the variance of the combined estimator. A poor choice of weights can result in a
large variance, thus partially negating the gains obtained by importance sampling. The best
weighting strategies are the ones that yield the smallest value.
For example, consider the case where the weighting functions are constant over the whole
domain. In this case,
P D
J
X
j D1
Z
wj
I.x/L
E j .x/d
E xE D
J
X
wj Ej ŒI.x/L
E j .x/
E :
(5.6)
j D1
That is, the estimator is simply a weighted combination of the estimators obtained by using
each of the biasing techniques. P
Unfortunately, the variance of P is also a weighted sum of
2
the individual variances: P D jJD1 wj j2 , and if any of the sampling techniques is bad in a
given region, then P will also have a high variance.
56
Importance sampling
0
−5
10
probability
10
−10
10
X
−15
10
−30
−20
0
−10
10
20
30
Figure 5.6: Multiple-importance sampled probability distribution (via histograms) for sum-ofGaussians, showing results of individual biasing distributions.
A relatively simple and particularly useful choice of weights is the balance heuristic. In this
case, the weights wj .x/
E are assigned according to
wj .x/
E D PJ
Mj pj.x/
E
j 0 D1
Mj 0 pj0.x/
E
:
(5.7)
The quantity qj .x/
E D Mj pj.x/
E is proportional to the expected number of hits from the j -th
distribution. Thus, the weight associated with a sample xE with the balance heuristic is given
by the relative likelihood of realizing that sample with the j -th distribution relative to the total
likelihood of realizing that same sample with all distributions. Thus, Eq. (5.7) weights each distribution pj.x/
E most heavily in those regions of sample space where pj.x/
E is largest. (Eq. (5.7)
can also be written in terms of likelihood ratios, a form which is particularly convenient for
use in Eq. (5.2).) The balance heuristic has been mathematically proven to be asymptotically
close to optimal as the number of realizations becomes large.
Eric Veach, Robust Monte Carlo Methods for Light Transport Simulation, Ph.D. thesis, Stanford University
(1997)
5.3 Multiple importance sampling and balance heuristics
57
0
-5
10
probability
10
N=10
-10
10
X
-20
-15
-10
-5
0
5
10
15
20
Figure 5.7: Multiple-importance sampled probability distribution for sum-of-Gaussians, showing individual biasing distributions weighted, and then combined, with balance heuristics.