Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 5. Importance sampling For Monte-Carlo integration, one often has a choice for the probability distribution. Suppose instead we write Z 1 Z 1 p.x/ p.X/ I D EŒg.X/ D g.x/p.x/ dx D g.x/ ; p .x/ dx D E g.X/ p .x/ p .X/ 1 1 where p .x/ is some called the biasing probability distribution. From this we can then construct the estimate N X p.Xj / 1 IO D g.Xj / ; N 1 p .Xj / where now the random samples Xj are drawn according to the distribution p .x/. The quantity p.x/=p .x/ is called the likelihood ratio. Note that E ŒIO D I . The utility of this comes home when one looks at the variance, Z 1 2 p.x/ p.X/ D g.x/ I p .x/ dx : Var g.X/ p .X/ p .x/ 1 From this it’s easy to see that we get zero variance if we choose p .x/ D g.x/p.x/ : I The only problem with this, of course, is that we must know the answer I in order to construct this distribution. In general, note we can also write Z p.X/ Var g.X/ D p .X/ Z D 1 1 1 1 p.x/ g.x/ p .x/ g 2 .x/ 2 I p .x/ dx p.x/ p.x/ dx p .x/ I2 and a similar result for the original variance (just put p .x/ D p.x/. Subtracting the two, we have Z 1 p.X/ p.x/ 2 VarŒg.X/ Var g.X/ D g .x/ 1 p.x/ dx : p .X/ p .x/ 1 5.1 An exact multidimensional example 47 This shows that if we choose p .x/ > p.x/ when g 2 .x/p.x/ is large and p .x/ < p.x/ when g 2 .x/p.x/ is small, then the variance will be reduced. In this case the probability mass is redistributed in accordance with its relative importance as measured by the weight g.x/p.x/. In the non-ideal case, we can still get estimates for the variances of I and IO, O I2 D 1 N X N 1 j D1 g.Xj / p.Xj / p .Xj / !2 IO ; and O I2 O 5.1 N X p.Xj / 1 D g.Xj / N.N 1/ j D1 p .Xj / !2 IO : An exact multidimensional example Suppose we want to estimate P.ZN m/, where ZN D N X Xj ; 1 where the Xj are iid (independent, identically-distributed) random variables. Let I.z/ D 1 if z > m, and 0 otherwise. Then we want Z 1 Z 1 P.ZN m/ D I.z/p.z/ dz D I.x/ E p.x/ E d xE 1 Z 1 1 D Z 1 I.x1 C : : : C xN / p.x1 /p.x2 / : : : p.xN / dx1 dx2 : : : dxN : ::: 1 1 If N is big, the difficulty is that it can be hard to find all of the regions in the N -dimensional space that contribute to the integral. With importance sampling, the above becomes Z 1 P.ZN m/ D I.x/ E 1 p.x/ E p .x/ E d xE ; p .x/ E and when we do Monte Carlo sampling, this becomes K 1 X E p.XEj / PO D I.Xj / ; K j D1 p .XE / j 48 Importance sampling where K is the number of samples, with an estimated variance of !2 K X p.XEj / 1 2 E O PO D I.Xj / K.K 1/ j D1 p .XEj / PO : What makes this work is that we know how to compute I.XEj / D I.X1 .j / C : : : C XN.j / /; in this last, the subscripts denote component and the superscripts the particular trial. Also, since the Xj are independent, we have p.XEj / p .XEj / N Y p.Xk .j / / D kD1 p .Xk .j / / ; i.e., we can compute the overall likelihood ratio as a product of individual likelihood ratios. As an example, suppose the Xj are Gaussian random variables with zero mean and variance 1. Then, of course, because the sum of N Gaussians is also Gaussian with mean zero and variance N , we know that P D P.ZN m/ D p 1 Z 1 e 2N x 2 =2N dx D m p 1 erfc .m= 2N / : 2 Even though we know the exact answer, it’s still instructive to compute this probability with importance sampling. The key step is determining the biasing distribution p .x/. One simple choice, for which many answers can be obtained analytically, is a Gaussian distribution with mean and variance 2 . Note that with this particular choice we are doing the biasing parametrically, i.e., we are choosing a distribution and adjusting its free parameters to do the biasing. Thus, we choose p .xj / D p 1 2 2 e .xj /=2 2 : E E E Of course, we get the same result for E ŒI.X/.p. X/=p .X// D EŒPO D P as for the unbiased case; that’s the result that we’re looking for. The interesting result is the variance, " Var E E p.X/ I.X/ E p .X/ # Z 1 D 1 Z 1 D 1 " E E p.X/ I.X/ E p .X/ E I.X/ #2 E d XE p .X/ E p.X/ E d XE p.X/ E p .X/ P2 P2 5.1 An exact multidimensional example 49 Suppose we first try D 0; the idea here is to increase the variance to spread the distribution out, and thus get more samples at larger values. Then in the integral we need !N N p N 2 2 Y Y e xj2 E p.X/ p .x / 2 j E D D p.X/ .x / xj2 =2 2 E p 2 j p .X/ j D1 j D1 e N Y N D p e . 2/N j D1 .1 1=2 2 /xj2 N D O N N Y j D1 p 1 2 O e xj2 =2O ; p where O D = 2 2 1. We now have an integral that looks like we are computing the probability for a sum of Gaussians with variance O 2 to be bigger than m, with a correction factor of N O N . Thus, the result must be p p 1 Var I D N O N erfc .m= 2N O 2 / P 2 : p 2 p In the above, let’s assume that N is large but that m= N is O.1/. The behavior as a function of will therefore be dominated by the prefactor, . / O N . We are interested in minimizing the variance, which means that we want O to be minimal. Taking logarithmic derivatives of O D p 2 2 2 1 ) 2 ln 1 ln .2 2 2 1/ 2 ) 1 4 D0 2 2 2 1 which gives 2 D 2 2 1 or D 1. Note that in this case Var ŒI.p=p / P , which means p that the coefficient of variation (standard deviation divided by the mean) is O.1= P /, which means that many, many samples will be required to determine the value with Monte-Carlo sampling when P is small. In this particular case, importance sampling is really of no benefit. This particular issue is known as the “dimensionality problem” of variance scaling. On the other hand, suppose we set D 1 and take nonzero. Then we get instead N N Y Y E p.X/ p 2 .xj / 1 e E p.X/ D D p E p .xj / . 2/N j D1 e .xj p .X/ j D1 D N Y 1 e p . 2/N j D1 .xj C/2 =2C2 xj2 /2 =2 D e N 2 N Y 1 p e 2 j D1 .xj C/2 =2 Now we have an integral that looks like we are computing the probability for a sum of Gaus2 sians each with mean and variance 1 to be bigger than m, with a correction factor of e N . Thus, the result will be p 21 e N erfc ..mCN/= 2N / P 2 : 2 50 Importance sampling p In this expression, if we assume m= N is large, we can use the asymptotic expansion 1 erfc .z/ p e z z2 : The above expression is then approximately e N2 1 e 2 .mCN/2 =2N p 1 2N p mCN P2 Taking the logarithmic derivative of the first term, differentiating with respect to , and neglecting some small terms, we get 2N 1 2N.mCN/ 2N N D0 mCN ) N m N D0 mCN 1p 2 m CN : N If, in addition, m2 N (i.e., P is small) this becomes m=N . In addition, in this case it is easy to check that Var ŒI.p=p / D O.P 2 /, which means that it’s possible to get fairly good results with a relatively small number of samples. ) .N/2 m2 D N ) D A simpler way to see this result is to ask what is the most probable way for the sum X1 C X2 C C XN to achieve a particular sum S . Maximizing the probability for a bunch of Gaussians is equivalent to minimizing X12 C X22 C : : : C XN2 . Here we also want the sum to achieve a particular value, so we want to X X minimize Xj2 subject to the constraint Xj D S : j j This is a simple Lagrange multiplier problem with the solution Xj D S=N , i.e., the mean of each Gaussian should be shifted by the same amount. This type of biasing, where one shifts the mean of the distribution, is known as mean translation. Generally speaking, this method tends to work well in practice. The difficulty, of course, is figuring out a good shift of the mean. As a specific example, take N D 10 and m D 15. The exact probability for the sum of 10 Gaussians to be larger than 15 is 1.05x10 6 . By using Gaussians with a mean shifted by m=N D 1:5, producing 10000 trials of the sum generates the sample mean value 1.04x10 6 with a sample standard deviation of 2.4x10 6 , and a standard error of the mean of 2.4x10 8 (or, a coefficient of variation of 0.023). Running this numerical experiment 10,000 times gives an overall mean of 1.05x10 6 with a variance of 2.4x10 8 , consistent with the above. The histogram of these results is shown in Fig. 5.1. 5.2 A coin flipping example 51 number of hits 1500 1000 500 0 0.95 1 1.05 1.1 computed mean 1.15 −6 x 10 Figure 5.1: Histogram of importance-sampled Monte-Carlo results. Each individual numerical result is the estimated probability from 10,000 trials that the sum of 10 standard Gaussians is larger than or equal to 15. The histogram shows the results of 10,000 such numerical experiments. 5.2 A coin flipping example Consider the specific problem where we flip a coin N times and count the number of heads. This problem is discrete, of course, and so the previous continuous theory really should be modified. On the other hand, the notation for the continuous case is a lot simpler, and it provides sufficient information to guide the importance sampling even in the discrete case. In this case we let Xj be the result of flipping the coin the j th time, and we let ( Xj D 1 for heads ; 0 for tails : Also, for a fair coin P.Xj D 1/ D 1=2 and P.Xj D 0/ D 1=2. In addition, suppose we are interested in the probability that the number of heads is greater than or equal to m, i.e., P.ZN m/. Since the number of heads follows a binomial distribution, we can compute this probability exactly, N X N P.ZN m/ D j j Dm ! 1 j 1 N 2 2 j N 1 X N D N 2 j Dm j As an example, for N D 100 and m D 85, the probability is 2:4 10 13 ! : . Because this probability is so low, there is no way that we can simulate this with standard Monte Carlo. We can, however, simulate it with importance sampling. The idea is to use an unfair coin, i.e., one with the probability of a head being p. The IS estimator, for M trials and 52 Importance sampling a vector of N heads or tails XE , is IO D M X kD1 0 1 .j / M N N E X X Y p. X / p.X / .j / k k I.XEk / D I@ Xk A .j / : / p .XE / j D1 j D1 p .X k kD1 k Keeping track of the total number of heads is simple, of course. The other piece of information we need is the likelihood ratio. This is also easy to calculate from the likelihood ratios for each single step, since 8 1 ˆ ˆ if Xk .j / D1 ; .j / ˆ < 2p p.Xk / D .j / 1 p .Xk / ˆ ˆ ˆ if Xk .j / D0 : : 2.1 p/ We just multiply the likelihood ratios for each individual step to get the overall likelihood ratio. Note that if p > 12 , heads are more prevalent, which means that there are more events with a single-step likelihood ratio of 1=.2p/ < 1. Thus, we expect the overall likelihood ratio to be smaller than 1, as well. (It can actually get quite small, as we will see.) The one thing we haven’t yet addressed for this example is the best choice for the biasing probability p. In this case we can actually calculate the the variance as a function of p, # ! " " N j # N j X E N 1 1 p. X/ 1 E I2 : D Var I.X/ N E 2 2p 2.1 p/ j p .X/ j Dm Figure 5.2 shows the resulting standard deviation as a function of p. Note that the minimum (roughly 5:6 10 13 ) occurs near p D 0:85, which is the value of p for which the expected position is 85. This makes sense intuitively; if p is too small, there will be too few samples that produce the required position. This will also happen if p is too large. Assuming that the trial variance is the sample variance divided by the number of trials, from this we can also calculate the expected number of trials needed to produce a standard deviation that is 10% of the expected mean value. This number is 2 ; I2 and is shown in Figure 5.3. Note that the number of trials is extremely large unless we are close to the optimal value of p. The figure shows that near the minimum only a relatively small number of trials (less than 1,000) is needed in order to determine the probability quite accurately. 100 These importance-sampled simulations are relatively straightforward to do. One merely draws random numbers from a uniform distribution and declares a head if the number is less than or equal to p, otherwise one gets a tail. The overall likelihood ratio is just the product of the individual likelihood ratios, of course. Figures 5.4 and 5.5 show the computed mean and standard deviation for this particular importance-sampled Monte-Carlo simulation. 5.2 A coin flipping example 53 −6 log10(σ) −8 −10 −12 −14 0.5 0.6 0.7 0.8 0.9 1 p Figure 5.2: Standard deviation of importance-sampled symmetric random walk as a function of the biased probability p for N D100 and mD85. log10(N) 15 10 5 0 0.5 0.6 0.7 0.8 0.9 1 p Figure 5.3: Expected number of trials needed to produce a trial standard deviation that is 10% of the mean as a function of p. −13 3.5 x 10 P(x >= m) 3 2.5 2 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 p Figure 5.4: Importance-sampled Monte-Carlo result for the probability that flipping a coin 100 times results in 85 or more heads as a function of the biasing probability p. Green is the exact result, blue is the numerical result. 54 Importance sampling −10.5 Var(P) −11 −11.5 −12 −12.5 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 p Figure 5.5: Importance-sampled Monte-Carlo result for the standard deviation obtained from the experiment of flipping a coin 100 times to get 85 or more heads as a function of the biasing probability p. Green is the exact result, blue is the numerical result. 5.3 Multiple importance sampling and balance heuristics In many practical cases no single choice of biasing distribution can efficiently capture all the regions of sample space that give rise to the events of interest. In these cases, it is necessary to use importance sampling with more than one biasing distribution. The simultaneous use of different biasing methods is called multiple importance sampling. When using several biasing distributions pj.x/, E a difficulty arises about how to correctly weight the results coming from different distributions. One solution to this problem can be found by assigning a weight wj .x/ E to each distribution and by rewriting the probability P as: P D J X Pj D j D1 J Z X wj .x/I. E x/L E j .x/p E j .x/d E xE ; (5.1) j D1 where J is the number of different biasing distributions used and Lj .x/ E D p.x/=p E j.x/ E is the likelihood ratio for the j -th distribution. Note that the weights wj .x/ E depend on the value of the random variables for each individual sample. From Eq. (5.1), a multiply-importancesampled Monte Carlo estimator for P can now be written as PO D J X j D1 POj D Mj J X 1 X wj .xEj;m /I.xEj;m /Lj .xEj;m / ; Mj mD1 j D1 (5.2) where Mj is the number of samples drawn from the j -th distribution pj.x/, E and xEj;m is the m-th such sample. Several ways exist to choose the weights wj .x/, E the particulars of which we will discuss momentarily. Generally, however, the quantity PO is an unbiased estimator for P P (i.e., the expectation value of PO is equal to P ) for any choice of weights such that jJD1 wj .x/ E D 1 5.3 Multiple importance sampling and balance heuristics 55 for all x. Thus, each choice of weights corresponds to a different way of partitioning of the total probability. The simplest possibility is just to set wj .x/ E D 1=J for all x, meaning that each distribution is assigned an equal weight in all regions of sample space. This choice is not advantageous, however, as we will see shortly. If PO is a multiply-importance-sampled Monte Carlo estimator defined according to Eq. (5.2), then, similarly to previous results, one can show that an unbiased estimator of its variance is O P2O Mj X 2 wj .xEj;m /Lj2 .xEj;m /I.xEj;m / J X 1 D Mj .Mj j D1 1/ mD1 2 POj : (5.3) Recursion relations can also be written so that O 2 can be obtained without the need of storing all the individual samples until the end of the simulation: O P2O with PO D PJ j D1 D J X 1 Mj .Mj j D1 1/ SOj; Mj ; (5.4) POj; Mj and m POj;m D m SOj;m D SOj;m 1 O Pj;m 1 C 1 C m 1 m 1 2 w .xEj;m /Lj2 .xEj;m /I.xEj;m / ; m j wj2 .xEj;m /Lj2 .xEj;m /I.xj;m // (5.5a) POj;m 1 2 : (5.5b) When using multiple importance sampling, the choice of weights wj .x/ E is almost as important as the choice of biasing distributions pj .x/. E Different weighting functions result in different values for the variance of the combined estimator. A poor choice of weights can result in a large variance, thus partially negating the gains obtained by importance sampling. The best weighting strategies are the ones that yield the smallest value. For example, consider the case where the weighting functions are constant over the whole domain. In this case, P D J X j D1 Z wj I.x/L E j .x/d E xE D J X wj Ej ŒI.x/L E j .x/ E : (5.6) j D1 That is, the estimator is simply a weighted combination of the estimators obtained by using each of the biasing techniques. P Unfortunately, the variance of P is also a weighted sum of 2 the individual variances: P D jJD1 wj j2 , and if any of the sampling techniques is bad in a given region, then P will also have a high variance. 56 Importance sampling 0 −5 10 probability 10 −10 10 X −15 10 −30 −20 0 −10 10 20 30 Figure 5.6: Multiple-importance sampled probability distribution (via histograms) for sum-ofGaussians, showing results of individual biasing distributions. A relatively simple and particularly useful choice of weights is the balance heuristic. In this case, the weights wj .x/ E are assigned according to wj .x/ E D PJ Mj pj.x/ E j 0 D1 Mj 0 pj0.x/ E : (5.7) The quantity qj .x/ E D Mj pj.x/ E is proportional to the expected number of hits from the j -th distribution. Thus, the weight associated with a sample xE with the balance heuristic is given by the relative likelihood of realizing that sample with the j -th distribution relative to the total likelihood of realizing that same sample with all distributions. Thus, Eq. (5.7) weights each distribution pj.x/ E most heavily in those regions of sample space where pj.x/ E is largest. (Eq. (5.7) can also be written in terms of likelihood ratios, a form which is particularly convenient for use in Eq. (5.2).) The balance heuristic has been mathematically proven to be asymptotically close to optimal as the number of realizations becomes large. Eric Veach, Robust Monte Carlo Methods for Light Transport Simulation, Ph.D. thesis, Stanford University (1997) 5.3 Multiple importance sampling and balance heuristics 57 0 -5 10 probability 10 N=10 -10 10 X -20 -15 -10 -5 0 5 10 15 20 Figure 5.7: Multiple-importance sampled probability distribution for sum-of-Gaussians, showing individual biasing distributions weighted, and then combined, with balance heuristics.