Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random Number Generation Uniform random numbers • Rules of the form xn+1 ≡ a + bxn (mod m) are called linear congruential generators (LCG’s). The result is a sequence of “pseudo-uniform” integers on 0, . . . , m − 1. An initial value x0 called the seed must be specified. To get pseudo-uniform draws on (0, 1), use Un = xn /m. The values of a, b, and m must be chosen very carefully in order to get a sequence that behaves like an iid uniform sequence. • The period of an LCG is the number of distinct values that occur in the sequence of xi ’s. Clearly this is at most m, but can be smaller depending on the values of a and b. • A special case of the LCG is the multiplicative congruential generator (MCG) xn+1 = bxn (mod m), or equivalently xn+1 = x0 bn (mod m). For a MCG, m should be a prime number and b should be a “primitive root mod m”, meaning that the powers of b generate all values between 1 and m − 1. For example, for m = 5, the powers of the value b = 2 are 1, 2, 4, 3, 1, 2, 4, 3, . . . so 2 is a primitive root modulo 5. But the power of b = 4 are 1, 4, 1, 4, . . . so 4 is not a primitive root modulo 5. In practice, having b be a primitive root modulo m is necessary for an MCG to perform well but far from sufficient. The values b = 16807 and m = 231 − 1 = 21474836467 have been shown to give reasonably good (but not state of the art) performance. 1 • When a 6= 0, the constant m may be non-prime without sacrificying the potential to cover all the integers between 1 and m (even 0 can be covered in this case, whereas a MCG can never reach zero). The standard C library function drand48 is an LCG with m = 248 , a = 11, and b = 25214903917. This random number generator can be shown to have period 248 . The following table shows a sequence from this LCG (the results differ from what you will acually get from the drand48 library function due to the use of the “shuffle box” described below). xi a + bxi xi+1 0 11 277363943098 11718085204285 49720483695876 102626409374399 25707281917278 25979478236433 137139456763464 148267022728371 127911637363266 65633894156837 233987836661708 262259097190887 159894566279526 156526639281273 14307911880080 215905707320923 5324043867850 71032958119949 11 277363943098 6993705175256325314877 295470392517265591684356 1253697219098278389146303 2587715051722178864620894 648206643511396312177937 655070047545450703808072 3457958225520320556088499 3738538732155529954929218 3225279645980899415312933 1654952334863192683630540 5899980819171657253110247 6612837937027380313004390 4031726125588636254303353 3946804169928216632446352 360772623309120026273371 5444041665228996928755402 134245254577730795368461 1791089213934798995940244 11 277363943098 11718085204285 49720483695876 102626409374399 25707281917278 25979478236433 137139456763464 148267022728371 127911637363266 65633894156837 233987836661708 262259097190887 159894566279526 156526639281273 14307911880080 215905707320923 5324043867850 71032958119949 83935042429844 xi+1 /248 0.000000 0.000985 0.041631 0.176643 0.364602 0.091331 0.092298 0.487217 0.526750 0.454433 0.233178 0.831292 0.931731 0.568060 0.556094 0.050832 0.767051 0.018915 0.252360 0.298197 • At least two problems have been observed with the performance of even the best MCG’s and LCG’s: – All MCG’s and LCG’s exhibit some positive autocorrelation. For example, in an MCG, an extremely small value is always followed by another fairly small value. For example in the MCG given above (b = 16807, m = 231 − 1), the frequency of values less than 10000 is 104 /231 ≈ 4.7 × 10−6 , but such a value is always followed by a value less than 104 × 16807/231 ≈ 0.08 (on the (0, 1) scale). – The output of MCG’s and LCG’s tend to fall on a rather small number of hyperplanes – if consecutive values are binned into k-tuples zj = (xkj , xkj+1 , . . . , xkj+k−1 ), then the zj tend to fall onto around m1/k distinct hyperplanes. 2 • A shuffle box can break up low-order serial correlations and destroy the hyperplane structure. To initialize, fill in an array [v1 , . . . , v32 ] with random numbers, and let k denote another random number. To generate one draw, let k ∗ denote the high-order five bits of k. The output random number is vk∗ . Replace k with vk∗ , and replace vk∗ with a new random number. • A method for generating pseudo-random sequences that perform well without shuffle boxes, but that are computationally expensive to implement are the ICG’s (inversive congruential generators). These are rules of the form xn+1 = a + bx−1 n (mod m), where −1 −1 a satisfies m|aa − 1. Elementary simulation of non-uniform random numbers • Inversion method: Suppose F (t) is a strictly increasing CDF and U is a uniform deviate, then t = P (U < t) F (t) = P (U < F (t)) F (t) = P (F −1 (U ) < t). For example, the CDF of an exponential distribution with mean λ is F (t) = 1 − exp(−t/λ). We can solve for the inverse: F −1 (U ) = −λ log(1 − U ). Therefore the distribution of −λ log(1 − U ) is exponential with mean λ. Since 1 − U and U are equal in distribution, we can simply use −λ log(U ). • The logistic distribution has CDF F (x) = ex /(1 + ex ), which has inverse F −1 (x) = log(x/(1 − x)). Thus log(U/(1 − U )) has a logistic distribution. • The Cauchy distribution has CDF F (t) = 1/2 + atan(t)/π. Thus tan(π(U − 1/2)) has a Cauchy distribution. • The inversion method can also be applied to discrete random variables. Suppose G has a geometric distribution, so the mass function is P (G = g) = (1 − p)g−1 p and the CDF is F (g) = 1 − (1 − p)g for g = 1, 2, . . .. Since P (G = g) = P (G ≤ g) − P (G ≤ g − 1) = P (F (g − 1) ≤ U ≤ F (g)) = P (g − 1 ≤ log(1 − U )/ log(1 − p) ≤ g), a geometric random variable can be simulated using dlog(U )/ log(1 − p)e. 3 • Density transformation method: Suppose U = [U1 , . . . , Uk ] is a k-vector of independent uniform draws, and let G = [g1 (U ), . . . , gk (U )] denote a transformation of U . The density of G is given by P (G) = |J(G)|. For example, take k = 2, and let g1 (x1 , x2 ) = q g2 (x1 , x2 ) = q −2 log(x1 ) cos(2πx2 ) −2 log(x1 ) sin(2πx2 ). 1 It is easy to verify that |J(G)| = 2π exp(−G21 /2) exp(−G22 /2). This is called the BoxMuller method for generating Gaussian draws. Rejection method • Suppose that 1. We want to simulate from the density π, which we can evaluate as a function. 2. There is a density f whose sample space contains the sample space of π, and we can easily simulate from f . 3. There is a constant c such that supx π(x)/f (x) < c. Under these circumstances, we can generate a candidate draw (or proposal draw) from f (which is called the trial distribution, candidate dsitribution, or proposal distribution), then make a random decision to either accept or reject the candidate draw. The goal is to specify the probability of accepting the draw in such a way that the marginal distribution of accepted draws is π. The rejection sampling algorithm: 1. Generate Z according to f . 2. With probability π(Z)/cf (Z) use Z as the next draw. Otherwise, reject it and draw a new Z. We can easily confirm that π is the density of accepted draws: P (accept|Z)P (Z) P (accept) π(Z)f (Z) = cf (Z)P (accept) π(Z) = cP (accept) = π(Z). P (Z|accept) = 4 Since P (Z|accept) and π(Z) are both densities in Z, it follows that P (accept) = 1/c. Note that since π(Z) ≤ cf (Z) we must have c ≥ 1 since both π and f are densities. √ Example 1: Suppose our target is π(x) = exp(−x2 /2)/ 2π, the standard normal density, and the trial density is f (x) = π −1 /(1 + x2 ), the standard Cauchy density. It can be shown that s π(x)/f (x) ≤ 2π . e Therefore if we simulate a Cauchy draw Z (e.g. using tan(πU ) where U is uniform), and then accept it with probability √ r √ e exp(−x2 /2)/ 2π 2 2 e/2 · = exp(−x /2)(1 + x ) π −1 /(1 + x2 ) 2π then the draws that are accepted will be iid standard normal. The following Octave program generates 10000 iid standard normal draws by rejection sampling from a standard Cauchy distribution. You should check the output array N to confirm that the resulting values are actually normal rather than Cauchy. The counter ii keeps track of the total number of Cauchy draws generated. I got a total of 15, 370 (it is a random variable hence your results may vary). Thus around 1.5 uniform draws must be generated for each normal value produced. The Box-Muller method generates one normal draw per uniform draw, so is more efficient. ii = 0; for it=1:10000 while (1) Z = tan(pi*rand(1,1)); R = exp(-Z^2/2)*(1+Z^2)*sqrt(e)/2; ii = ii+1; if (rand(1,1) < R) N(it) = Z; break; endif endwhile endfor 5 Note that while we can reject from a Cauchy distribution to produce normal draws, the reverse does not work – if π(x) is Cauchy and f (x) is normal, rejection sampling cannot be used since π/f is unbounded. The intuitive reason for this is that the outliers from a Cauchy distribution can be rejected so that the accepted draws are approximately normal, but the tails of the normal distribution are so thin that no matter how much we reject from the center of the distribution, the result will never appear Cauchy. Example 2: The triangular distribution has density p(x) = 2(1 − x)I(0 ≤ x ≤ 1). We can simulate from this distribution by rejecting from a uniform [0, 1] trial distribution. A uniform draw U is accepted with probability 1 − U . The marginal acceptance probability is P (accept) = 1/c = 1/2. Example 3: Suppose we want to sample from a binomial distribution B(p, n). If n is small then we can simply add independent Bernoulli trials, which are easy to simulate directly. If n is large, that approach is very inefficient. An alternative is to rejection sample from a Cauchy distribution. Although a Cauchy distribution is continuous and the bionomial distribution is discrete, this can be justfied as follows. In general, suppose p(k) (k = 1, 2, . . .), is a probability mass function. We can define a density p̃(x) = p(bxc). If we sample from x ∼ p̃ via rejection sampling from a continuous trial distribution, then bxc has mass function p. For generating binomial draws, we can q rejection sample from a generalized Cauchy distribution with µ = np, σ = np(1 − p), and c = π/1.2. • Computational efficiency: The computational efficiency of rejection sampling is determined by P (accept) = 1/c – larger values for the bound c yield less efficient schemes. Suppose we cannot calculate a tight upper bound for π(Z)/f (Z), but we know π(Z)/f (Z) ≤ c0 . Suppose that c is the tight upper bound. Since P (accept) = 1/c, the expected number of trials that must be made to get one accepted value is c. Thus the rejection sampling scheme using c0 is c0 /c times worse than the optimal rejection sampling scheme in terms of average performance. • Uniform sampling: A simple but important class of examples is the simulation of a uniform draw from a complicated sample space S. Suppose we embed S in a larger set T from which we can easily simulate a uniform draw (say a ball or a cube). Let f be the uniform distribution on T , and note that supx π(x)/f (x) = Vol(T )/Vol(S) ≡ c. If x ∈ S then π(x)/cf (x) is 1, otherwise it is 0. So we accept all candidate draws that lie in S and reject all draws in T \ S. The marginal acceptance rate is Vol(S)/Vol(T ), so the method is more efficient when T is as small as possible. • Normalizing constants: To use rejection sampling to generate draws from π(x) based on trial density f (x), we only need to know π(x) and f (x) up to multiplicative constants. For example, we may be able to evaluate π̃(x) = aπ(x) and f˜(x) = bf (x) but lack the ability to compute the values of a and b. As long as we can bound π̃(x)/f˜(x) < c, then the rejection sampling as described above still works. 6 P (Z|accept) = P (accept|Z)P (Z)/P (accept) = π̃(Z)/cf˜(Z) · f (Z)/P (accept) = π(Z). However we now have P (accept) = a/bc, so we do not know the marginal acceptance rate. Rejection sampling the gamma distribution Rejection sampling is typically used to generate draws from complicated distributions that arise in specialized problems (these distributions are too particular to have names). Among the classical distributions, most can be simulated using a deterministic operation on a uniform draw. The gamma distribution is more challenging, and cannot be handled in this way. Therefore draws from the gamma distribution are usually generated via rejection sampling. • Method 1: rejection sampling from a Cauchy trial distribution. Without loss of generality take β = 1, so we are sampling from the density π(x; α) = xα−1 exp(−x)/Γ(α). We consider the casse α > 1. The ratio π(x)/f (x) can be written π (exp((α − 1) log x − x) + exp((α + 1) log x − x)) Γ(α) 2π exp((α + 1) log x − x) ≤ Γ(α) 2π ≤ exp((α + 1)(log(α + 1) − 1)). Γ(α) π(x)/f (x) = Note that c goes to ∞ like nn /n!, so the efficiency of this method is poor for large α (i.e., for α = 1, 1/c ≈ .294; α = 10, 1/c ≈ .012; α = 100, 1/c ≈ .0004). • Method 2: rejection sampling from a generalized Cauchy trial distribution. This distribution has density: (πσ)−1 p(x) = . 1 + ( x−µ )2 σ We now have the opportunity to optimize the acceptance rate over µ and σ. To sample from the generalized Cauchy distribution, sample Z from a standard Cauchy distribution, and transform via Z → σZ + µ. It is proved in the following paper that the following values of µ, σ, and c are optimal: J.H. Ahrens and U. Dieter. Generating gamma variates by a modified rejection technique. Comm. ACM, 25(1):47–54, January 1982. 7 µ = α−1 √ σ = 2α − 1 c = πσ exp(−µ(1 − log(µ)))/Γ(α). • Method 3: a clever method for rejection sampling a gamma draw from an unusual trial distribution. Let f (x) = 4 exp(−α)αλ+α xλ−1 , Γ(α)(αλ + xλ )2 √ where λ = 2α − 1. The function f (x) is non-negative and integrable, thus can be re-normalized to a density. It is relatively easy to show that π(x; α) ≤ f (x). Therefore if we can simulate from the density that is proportional to f , then accept the draw with probability π(x; α)/f (x), we will have a Gamma draw. The key is to recognize that if U is uniform on (0, 1), then α(u/(1 − u))1/λ has density proportional to f . The following program implements method 3. ## Generate draws from a gamma distribution with parameters alpha and ## beta=1. alpha = 3; ## The lambda parameter for the trial distribution. L = sqrt(2*alpha - 1); ## Count the total number of candidates. n = 0; ## Generate 1000 draws. for i=1:1000 ## Loop until one draw is accepted. while (1) n = n+1; ## Generate the candidate point. u = rand(1,1); x = alpha*(u/(1-u))^(1/L); ## The log gamma density at x. g = -x + (alpha-1)*log(x) - lgamma(alpha); 8 ## The log trial density rescaled so that g/f <= 1. f = log(4) -alpha + (L+alpha)*log(alpha) + (L-1)*log(x) ... - lgamma(alpha) - 2*log(alpha^L + x^L); if (log(rand(1,1)) < g-f) G(i) = x; break; endif endwhile endfor Rejection sampling when supx π(x)/f (x) is difficult to determine or incorrectly specified • Suppose we cannot calculate a value c such that supx π(x)/f (x) ≤ c, but we know that a finite such c exists. Begin with a guess c0 and apply rejection sampling as if c0 is the supremum. Then when we reach a candidate point Z such that π(Z)/f (Z) = c1 > c0 , we need to go back through all points that were previously accepted, and reject each with probability 1 − c0 /c1 . Continue in this way, re-evaluating all points each time a new upper bound for π/f is discovered. After many samples are drawn, the maximum ratio π/f among the generated points will be close to sup π/f , so the points that were never rejected will be very nearly a sample from π. • Suppose that supx π(x)/cf (x) is finite, but is not necessarily smaller than 1, so we adopt the acceptance probability P (accept|Z) = min(π(Z)/cf (Z), 1). Then we have P (Z|accept) = = = ∝ P (accept|Z)P (Z)/P (accept) min(π(Z)/cf (Z), 1) · f (Z)/P (accept) min(π(Z), cf (Z))/cP (accept) min(π(Z), cf (Z)). If π(Z) ≤ cf (Z), we have the usual rejection sampling and hence the accepted draws have distribution π. If the inequality does not hold, we get a distribution that will be shaped like π on U = {Z|π(Z)/f (Z) ≤ c)} (although scaled incorrectly), and shaped like f on U c . For example, suppose we wish to sample from a standard normal target q density π, using a Cauchy trial density f with c = 1.2, where the correct bound is 2π/e ≈ 1.52. The accepted draws will be distributed according to the density on the right, below. The resulting density is correctly shaped in the tails, but not near zero (where it has a Cauchy rather than a Gaussian shape). 9 0.4 0.4 Standard normal 1.2*Cauchy 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 Importance sampling • Suppose we wish to calculate the value of h(x)π(x)dx, where π > 0 integrates to a finite value. One of the main applications of simulation is to estimate such quantities by recasting them as estimation problems. First note that R R h(x)π(x)dx R = Eπ h, π(x)dx where Eπ denotes the expected value with respect to the unique density function π̃ that is proportional to π. It follows that if we can simulate from π̃, we can estimate Eπ h ubiasedly and consistently using the sample mean. More generally, we can choose a density f and write Z h(x)π(x)dx = Z (h(x)π(x)/f (x)) · f (x)dx = Ef hπ/f, where it is necessary to select f so that hπ/f is integrable with respect to f . If we have an iid sample Z1 , . . . , Zn from f , then n−1 X h(Zi )π(Zi )/f (Zi ) i R is an unbiased and consistent and unbiased estimate of h(x)π(x)dx (under the conditions of the LLN). This method for approximating an integral is called importance sampling. An important practical advantage of importance sampling over rejection sampling is that there is no need to calculate a bound for the ratio π/f . 10 2 Letting wi = π(Zi )/f (Zi ) be the importance weights, the estimator can be written n−1 X wi h(Zi ). Note that the wi are closely related to the acceptance probabilities under rejection sampling. The Zi with small importance weights would likely be rejected under rejection sampling. Under importance sampling we allow them to contribute a small degree to the approximation. Example: Suppose exp(−|x| − |y|)dxdy over the region [−5, 5]2 . The true value is 4(1 − exp(−5))2 ≈ 3.95. The following programs calculate the importance sampling estimates based on a uniform trial density on [−5, 5]2 , and on a bivariate standard normal trial density truncated to [−5, 5]2 . For the uniform trial density, the importance weights are wi = 100, and for the normal trial density the importance weights are wi = 2π(F (5) − F (−5))2 exp((z12 + z22 )/2), where F is the standard normal CDF. R ## Do 1000 replicates using a uniform [-5,5]^2 trial density. for r=1:1000 ## Simulate trial density values. Z = 10*rand(1000,2) - 5; ## The integrand values at the points in Z. F = e.^(-abs(Z(:,1)) - abs(Z(:,2))); ## Estimate the integral using weights equal to 100. I1(r) = 100*sum(F)/1000; endfor ## Do 1000 replicates using a truncated standard normal trial density. for r=1:1000 ## Simulate from a truncated normal on [-5,5]^2. Z = []; while (1) X = randn(1000,2); ii = find(max(abs(X)’)’ <= 5); Z = [Z; X(ii,:)]; if (size(Z,1) >= 1000) break; endif endwhile Z = Z(1:1000,:); ## Importance weights for the truncated normal trial density. 11 W = 2*pi*(normal_cdf(5) - normal_cdf(-5))^2*exp((Z(:,1).^2+Z(:,2).^2)/2); ## The integrand values at the points in Z. F = e.^(-abs(Z(:,1)) - abs(Z(:,2))); ## Estimate the integral. I2(r) = dot(F, W) / 1000; endfor • The efficiency of importance sampling depends on the skew of the weights. If the weights are highly skewed, the sample mean is mostly determined by just a few values, √ so the usual n convergence will not hold. The “effective sample size” (ESS), given by the formula ESS = SS 1 + var(w) can where the weighted sample mean should congerge at rate √ provide some guidance, √ ESS rather than n. Technically, this result may not apply in importance sampling, since the weights and the values being averaged are dependent, but it provides some heuristic guidance. If π is a density and we are approximating Eπ h, then Ef w = 1 and varf wi = Eπ w − 1, so the effective sample size is SS/Eπ w. • Rao-Blackwellization: Suppose we generate Z1 , . . . , Zn from a rejection sampling trial density f , and let Di be the indicators of whether Zi is accepted (so each Di is a Bernoulli trial with success probability π(Zi )/f (Zi )). The rejection sampling estimator of EZ can be written θ̂ = X Di Zi / i X Di . i View this as an estimator of EZ based on data Z1 , . . . , Zn . By the Rao-Blackwell theorem, θ∗ = E(θ̂|Z1 , . . . , Zn ) is unbiased and at least as efficient as θ̂. For large n, θ∗ is approximately the importance sampling estimate of EZ, so importance sampling can be viewed as the Rao-Blackewellization of rejection sampling. • Normalizing constants: If the densities π and/or f are only known up to a constant of proportionality, then the importance sampling weights must be renormalized: wi → R P wi / wi . The resulting estimate of h(x)π(x)dx is still consistent, but is no longer unbiased. 12