Download Random Number Generation Uniform random numbers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Random Number Generation
Uniform random numbers
• Rules of the form
xn+1 ≡ a + bxn (mod m)
are called linear congruential generators (LCG’s).
The result is a sequence of “pseudo-uniform” integers on 0, . . . , m − 1. An initial
value x0 called the seed must be specified. To get pseudo-uniform draws on (0, 1), use
Un = xn /m.
The values of a, b, and m must be chosen very carefully in order to get a sequence that
behaves like an iid uniform sequence.
• The period of an LCG is the number of distinct values that occur in the sequence of
xi ’s. Clearly this is at most m, but can be smaller depending on the values of a and b.
• A special case of the LCG is the multiplicative congruential generator (MCG)
xn+1 = bxn (mod m),
or equivalently
xn+1 = x0 bn (mod m).
For a MCG, m should be a prime number and b should be a “primitive root mod m”,
meaning that the powers of b generate all values between 1 and m − 1.
For example, for m = 5, the powers of the value b = 2 are
1, 2, 4, 3, 1, 2, 4, 3, . . .
so 2 is a primitive root modulo 5. But the power of b = 4 are
1, 4, 1, 4, . . .
so 4 is not a primitive root modulo 5. In practice, having b be a primitive root modulo
m is necessary for an MCG to perform well but far from sufficient. The values b = 16807
and m = 231 − 1 = 21474836467 have been shown to give reasonably good (but not
state of the art) performance.
1
• When a 6= 0, the constant m may be non-prime without sacrificying the potential to
cover all the integers between 1 and m (even 0 can be covered in this case, whereas a
MCG can never reach zero).
The standard C library function drand48 is an LCG with m = 248 , a = 11, and
b = 25214903917. This random number generator can be shown to have period 248 .
The following table shows a sequence from this LCG (the results differ from what you
will acually get from the drand48 library function due to the use of the “shuffle box”
described below).
xi
a + bxi
xi+1
0
11
277363943098
11718085204285
49720483695876
102626409374399
25707281917278
25979478236433
137139456763464
148267022728371
127911637363266
65633894156837
233987836661708
262259097190887
159894566279526
156526639281273
14307911880080
215905707320923
5324043867850
71032958119949
11
277363943098
6993705175256325314877
295470392517265591684356
1253697219098278389146303
2587715051722178864620894
648206643511396312177937
655070047545450703808072
3457958225520320556088499
3738538732155529954929218
3225279645980899415312933
1654952334863192683630540
5899980819171657253110247
6612837937027380313004390
4031726125588636254303353
3946804169928216632446352
360772623309120026273371
5444041665228996928755402
134245254577730795368461
1791089213934798995940244
11
277363943098
11718085204285
49720483695876
102626409374399
25707281917278
25979478236433
137139456763464
148267022728371
127911637363266
65633894156837
233987836661708
262259097190887
159894566279526
156526639281273
14307911880080
215905707320923
5324043867850
71032958119949
83935042429844
xi+1 /248
0.000000
0.000985
0.041631
0.176643
0.364602
0.091331
0.092298
0.487217
0.526750
0.454433
0.233178
0.831292
0.931731
0.568060
0.556094
0.050832
0.767051
0.018915
0.252360
0.298197
• At least two problems have been observed with the performance of even the best MCG’s
and LCG’s:
– All MCG’s and LCG’s exhibit some positive autocorrelation. For example, in an
MCG, an extremely small value is always followed by another fairly small value.
For example in the MCG given above (b = 16807, m = 231 − 1), the frequency of
values less than 10000 is 104 /231 ≈ 4.7 × 10−6 , but such a value is always followed
by a value less than 104 × 16807/231 ≈ 0.08 (on the (0, 1) scale).
– The output of MCG’s and LCG’s tend to fall on a rather small number of hyperplanes – if consecutive values are binned into k-tuples zj = (xkj , xkj+1 , . . . , xkj+k−1 ),
then the zj tend to fall onto around m1/k distinct hyperplanes.
2
• A shuffle box can break up low-order serial correlations and destroy the hyperplane
structure. To initialize, fill in an array [v1 , . . . , v32 ] with random numbers, and let k
denote another random number. To generate one draw, let k ∗ denote the high-order
five bits of k. The output random number is vk∗ . Replace k with vk∗ , and replace vk∗
with a new random number.
• A method for generating pseudo-random sequences that perform well without shuffle
boxes, but that are computationally expensive to implement are the ICG’s (inversive
congruential generators). These are rules of the form xn+1 = a + bx−1
n (mod m), where
−1
−1
a satisfies m|aa − 1.
Elementary simulation of non-uniform random numbers
• Inversion method:
Suppose F (t) is a strictly increasing CDF and U is a uniform deviate, then
t = P (U < t)
F (t) = P (U < F (t))
F (t) = P (F −1 (U ) < t).
For example, the CDF of an exponential distribution with mean λ is F (t) = 1 −
exp(−t/λ). We can solve for the inverse:
F −1 (U ) = −λ log(1 − U ).
Therefore the distribution of −λ log(1 − U ) is exponential with mean λ. Since 1 − U
and U are equal in distribution, we can simply use −λ log(U ).
• The logistic distribution has CDF F (x) = ex /(1 + ex ), which has inverse F −1 (x) =
log(x/(1 − x)). Thus log(U/(1 − U )) has a logistic distribution.
• The Cauchy distribution has CDF F (t) = 1/2 + atan(t)/π. Thus tan(π(U − 1/2)) has
a Cauchy distribution.
• The inversion method can also be applied to discrete random variables. Suppose G
has a geometric distribution, so the mass function is P (G = g) = (1 − p)g−1 p and the
CDF is F (g) = 1 − (1 − p)g for g = 1, 2, . . .. Since
P (G = g) = P (G ≤ g) − P (G ≤ g − 1)
= P (F (g − 1) ≤ U ≤ F (g))
= P (g − 1 ≤ log(1 − U )/ log(1 − p) ≤ g),
a geometric random variable can be simulated using dlog(U )/ log(1 − p)e.
3
• Density transformation method: Suppose U = [U1 , . . . , Uk ] is a k-vector of independent
uniform draws, and let G = [g1 (U ), . . . , gk (U )] denote a transformation of U . The
density of G is given by P (G) = |J(G)|. For example, take k = 2, and let
g1 (x1 , x2 ) =
q
g2 (x1 , x2 ) =
q
−2 log(x1 ) cos(2πx2 )
−2 log(x1 ) sin(2πx2 ).
1
It is easy to verify that |J(G)| = 2π
exp(−G21 /2) exp(−G22 /2). This is called the BoxMuller method for generating Gaussian draws.
Rejection method
• Suppose that
1. We want to simulate from the density π, which we can evaluate as a function.
2. There is a density f whose sample space contains the sample space of π, and we
can easily simulate from f .
3. There is a constant c such that supx π(x)/f (x) < c.
Under these circumstances, we can generate a candidate draw (or proposal draw) from f
(which is called the trial distribution, candidate dsitribution, or proposal distribution),
then make a random decision to either accept or reject the candidate draw. The goal
is to specify the probability of accepting the draw in such a way that the marginal
distribution of accepted draws is π.
The rejection sampling algorithm:
1. Generate Z according to f .
2. With probability π(Z)/cf (Z) use Z as the next draw. Otherwise, reject it and
draw a new Z.
We can easily confirm that π is the density of accepted draws:
P (accept|Z)P (Z)
P (accept)
π(Z)f (Z)
=
cf (Z)P (accept)
π(Z)
=
cP (accept)
= π(Z).
P (Z|accept) =
4
Since P (Z|accept) and π(Z) are both densities in Z, it follows that P (accept) = 1/c.
Note that since π(Z) ≤ cf (Z) we must have c ≥ 1 since both π and f are densities.
√
Example 1: Suppose our target is π(x) = exp(−x2 /2)/ 2π, the standard normal
density, and the trial density is f (x) = π −1 /(1 + x2 ), the standard Cauchy density. It
can be shown that
s
π(x)/f (x) ≤
2π
.
e
Therefore if we simulate a Cauchy draw Z (e.g. using tan(πU ) where U is uniform),
and then accept it with probability
√
r
√
e
exp(−x2 /2)/ 2π
2
2
e/2
·
=
exp(−x
/2)(1
+
x
)
π −1 /(1 + x2 )
2π
then the draws that are accepted will be iid standard normal.
The following Octave program generates 10000 iid standard normal draws by rejection
sampling from a standard Cauchy distribution. You should check the output array N
to confirm that the resulting values are actually normal rather than Cauchy.
The counter ii keeps track of the total number of Cauchy draws generated. I got a
total of 15, 370 (it is a random variable hence your results may vary). Thus around 1.5
uniform draws must be generated for each normal value produced. The Box-Muller
method generates one normal draw per uniform draw, so is more efficient.
ii = 0;
for it=1:10000
while (1)
Z = tan(pi*rand(1,1));
R = exp(-Z^2/2)*(1+Z^2)*sqrt(e)/2;
ii = ii+1;
if (rand(1,1) < R)
N(it) = Z;
break;
endif
endwhile
endfor
5
Note that while we can reject from a Cauchy distribution to produce normal draws, the
reverse does not work – if π(x) is Cauchy and f (x) is normal, rejection sampling cannot
be used since π/f is unbounded. The intuitive reason for this is that the outliers from
a Cauchy distribution can be rejected so that the accepted draws are approximately
normal, but the tails of the normal distribution are so thin that no matter how much
we reject from the center of the distribution, the result will never appear Cauchy.
Example 2: The triangular distribution has density p(x) = 2(1 − x)I(0 ≤ x ≤ 1). We
can simulate from this distribution by rejecting from a uniform [0, 1] trial distribution.
A uniform draw U is accepted with probability 1 − U . The marginal acceptance
probability is P (accept) = 1/c = 1/2.
Example 3: Suppose we want to sample from a binomial distribution B(p, n). If n is
small then we can simply add independent Bernoulli trials, which are easy to simulate
directly. If n is large, that approach is very inefficient. An alternative is to rejection
sample from a Cauchy distribution. Although a Cauchy distribution is continuous
and the bionomial distribution is discrete, this can be justfied as follows. In general,
suppose p(k) (k = 1, 2, . . .), is a probability mass function. We can define a density
p̃(x) = p(bxc). If we sample from x ∼ p̃ via rejection sampling from a continuous trial
distribution, then bxc has mass function p. For generating binomial draws,
we can
q
rejection sample from a generalized Cauchy distribution with µ = np, σ = np(1 − p),
and c = π/1.2.
• Computational efficiency: The computational efficiency of rejection sampling is determined by P (accept) = 1/c – larger values for the bound c yield less efficient
schemes. Suppose we cannot calculate a tight upper bound for π(Z)/f (Z), but we know
π(Z)/f (Z) ≤ c0 . Suppose that c is the tight upper bound. Since P (accept) = 1/c,
the expected number of trials that must be made to get one accepted value is c. Thus
the rejection sampling scheme using c0 is c0 /c times worse than the optimal rejection
sampling scheme in terms of average performance.
• Uniform sampling: A simple but important class of examples is the simulation of a
uniform draw from a complicated sample space S. Suppose we embed S in a larger set
T from which we can easily simulate a uniform draw (say a ball or a cube). Let f be
the uniform distribution on T , and note that supx π(x)/f (x) = Vol(T )/Vol(S) ≡ c. If
x ∈ S then π(x)/cf (x) is 1, otherwise it is 0. So we accept all candidate draws that
lie in S and reject all draws in T \ S. The marginal acceptance rate is Vol(S)/Vol(T ),
so the method is more efficient when T is as small as possible.
• Normalizing constants: To use rejection sampling to generate draws from π(x) based on
trial density f (x), we only need to know π(x) and f (x) up to multiplicative constants.
For example, we may be able to evaluate π̃(x) = aπ(x) and f˜(x) = bf (x) but lack the
ability to compute the values of a and b. As long as we can bound π̃(x)/f˜(x) < c, then
the rejection sampling as described above still works.
6
P (Z|accept) = P (accept|Z)P (Z)/P (accept)
= π̃(Z)/cf˜(Z) · f (Z)/P (accept)
= π(Z).
However we now have P (accept) = a/bc, so we do not know the marginal acceptance
rate.
Rejection sampling the gamma distribution
Rejection sampling is typically used to generate draws from complicated distributions that
arise in specialized problems (these distributions are too particular to have names). Among
the classical distributions, most can be simulated using a deterministic operation on a uniform draw. The gamma distribution is more challenging, and cannot be handled in this way.
Therefore draws from the gamma distribution are usually generated via rejection sampling.
• Method 1: rejection sampling from a Cauchy trial distribution. Without loss of generality take β = 1, so we are sampling from the density π(x; α) = xα−1 exp(−x)/Γ(α).
We consider the casse α > 1. The ratio π(x)/f (x) can be written
π
(exp((α − 1) log x − x) + exp((α + 1) log x − x))
Γ(α)
2π
exp((α + 1) log x − x)
≤
Γ(α)
2π
≤
exp((α + 1)(log(α + 1) − 1)).
Γ(α)
π(x)/f (x) =
Note that c goes to ∞ like nn /n!, so the efficiency of this method is poor for large α
(i.e., for α = 1, 1/c ≈ .294; α = 10, 1/c ≈ .012; α = 100, 1/c ≈ .0004).
• Method 2: rejection sampling from a generalized Cauchy trial distribution. This
distribution has density:
(πσ)−1
p(x) =
.
1 + ( x−µ
)2
σ
We now have the opportunity to optimize the acceptance rate over µ and σ. To
sample from the generalized Cauchy distribution, sample Z from a standard Cauchy
distribution, and transform via Z → σZ + µ.
It is proved in the following paper that the following values of µ, σ, and c are optimal: J.H. Ahrens and U. Dieter. Generating gamma variates by a modified rejection
technique. Comm. ACM, 25(1):47–54, January 1982.
7
µ = α−1
√
σ =
2α − 1
c = πσ exp(−µ(1 − log(µ)))/Γ(α).
• Method 3: a clever method for rejection sampling a gamma draw from an unusual
trial distribution. Let
f (x) =
4 exp(−α)αλ+α xλ−1
,
Γ(α)(αλ + xλ )2
√
where λ = 2α − 1. The function f (x) is non-negative and integrable, thus can be
re-normalized to a density. It is relatively easy to show that π(x; α) ≤ f (x). Therefore
if we can simulate from the density that is proportional to f , then accept the draw
with probability π(x; α)/f (x), we will have a Gamma draw.
The key is to recognize that if U is uniform on (0, 1), then α(u/(1 − u))1/λ has density
proportional to f .
The following program implements method 3.
## Generate draws from a gamma distribution with parameters alpha and
## beta=1.
alpha = 3;
## The lambda parameter for the trial distribution.
L = sqrt(2*alpha - 1);
## Count the total number of candidates.
n = 0;
## Generate 1000 draws.
for i=1:1000
## Loop until one draw is accepted.
while (1)
n = n+1;
## Generate the candidate point.
u = rand(1,1);
x = alpha*(u/(1-u))^(1/L);
## The log gamma density at x.
g = -x + (alpha-1)*log(x) - lgamma(alpha);
8
## The log trial density rescaled so that g/f <= 1.
f = log(4) -alpha + (L+alpha)*log(alpha) + (L-1)*log(x) ...
- lgamma(alpha) - 2*log(alpha^L + x^L);
if (log(rand(1,1)) < g-f)
G(i) = x;
break;
endif
endwhile
endfor
Rejection sampling when supx π(x)/f (x) is difficult to determine or incorrectly
specified
• Suppose we cannot calculate a value c such that supx π(x)/f (x) ≤ c, but we know that
a finite such c exists. Begin with a guess c0 and apply rejection sampling as if c0 is the
supremum. Then when we reach a candidate point Z such that π(Z)/f (Z) = c1 > c0 ,
we need to go back through all points that were previously accepted, and reject each
with probability 1 − c0 /c1 . Continue in this way, re-evaluating all points each time a
new upper bound for π/f is discovered. After many samples are drawn, the maximum
ratio π/f among the generated points will be close to sup π/f , so the points that were
never rejected will be very nearly a sample from π.
• Suppose that supx π(x)/cf (x) is finite, but is not necessarily smaller than 1, so we
adopt the acceptance probability P (accept|Z) = min(π(Z)/cf (Z), 1). Then we have
P (Z|accept) =
=
=
∝
P (accept|Z)P (Z)/P (accept)
min(π(Z)/cf (Z), 1) · f (Z)/P (accept)
min(π(Z), cf (Z))/cP (accept)
min(π(Z), cf (Z)).
If π(Z) ≤ cf (Z), we have the usual rejection sampling and hence the accepted draws
have distribution π. If the inequality does not hold, we get a distribution that will be
shaped like π on U = {Z|π(Z)/f (Z) ≤ c)} (although scaled incorrectly), and shaped
like f on U c .
For example, suppose we wish to sample from a standard normal target
q density π,
using a Cauchy trial density f with c = 1.2, where the correct bound is 2π/e ≈ 1.52.
The accepted draws will be distributed according to the density on the right, below.
The resulting density is correctly shaped in the tails, but not near zero (where it has
a Cauchy rather than a Gaussian shape).
9
0.4
0.4
Standard normal
1.2*Cauchy
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
Importance sampling
• Suppose we wish to calculate the value of h(x)π(x)dx, where π > 0 integrates to a
finite value. One of the main applications of simulation is to estimate such quantities
by recasting them as estimation problems. First note that
R
R
h(x)π(x)dx
R
= Eπ h,
π(x)dx
where Eπ denotes the expected value with respect to the unique density function π̃
that is proportional to π. It follows that if we can simulate from π̃, we can estimate
Eπ h ubiasedly and consistently using the sample mean.
More generally, we can choose a density f and write
Z
h(x)π(x)dx =
Z
(h(x)π(x)/f (x)) · f (x)dx
= Ef hπ/f,
where it is necessary to select f so that hπ/f is integrable with respect to f . If we
have an iid sample Z1 , . . . , Zn from f , then
n−1
X
h(Zi )π(Zi )/f (Zi )
i
R
is an unbiased and consistent and unbiased estimate of h(x)π(x)dx (under the conditions of the LLN). This method for approximating an integral is called importance
sampling.
An important practical advantage of importance sampling over rejection sampling is
that there is no need to calculate a bound for the ratio π/f .
10
2
Letting wi = π(Zi )/f (Zi ) be the importance weights, the estimator can be written
n−1
X
wi h(Zi ).
Note that the wi are closely related to the acceptance probabilities under rejection
sampling. The Zi with small importance weights would likely be rejected under rejection sampling. Under importance sampling we allow them to contribute a small degree
to the approximation.
Example: Suppose exp(−|x| − |y|)dxdy over the region [−5, 5]2 . The true value is
4(1 − exp(−5))2 ≈ 3.95. The following programs calculate the importance sampling
estimates based on a uniform trial density on [−5, 5]2 , and on a bivariate standard
normal trial density truncated to [−5, 5]2 . For the uniform trial density, the importance
weights are wi = 100, and for the normal trial density the importance weights are
wi = 2π(F (5) − F (−5))2 exp((z12 + z22 )/2), where F is the standard normal CDF.
R
## Do 1000 replicates using a uniform [-5,5]^2 trial density.
for r=1:1000
## Simulate trial density values.
Z = 10*rand(1000,2) - 5;
## The integrand values at the points in Z.
F = e.^(-abs(Z(:,1)) - abs(Z(:,2)));
## Estimate the integral using weights equal to 100.
I1(r) = 100*sum(F)/1000;
endfor
## Do 1000 replicates using a truncated standard normal trial density.
for r=1:1000
## Simulate from a truncated normal on [-5,5]^2.
Z = [];
while (1)
X = randn(1000,2);
ii = find(max(abs(X)’)’ <= 5);
Z = [Z; X(ii,:)];
if (size(Z,1) >= 1000)
break;
endif
endwhile
Z = Z(1:1000,:);
## Importance weights for the truncated normal trial density.
11
W = 2*pi*(normal_cdf(5) - normal_cdf(-5))^2*exp((Z(:,1).^2+Z(:,2).^2)/2);
## The integrand values at the points in Z.
F = e.^(-abs(Z(:,1)) - abs(Z(:,2)));
## Estimate the integral.
I2(r) = dot(F, W) / 1000;
endfor
• The efficiency of importance sampling depends on the skew of the weights. If the
weights are highly
skewed, the sample mean is mostly determined by just a few values,
√
so the usual n convergence will not hold. The “effective sample size” (ESS), given
by the formula
ESS =
SS
1 + var(w)
can
where the weighted sample mean should congerge at rate
√ provide some guidance,
√
ESS rather than n. Technically, this result may not apply in importance sampling,
since the weights and the values being averaged are dependent, but it provides some
heuristic guidance.
If π is a density and we are approximating Eπ h, then Ef w = 1 and varf wi = Eπ w − 1,
so the effective sample size is SS/Eπ w.
• Rao-Blackwellization: Suppose we generate Z1 , . . . , Zn from a rejection sampling trial
density f , and let Di be the indicators of whether Zi is accepted (so each Di is a
Bernoulli trial with success probability π(Zi )/f (Zi )). The rejection sampling estimator
of EZ can be written
θ̂ =
X
Di Zi /
i
X
Di .
i
View this as an estimator of EZ based on data Z1 , . . . , Zn . By the Rao-Blackwell
theorem, θ∗ = E(θ̂|Z1 , . . . , Zn ) is unbiased and at least as efficient as θ̂. For large n,
θ∗ is approximately the importance sampling estimate of EZ, so importance sampling
can be viewed as the Rao-Blackewellization of rejection sampling.
• Normalizing constants: If the densities π and/or f are only known up to a constant
of proportionality, then the importance
sampling weights must be renormalized: wi →
R
P
wi / wi . The resulting estimate of h(x)π(x)dx is still consistent, but is no longer
unbiased.
12