Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Line (geometry) wikipedia , lookup
Infinite monkey theorem wikipedia , lookup
Karhunen–Loève theorem wikipedia , lookup
Proofs of Fermat's little theorem wikipedia , lookup
Central limit theorem wikipedia , lookup
Elementary mathematics wikipedia , lookup
German tank problem wikipedia , lookup
Monte Carlo Integration Ronald Kleiss1 IMAPP Radboud University, Nijmegen, the Netherlands NIKHEF Academic lectures, 7/12/07 - 10/12/07 1 [email protected] 1 1 Ideology • Any numerical procedure in which the outcome depends on at least one random variable is a Monte Carlo integral. • Numerical answer to a given problem = R(r1, r2, . . . , rn), involving random variables r1,r2, . . .,rn, each drawn from its own domain A1,A2,. . .,An, with a combined probability density P(r1, r2, . . . , rn). • Expected answer is given by Z Z Z · · · R(x1, x2, . . . , xn) P(x1, x2, . . . , xn) dx1 dx2 · · · dxn A1 A2 An • The number n of random variables may be huge. 2 2 Probability theory and Monte Carlo estimators 2.1 Random numbers and probability densities • A single number is never random • We can only discuss (potentially) infinite sequences of random numbers: a stream. • Promised or assumed that the relative frequency with which numbers fall into given intervals takes on definite values. • The basic property of randomness: no matter how much information about the number stream up to XN, unable to predict XN+1 to better than the relative frequency: the probability. • Random variable r: probability density P(r) if the probability to find r in [x, x + dx] is P(x)dx (infinitesimal dx). −∞ < r < ∞: if r is bounded P(x) = 0 outside range. 3 • Discrete random variables have probability density consisting of Dirac delta functions: Z P(x) ≥ 0 , dx P(x) = 1 . (1) • Combined probability densities P(x1, x2, . . . , xk) of k real variables, by trivial extension. If P(x1, x2, . . . , xk) = P1(x1)P2(x2) · · · Pk(xk) , (2) the random variables x1, x2, . . . , xk are independent. • If Pi(x) = Pj(x) for all i, j in 1, 2, . . . , k the random variables are independent, identically distributed, or iid. A perfect source of random numbers is supposed to deliver iid variables. 4 2.2 Expectation values • Average value of f(x) sampled over very many values of x Z hf(x)i = dx P(x) f(x) . • Variance (square of standard deviation) D E σ (f(x))2 = f(x)2 − hf(x)i2 . (3) (4) • Moments of the density P(x) are expectation values of powers of x Z n Mn ≡ hx i = dx P(x) xn (n = 0, 1, 2, 3, . . .) . (5) • Characteristic function Z E D X 1 n φ(z) = (iz) Mn = dx P(x) eizx = eizx . n! n≥0 (6) φ(0) = 1, φ ′ (0) = iM1, and φ ′′ (0) = −M2. 5 2.3 Chebyshev-Bienaymé theorem Density P(x) with finite mean m and variance σ2, a > 0. Z 2 σ = dx P(x) (x − m)2 Z ≥ dx P(x) (x − m)2 θ(|x − m| ≥ a) Z 2 dx P(x) θ(|x − m| ≥ a) ≥ a = a2 Prob(|x − m| ≥ a) , Prob |x − m| ≤ a ≥ 1 − 6 σ2 . a2 (7) 2.4 Central limit theorem x1, x2, . . . , xN be N iid with density P(x), φ(z) = hexp(izx)i. N ξ= Z 1 X xj . N j=1 (8) ! 1 Π(ξ) = dx1 · · · dxN P(x1) · · · P(xN) δ ξ − (x1 + · · · + xN) . N !N Z Z N z izξ izx/N Φ(z) = dξ Π(ξ) e = dx P(x) e , =φ N z N log φ N iz z2 ≈ N log 1 + M1 − M2 + · · · N 2N2 ! iz z2 z2 2 ≈ N M1 + M − M2 + · · · N 2N2 1 2N2 z2 2 σ + ··· ≈ izm − 2N ! 7 (9) (10) Π(ξ) = s ! N N exp − 2 (ξ − m)2 2 2πσ 2σ 8 Approach to Gaussian for P(x) = exp(−x)θ(x) for N = 5 and N = 15 1.6 1.4 0.8 1.2 1 0.6 0.8 0.4 0.6 0.4 0.2 0.2 0 0.5 1 1.5 2 2.5 0 3 x 0.5 1 1.5 x 9 2 2.5 3 2.5 Confidence levels Probability that |a − A| ≤ kσ k Chebyshev Gaussian 0.5 ≥0 0.384 1.0 ≥0 0.684 1.5 ≥ 0.556 0.866 2.0 ≥ 0.750 0.955 2.5 ≥ 0.840 0.988 3.0 ≥ 0.889 0.997 10 2.6 Integral estimator: bias and convergence Jm = Z1 dx f(x)m , m = 1, 2, 3, . . . . (11) 0 The desired integral is J1. Estimate this integral using a stream of iid random numbers xj, (j = 1, 2, 3, 4, . . .) uniform in [0, 1). Sm = N X (wj)m , j=1 wj ≡ f(xj) . The xj are events, the wj are the weights. Z1 m h(wj) i = dx f(x)m = Jm . (12) (13) 0 The Monte Carlo estimator of the integral J1 is 1 E1 = S1 . N 11 (14) • The expectation value E1 is N 1 1 X hE1i = hS1i = hwji = J1 , N N j=1 (15) The Monte Carlo estimate is unbiased. • The variance of E1 is D E σ (E1)2 = E21 − hE1i2 = 1 D 2E 1 2 S − J = (J2 − J21) 1 1 N2 N (16) Chebyshev: the Monte Carlo estimate converges for N → ∞ provided J2 < ∞. √ • Convergence ‘only’ as 1/ N, but... • Valid in all dimensions! 12 2.7 First and second order error estimators The estimator of σ (E1)2 is E2 ! 1 1 = S2 − S21 2 N N 1 2 NS − S , ≈ 2 1 N3 where Nk = N(N − 1)(N − 2) · · · (N − k + 1) = N!/(N − k)!. Correct estimator: hE2i = σ (E1)2 13 (17) E2 has its own uncertainty (important for confidence levels!) σ (E2) 2 1 1 2 = J − 4J J + 3J 4 3 1 2 + 3 N N2 = O N−3 Estimator for σ (E2)2 is E4 N4 2 2 − 1 J − J 2 1 (N2)2 ! !2 4 1 3 1 1 N2 S4 − S3S1 + S22 + S2 − S21 − 4 = 2 4 2 2 N N N N (N ) N N 1 N3S4 − 4N2S3S1 − N2S22 + 8NS2S21 − 4S41 ≈ N7 ! ! ...and so on to E8, E16, E32, . . . 2.8 Examples fα(x) = (1 + α) xα θ(0 < x ≤ 1) , J1 = 1 14 (18) Jm is defined for α > −1/m.√ Estimates for various values of α; 1 ≤ N ≤ 220. E2,4 multiplied by powers of N to approach constants. 15 0.18 0.35 0.16 0.3 0.14 0.25 0.12 0.2 0.1 0.08 0.15 0.06 0.1 0.5 0.04 0.45 0.05 0.02 0.4 0 0 0.35 -0.02 -0.05 6 7 8 9 10 11 12 13 14 6 7 8 9 10 11 12 13 14 0.3 0.25 0.2 α = 1.0 α = −0.1 0.15 0.1 1 0.05 2 0 1 0.8 -0.05 6 7 8 9 10 11 0 0.6 -1 0.4 α = −0.3 -2 -3 0.2 -4 0 -5 -0.2 -6 6 7 8 9 10 11 α = −0.7 12 13 14 6 7 16 α = −0.999 8 9 10 11 12 13 14 12 13 14 E2 , hE2i 3.178E-7 3.179E-7 -0.1 1.000 1.193E-8 1.192E-8 -0.3 1.000 2.114E-7 2.146E-7 -0.7 0.9903 1.287E-4 — -0.999 1.589E-2 6.354E-6 — α 1 E1 0.9990 17 E4 , hE4i |1 − E1|/σ 7.716E-20 1.858 7.710E-20 2.209E-21 1.657 2.281E-21 1.918E-16 1.170 — 5.260E-9 0.8474 — 2.612E-11 6.203 — 3 Pseudo-random number generators 3.1 Digital vs. analog methods Against ‘natural’ random numbers: • Randomness • Speed • Reproducibility Consensus nowadays: streams produced by simple, repeatable, computer algorithm. Not truly random, but mostly requirement of true randomness is much more than is actually needed: in a simple integration uniform distribution is sufficient. Computer-generated number streams are called psuedo-random1 1 From the Greek, ψǫυδǫιν = ‘to lie’. 18 3.2 The ensemble of random-number algorithms Generate streams of integers in (1, 2, 3, . . . , M). (M is 2‘large’ ). • Many actual random number generators work internally with integers • On output scale to (0, 1) by dividing by M. • No rounding errors internally • Precision of numbers is constant over the interval • Algorithms that produce a new number in the stream using only the last one produced: nj+1 = f(nj) , j = 1, 2, . . . • Algorithm f(n) completely specified by set results: f(n) ↔ {f(1), f(2), f(3), . . . , f(M − 1), f(M)} There are exactly MM different algorithms in the ensemble. 19 • Starting value n1; n2 = f(n1), n3 = f(n2), . . . • As soon as a number reappears, start to cycles, becomes useless. • Length of the stream up to the first repetition is the lifetime of the algorithm. Max. lifetime=M, wanted as long as possible. • Decide on a starting value n1, choose an algorithm at random • Probability that n2 6= n1 is (1 − 1/M) • Probability that n3 6= n1 and n3 6= n2 is (1 − 1/M)(1 − 2/M), and so on. • Suppose that up to np all different, but np+1 is a reappearing number: lifetime p • Probability to pick an algorithm with lifetime p is Q(p) = M! p (M − p)! Mp+1 20 = A(p) − A(p + 1) , A(p) = Mp , Mp A(p) is the probability to pick an algorithm with a lifetime ≥ p P • p Q(p) = A(1) = 1. • Expected lifetime hpi = = M X p=1 ∞ Z dx e 0 = p Q(p) s = M X A(p) p=1 −x ∞ ! Z M X x M xp −x A(p) = 1+ e −1 p! M p=1 0 1 πM 1 − +O √ 2 3 M 21 ! . • Approximate form p p2 Q(p) ≈ , (19) exp − M 2M √ probabilities are a function of p/ M: lifetime is typically much smaller than M. ! • To hit upon such an lifetime M by coincidence is Q(M) = √ M! −M ≈ e 2πM , MM (20) • Upshot: pseudo-random number algorithms must be chosen very carefully. • Also algorithms using larger ‘register’: nj = f(nj−1, nj−2, . . . , nj−k) – Not full register necessarily used 22 – After nj has been generated, register must be shifted. Analysis same as before since set of k numbers in (1, . . . , M) is equivalent to a single number taking values in (1, . . . , Mk): simply replace M by Mk. Max. lifetime = Mk. 23 3.3 Obsolete algorithms 3.3.1 The mid-square method n2j nj+1 = 10k $ % mod 10k . • Typical example of a ‘random algorithm’; • difficult to analyze, depends crucially on starting value, • performs poorly for lifetimes. 24 (21) ’midsquare_lifetimes’ 200 Lifetimes of the mid-square algorithm for decimal 4-difit numbers. The longest lifetime is a disappointing 111 for starting value 6239, not even the ‘expected’ 125 for M = 10, 000. 150 100 50 0 0 25 20 40 60 80 100 120 3.3.2 The logistic map xj+1 = 4xj(1 − xj) (22) is supposed to produce a really chaotic sequence of values. However, write xj = (sin θj)2 , 0 ≤ θj ≤ π/2 , (23) θj+1 = 2θj mod (π/2) . (24) becomes equivalent to 26 The distribution of lifetimes of the logistic map algorithm 22 is given here as a function of the starting value. The numbers have been truncated to an accuracy of 10−5 in the algorithm, so that the maximum lifetime is potentially 105; different truncations produce essentially the same plot, with a maximum lifetime of order of the square root of the maximum. 800 700 600 500 400 300 200 100 0 0 50 100 150 27 200 250 300 350 400 450 500 3.4 Linear congruential generators xn+1 = (axn + c) mod m , (25) where a, c and m are predetermined integers. • Maximum possible lifetime m if 1. c is relatively prime to m; 2. every prime factor of m also divides b ≡ a − 1; 3. if m is a multiple of 4, b is also a multiple of 4. If c = 0 is chosen, the maximal period is smaller than m, but can be made quite large. m should be very large compared to the number of numbers used. • Approximate real-number relation: xn+1 = axn + δ − k , 28 δ= c , m (26) where k is a natural number such that k−δ k+1−δ ≤ xn < a a 1 2 D E 1 x2n ≈ 3 (27) hxni ≈ hxnxn+1i ≈ (1−δ)/a Z dx x(ax + δ) + 0 + a−1 X k=1 = Z1 dx x(ax + δ − a) 1−δ/a (k+1−δ)/a Z dx x(ax + δ − k) (k−δ)/a 1 1 + (1 − 6δ(1 − δ)) , 4 12a 29 The serial correlation coefficient is given by hxnxn+1i − hxni hxn+1i 1 − 6δ(1 − δ) . ≈ 2 a hx2ni − hxni (28) Error by approximating a discrete set of points by a continuum is of order O (a/m). Serial correlation can be small if a is large. • Lattice structure: k-tuples (xn, xn+1, xn+2, . . . , xn+k) as points in a k-dimensional unit hypercube, restricted to a number of regularly spaced parallel hyperplanes. For high k the number of planes can actually be quite small! The socalled spectral test: determination of the distance between hyperplanes,can indicate good a in a given dimension. 30 1 Distribution of 3-tuples (xn, xn+1, xn+2), for the linear congruential generator with m = 220 + 7, a = 1021, c = 1 and x0 = 512. Both m and a are prime numbers, so the period is maximal. The plot shows the projection of the slice xn+2 < 0.3, for the first 40,000 3-tuples. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 31 3.5 Modern forms: RCARRY algorithm Register consists of (xn−1, xn−2, . . . , xn−r) (can be stored as integers in (0, 1, 2, . . . , B− 1) and carry bit cn−1, associated with xn−1. The algorithm is as follows: y ← xn−s − xn−r − cn−1 if y ≥ 0 then xn ← y , cn ← 0 if y < 0 then xn ← y + B , cn ← 1 , Recommended choice s = 10 , r = 24 , B = 224 On most machines, floating-point representation is exact if we take B = 224. Analysis: define zn through zn = r−1 X j=0 j B xn−r+j − s−1 X j=0 32 Bjxn−s+j + cn−1 . (29) A single step of the algorithm is then equivalent to Bzn+1 − zn − mxn = 0 , m ≡ Br − Bs + 1 (30) We have a linear congruential generator for the numbers z: zn+1 = azn mod m , a = m − m−1 so that aB mod m = 1 . B (31) The choices recommended above rely on the fact that m = 2576 − 2240 + 1 is prime , (32) so period is (m − 1)/48 ≈ 5 × 10171. a = 2576 − 2552 − 2240 + 2216 + 1 . 33 (33) 4 Algorithm testing 4.1 General strategy • Theoretical (e.g. spectral test), but mostly address whole period, not guaranteed to be meaningful for shorter streams. • Empirical tests, claim to look for various aspects of (non)randomness of a given stream. Consist of computing a certain number using the given stream, comparing with result of an equally long stream of truly random numbers would give. Need its mean, variance, and so on to assign a confidence level to the pseudorandom result. Example: divide the unit interval into M bins, count the occupancies of these bins, calculate the χ2 • Philosophical glitch: a stream of truly random numbers will fail at the oneσ level in about one out of three cases. So, if a given test is failed, what does one conclude? Many proposed generators claimed that to pass all tests performed. Conclusion should be: not a good model for truly random numbers! 34 1100 1050 1000 950 900 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 The plot shows the occupancies of the 100 bins for 100,000 points generated by RCARRY, connected by a solid line to guide the eye. The average occupancy is of course 1000, √ with fluctuations of the expected O 1000 . The computed value of χ2 is 85.234, or 0.86 per degree of freedom, or 0.98 standard deviations. As far as this simple test is concerned, the stream looks acceptably random. 35 1100 1050 1000 950 900 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 The plot shows the occupancies of the 100 bins for 100,000 points generated by CORPUT, connected by a solid line to guide the eye. The average occupancy is of course 1000, but the fluctuations are tiny. The computed value of χ2 is 0.148, or 0.0015 per degree of freedom, off the mark by 7 standard deviations. We conclude that the stream generated by CORPUT is much too uniform to be considered random. 36 5 Discrepancies and Quasi-Monte Carlo 5.1 The notion of non-uniformity of point sets √ • Monte Carlo goes as 1/ N • other methods (i.e.trapezoid in D = 1) go as 1/N2 • difference: (non)uniformity of the point set: discrepancy 37 5.2 Star discrepancies • Any D-dimensional point set X = (~x1, ~x2, . . . , ~xN), random, regular, or otherwise. • Characteristic function (‘step function’): θ(~y > ~x) = D Y µ=1 θ(0 ≤ xµ ≤ yµ) • Local discrepancy of point set at ~y: L∗ (~y) = N D Y 1 X θ(~y > ~x) − yµ , N j=1 µ=1 38 (34) 1 local discpreancy for D = 2. The point ~y is at (0.55, 0.55) and the rectangle spanned by the origin and ~y contains 3 out of the 12 points. The local discrepancy is therefore 0.0525. y 0 1 39 Measures of global discrepancy L∗∞ ≡ sup |L∗ (~y)| Kolmogorov statistic ~ y Z L∗2 ≡ dD~y (L∗ (~y))2 Kramér-von Mises statistic These are small if the point set is globally uniform. Explicit formula in terms of point set only: L∗2 N D N D 1 X Y 1 2 XY µ µ 2 = 2 + 1 − (xµ ) 1 − max(xi , xj ) − D i N i,j=1 µ=1 2 N i=1 µ=1 3 where µ = 1, 2, . . . , D labels the coordinates of the points. 40 !D , (35) Expected for random points: hL∗2i = 1 −D 2 − 3−D , N (36) For ‘trapezoid poit set’ (regular hypercubic lattice) of MD points: L∗2 = 1 3 !D 1 1+ 2M2 !D 1 −2 1+ 8M2 41 !D + 1 ≈ D , 4M23D (37) 5.3 Koksma-Hlawka inequalities Variation of a function in one dimension: arbitrary partition of the unit interval: P : (0 ≡ z0 < z1 < z2 < z3 < · · · < zm−1 < zm ≡ 1) , W(P) = m X |f(zj) − f(zj−1)| . (38) j=1 Supremum over all possible partitions = variation: Var[f] = sup W(P) . (39) P Koksma inequality: 1 N N X j=1 Z1 f(xj) − dx 0 f(x) ≤ Var[f] × L∗∞ (X) . D > 1: Koksma-Hlawka inequality (generalization). 42 (40) • Finding the variation of an integrand is much harder than finding its integral • Bounded but discontinuous functions in more than one dimension: the variation can be ∞ • Do show that the notion of non-uniformity of point sets, as embodied in discrepancy, is relevant in establishing a more manageable way of talking about uniformity. 43 5.4 Diaphonies Another (possibly better) measure of non-uniformity N 1 X 2 X σn exp(2iπ~ n · ~xj)|2 T (X) = ~ | N ~ j=1 ~ 6=0 n N 1 X β(~xj − ~xk) , = N j,k=1 X σ2n n · ~x) . β(~x) = ~ exp(2iπ~ ~ 6=~0 n Invariant under translations mod 1 of the point set. Properties depend on the strengths σ2n ~ . Useful to choose the strengths so that hT i = 1: X σ2n (41) ~ = 1 . ~ 6=~0 n 44 D = 1: 3 1 , n 6= 0 . π2 n2 The corresponding function β(x) reads σ2n = β(x) = X n≥1 6 π2n2 (42) cos(2πnx) = 1 − 6{x}(1 − {x}) , {x} = x mod 1 . (43) Generalization to D-dim σ2n ~ = D Y cDτ(nµ) , τ(n) = max(1, n)−2 µ=1 45 π2 , cD = 1 + 3 !D −1 − 1 , (44) Other choice (try to approx. rotational invariance): σ2n ~ D ∞ X 1 2 2 ~ , K(v) = −1 + = exp(−v~ n ) ∀n e−vn K(v) n=−∞ , (45) with v > 0 a real parameter: Jacobi diaphony. The β function is in this case given by D Y 1 β(~x) = −1 + φ(v, xµ) , (46) K(v) µ=1 where φ(v, x) = X 2 exp(−vn + 2iπnx) = n r πX π2 exp − (n + x)2 v n v K(v) = −1 + φ(v; 0)D . 46 ! (47) (48) 6 The function φ(v, x) as a function of x for v = 0.1, 0.3, 1. For decreasing values of v, the function (that always integrates to unity), becomes more and more peaked at {x} ≈ 0: in the limit v → 0 we actually have φ(0, x) = (δ (x) + δ (1 − x))/2. 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 47 5.5 Assessing point sets The discrepancy or diaphony can be used as a test of whether a given point set ‘looks random’. Relevant: variance of T for random poin sets. ! Z X 1 2 σ4n (49) ≈2 σ (T (X)) = 2 d~x β(~x)2 + O ~ , N ~ ~ 6=0 n More completely: hexp(zT (X))i = exp(ψ(z)) , ψ(z) = X (2z)m X σ2m . ~ n 2m m≥1 ~ ~ 6=0 n From this we may construct the actual T (X) distribution H(t), the probability density for T (X) to have the value t > 0. Compare with actual T for given point set. It can be proven that for large dimensionality the T (X) distribution for random point sets approaches a Gaussian (law of large number of degrees of freedom). 48 6 Quasi-random number generators 6.1 Low-discrepancy point sets Message from Koksma-Hlawka: the lower the discrepancy of the point set is, the more accurate the integral. Goes under the name of Quasi-Monte Carlo. • Irregularity of distribution: optimal point set of N points is not optimal point set of N + 1 points! • Difference between fixed-size point sets and streams of quasirandom numbers . • Quasi-random points are not independent of one another, have to consider again the error estimate. 49 6.2 Finite point sets: Korobov sets Method of good lattice points: D-dimensional vector ~g with integer components. The set of N points is then defined by xµ j = ! j µ g N mod 1 , µ = 1, 2, . . . , N . (50) Discrepancy depends on ~g. Each gµ be relatively prime to N. Example for D = 2 based on the Fibonacci numbers: F1 = F2 = 1 , Fn = Fn−1 + Fn−2 , n ≥ 3 . (51) ~g = (1, Fn−1) , N = Fn . (52) 50 The Fibonacci point set for n = 15, which contains 610 points. The low value of L∗ can be ascribed to the tilting of the regular trapezoidal grid, so that no two points are precisely aligned horizontally or vertically. The regularity of the grid, however, makes it dangerous for integrands with a similar pattern, and can be picked out by other measures of nonuniformity such as T . 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 51 6.3 Infinite streams: Richtmeyer sequences A drawback of Korobov sets: finiteness of the number of different points. Replace the factor ~g by irrational ~θ. D = 1: xj = (jθ) mod 1 , j = 1, 2, . . . , N , (53) is equidistributed, i.e. the discrepancy goes to zero as N approaches infinity. Speed depends on θ. 52 Continued fraction representation: θ= 1 a1 + , 1 a2 + (54) 1 a3 + 1 1 a 4 + ··· where the a’s are positive integers, the continued fraction coefficients. Set a1, a2, a3, . . . is completely equivalent to a unique θ. If ak = ∞ θ is then rational. Discrepancy is lower if the coefficients are smaller, √ and θ ‘more irrational’. ‘Most irrational’ number: aj = 1 for all j, θ = (−1 + 5)/2. 53 Continued fraction coefficients for some simple numbers. √ θ √2 − 1 √3 − 1 √5 − 2 √6 − 2 √7 − 2 √8−2 √10 − 3 √11 − 3 √ 12 − 3 √2 − 1/2 3 − 1/2 π−3 π − 5/2 e−2 a1, a2, a3, a4, . . . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . . . 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, . . . 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, . . . 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, . . . 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 4, . . . 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, . . . 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, . . . 3, 6, 3, 6, 3, 6, 3, 6, 3, 6, 3, 6, . . . 2, 6, 2, 6, 2, 6, 2, 6, 2, 6, 2, 6, . . . 1, 10, 1, 1, 1, 10, 1, 1, 1, 10, 1, 1, . . . 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, . . . 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, . . . 1, 1, 1, 3, 1, 3, 4, 73, 6, 3, 3, 2, 1, 3, . . . 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10, 1, . . . 54 9 120 8 100 7 80 6 5 60 4 40 3 20 2 1 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 √ θ = ( 5 − 1)/2 500 22 450 20 200 300 400 θ= √ 500 600 700 800 900 1000 800 900 1000 37 mod 1 18 400 55 350 16 14 300 12 250 10 200 8 150 6 100 4 50 2 0 0 0 100 200 300 400 500 600 700 θ = π mod 1 800 900 1000 0 100 200 300 400 500 600 700 θ = e mod 1 56 Multidimensional sequences: Richtmeyer sequences µ xµ j = (jθ ) mod 1 , (55) where θµ, (µ = 1, 2, . . . , D) are irrational numbers that are all relatively prime √ to one another. Simple choice: θµ = pµ mod 1, where pµ is the µth prime number. Below we give some results for sets of 2000 points in two dimensions. 57 1 1 ’fort.1’ ’fort.1’ 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 √ √ ~θ = ( 2 mod 1, 3 mod 1) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 √ √ ~θ = ( 2 mod 1, 5 mod 1) 58 1 6.4 van der Corput sequences In the previous section we discussed Richtmeyer sequences, and presented results for N = 1000, for one dimension, for various θ’s. The 1000-point point set with the minimal discrepancy is given by xj = 2j − 1 , j = 1, 2, . . . , 1000 . 2000 (56) In the following figure we plot 12N2L∗2 for this set, where N denotes the first N points. 59 300000 250000 200000 150000 100000 50000 0 0 100 200 300 400 500 600 700 800 900 1000 60 The curve reads a maximum of about 250,000 in the middle. This is due to the fact that, for N = 500, the interval (0, 1/2) contains precisely 500 points, and the interval (1/2, 1) precisely none. The value for the 500point point set with the minimal discrepancy would be 1, and this value would be obtained for N = 500 if we had constructed the point set by defining xj = ((2j − 1)/1000) mod 1. The order in which the points are generated is extremely important. Consider a numbering system in base b: that is, given an integer b ≥ 2, we can write any natural number n as n = n0 + n1b + n2b2 + n3b3 + · · · + nkbk , n < bk+1 . (57) The van der Corput transform of n in base b is then φb(n) = n0b−1 + n1b−2 + n2b−3 + · · · + nkb−k−1 . (58) Most leading bit in φb(n) will change most rapidly, followed by the next-toleast(most) leading bit, and so on. For b = 2, the first few van der Corput transforms are given here: n φ2(n) 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 1 4 3 4 1 8 5 8 3 8 7 8 1 16 9 16 5 16 7 16 13 16 3 16 11 16 These numbers have been used in the CORPUT generator mentioned before. 61 15 16 This sequence attempts to fill the unit interval in as uniform a way as possible; after 2k points the interval is filled almost optimally, with xj = (j − 1)/2k, j = 0, 1, . . . , 2k − 1. The L∗2 in that case is 1/(3N2), almost optimal1. 1 For point sets defined by x = (2j − α)/(2N), j = 1, 2, . . . , N and −1 ≤ α ≤ 1, the discrepancy is given by L∗2 = (1 + 3(1 − α)2 )/(12N2 ), 62 5 The ‘normalized’ extreme discrepancy NL∗∞ for the first N van der Corput numbers with base b = 2,for N from 1 to 211. Note the fractal structure. In fact, if we plot only the values for N multiples of 2c, we obtain exactly the same plot as that for N = 211−c. 4 3 2 1 0 0 500 1000 1500 2000 63 6 5 The ‘normalized’ extreme discrepancy NL∗∞ for the first N van der Corput numbers with base b = 5,for N from 1 to 55. The fractal structure is in this case of a 5-fold nature. 4 3 2 1 0 0 500 1000 1500 2000 2500 3000 64 In more dimensions, define the vector sequence ~xj by choosing a vector ~b with natural components, and defining xµ j = φbµ (j) , (59) where the bases bµ, µ = 1, 2, . . . , D are all relatively prime. 1 ’fort.1’ 0.9 0.8 0.7 0.6 The van der Corput points (φ2(n), φ3(n)) for 1 ≤ n ≤ 1000. 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 65 1 ’fort.1’ 0.9 0.8 0.7 The van der Corput points (φ2(n), φ4(n)) for 1 ≤ n ≤ 1000. This plot shows why the bases of the axes should be relatively prime to one another. 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 66 1 ’fort.1’ 0.9 The van der Corput points (φ17(n), φ19(n) for 1 ≤ n ≤ 1000. The onset of uniformity is seen to be quite slow in this case, due to the fact that (a) the primes 17 and 19 are ‘large’ for N = 1000 (1000 ∼ 2 × 192), and (b) they are close to one another. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 67 Here we give the normalized discrepancy NL∗2 as a function of N, for the two-dimensional van der Corput sequence with ~b = (2, 3) (lower curve), and for the RCARRY algorithm (upper curve). The straight line at 2−2 − 3−2 = 5/36 is the expected value of NL∗2 for truly random points. The ∗ NL2 for the pseudorandom set from RCARRY shows appreciable fluctuations around the expected value, while that for the quasirandom set from CORPUT is steadily decreasing. 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 100 200 300 400 500 600 700 800 900 1000 68 0.12 0.08 0.08 0.07 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0.1 0.08 0.06 0.04 0.02 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 0 100 200 D=3 300 400 500 600 700 800 900 1000 0 100 200 D=4 0.09 300 400 500 600 700 800 900 1000 700 800 900 1000 D=5 0.02 0.02 0.015 0.015 0.01 0.01 0.005 0.005 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0 100 200 300 400 500 600 D=6 700 800 900 1000 0 0 100 200 300 400 500 600 D=8 69 700 800 900 1000 0 100 200 300 400 500 600 D = 10 6.5 Nierreiter sequences In these sequences one may use b = 2 for all coordinates: we have ~xj = φ2(p1(j)) , φ2(p2(j)) , . . . , φ2(pD(j)) , j = 1, 2, 3, . . . , (60) where the functions pµ(j) are cleverly chosen permutations. The discrepancy can go to 0 rapidly as N → ∞. 70 6.6 Error estimates revisited In standard Monte Carlo, the expectation value of the squared error decreases as 1/N under the assumption that the points are iid. For Quasi-Monte Carlo, this is explicitly not true: indeed, the various points ‘know’ about each other’s position. Hypercube: combined probability density for truly random points P(~x1, ~x2, . . . , ~xN) = 1 . (61) Quasirandom points are not independent, combined probability density P(~x1, ~x2, . . . , ~xN) = 1 − 1 FN(~x1, ~x2, . . . , ~xN) , N (62) where the factor −1/N is a convention. Clearly, FN must be symmetric in all its arguments. Moreover, let us define Fk(~x1, ~x2, . . . , ~xk) = Z1 d~xk+1d~xk+2 · · · d~xNFN(~x1, ~x2, . . . , ~xN) 0 71 (63) for k < N. If the point set is to be of any use at all, we must have F1(~x) = 0, otherwise the integral estimate is biased. Assume combined probability density of the points translation-invariant modulo 1: FN(~x1 + ~y, . . . , ~xN + ~y) = F(~x1, . . . , ~xN) ∀~y , (64) Combined probability density of a pair of points depends only on their difference: 1 P(~x1, ~x2) = 1 − F2(~x1 − ~x2) . (65) N The integral estimator is N 1 X f(~xj) , E1 = N j=1 still find hE1i = J1.Expectation value ofsquare D Z E E21 ffF2 ! Z N2 1 2 2 ffF2 , NJ2 + N J1 − = N2 N Z ≡ d~x1 d~x2 f(~x1) f(~x2) F2(~x1 − ~x2) . 72 (66) The squared error ! Z 1 N−1 2 σ (E1) = J1 − J2 − ffF2 . N N 2 (67) To actually estimate the squared error, we also have to modify E2. It turns out that the definition that gives hE2i = σ (E1)2 is given by N X 1 F2(~0) 1 1 δj,k − E2 = − 3 F2(~xj − ~xk) . f(~xj) f(~xk) 2 + 3 2 N N NN N j,k=1 (68) 2 Unfortunately, in this case we do have to perform O N sums. It is therefore customary, even if not justified, to employ the standard form of E2. An alternative method is the following. Suppose that a set of 100,000 quasirandom points, say, is used in a calculation. We may then collect the weights in 100 groups of 1,000 each, and study the distribution of the 100 averages in addition to their average, in order to arrive at an error estimate. 73 7 Phase space algorithms 7.1 General One of the fundamental integrations in particle phenomenology: transition rates over phase space. Phase space integration element dV(1, . . . , n; s) ≡ n n n X Y X √ ~pj δ s − p0j d4pj δ p2j − m2j θ(p0j) δ3 j=1 j=1 j=1 Phase space volume: V(1, . . . , n; s) = Z 74 dV(1, . . . , n; s) (69) 7.2 Hierarchical method Based on 2-particle split-up: insert 4 2 d Q du δ(Q − u) δ 4 n X j=2 pj − Q (70) So 1 → n process split up in 1 → 2 followed by 1 → (n − 1). Z V(1, . . . , n; s) = du V(1, Q; s) V(2, . . . , n; s) Repeat: cascade of 1 → 2 processes. Problem: distribution of u’s, and ordering of u’s. Old algorithm FOWL by F. James. 75 7.3 RAMBO algorithm (RAndom Momenta BOoster) Valid for ultra-relativistic limit. Phase space element n Y j=1 d4pjδ p2j θ(p0j) where µ P = n X δ3 ~P δ P0 − pµ , j s = P2 . j=1 • Obtain four-momenta that satisfy the constraints • ensure that the whole phase space is covered • as uniform a density as possible. 76 √ s , (71) (72) 2 −n 1 = (2πw ) Z Y n j=1 d4qj δ q2j θ(q0j) exp(−q0j/w) Agorithm as follows: P • kµ = j qµ j . In general not at rest, nor does its square equal s. . (73) • Perform on every qj the Lorentz boost Λ which brings kµ to its rest frame • Scaling transform to correct overal invariant mass. • The resulting vectors are denoted by pj. 77 2 −n 1 = (2πw ) Z Y n 4 d qj δ j=1 4 d kδ 4 k− X j q2j θ(q0j) √ k2 qj dx δ x − √ s 4 d pj δ ! 4 1 pj − Λ(qj) x . The scaling factor x runs from 0 to ∞. Owing to Lorentz invariance δ 1 pj − Λ(qj) x 4 ! !! δ q2j = x2 δ4 qj − xΛ−1(pj) δ p2j . (74) Furthermore, δ 4 k− X j qj √ k2 δ x− √ s ! = 2s 3 ~ 0 √ 2 2 δ P δ P − s δ k − x s . x3 78 The formula can be written as Z Y n √ 1=N d4pj δ p2j θ(p0j) δ3 ~P δ P0 − s , (75) j=1 where 2 n N = 2s(2πw ) Z∞ 0 dx Z d4k x2n−3 exp(−k0/w) q (k0)2 − x2s . (76) Volume of phase space in the ultra-relativistic limit: VUR = N −1 = n−1 π 2 sn−2 . Γ (n)Γ (n − 1) (77) We see that the RAMBO algorithm is essentially the best possible: the whole of phase space is covered uniformly. The energy scale w drops out as it should, and therefore in practice we use w = 1. 79 7.4 Inclusion of particle masses • Particles may be relatively light but not massless. Assume that a ‘massless’ phase space point has been generated according to RAMBO: • Scale the three-momenta down, • Adjust the energies so as to build in the masses, • while keeping energy conservation intact. • Determine the unique root ξ0 of n q X √ F(ξ) = − s + ξ2(q0j)2 + m2j . (78) j=1 µ • Replace qµ j by pj : ~pj ← ξ~qj , p0j ← 80 q ~p2j + m2j . (79) • Phase space is not filled uniformly, but rather with a fluctuating density • Implies that we have to assign to each generated phase space point a weight 2n−3 n X |~p | √j w(p1, p2, . . . , pn) = VUR s j=1 j=1 −1 n X |~pj|2 √ . p0j p0 s j=1 j (80) n Y |~pj| • For all masses equal, mj = m, the maximum weight is expected to be n2m2 wmax = 1 − s 81 !(3n−5)/2 . (81) 6000 1600 1000 900 1400 5000 800 1200 700 4000 1000 600 3000 800 500 400 600 2000 300 400 200 1000 200 100 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 n = 3 , r = 0.3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 n = 3 , r = 0.6 900 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.9 1 0.9 1 n = 3 , r = 0.9 1200 5000 4500 800 1000 4000 700 3500 600 800 3000 500 600 2500 400 2000 300 400 1500 200 1000 200 82 100 0 500 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 n = 6 , r = 0.3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 n = 6 , r = 0.6 1000 10000 900 9000 800 8000 700 7000 600 6000 500 5000 400 4000 300 3000 200 2000 100 1000 0.2 0.3 0.4 0.5 0.6 0.7 0.8 n = 6 , r = 0.9 35000 30000 25000 20000 15000 10000 5000 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 n = 10 , r = 0.3 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 n = 10 , r = 0.6 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 n = 10 , r = 0.9