Download lectures 1-4

Document related concepts

Line (geometry) wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Algorithm wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Proofs of Fermat's little theorem wikipedia , lookup

Central limit theorem wikipedia , lookup

Elementary mathematics wikipedia , lookup

German tank problem wikipedia , lookup

Expected value wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Monte Carlo Integration
Ronald Kleiss1
IMAPP
Radboud University, Nijmegen, the Netherlands
NIKHEF Academic lectures, 7/12/07 - 10/12/07
1
[email protected]
1
1 Ideology
• Any numerical procedure in which the outcome depends on at least one
random variable is a Monte Carlo integral.
• Numerical answer to a given problem = R(r1, r2, . . . , rn), involving random
variables r1,r2, . . .,rn, each drawn from its own domain A1,A2,. . .,An, with
a combined probability density P(r1, r2, . . . , rn).
• Expected answer is given by
Z Z
Z
· · · R(x1, x2, . . . , xn) P(x1, x2, . . . , xn) dx1 dx2 · · · dxn
A1 A2
An
• The number n of random variables may be huge.
2
2 Probability theory and Monte Carlo estimators
2.1 Random numbers and probability densities
• A single number is never random
• We can only discuss (potentially) infinite sequences of random numbers: a
stream.
• Promised or assumed that the relative frequency with which numbers fall
into given intervals takes on definite values.
• The basic property of randomness: no matter how much information about
the number stream up to XN, unable to predict XN+1 to better than the relative frequency: the probability.
• Random variable r: probability density P(r) if the probability to find r in
[x, x + dx] is P(x)dx (infinitesimal dx). −∞ < r < ∞: if r is bounded
P(x) = 0 outside range.
3
• Discrete random variables have probability density consisting of Dirac delta
functions:
Z
P(x) ≥ 0 ,
dx P(x) = 1 .
(1)
• Combined probability densities P(x1, x2, . . . , xk) of k real variables, by
trivial extension. If
P(x1, x2, . . . , xk) = P1(x1)P2(x2) · · · Pk(xk) ,
(2)
the random variables x1, x2, . . . , xk are independent.
• If Pi(x) = Pj(x) for all i, j in 1, 2, . . . , k the random variables are independent, identically distributed, or iid. A perfect source of random numbers is
supposed to deliver iid variables.
4
2.2 Expectation values
• Average value of f(x) sampled over very many values of x
Z
hf(x)i = dx P(x) f(x) .
• Variance (square of standard deviation)
D
E
σ (f(x))2 = f(x)2 − hf(x)i2 .
(3)
(4)
• Moments of the density P(x) are expectation values of powers of x
Z
n
Mn ≡ hx i = dx P(x) xn (n = 0, 1, 2, 3, . . .) .
(5)
• Characteristic function
Z
E
D
X 1
n
φ(z) =
(iz) Mn = dx P(x) eizx = eizx .
n!
n≥0
(6)
φ(0) = 1, φ ′ (0) = iM1, and φ ′′ (0) = −M2.
5
2.3 Chebyshev-Bienaymé theorem
Density P(x) with finite mean m and variance σ2, a > 0.
Z
2
σ =
dx P(x) (x − m)2
Z
≥
dx P(x) (x − m)2 θ(|x − m| ≥ a)
Z
2
dx P(x) θ(|x − m| ≥ a)
≥ a
= a2 Prob(|x − m| ≥ a) ,
Prob |x − m| ≤ a ≥ 1 −
6
σ2
.
a2
(7)
2.4 Central limit theorem
x1, x2, . . . , xN be N iid with density P(x), φ(z) = hexp(izx)i.
N
ξ=
Z
1 X
xj .
N j=1
(8)
!
1
Π(ξ) = dx1 · · · dxN P(x1) · · · P(xN) δ ξ − (x1 + · · · + xN) .
N
!N
Z
Z
N
z
izξ
izx/N
Φ(z) = dξ Π(ξ) e =
dx P(x) e
,
=φ
N
z
N log φ
N
iz
z2
≈ N log 1 + M1 −
M2 + · · ·
N
2N2
!
iz
z2
z2
2
≈ N
M1 +
M −
M2 + · · ·
N
2N2 1 2N2
z2 2
σ + ···
≈ izm −
2N
!
7
(9)
(10)
Π(ξ) =
s
!
N
N
exp − 2 (ξ − m)2
2
2πσ
2σ
8
Approach to Gaussian for P(x) = exp(−x)θ(x) for N = 5 and N = 15
1.6
1.4
0.8
1.2
1
0.6
0.8
0.4
0.6
0.4
0.2
0.2
0
0.5
1
1.5
2
2.5
0
3
x
0.5
1
1.5
x
9
2
2.5
3
2.5 Confidence levels
Probability that |a − A| ≤ kσ
k Chebyshev
Gaussian
0.5
≥0
0.384
1.0
≥0
0.684
1.5
≥ 0.556
0.866
2.0
≥ 0.750
0.955
2.5
≥ 0.840
0.988
3.0
≥ 0.889
0.997
10
2.6 Integral estimator: bias and convergence
Jm =
Z1
dx f(x)m , m = 1, 2, 3, . . . .
(11)
0
The desired integral is J1. Estimate this integral using a stream of iid random
numbers xj, (j = 1, 2, 3, 4, . . .) uniform in [0, 1).
Sm =
N
X
(wj)m ,
j=1
wj ≡ f(xj) .
The xj are events, the wj are the weights.
Z1
m
h(wj) i = dx f(x)m = Jm .
(12)
(13)
0
The Monte Carlo estimator of the integral J1 is
1
E1 = S1 .
N
11
(14)
• The expectation value E1 is
N
1
1 X
hE1i = hS1i =
hwji = J1 ,
N
N j=1
(15)
The Monte Carlo estimate is unbiased.
• The variance of E1 is
D
E
σ (E1)2 = E21 − hE1i2 =
1 D 2E
1
2
S
−
J
=
(J2 − J21)
1
1
N2
N
(16)
Chebyshev: the Monte Carlo estimate converges for N → ∞ provided
J2 < ∞.
√
• Convergence ‘only’ as 1/ N, but...
• Valid in all dimensions!
12
2.7 First and second order error estimators
The estimator of σ (E1)2 is
E2
!
1
1
=
S2 − S21
2
N
N
1
2
NS
−
S
,
≈
2
1
N3
where Nk = N(N − 1)(N − 2) · · · (N − k + 1) = N!/(N − k)!.
Correct estimator:
hE2i = σ (E1)2
13
(17)
E2 has its own uncertainty (important for confidence levels!)
σ (E2)
2
1 1
2
=
J
−
4J
J
+
3J
4
3
1
2 +
3
N
N2
= O N−3
Estimator for σ (E2)2 is
E4
N4
2 2
−
1
J
−
J
2
1
(N2)2
!
!2
4
1
3
1
1
N2
S4 − S3S1 + S22 +
S2 − S21
− 4
=
2
4
2
2
N N
N
N
(N )
N
N
1
N3S4 − 4N2S3S1 − N2S22 + 8NS2S21 − 4S41
≈
N7
!
!
...and so on to E8, E16, E32, . . .
2.8 Examples
fα(x) = (1 + α) xα θ(0 < x ≤ 1) , J1 = 1
14
(18)
Jm is defined for α > −1/m.√ Estimates for various values of α; 1 ≤ N ≤ 220.
E2,4 multiplied by powers of N to approach constants.
15
0.18
0.35
0.16
0.3
0.14
0.25
0.12
0.2
0.1
0.08
0.15
0.06
0.1
0.5
0.04
0.45
0.05
0.02
0.4
0
0
0.35
-0.02
-0.05
6
7
8
9
10
11
12
13
14
6
7
8
9
10
11
12
13
14
0.3
0.25
0.2
α = 1.0
α = −0.1
0.15
0.1
1
0.05
2
0
1
0.8
-0.05
6
7
8
9
10
11
0
0.6
-1
0.4
α = −0.3
-2
-3
0.2
-4
0
-5
-0.2
-6
6
7
8
9
10
11
α = −0.7
12
13
14
6
7
16
α = −0.999
8
9
10
11
12
13
14
12
13
14
E2 , hE2i
3.178E-7
3.179E-7
-0.1
1.000
1.193E-8
1.192E-8
-0.3
1.000
2.114E-7
2.146E-7
-0.7
0.9903
1.287E-4
—
-0.999 1.589E-2 6.354E-6
—
α
1
E1
0.9990
17
E4 , hE4i |1 − E1|/σ
7.716E-20
1.858
7.710E-20
2.209E-21
1.657
2.281E-21
1.918E-16
1.170
—
5.260E-9
0.8474
—
2.612E-11
6.203
—
3 Pseudo-random number generators
3.1 Digital vs. analog methods
Against ‘natural’ random numbers:
• Randomness
• Speed
• Reproducibility
Consensus nowadays: streams produced by simple, repeatable, computer algorithm. Not truly random, but mostly requirement of true randomness is much
more than is actually needed: in a simple integration uniform distribution is sufficient. Computer-generated number streams are called psuedo-random1
1
From the Greek, ψǫυδǫιν = ‘to lie’.
18
3.2 The ensemble of random-number algorithms
Generate streams of integers in (1, 2, 3, . . . , M). (M is 2‘large’ ).
• Many actual random number generators work internally with integers
• On output scale to (0, 1) by dividing by M.
• No rounding errors internally
• Precision of numbers is constant over the interval
• Algorithms that produce a new number in the stream using only the last one
produced:
nj+1 = f(nj) , j = 1, 2, . . .
• Algorithm f(n) completely specified by set results:
f(n) ↔ {f(1), f(2), f(3), . . . , f(M − 1), f(M)}
There are exactly MM different algorithms in the ensemble.
19
• Starting value n1; n2 = f(n1), n3 = f(n2), . . .
• As soon as a number reappears, start to cycles, becomes useless.
• Length of the stream up to the first repetition is the lifetime of the algorithm.
Max. lifetime=M, wanted as long as possible.
• Decide on a starting value n1, choose an algorithm at random
• Probability that n2 6= n1 is (1 − 1/M)
• Probability that n3 6= n1 and n3 6= n2 is (1 − 1/M)(1 − 2/M), and so on.
• Suppose that up to np all different, but np+1 is a reappearing number: lifetime p
• Probability to pick an algorithm with lifetime p is
Q(p) =
M!
p
(M − p)! Mp+1
20
= A(p) − A(p + 1) , A(p) =
Mp
,
Mp
A(p) is the probability to pick an algorithm with a lifetime ≥ p
P
•
p Q(p) = A(1) = 1.
• Expected lifetime
hpi =
=
M
X
p=1
∞
Z
dx e
0
=
p Q(p)
s
=
M
X
A(p)
p=1
−x
∞
!
Z
M
X
x M
xp
−x
A(p) =
1+
e
−1
p!
M
p=1
0
1
πM 1
− +O √
2
3
M
21
!
.
• Approximate form
p
p2
Q(p) ≈
,
(19)
exp −
M
2M
√
probabilities are a function of p/ M: lifetime is typically much smaller
than M.
!
• To hit upon such an lifetime M by coincidence is
Q(M) =
√
M!
−M
≈
e
2πM ,
MM
(20)
• Upshot: pseudo-random number algorithms must be chosen very carefully.
• Also algorithms using larger ‘register’:
nj = f(nj−1, nj−2, . . . , nj−k)
– Not full register necessarily used
22
– After nj has been generated, register must be shifted.
Analysis same as before since set of k numbers in (1, . . . , M) is equivalent
to a single number taking values in (1, . . . , Mk): simply replace M by Mk.
Max. lifetime = Mk.
23
3.3 Obsolete algorithms
3.3.1 The mid-square method
n2j
nj+1 =
10k
$
%
mod 10k .
• Typical example of a ‘random algorithm’;
• difficult to analyze, depends crucially on starting value,
• performs poorly for lifetimes.
24
(21)
’midsquare_lifetimes’
200
Lifetimes of the mid-square algorithm for decimal 4-difit numbers.
The longest lifetime is a disappointing 111 for starting value 6239, not
even the ‘expected’ 125 for M =
10, 000.
150
100
50
0
0
25
20
40
60
80
100
120
3.3.2 The logistic map
xj+1 = 4xj(1 − xj)
(22)
is supposed to produce a really chaotic sequence of values. However, write
xj = (sin θj)2 , 0 ≤ θj ≤ π/2 ,
(23)
θj+1 = 2θj mod (π/2) .
(24)
becomes equivalent to
26
The distribution of lifetimes of the logistic
map algorithm 22 is
given here as a function
of the starting value.
The numbers have been
truncated to an accuracy of 10−5 in the algorithm, so that the maximum lifetime is potentially 105; different
truncations produce essentially the same plot,
with a maximum lifetime of order of the
square root of the maximum.
800
700
600
500
400
300
200
100
0
0
50
100
150
27
200
250
300
350
400
450
500
3.4 Linear congruential generators
xn+1 = (axn + c) mod m ,
(25)
where a, c and m are predetermined integers.
• Maximum possible lifetime m if
1. c is relatively prime to m;
2. every prime factor of m also divides b ≡ a − 1;
3. if m is a multiple of 4, b is also a multiple of 4.
If c = 0 is chosen, the maximal period is smaller than m, but can be made
quite large. m should be very large compared to the number of numbers
used.
• Approximate real-number relation:
xn+1 = axn + δ − k ,
28
δ=
c
,
m
(26)
where k is a natural number such that
k−δ
k+1−δ
≤ xn <
a
a
1
2
D E
1
x2n ≈
3
(27)
hxni ≈
hxnxn+1i ≈
(1−δ)/a
Z
dx x(ax + δ) +
0
+
a−1
X
k=1
=
Z1
dx x(ax + δ − a)
1−δ/a
(k+1−δ)/a
Z
dx x(ax + δ − k)
(k−δ)/a
1
1
+
(1 − 6δ(1 − δ)) ,
4 12a
29
The serial correlation coefficient is given by
hxnxn+1i − hxni hxn+1i
1 − 6δ(1 − δ)
.
≈
2
a
hx2ni − hxni
(28)
Error by approximating a discrete set of points by a continuum is of order
O (a/m). Serial correlation can be small if a is large.
• Lattice structure: k-tuples (xn, xn+1, xn+2, . . . , xn+k) as points in a k-dimensional
unit hypercube, restricted to a number of regularly spaced parallel hyperplanes. For high k the number of planes can actually be quite small! The socalled spectral test: determination of the distance between hyperplanes,can
indicate good a in a given dimension.
30
1
Distribution
of
3-tuples
(xn, xn+1, xn+2), for the linear congruential generator with
m = 220 + 7, a = 1021, c = 1 and
x0 = 512. Both m and a are prime
numbers, so the period is maximal.
The plot shows the projection of
the slice xn+2 < 0.3, for the first
40,000 3-tuples.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
31
3.5 Modern forms: RCARRY algorithm
Register consists of (xn−1, xn−2, . . . , xn−r) (can be stored as integers in (0, 1, 2, . . . , B−
1) and carry bit cn−1, associated with xn−1. The algorithm is as follows:
y ← xn−s − xn−r − cn−1
if y ≥ 0 then xn ← y , cn ← 0
if y < 0 then xn ← y + B , cn ← 1 ,
Recommended choice
s = 10 , r = 24 , B = 224
On most machines, floating-point representation is exact if we take B = 224.
Analysis: define zn through
zn =
r−1
X
j=0
j
B xn−r+j −
s−1
X
j=0
32
Bjxn−s+j + cn−1 .
(29)
A single step of the algorithm is then equivalent to
Bzn+1 − zn − mxn = 0 ,
m ≡ Br − Bs + 1
(30)
We have a linear congruential generator for the numbers z:
zn+1 = azn mod m , a = m −
m−1
so that aB mod m = 1 .
B
(31)
The choices recommended above rely on the fact that
m = 2576 − 2240 + 1 is prime ,
(32)
so period is (m − 1)/48 ≈ 5 × 10171.
a = 2576 − 2552 − 2240 + 2216 + 1 .
33
(33)
4 Algorithm testing
4.1 General strategy
• Theoretical (e.g. spectral test), but mostly address whole period, not guaranteed to be meaningful for shorter streams.
• Empirical tests, claim to look for various aspects of (non)randomness of
a given stream. Consist of computing a certain number using the given
stream, comparing with result of an equally long stream of truly random
numbers would give. Need its mean, variance, and so on to assign a confidence level to the pseudorandom result. Example: divide the unit interval
into M bins, count the occupancies of these bins, calculate the χ2
• Philosophical glitch: a stream of truly random numbers will fail at the oneσ level in about one out of three cases. So, if a given test is failed, what
does one conclude? Many proposed generators claimed that to pass all
tests performed. Conclusion should be: not a good model for truly random
numbers!
34
1100
1050
1000
950
900
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The plot shows the occupancies of the
100 bins for 100,000 points generated by
RCARRY, connected by a solid line to
guide the eye. The average occupancy
is of course 1000,
√ with
fluctuations of
the expected O
1000 . The computed
value of χ2 is 85.234, or 0.86 per degree
of freedom, or 0.98 standard deviations.
As far as this simple test is concerned, the
stream looks acceptably random.
35
1100
1050
1000
950
900
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The plot shows the occupancies of the
100 bins for 100,000 points generated by
CORPUT, connected by a solid line to
guide the eye. The average occupancy
is of course 1000, but the fluctuations
are tiny. The computed value of χ2 is
0.148, or 0.0015 per degree of freedom,
off the mark by 7 standard deviations.
We conclude that the stream generated by
CORPUT is much too uniform to be considered random.
36
5 Discrepancies and Quasi-Monte Carlo
5.1 The notion of non-uniformity of point sets
√
• Monte Carlo goes as 1/ N
• other methods (i.e.trapezoid in D = 1) go as 1/N2
• difference: (non)uniformity of the point set: discrepancy
37
5.2 Star discrepancies
• Any D-dimensional point set X = (~x1, ~x2, . . . , ~xN), random, regular, or
otherwise.
• Characteristic function (‘step function’):
θ(~y > ~x) =
D
Y
µ=1
θ(0 ≤ xµ ≤ yµ)
• Local discrepancy of point set at ~y:
L∗ (~y) =
N
D
Y
1 X
θ(~y > ~x) −
yµ ,
N j=1
µ=1
38
(34)
1
local discpreancy for D = 2.
The point ~y is at (0.55, 0.55)
and the rectangle spanned by
the origin and ~y contains 3
out of the 12 points. The
local discrepancy is therefore
0.0525.
y
0
1
39
Measures of global discrepancy
L∗∞ ≡ sup |L∗ (~y)| Kolmogorov statistic
~
y
Z
L∗2 ≡
dD~y (L∗ (~y))2 Kramér-von Mises statistic
These are small if the point set is globally uniform.
Explicit formula in terms of point set only:
L∗2
N D
N
D
1 X Y
1
2 XY
µ µ
2
= 2
+
1 − (xµ
)
1 − max(xi , xj ) − D
i
N i,j=1 µ=1
2 N i=1 µ=1
3
where µ = 1, 2, . . . , D labels the coordinates of the points.
40
!D
,
(35)
Expected for random points:
hL∗2i =
1 −D
2 − 3−D ,
N
(36)
For ‘trapezoid poit set’ (regular hypercubic lattice) of MD points:
L∗2 =
1
3
!D 

1
1+
2M2
!D
1
−2 1+
8M2
41
!D

+ 1 ≈
D
,
4M23D
(37)
5.3 Koksma-Hlawka inequalities
Variation of a function in one dimension: arbitrary partition of the unit interval:
P : (0 ≡ z0 < z1 < z2 < z3 < · · · < zm−1 < zm ≡ 1) ,
W(P) =
m
X
|f(zj) − f(zj−1)| .
(38)
j=1
Supremum over all possible partitions = variation:
Var[f] = sup W(P) .
(39)
P
Koksma inequality:
1
N
N
X
j=1
Z1
f(xj) − dx
0
f(x)
≤ Var[f] × L∗∞ (X) .
D > 1: Koksma-Hlawka inequality (generalization).
42
(40)
• Finding the variation of an integrand is much harder than finding its integral
• Bounded but discontinuous functions in more than one dimension: the variation can be ∞
• Do show that the notion of non-uniformity of point sets, as embodied in
discrepancy, is relevant in establishing a more manageable way of talking
about uniformity.
43
5.4 Diaphonies
Another (possibly better) measure of non-uniformity
N
1 X 2 X
σn
exp(2iπ~
n · ~xj)|2
T (X) =
~ |
N ~
j=1
~ 6=0
n
N
1 X
β(~xj − ~xk) ,
=
N j,k=1
X
σ2n
n · ~x) .
β(~x) =
~ exp(2iπ~
~ 6=~0
n
Invariant under translations mod 1 of the point set. Properties depend on the
strengths σ2n
~ . Useful to choose the strengths so that hT i = 1:
X
σ2n
(41)
~ = 1 .
~ 6=~0
n
44
D = 1:
3 1
, n 6= 0 .
π2 n2
The corresponding function β(x) reads
σ2n =
β(x) =
X
n≥1
6
π2n2
(42)
cos(2πnx) = 1 − 6{x}(1 − {x}) , {x} = x mod 1 . (43)
Generalization to D-dim
σ2n
~ =
D
Y
cDτ(nµ) , τ(n) = max(1, n)−2
µ=1
45

π2
, cD =  1 +
3
!D
−1
− 1
,
(44)
Other choice (try to approx. rotational invariance):
σ2n
~

D
∞
X
1
2
2

~ , K(v) = −1 +
=
exp(−v~
n ) ∀n
e−vn 
K(v)
n=−∞
,
(45)
with v > 0 a real parameter: Jacobi diaphony. The β function is in this case given
by
D
Y
1
β(~x) =
−1 +
φ(v, xµ)
,
(46)
K(v)
µ=1
where
φ(v, x) =
X
2
exp(−vn + 2iπnx) =
n
r
πX
π2
exp − (n + x)2
v n
v
K(v) = −1 + φ(v; 0)D .
46
!
(47)
(48)
6
The function φ(v, x) as a
function of x for v =
0.1, 0.3, 1.
For decreasing values of v, the function (that always integrates
to unity), becomes more and
more peaked at {x} ≈ 0: in
the limit v → 0 we actually have φ(0, x) = (δ (x) +
δ (1 − x))/2.
5
4
3
2
1
0
0
0.2
0.4
0.6
0.8
1
47
5.5 Assessing point sets
The discrepancy or diaphony can be used as a test of whether a given point set
‘looks random’. Relevant: variance of T for random poin sets.
!
Z
X
1
2
σ4n
(49)
≈2
σ (T (X)) = 2 d~x β(~x)2 + O
~ ,
N
~
~ 6=0
n
More completely:
hexp(zT (X))i = exp(ψ(z)) , ψ(z) =
X (2z)m X
σ2m
.
~
n
2m
m≥1
~
~ 6=0
n
From this we may construct the actual T (X) distribution H(t), the probability density for T (X) to have the value t > 0. Compare with actual T for given point set.
It can be proven that for large dimensionality the T (X) distribution for random
point sets approaches a Gaussian (law of large number of degrees of freedom).
48
6 Quasi-random number generators
6.1 Low-discrepancy point sets
Message from Koksma-Hlawka: the lower the discrepancy of the point set is, the
more accurate the integral. Goes under the name of Quasi-Monte Carlo.
• Irregularity of distribution: optimal point set of N points is not optimal
point set of N + 1 points!
• Difference between fixed-size point sets and streams of quasirandom numbers .
• Quasi-random points are not independent of one another, have to consider
again the error estimate.
49
6.2 Finite point sets: Korobov sets
Method of good lattice points: D-dimensional vector ~g with integer components.
The set of N points is then defined by
xµ
j
=
!
j µ
g
N
mod 1 , µ = 1, 2, . . . , N .
(50)
Discrepancy depends on ~g. Each gµ be relatively prime to N. Example for D = 2
based on the Fibonacci numbers:
F1 = F2 = 1 , Fn = Fn−1 + Fn−2 , n ≥ 3 .
(51)
~g = (1, Fn−1) , N = Fn .
(52)
50
The Fibonacci point set for
n = 15, which contains 610
points. The low value of L∗
can be ascribed to the tilting of the regular trapezoidal
grid, so that no two points are
precisely aligned horizontally
or vertically. The regularity
of the grid, however, makes it
dangerous for integrands with
a similar pattern, and can be
picked out by other measures
of nonuniformity such as T .
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
51
6.3 Infinite streams: Richtmeyer sequences
A drawback of Korobov sets: finiteness of the number of different points. Replace
the factor ~g by irrational ~θ. D = 1:
xj = (jθ) mod 1 , j = 1, 2, . . . , N ,
(53)
is equidistributed, i.e. the discrepancy goes to zero as N approaches infinity.
Speed depends on θ.
52
Continued fraction representation:
θ=
1
a1 +
,
1
a2 +
(54)
1
a3 +
1
1
a 4 + ···
where the a’s are positive integers, the continued fraction coefficients. Set a1, a2, a3, . . .
is completely equivalent to a unique θ. If ak = ∞ θ is then rational. Discrepancy
is lower if the coefficients are smaller,
√ and θ ‘more irrational’. ‘Most irrational’
number: aj = 1 for all j, θ = (−1 + 5)/2.
53
Continued fraction coefficients for some simple numbers.
√ θ
√2 − 1
√3 − 1
√5 − 2
√6 − 2
√7 − 2
√8−2
√10 − 3
√11 − 3
√ 12 − 3
√2 − 1/2
3 − 1/2
π−3
π − 5/2
e−2
a1, a2, a3, a4, . . .
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . . .
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, . . .
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, . . .
2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, . . .
1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 4, . . .
1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, . . .
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, . . .
3, 6, 3, 6, 3, 6, 3, 6, 3, 6, 3, 6, . . .
2, 6, 2, 6, 2, 6, 2, 6, 2, 6, 2, 6, . . .
1, 10, 1, 1, 1, 10, 1, 1, 1, 10, 1, 1, . . .
4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, . . .
7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, . . .
1, 1, 1, 3, 1, 3, 4, 73, 6, 3, 3, 2, 1, 3, . . .
1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10, 1, . . .
54
9
120
8
100
7
80
6
5
60
4
40
3
20
2
1
0
0
100
200
300
400
500
600
700
800
900
1000
0
100
√
θ = ( 5 − 1)/2
500
22
450
20
200
300
400
θ=
√
500
600
700
800
900
1000
800
900
1000
37 mod 1
18
400
55
350
16
14
300
12
250
10
200
8
150
6
100
4
50
2
0
0
0
100
200
300
400
500
600
700
θ = π mod 1
800
900
1000
0
100
200
300
400
500
600
700
θ = e mod 1
56
Multidimensional sequences: Richtmeyer sequences
µ
xµ
j = (jθ ) mod 1 ,
(55)
where θµ, (µ = 1, 2, . . . , D) are irrational numbers that are all relatively prime
√
to one another. Simple choice: θµ = pµ mod 1, where pµ is the µth prime
number. Below we give some results for sets of 2000 points in two dimensions.
57
1
1
’fort.1’
’fort.1’
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
√
√
~θ = ( 2 mod 1, 3 mod 1)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
√
√
~θ = ( 2 mod 1, 5 mod 1)
58
1
6.4 van der Corput sequences
In the previous section we discussed Richtmeyer sequences, and presented results
for N = 1000, for one dimension, for various θ’s. The 1000-point point set with
the minimal discrepancy is given by
xj =
2j − 1
, j = 1, 2, . . . , 1000 .
2000
(56)
In the following figure we plot 12N2L∗2 for this set, where N denotes the first N
points.
59
300000
250000
200000
150000
100000
50000
0
0
100
200
300
400
500
600
700
800
900
1000
60
The curve reads a maximum
of about 250,000 in the middle.
This is due to the
fact that, for N = 500,
the interval (0, 1/2) contains
precisely 500 points, and
the interval (1/2, 1) precisely
none. The value for the 500point point set with the minimal discrepancy would be 1,
and this value would be obtained for N = 500 if we
had constructed the point set
by defining xj = ((2j −
1)/1000) mod 1. The order
in which the points are generated is extremely important.
Consider a numbering system in base b: that is, given an integer b ≥ 2, we
can write any natural number n as
n = n0 + n1b + n2b2 + n3b3 + · · · + nkbk , n < bk+1 .
(57)
The van der Corput transform of n in base b is then
φb(n) = n0b−1 + n1b−2 + n2b−3 + · · · + nkb−k−1 .
(58)
Most leading bit in φb(n) will change most rapidly, followed by the next-toleast(most) leading bit, and so on. For b = 2, the first few van der Corput transforms are given here:
n
φ2(n)
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13
14 15
1
2
1
4
3
4
1
8
5
8
3
8
7
8
1
16
9
16
5
16
7
16
13
16
3
16
11
16
These numbers have been used in the CORPUT generator mentioned before.
61
15
16
This sequence attempts to fill the unit interval in as uniform a way as possible; after 2k points the interval is filled almost optimally, with xj = (j − 1)/2k,
j = 0, 1, . . . , 2k − 1. The L∗2 in that case is 1/(3N2), almost optimal1.
1
For point sets defined by x = (2j − α)/(2N), j = 1, 2, . . . , N and −1 ≤ α ≤ 1, the
discrepancy is given by L∗2 = (1 + 3(1 − α)2 )/(12N2 ),
62
5
The ‘normalized’ extreme
discrepancy NL∗∞ for the first
N van der Corput numbers
with base b = 2,for N from
1 to 211. Note the fractal
structure.
In fact, if we
plot only the values for N
multiples of 2c, we obtain
exactly the same plot as that
for N = 211−c.
4
3
2
1
0
0
500
1000
1500
2000
63
6
5
The ‘normalized’ extreme
discrepancy NL∗∞ for the first
N van der Corput numbers
with base b = 5,for N from 1
to 55. The fractal structure is
in this case of a 5-fold nature.
4
3
2
1
0
0
500
1000
1500
2000
2500
3000
64
In more dimensions, define the vector sequence ~xj by choosing a vector ~b with
natural components, and defining
xµ
j = φbµ (j) ,
(59)
where the bases bµ, µ = 1, 2, . . . , D are all relatively prime.
1
’fort.1’
0.9
0.8
0.7
0.6
The van der Corput points
(φ2(n), φ3(n)) for 1 ≤ n ≤ 1000.
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
65
1
’fort.1’
0.9
0.8
0.7
The van der Corput points
(φ2(n), φ4(n)) for 1 ≤ n ≤ 1000.
This plot shows why the bases of
the axes should be relatively prime
to one another.
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
66
1
’fort.1’
0.9
The van der Corput points
(φ17(n), φ19(n) for 1 ≤ n ≤
1000. The onset of uniformity is
seen to be quite slow in this case,
due to the fact that (a) the primes
17 and 19 are ‘large’ for N = 1000
(1000 ∼ 2 × 192), and (b) they are
close to one another.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
67
Here we give the normalized discrepancy NL∗2 as a function of
N, for the two-dimensional van
der Corput sequence with ~b =
(2, 3) (lower curve), and for the
RCARRY algorithm (upper curve).
The straight line at 2−2 − 3−2 =
5/36 is the expected value of NL∗2
for truly random points.
The
∗
NL2 for the pseudorandom set from
RCARRY shows appreciable fluctuations around the expected value,
while that for the quasirandom set
from CORPUT is steadily decreasing.
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
100
200
300
400
500
600
700
800
900
1000
68
0.12
0.08
0.08
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.1
0.08
0.06
0.04
0.02
0
0
0
100
200
300
400
500
600
700
800
900
1000
0
0
100
200
D=3
300
400
500
600
700
800
900
1000
0
100
200
D=4
0.09
300
400
500
600
700
800
900
1000
700
800
900
1000
D=5
0.02
0.02
0.015
0.015
0.01
0.01
0.005
0.005
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
0
100
200
300
400
500
600
D=6
700
800
900
1000
0
0
100
200
300
400
500
600
D=8
69
700
800
900
1000
0
100
200
300
400
500
600
D = 10
6.5 Nierreiter sequences
In these sequences one may use b = 2 for all coordinates: we have
~xj =
φ2(p1(j)) , φ2(p2(j)) , . . . , φ2(pD(j))
, j = 1, 2, 3, . . . ,
(60)
where the functions pµ(j) are cleverly chosen permutations. The discrepancy can
go to 0 rapidly as N → ∞.
70
6.6 Error estimates revisited
In standard Monte Carlo, the expectation value of the squared error decreases as
1/N under the assumption that the points are iid. For Quasi-Monte Carlo, this is
explicitly not true: indeed, the various points ‘know’ about each other’s position.
Hypercube: combined probability density for truly random points
P(~x1, ~x2, . . . , ~xN) = 1 .
(61)
Quasirandom points are not independent, combined probability density
P(~x1, ~x2, . . . , ~xN) = 1 −
1
FN(~x1, ~x2, . . . , ~xN) ,
N
(62)
where the factor −1/N is a convention. Clearly, FN must be symmetric in all its
arguments. Moreover, let us define
Fk(~x1, ~x2, . . . , ~xk) =
Z1
d~xk+1d~xk+2 · · · d~xNFN(~x1, ~x2, . . . , ~xN)
0
71
(63)
for k < N. If the point set is to be of any use at all, we must have F1(~x) = 0,
otherwise the integral estimate is biased. Assume combined probability density
of the points translation-invariant modulo 1:
FN(~x1 + ~y, . . . , ~xN + ~y) = F(~x1, . . . , ~xN) ∀~y ,
(64)
Combined probability density of a pair of points depends only on their difference:
1
P(~x1, ~x2) = 1 − F2(~x1 − ~x2) .
(65)
N
The integral estimator is
N
1 X
f(~xj) ,
E1 =
N j=1
still find hE1i = J1.Expectation value ofsquare
D
Z
E
E21
ffF2
!
Z
N2
1
2 2
ffF2 ,
NJ2 + N J1 −
=
N2
N
Z
≡
d~x1 d~x2 f(~x1) f(~x2) F2(~x1 − ~x2) .
72
(66)
The squared error
!
Z
1
N−1
2
σ (E1) =
J1 − J2 −
ffF2 .
N
N
2
(67)
To actually estimate the squared error, we also have to modify E2. It turns out
that the definition that gives hE2i = σ (E1)2 is given by
N
X



1
F2(~0) 
1
1
δj,k −
E2 =
− 3 F2(~xj − ~xk) .
f(~xj) f(~xk)  2 +
3
2
N
N
NN
N
j,k=1
(68)
2
Unfortunately, in this case we do have to perform O N sums. It is therefore
customary, even if not justified, to employ the standard form of E2. An alternative
method is the following. Suppose that a set of 100,000 quasirandom points, say,
is used in a calculation. We may then collect the weights in 100 groups of 1,000
each, and study the distribution of the 100 averages in addition to their average,
in order to arrive at an error estimate.
73
7 Phase space algorithms
7.1 General
One of the fundamental integrations in particle phenomenology: transition rates
over phase space. Phase space integration element
dV(1, . . . , n; s) ≡

 

n
n n
X
Y
X
√
~pj δ  s −
p0j
d4pj δ p2j − m2j θ(p0j) δ3 
j=1
j=1
j=1
Phase space volume:
V(1, . . . , n; s) =
Z
74
dV(1, . . . , n; s)
(69)
7.2 Hierarchical method
Based on 2-particle split-up: insert
4
2
d Q du δ(Q − u) δ

4
n
X
j=2

pj − Q 
(70)
So 1 → n process split up in 1 → 2 followed by 1 → (n − 1).
Z
V(1, . . . , n; s) =
du V(1, Q; s) V(2, . . . , n; s)
Repeat: cascade of 1 → 2 processes. Problem: distribution of u’s, and ordering
of u’s. Old algorithm FOWL by F. James.
75
7.3 RAMBO algorithm (RAndom Momenta BOoster)
Valid for ultra-relativistic limit. Phase space element
n Y
j=1
d4pjδ p2j θ(p0j)
where
µ
P =
n
X
δ3 ~P δ P0 −
pµ
,
j
s = P2 .
j=1
• Obtain four-momenta that satisfy the constraints
• ensure that the whole phase space is covered
• as uniform a density as possible.
76
√ s ,
(71)
(72)
2 −n
1 = (2πw )
Z Y
n j=1
d4qj δ q2j θ(q0j) exp(−q0j/w)
Agorithm as follows:
P
• kµ = j qµ
j . In general not at rest, nor does its square equal s.
.
(73)
• Perform on every qj the Lorentz boost Λ which brings kµ to its rest frame
• Scaling transform to correct overal invariant mass.
• The resulting vectors are denoted by pj.
77
2 −n
1 = (2πw )
Z Y
n
4
d qj δ
j=1
4
d kδ

4
k−
X
j
q2j

θ(q0j)
√
k2
qj dx δ x − √
s
4
d pj δ
!
4
1
pj − Λ(qj)
x
.
The scaling factor x runs from 0 to ∞. Owing to Lorentz invariance
δ
1
pj − Λ(qj)
x
4
!
!!
δ q2j = x2 δ4 qj − xΛ−1(pj) δ p2j
.
(74)
Furthermore,
δ

4
k−
X
j

qj
√
k2
δ x− √
s
!
=
2s 3 ~ 0 √ 2
2
δ
P
δ
P
−
s
δ
k
−
x
s
.
x3
78
The formula can be written as
Z Y
n √ 1=N
d4pj δ p2j θ(p0j) δ3 ~P δ P0 − s ,
(75)
j=1
where
2 n
N = 2s(2πw )
Z∞
0
dx
Z
d4k x2n−3 exp(−k0/w)
q
(k0)2 − x2s .
(76)
Volume of phase space in the ultra-relativistic limit:
VUR = N
−1
=
n−1
π
2
sn−2
.
Γ (n)Γ (n − 1)
(77)
We see that the RAMBO algorithm is essentially the best possible: the whole of
phase space is covered uniformly. The energy scale w drops out as it should, and
therefore in practice we use w = 1.
79
7.4 Inclusion of particle masses
• Particles may be relatively light but not massless. Assume that a ‘massless’
phase space point has been generated according to RAMBO:
• Scale the three-momenta down,
• Adjust the energies so as to build in the masses,
• while keeping energy conservation intact.
• Determine the unique root ξ0 of
n q
X
√
F(ξ) = − s +
ξ2(q0j)2 + m2j .
(78)
j=1
µ
• Replace qµ
j by pj :
~pj ← ξ~qj ,
p0j ←
80
q
~p2j + m2j .
(79)
• Phase space is not filled uniformly, but rather with a fluctuating density
• Implies that we have to assign to each generated phase space point a weight
2n−3 

n
X
|~p |

√j 
w(p1, p2, . . . , pn) = VUR
s
j=1

j=1
 
−1
n
X
|~pj|2 
 
√
.
p0j
p0 s
j=1 j
(80)
n
Y
|~pj|
• For all masses equal, mj = m, the maximum weight is expected to be
n2m2
wmax = 1 −
s
81
!(3n−5)/2
.
(81)
6000
1600
1000
900
1400
5000
800
1200
700
4000
1000
600
3000
800
500
400
600
2000
300
400
200
1000
200
100
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
n = 3 , r = 0.3
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
n = 3 , r = 0.6
900
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.9
1
0.9
1
n = 3 , r = 0.9
1200
5000
4500
800
1000
4000
700
3500
600
800
3000
500
600
2500
400
2000
300
400
1500
200
1000
200
82
100
0
500
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
n = 6 , r = 0.3
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
n = 6 , r = 0.6
1000
10000
900
9000
800
8000
700
7000
600
6000
500
5000
400
4000
300
3000
200
2000
100
1000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n = 6 , r = 0.9
35000
30000
25000
20000
15000
10000
5000
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n = 10 , r = 0.3
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n = 10 , r = 0.6
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n = 10 , r = 0.9