Download On the computation of the reciprocal of floating point expansions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
On the computation of the reciprocal of floating point expansions
using an adapted Newton-Raphson iteration
Mioara Joldes, Valentina Popescu, Jean-Michel Muller
ASAP 2014
Motivation
Numerical problems require floating-point operations with higher precision than double
(binary64)
1 / 10
Motivation
Numerical problems require floating-point operations with higher precision than double
(binary64)
e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor,
long term stability of the solar system
1 / 10
Motivation
Numerical problems require floating-point operations with higher precision than double
(binary64)
e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor,
long term stability of the solar system
These are usually high performance computing problems also.
1 / 10
Motivation
Numerical problems require floating-point operations with higher precision than double
(binary64)
e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor,
long term stability of the solar system
These are usually high performance computing problems also.
Our project:
AMPAR
CudA
Multiple
Precision
ARithmetic
librarY
1 / 10
Motivation
Numerical problems require floating-point operations with higher precision than double
(binary64)
e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor,
long term stability of the solar system
These are usually high performance computing problems also.
Our project:
AMPAR
CudA
Multiple
Precision
ARithmetic
librarY
Existing Multiple-Precision Libraries (on CPU/GPU architectures)
Multiple-digits representation:
GNU MPFR, ARPREC,
GPU variants: GARPREC,
CUMP
1 / 10
Motivation
Numerical problems require floating-point operations with higher precision than double
(binary64)
e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor,
long term stability of the solar system
These are usually high performance computing problems also.
Our project:
AMPAR
CudA
Multiple
Precision
ARithmetic
librarY
Existing Multiple-Precision Libraries (on CPU/GPU architectures)
Multiple-digits representation:
GNU MPFR, ARPREC,
GPU variants: GARPREC,
CUMP
Multiple-terms representation:
QD,
GPU variant: GQD
1 / 10
Extending the precision using multiple-terms format: FP expansions
2 / 10
Extending the precision using multiple-terms format: FP expansions
Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote
ulp (x) = 2ex −p+1 (Goldberg’s def).
2 / 10
Extending the precision using multiple-terms format: FP expansions
Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote
ulp (x) = 2ex −p+1 (Goldberg’s def).
Def.
A floating-point expansion u with n terms is the unevaluated sum u =
n−1
P
ui of n FP
i=0
numbers u0 , . . . , un−1 , s.t. ui 6= 0 ⇒ |ui | ≥ |ui+1 |.
2 / 10
Extending the precision using multiple-terms format: FP expansions
Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote
ulp (x) = 2ex −p+1 (Goldberg’s def).
Def.
A floating-point expansion u with n terms is the unevaluated sum u =
n−1
P
ui of n FP
i=0
numbers u0 , . . . , un−1 , s.t. ui 6= 0 ⇒ |ui | ≥ |ui+1 |.
Non-overlapping FP expansions
u is Bailey-nonoverlapping if for all 0 < i < n, we have |ui | ≤
1
2
ulp (ui−1 ).
2 / 10
Extending the precision using multiple-terms format: FP expansions
Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote
ulp (x) = 2ex −p+1 (Goldberg’s def).
Def.
A floating-point expansion u with n terms is the unevaluated sum u =
n−1
P
ui of n FP
i=0
numbers u0 , . . . , un−1 , s.t. ui 6= 0 ⇒ |ui | ≥ |ui+1 |.
Non-overlapping FP expansions
u is Bailey-nonoverlapping if for all 0 < i < n, we have |ui | ≤
1
2
ulp (ui−1 ).
Example: π in double-double
p0 = 11.0010010000111111011010101000100010000101101000110002 ,
and
−53
p1 = 1.00011010011000100110001100110001010001011100000001112 × 2
.
p0 + p1 ↔ 107 bits FP approx.
2 / 10
Extending the precision using FP expansions
Pros:
– use directly available and highly optimized native FP operations
– sufficiently simple and regular algorithms for addition and multiplication
– straightforwardly portable to highly parallel architectures, such as GPUs.
3 / 10
Extending the precision using FP expansions
Pros:
– use directly available and highly optimized native FP operations
– sufficiently simple and regular algorithms for addition and multiplication
– straightforwardly portable to highly parallel architectures, such as GPUs.
Cons:
– lack of thorough error bounds
– no correct rounding
– QD only supports 2-doubles and 4-doubles format
3 / 10
Extending the precision using FP expansions
Pros:
– use directly available and highly optimized native FP operations
– sufficiently simple and regular algorithms for addition and multiplication
– straightforwardly portable to highly parallel architectures, such as GPUs.
Cons:
– lack of thorough error bounds
– no correct rounding
– QD only supports 2-doubles and 4-doubles format
Existing algorithms
Addition and Multiplication:
– generalized/adapted versions of [Priest’91], [Shewchuck’97], [Bailey’01]
– based on Error-Free Transforms: 2Sum, 2Prod, 2ProdFMA
3 / 10
Extending the precision using FP expansions
Pros:
– use directly available and highly optimized native FP operations
– sufficiently simple and regular algorithms for addition and multiplication
– straightforwardly portable to highly parallel architectures, such as GPUs.
Cons:
– lack of thorough error bounds
– no correct rounding
– QD only supports 2-doubles and 4-doubles format
Existing algorithms
Addition and Multiplication:
– generalized/adapted versions of [Priest’91], [Shewchuck’97], [Bailey’01]
– based on Error-Free Transforms: 2Sum, 2Prod, 2ProdFMA
Division based on classical ”paper and pencil” long division algorithm [Bailey’01,
Priest’91, Daumas’99]
3 / 10
Our contribution: Reciprocal of FP expansions with
an adapted Newton-Raphson Iteration & explicit error bound
Newton iteration for root α of f
f (x )
xn+1 = xn − f 0 (xn ) ,
n
when x0 close to α, f 0 (α) 6= 0 → quadratic convergence.
4 / 10
Our contribution: Reciprocal of FP expansions with
an adapted Newton-Raphson Iteration & explicit error bound
Newton iteration for root α of f
f (x )
xn+1 = xn − f 0 (xn ) ,
n
when x0 close to α, f 0 (α) 6= 0 → quadratic convergence.
Newton-Raphson iteration for reciprocal i.e., root 1/a of f (x) = 1/x − a
xn+1 = xn (2 − axn ),
when x0 close to 1/a, → quadratic convergence, xn+1 −
1
a
= −a(xn −
1 2
) .
a
4 / 10
Our contribution: Reciprocal of FP expansions with
an adapted Newton-Raphson Iteration & explicit error bound
Newton iteration for root α of f
f (x )
xn+1 = xn − f 0 (xn ) ,
n
when x0 close to α, f 0 (α) 6= 0 → quadratic convergence.
Newton-Raphson iteration for reciprocal i.e., root 1/a of f (x) = 1/x − a
xn+1 = xn (2 − axn ),
when x0 close to 1/a, → quadratic convergence, xn+1 −
1
a
= −a(xn −
1 2
) .
a
Adapted Newton-Raphson iteration for reciprocal of FP expansion:
FP Expansions
xn+1 = xn · (2 − a · xn ).
| {z }
RdM ulE
|
{z
}
RdSubE
|
{z
}
RdM ulE
→ quadratic convergence
4 / 10
Main Algorithm
Algorithm 1 Truncated Newton iteration based algorithm for reciprocal of an FP expansion.
Input: FP expansion a = a0 + . . . + a2k −1 ; length of output FP expansion 2q .
Output: FP expansion x = x0 + . . . + x2q −1 s.t.
−2q (p−3)−1
x − 1 ≤ 2
.
a
|a|
1:
2:
3:
4:
5:
6:
7:
(1)
x0 = RN (1/a0 )
for i ← 0 to q − 1 do
v̂[0 : 2i+1 − 1] ← RdMulE(x[0 : 2i − 1], a[0 : 2i+1 − 1], 2i+1 )
ŵ[0 : 2i+1 − 1] ← RdSubE(2, v̂[0 : 2i+1 − 1], 2i+1 )
x[0 : 2i+1 − 1] ← RdMulE(x[0 : 2i − 1], ŵ[0 : 2i+1 − 1], 2i+1 )
end for
return FP expansion x = x0 + . . . + x2q −1 .
5 / 10
Example, adapted Newton-Raphson iteration on FP expansions
General procedure: xn+1 = xn (2 − axn )
iter 0
=
x0
,
RN a1
0
6 / 10
Example, adapted Newton-Raphson iteration on FP expansions
General procedure: xn+1 = xn (2 − axn )
iter 0
=
x0
,
RN a1
0
!
·
=
iter 1
x0
x1
x0
−
2
·
a0
a1
,
x0
6 / 10
Example, adapted Newton-Raphson iteration on FP expansions
General procedure: xn+1 = xn (2 − axn )
iter 0
=
x0
,
RN a1
0
!
·
=
iter 1
x0
x1
−
x0
2
·
a0
a1
,
x0
!
iter 2
·
=
x0
x1
x2
x3
x0
x1
−
2
·
a0
a1
a2
a3
,
x0
x1
..
.
6 / 10
Example, adapted Newton-Raphson iteration on FP expansions
General procedure: xn+1 = xn (2 − axn )
iter 0
=
x0
,
RN a1
0
!
·
=
iter 1
x0
x1
−
x0
2
·
a0
a1
,
x0
!
iter 2
·
=
x0
x1
x2
x3
x0
−
x1
·
a0
2
a1
a2
a3
,
x0
x1
..
.
But... when using Error Free Transforms,

 
·
...

u0
u1
un−1

...
v0
v1


= 
vm−1
......
w0
w1

w2mn−1
6 / 10
Example, adapted Newton-Raphson iteration on FP expansions
General procedure: xn+1 = xn (2 − axn )
iter 0
=
x0
,
RN a1
0
!
·
=
iter 1
x0
x1
−
x0
2
·
a0
a1
,
x0
!
iter 2
·
=
x0
x1
x2
x3
x0
−
x1
·
a0
2
a1
a2
a3
,
x0
x1
..
.
But... when using Error Free Transforms,

 
·
...

u0
u1
un−1

...
v0
v1


= 
vm−1
......
w0
w1

w2mn−1
Use truncated Addition/Multiplication.
6 / 10
Truncations and Error Analysis
General procedure: xn+1 = xn (2 − axn )
iter 0
,
=
x0
RN a1
0
iter 1
=
x0
x1
x0







·









·
,

a1
x0 
{z
}

−
2
a0
|
v̂0
|
{z
ŵ0
v̂1
}
ŵ1
..
.
7 / 10
Truncations and Error Analysis
General procedure: xn+1 = xn (2 − axn )
iter 0
Error analysis based on triangular inequalities.
Let P
(−j−1)p = 2−p ,
η= ∞
j=0 2
1−2−p
,
=
x0
RN a1
0
iter 1
=
x0
x1
x0







·









·
,

a1
x0 
{z
}

−
2
a0
|
v̂0
|
{z
ŵ0
v̂1
}
i+1
γi = 2−(2
−1)p
η
,
1−η
|xi+1 − τi | ≤ γi |xi · ŵi | ,
(2a)
|wi − ŵi | ≤ γi |wi | ≤ γi |2 − v̂i | ,
|vi − v̂i | ≤ γi a(fi ) · xi ,
a − a(fi ) ≤ γi |a| .
(2b)
(2c)
(2d)
ŵ1
..
.
7 / 10
Truncations and Error Analysis
General procedure: xn+1 = xn (2 − axn )
iter 0
Error analysis based on triangular inequalities.
Let P
(−j−1)p = 2−p ,
η= ∞
j=0 2
1−2−p
,
=
x0
RN a1
0
iter 1
=
x0
x1
x0







·









·
,

a1
x0 
{z
}

−
2
a0
|
v̂0
|
{z
ŵ0
v̂1
}
i+1
γi = 2−(2
−1)p
η
,
1−η
|xi+1 − τi | ≤ γi |xi · ŵi | ,
(2a)
|wi − ŵi | ≤ γi |wi | ≤ γi |2 − v̂i | ,
|vi − v̂i | ≤ γi a(fi ) · xi ,
a − a(fi ) ≤ γi |a| .
(2b)
(2c)
(2d)
ŵ1
..
.
And finally,
xi − a−1 −2i (p−3)−1
a−1 ≤ 2
7 / 10
Comparison and Results
Table: (a) Error bounds values for Priest’s formula [Priest’91] vs. Daumas [Daumas] vs. our analysis (1);
d = 2q terms are computed in the quotient
Prec, iteration
p = 53, q = 0
p = 53, q = 1
p = 53, q = 2
p = 53, q = 3
p = 53, q = 4
p = 24, q = 0
p = 24, q = 1
p = 24, q = 2
p = 24, q = 3
p = 24, q = 4
Eq. Priest
2
1
2−2
2−6
2−13
2
1
2−2
2−5
2−12
Eq. Daumas
2−49
2−98
2−195
2−387
2−764
2−20
2−40
2−79
2−155
2−300
Eq. (1)
2−51
2−101
2−201
2−401
2−801
2−22
2−43
2−85
2−169
2−337
8 / 10
Comparison and Results
Table: Timings† in MFlops/s for Alg. 1 vs. QD implementation for reciprocal (A) and division (B) of
expansions; the numerator, denominator and quotient have respectively dn , di and do terms.
di , do
1, 1
2, 2
4, 4
1, 2
2, 4
4, 2
1, 4
1, 8
2, 8
4, 8
8, 8
1, 16
2, 16
4, 16
8, 16
16, 16
Alg. 1
107
62
10
62
10.7
61
12.6
2
1.7
1.4
1.3
0.3
0.27
0.22
0.19
0.17
QD
107
70
3.6
86.2
3.7
86.2
7.36
*
*
*
*
*
*
*
*
*
dn , di , do
2, 2, 2
4, 4, 4
2, 1, 2
4, 2, 4
2, 4, 2
4, 1, 4
Alg. 1
46.3
6.8
46.7
7
46.1
7.7
QD
70
3.6
86.2
3.7
86.2
7.36
(B) Division
(A) Reciprocal
†
∗
Intel(R) Core(TM) i7 CPU 3820, 3.6GHz computer
precision not supported
9 / 10
Conclusion
Use multiple-term format for multiple-precision floating-point numbers → FP expansions
Method for computing the reciprocal/division of FP expansions based on:
– ”truncated” addition and multiplication
– ”adapted” Newton-Raphson Iteration
Thorough error analysis and explicit error bound
AMPAR
CudA
Multiple
Precision
ARithmetic
librarY
Thank you!
10 / 10