Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration Mioara Joldes, Valentina Popescu, Jean-Michel Muller ASAP 2014 Motivation Numerical problems require floating-point operations with higher precision than double (binary64) 1 / 10 Motivation Numerical problems require floating-point operations with higher precision than double (binary64) e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor, long term stability of the solar system 1 / 10 Motivation Numerical problems require floating-point operations with higher precision than double (binary64) e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor, long term stability of the solar system These are usually high performance computing problems also. 1 / 10 Motivation Numerical problems require floating-point operations with higher precision than double (binary64) e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor, long term stability of the solar system These are usually high performance computing problems also. Our project: AMPAR CudA Multiple Precision ARithmetic librarY 1 / 10 Motivation Numerical problems require floating-point operations with higher precision than double (binary64) e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor, long term stability of the solar system These are usually high performance computing problems also. Our project: AMPAR CudA Multiple Precision ARithmetic librarY Existing Multiple-Precision Libraries (on CPU/GPU architectures) Multiple-digits representation: GNU MPFR, ARPREC, GPU variants: GARPREC, CUMP 1 / 10 Motivation Numerical problems require floating-point operations with higher precision than double (binary64) e.g. in dynamical systems like: finding sinks in Hénon map, iterating Lorenz attractor, long term stability of the solar system These are usually high performance computing problems also. Our project: AMPAR CudA Multiple Precision ARithmetic librarY Existing Multiple-Precision Libraries (on CPU/GPU architectures) Multiple-digits representation: GNU MPFR, ARPREC, GPU variants: GARPREC, CUMP Multiple-terms representation: QD, GPU variant: GQD 1 / 10 Extending the precision using multiple-terms format: FP expansions 2 / 10 Extending the precision using multiple-terms format: FP expansions Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote ulp (x) = 2ex −p+1 (Goldberg’s def). 2 / 10 Extending the precision using multiple-terms format: FP expansions Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote ulp (x) = 2ex −p+1 (Goldberg’s def). Def. A floating-point expansion u with n terms is the unevaluated sum u = n−1 P ui of n FP i=0 numbers u0 , . . . , un−1 , s.t. ui 6= 0 ⇒ |ui | ≥ |ui+1 |. 2 / 10 Extending the precision using multiple-terms format: FP expansions Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote ulp (x) = 2ex −p+1 (Goldberg’s def). Def. A floating-point expansion u with n terms is the unevaluated sum u = n−1 P ui of n FP i=0 numbers u0 , . . . , un−1 , s.t. ui 6= 0 ⇒ |ui | ≥ |ui+1 |. Non-overlapping FP expansions u is Bailey-nonoverlapping if for all 0 < i < n, we have |ui | ≤ 1 2 ulp (ui−1 ). 2 / 10 Extending the precision using multiple-terms format: FP expansions Let a precision-p Floating-point (FP) number x = Mx · 2ex , with 1 ≤ |Mx | < 2. Denote ulp (x) = 2ex −p+1 (Goldberg’s def). Def. A floating-point expansion u with n terms is the unevaluated sum u = n−1 P ui of n FP i=0 numbers u0 , . . . , un−1 , s.t. ui 6= 0 ⇒ |ui | ≥ |ui+1 |. Non-overlapping FP expansions u is Bailey-nonoverlapping if for all 0 < i < n, we have |ui | ≤ 1 2 ulp (ui−1 ). Example: π in double-double p0 = 11.0010010000111111011010101000100010000101101000110002 , and −53 p1 = 1.00011010011000100110001100110001010001011100000001112 × 2 . p0 + p1 ↔ 107 bits FP approx. 2 / 10 Extending the precision using FP expansions Pros: – use directly available and highly optimized native FP operations – sufficiently simple and regular algorithms for addition and multiplication – straightforwardly portable to highly parallel architectures, such as GPUs. 3 / 10 Extending the precision using FP expansions Pros: – use directly available and highly optimized native FP operations – sufficiently simple and regular algorithms for addition and multiplication – straightforwardly portable to highly parallel architectures, such as GPUs. Cons: – lack of thorough error bounds – no correct rounding – QD only supports 2-doubles and 4-doubles format 3 / 10 Extending the precision using FP expansions Pros: – use directly available and highly optimized native FP operations – sufficiently simple and regular algorithms for addition and multiplication – straightforwardly portable to highly parallel architectures, such as GPUs. Cons: – lack of thorough error bounds – no correct rounding – QD only supports 2-doubles and 4-doubles format Existing algorithms Addition and Multiplication: – generalized/adapted versions of [Priest’91], [Shewchuck’97], [Bailey’01] – based on Error-Free Transforms: 2Sum, 2Prod, 2ProdFMA 3 / 10 Extending the precision using FP expansions Pros: – use directly available and highly optimized native FP operations – sufficiently simple and regular algorithms for addition and multiplication – straightforwardly portable to highly parallel architectures, such as GPUs. Cons: – lack of thorough error bounds – no correct rounding – QD only supports 2-doubles and 4-doubles format Existing algorithms Addition and Multiplication: – generalized/adapted versions of [Priest’91], [Shewchuck’97], [Bailey’01] – based on Error-Free Transforms: 2Sum, 2Prod, 2ProdFMA Division based on classical ”paper and pencil” long division algorithm [Bailey’01, Priest’91, Daumas’99] 3 / 10 Our contribution: Reciprocal of FP expansions with an adapted Newton-Raphson Iteration & explicit error bound Newton iteration for root α of f f (x ) xn+1 = xn − f 0 (xn ) , n when x0 close to α, f 0 (α) 6= 0 → quadratic convergence. 4 / 10 Our contribution: Reciprocal of FP expansions with an adapted Newton-Raphson Iteration & explicit error bound Newton iteration for root α of f f (x ) xn+1 = xn − f 0 (xn ) , n when x0 close to α, f 0 (α) 6= 0 → quadratic convergence. Newton-Raphson iteration for reciprocal i.e., root 1/a of f (x) = 1/x − a xn+1 = xn (2 − axn ), when x0 close to 1/a, → quadratic convergence, xn+1 − 1 a = −a(xn − 1 2 ) . a 4 / 10 Our contribution: Reciprocal of FP expansions with an adapted Newton-Raphson Iteration & explicit error bound Newton iteration for root α of f f (x ) xn+1 = xn − f 0 (xn ) , n when x0 close to α, f 0 (α) 6= 0 → quadratic convergence. Newton-Raphson iteration for reciprocal i.e., root 1/a of f (x) = 1/x − a xn+1 = xn (2 − axn ), when x0 close to 1/a, → quadratic convergence, xn+1 − 1 a = −a(xn − 1 2 ) . a Adapted Newton-Raphson iteration for reciprocal of FP expansion: FP Expansions xn+1 = xn · (2 − a · xn ). | {z } RdM ulE | {z } RdSubE | {z } RdM ulE → quadratic convergence 4 / 10 Main Algorithm Algorithm 1 Truncated Newton iteration based algorithm for reciprocal of an FP expansion. Input: FP expansion a = a0 + . . . + a2k −1 ; length of output FP expansion 2q . Output: FP expansion x = x0 + . . . + x2q −1 s.t. −2q (p−3)−1 x − 1 ≤ 2 . a |a| 1: 2: 3: 4: 5: 6: 7: (1) x0 = RN (1/a0 ) for i ← 0 to q − 1 do v̂[0 : 2i+1 − 1] ← RdMulE(x[0 : 2i − 1], a[0 : 2i+1 − 1], 2i+1 ) ŵ[0 : 2i+1 − 1] ← RdSubE(2, v̂[0 : 2i+1 − 1], 2i+1 ) x[0 : 2i+1 − 1] ← RdMulE(x[0 : 2i − 1], ŵ[0 : 2i+1 − 1], 2i+1 ) end for return FP expansion x = x0 + . . . + x2q −1 . 5 / 10 Example, adapted Newton-Raphson iteration on FP expansions General procedure: xn+1 = xn (2 − axn ) iter 0 = x0 , RN a1 0 6 / 10 Example, adapted Newton-Raphson iteration on FP expansions General procedure: xn+1 = xn (2 − axn ) iter 0 = x0 , RN a1 0 ! · = iter 1 x0 x1 x0 − 2 · a0 a1 , x0 6 / 10 Example, adapted Newton-Raphson iteration on FP expansions General procedure: xn+1 = xn (2 − axn ) iter 0 = x0 , RN a1 0 ! · = iter 1 x0 x1 − x0 2 · a0 a1 , x0 ! iter 2 · = x0 x1 x2 x3 x0 x1 − 2 · a0 a1 a2 a3 , x0 x1 .. . 6 / 10 Example, adapted Newton-Raphson iteration on FP expansions General procedure: xn+1 = xn (2 − axn ) iter 0 = x0 , RN a1 0 ! · = iter 1 x0 x1 − x0 2 · a0 a1 , x0 ! iter 2 · = x0 x1 x2 x3 x0 − x1 · a0 2 a1 a2 a3 , x0 x1 .. . But... when using Error Free Transforms, · ... u0 u1 un−1 ... v0 v1 = vm−1 ...... w0 w1 w2mn−1 6 / 10 Example, adapted Newton-Raphson iteration on FP expansions General procedure: xn+1 = xn (2 − axn ) iter 0 = x0 , RN a1 0 ! · = iter 1 x0 x1 − x0 2 · a0 a1 , x0 ! iter 2 · = x0 x1 x2 x3 x0 − x1 · a0 2 a1 a2 a3 , x0 x1 .. . But... when using Error Free Transforms, · ... u0 u1 un−1 ... v0 v1 = vm−1 ...... w0 w1 w2mn−1 Use truncated Addition/Multiplication. 6 / 10 Truncations and Error Analysis General procedure: xn+1 = xn (2 − axn ) iter 0 , = x0 RN a1 0 iter 1 = x0 x1 x0 · · , a1 x0 {z } − 2 a0 | v̂0 | {z ŵ0 v̂1 } ŵ1 .. . 7 / 10 Truncations and Error Analysis General procedure: xn+1 = xn (2 − axn ) iter 0 Error analysis based on triangular inequalities. Let P (−j−1)p = 2−p , η= ∞ j=0 2 1−2−p , = x0 RN a1 0 iter 1 = x0 x1 x0 · · , a1 x0 {z } − 2 a0 | v̂0 | {z ŵ0 v̂1 } i+1 γi = 2−(2 −1)p η , 1−η |xi+1 − τi | ≤ γi |xi · ŵi | , (2a) |wi − ŵi | ≤ γi |wi | ≤ γi |2 − v̂i | , |vi − v̂i | ≤ γi a(fi ) · xi , a − a(fi ) ≤ γi |a| . (2b) (2c) (2d) ŵ1 .. . 7 / 10 Truncations and Error Analysis General procedure: xn+1 = xn (2 − axn ) iter 0 Error analysis based on triangular inequalities. Let P (−j−1)p = 2−p , η= ∞ j=0 2 1−2−p , = x0 RN a1 0 iter 1 = x0 x1 x0 · · , a1 x0 {z } − 2 a0 | v̂0 | {z ŵ0 v̂1 } i+1 γi = 2−(2 −1)p η , 1−η |xi+1 − τi | ≤ γi |xi · ŵi | , (2a) |wi − ŵi | ≤ γi |wi | ≤ γi |2 − v̂i | , |vi − v̂i | ≤ γi a(fi ) · xi , a − a(fi ) ≤ γi |a| . (2b) (2c) (2d) ŵ1 .. . And finally, xi − a−1 −2i (p−3)−1 a−1 ≤ 2 7 / 10 Comparison and Results Table: (a) Error bounds values for Priest’s formula [Priest’91] vs. Daumas [Daumas] vs. our analysis (1); d = 2q terms are computed in the quotient Prec, iteration p = 53, q = 0 p = 53, q = 1 p = 53, q = 2 p = 53, q = 3 p = 53, q = 4 p = 24, q = 0 p = 24, q = 1 p = 24, q = 2 p = 24, q = 3 p = 24, q = 4 Eq. Priest 2 1 2−2 2−6 2−13 2 1 2−2 2−5 2−12 Eq. Daumas 2−49 2−98 2−195 2−387 2−764 2−20 2−40 2−79 2−155 2−300 Eq. (1) 2−51 2−101 2−201 2−401 2−801 2−22 2−43 2−85 2−169 2−337 8 / 10 Comparison and Results Table: Timings† in MFlops/s for Alg. 1 vs. QD implementation for reciprocal (A) and division (B) of expansions; the numerator, denominator and quotient have respectively dn , di and do terms. di , do 1, 1 2, 2 4, 4 1, 2 2, 4 4, 2 1, 4 1, 8 2, 8 4, 8 8, 8 1, 16 2, 16 4, 16 8, 16 16, 16 Alg. 1 107 62 10 62 10.7 61 12.6 2 1.7 1.4 1.3 0.3 0.27 0.22 0.19 0.17 QD 107 70 3.6 86.2 3.7 86.2 7.36 * * * * * * * * * dn , di , do 2, 2, 2 4, 4, 4 2, 1, 2 4, 2, 4 2, 4, 2 4, 1, 4 Alg. 1 46.3 6.8 46.7 7 46.1 7.7 QD 70 3.6 86.2 3.7 86.2 7.36 (B) Division (A) Reciprocal † ∗ Intel(R) Core(TM) i7 CPU 3820, 3.6GHz computer precision not supported 9 / 10 Conclusion Use multiple-term format for multiple-precision floating-point numbers → FP expansions Method for computing the reciprocal/division of FP expansions based on: – ”truncated” addition and multiplication – ”adapted” Newton-Raphson Iteration Thorough error analysis and explicit error bound AMPAR CudA Multiple Precision ARithmetic librarY Thank you! 10 / 10