Download slides - Lirmm

Fast Modular Reduction Will Hasenplaugh Gunnar Gaubatz Vinodh Gopal June 27, 2007 Modular Multiplication • Modular Multiplication is used in Public Key Cryptography – Diffie-Hellman and RSA – Prime-field Elliptic Curve Cryptography – Compute AB mod M where A,B and M are typically 100’s to 1000’s of bits • We present a variant of Barrett’s Modular Reduction Algorithm which exploits Karatsuba Multiplication and Modular Folding • Analysis is software focused – We use an abstract processor to compare algorithms fairly – The native word size is w-bits (a power of 2) – 1-cycle add and an m-cycle multiply – We present example data on an 8-bit processor with a 2-cycle multiplier – Atmel AVR series - representative of embedded handheld devices – Our algorithm is also applicable to hardware acceleration Digital Enterprise Group 2 Montgomery vs. Barrett Word-Serial Montgomery Barrett Pro: Pro: • Regularity • No Transformation Overhead • Interleaved Multiply and Reduce • Large Digit Based Computation – Low-Complexity Quotient Estimation • Right-to-Left computation leads to convenient hardware pipelines Con: • Transformation Overhead • n2 complexity – Allows sub-n2 multiplication techniques • Flexible ‘Off the Shelf’ hardware Con: • Quotient Estimation requires a ‘large digit’ multiplication • Left-to-Right computation is less convenient for hardware Digital Enterprise Group 3 Barrett vs. Montgomery Performance of n2 Barrett approaches ~2/3 of Montgomery •Quotient Estimation for Montgomery is amortized as operands grow Speedup vs. Montgomery 0.85 Barrett 0.8 0.75 0.7 0.65 16 32 64 128 256 512 1024 2048 4096 Operand Size (n bits) Digital Enterprise Group 4 Karatsuba Multiplication Recursive multiplication algorithm with O( n1.585 ) complexity. a1 ‘Schoolbook’ multiplication complexity scales as O( n2 ), but requires fewer additions per recursion. N=a1b122n+(a1b0+a0b1)2n+a0b0 Karatsuba Multiplication - x b1 B b0 b1+b0 a 1 b1 A=a12n+a0 Schoolbook Multiplication - a0 a1+a0 N=AB B=b12n+b0 A a0 b 0 + (a1+a0)(b1+b0) - a 0 b0 a 1 b1 N=a1b122n+ [(a1+a0)(b1+b0)-a1b1-a0b0]2n+a0b0 N=AB Digital Enterprise Group 5 Recursive Karatsuba Decomposition a1 A a0 <= 1 <= 2 For k recursions: ‘extra’ word is <= log2k bits a1+a0 <= 3 There are fewer particles in the universe than that. Just one extra word on an 8-bit machine is sufficient to handle multiplication of numbers up to 2^258 bits. So, we probably won’t need to rewrite this code. Digital Enterprise Group 6 Carry Handling There is considerable overhead in the naïve implementation of Karatsuba. ah al x • At a recursion depth of 4, ~20% of the multiplies are with sparsely populated ‘extra’ words. We turn sparsely populated multiplies into branches and adds. N=AB A=ah2n+al bl albl + al if bh =1 + bl if ah =1 + if 1 B=bh2n+bl ah & bh =1 N ah and bh are booleans N=ahbh22n+[ahbl+bhal]2n+albl bh Each recursion is a conveniently-sized multiply -> No ‘extra’ words. Digital Enterprise Group 7 Karatsuba vs. Schoolbook Multiplication Speedup vs. Schoolbook 5.0 4.0 Karatsuba Plus Carry Handling  3 8   4 3 log2 wn 3.0 2.0 1.0 0.0 64 128 256 512 1024 2048 4096 Operand Size (bits) Digital Enterprise Group 8 Barrett’s Algorithm A, B and M are n-bit numbers. We seek to find R = AB mod M using Barrett’s Algorithm. A x N / 2n N B N mod 2n R  AB mod M x μ N  AB  22 n     M   N    R  N   n  n  M 2  2  R  3M ~μ N / 22n μ N / 2n x - M ~μ NM / 22n A total of 3 n-bit multiplies. R Digital Enterprise Group 9 Barrett vs. Montgomery Speedup vs. Montgomery 4.0 3.5 3.0 Barrett Plus Karatsuba Plus Carry Handling  1 4   4 3 log2 wn 2.5 2.0 1.5 1.0 0.5 0.0 64 128 256 512 1024 2048 4096 Operand Size (n bits) Digital Enterprise Group 10 Folding We accelerate the reduction process by partially reducing N ( =AB ) with an inexpensive method called Folding:  C  23s  C  23s mod M A  mod M  n  2s N / 23s  23s     M  M   23s mod M N   N mod2 3s B x x N    3s  M  2  + N '  23s 1 N   N mod M N mod 23s M’=23s mod M ~NM’ / 23s N’ Digital Enterprise Group 11 Iterative Folding We can play the same trick again. N / 21.5n NN mod 21.5n F times, in fact. 2 n    22 M N N 0 i  n    1 2  n mod M (1) N(1)Nmod 21.25n  AB N  i 1 M(1) + i M   2 i F x  N  i 1   i  1 2  n  mod2  M 1 2  n   2  i i   N F     R  N    1  2 F n  M 1  2 F  n       2  2  R  3  F  M F  x M(2) + (2) N(2)N mod 21.125n Digital Enterprise Group 12 Iterative Folding ( F = 2 ) Speedup vs. Montgomery 4.0 3.5 3.0 2.5 Barrett Plus Karatsuba Plus Carry Handling Plus Folding  2 7   4 3 log2 wn 2.0 1.5 1.0 0.5 0.0 64 128 256 512 1024 2048 4096 Operand Size (n bits) Digital Enterprise Group 13 Summary •This Fast Modular Reduction technique is ~2x faster than Montgomery on RSA Encryption on 512 – 1024 bit keys. •As security requirements heighten, key sizes will grow to meet them and the asymptotic advantage of Karatsuba will continue to shine. We see a ~3x and ~4x advantage, respectively, for 2048 and 4096 bit keys. •The speedup of a multiplier-bound, w-bit architecture is  9 16   4 3 log2 wn •Strong encryption on low-power handheld devices is challenging – Ex: A 16MHz 8-bit Atmel AVR computes a 4096-bit RSA in almost 4 minutes with Montgomery, but we can do it in 1. Digital Enterprise Group 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - Lirmm