Download slides - Lirmm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Fast Modular Reduction
Will Hasenplaugh
Gunnar Gaubatz
Vinodh Gopal
June 27, 2007
Modular Multiplication
• Modular Multiplication is used in Public Key Cryptography
– Diffie-Hellman and RSA
– Prime-field Elliptic Curve Cryptography
– Compute AB mod M where A,B and M are typically 100’s to 1000’s of bits
• We present a variant of Barrett’s Modular Reduction Algorithm
which exploits Karatsuba Multiplication and Modular Folding
• Analysis is software focused
– We use an abstract processor to compare algorithms fairly
– The native word size is w-bits (a power of 2)
– 1-cycle add and an m-cycle multiply
– We present example data on an 8-bit processor with a 2-cycle multiplier
– Atmel AVR series - representative of embedded handheld devices
– Our algorithm is also applicable to hardware acceleration
Digital Enterprise Group
2
Montgomery vs. Barrett
Word-Serial Montgomery
Barrett
Pro:
Pro:
• Regularity
• No Transformation Overhead
• Interleaved Multiply and Reduce
• Large Digit Based Computation
–
Low-Complexity Quotient Estimation
• Right-to-Left computation leads to
convenient hardware pipelines
Con:
• Transformation Overhead
• n2 complexity
–
Allows sub-n2 multiplication
techniques
• Flexible ‘Off the Shelf’ hardware
Con:
• Quotient Estimation requires a
‘large digit’ multiplication
• Left-to-Right computation is less
convenient for hardware
Digital Enterprise Group
3
Barrett vs. Montgomery
Performance of n2 Barrett approaches ~2/3 of Montgomery
•Quotient Estimation for Montgomery is amortized as operands grow
Speedup vs. Montgomery
0.85
Barrett
0.8
0.75
0.7
0.65
16
32
64
128
256
512
1024
2048
4096
Operand Size (n bits)
Digital Enterprise Group
4
Karatsuba Multiplication
Recursive multiplication algorithm with
O( n1.585 ) complexity.
a1
‘Schoolbook’ multiplication complexity scales
as O( n2 ), but requires fewer additions
per recursion.
N=a1b122n+(a1b0+a0b1)2n+a0b0
Karatsuba Multiplication -
x
b1
B
b0
b1+b0
a 1 b1
A=a12n+a0
Schoolbook Multiplication -
a0
a1+a0
N=AB
B=b12n+b0
A
a0 b 0
+
(a1+a0)(b1+b0)
-
a 0 b0
a 1 b1
N=a1b122n+
[(a1+a0)(b1+b0)-a1b1-a0b0]2n+a0b0
N=AB
Digital Enterprise Group
5
Recursive Karatsuba Decomposition
a1
A
a0
<= 1
<= 2
For k recursions:
‘extra’ word is
<= log2k bits
a1+a0
<= 3
There are fewer particles
in the universe than that.
Just one extra word on
an 8-bit machine is
sufficient to handle
multiplication of numbers
up to 2^258 bits.
So, we probably won’t
need to rewrite this code.
Digital Enterprise Group
6
Carry Handling
There is considerable overhead in the
naïve implementation of Karatsuba.
ah
al
x
• At a recursion depth of 4, ~20% of
the multiplies are with sparsely
populated ‘extra’ words.
We turn sparsely populated multiplies
into branches and adds.
N=AB
A=ah2n+al
bl
albl
+
al
if
bh =1
+
bl
if
ah =1
+
if
1
B=bh2n+bl
ah
& bh =1
N
ah and bh are booleans
N=ahbh22n+[ahbl+bhal]2n+albl
bh
Each recursion is a conveniently-sized multiply
-> No ‘extra’ words.
Digital Enterprise Group
7
Karatsuba vs. Schoolbook Multiplication
Speedup vs. Schoolbook
5.0
4.0
Karatsuba
Plus Carry Handling

3
8
 
4
3
log2 wn
3.0
2.0
1.0
0.0
64
128
256
512
1024
2048
4096
Operand Size (bits)
Digital Enterprise Group
8
Barrett’s Algorithm
A, B and M are n-bit numbers. We
seek to find R = AB mod M using
Barrett’s Algorithm.
A
x
N / 2n
N
B
N mod 2n
R  AB mod M
x
μ
N  AB
 22 n 
 

M 
 N   
R  N   n  n  M
2  2 
R  3M
~μ N / 22n μ N / 2n
x
-
M
~μ NM / 22n
A total of 3 n-bit multiplies.
R
Digital Enterprise Group
9
Barrett vs. Montgomery
Speedup vs. Montgomery
4.0
3.5
3.0
Barrett
Plus Karatsuba
Plus Carry Handling

1
4
 
4
3
log2 wn
2.5
2.0
1.5
1.0
0.5
0.0
64
128
256
512
1024
2048
4096
Operand Size (n bits)
Digital Enterprise Group
10
Folding
We accelerate the reduction process by
partially reducing N ( =AB ) with an
inexpensive method called Folding:

C  23s  C  23s mod M
A
 mod M 
n  2s
N / 23s
 23s 
 

M 
M   23s mod M
N   N mod2
3s
B
x
x
N 
  3s  M 
2 
+
N '  23s 1
N   N mod M
N mod 23s
M’=23s mod M
~NM’ / 23s
N’
Digital Enterprise Group
11
Iterative Folding
We can play the same trick again.
N / 21.5n
NN mod 21.5n
F times, in fact.
2
n
   22
M
N
N
0
i 
n



1 2  n mod M
(1)
N(1)Nmod
21.25n
 AB
N
 i 1
M(1)
+
i
M   2
i
F
x
 N  i 1   i 
1 2  n

mod2

M
1 2  n

 2

i
i
  N F  
 
R  N  
 1  2 F n  M
1  2 F  n


 
  2
 2

R  3  F  M
F 
x
M(2)
+
(2)
N(2)N
mod
21.125n
Digital Enterprise Group
12
Iterative Folding ( F = 2 )
Speedup vs. Montgomery
4.0
3.5
3.0
2.5
Barrett
Plus Karatsuba
Plus Carry Handling
Plus Folding

2
7
 
4
3
log2 wn
2.0
1.5
1.0
0.5
0.0
64
128
256
512
1024
2048
4096
Operand Size (n bits)
Digital Enterprise Group
13
Summary
•This Fast Modular Reduction technique is ~2x faster than
Montgomery on RSA Encryption on 512 – 1024 bit keys.
•As security requirements heighten, key sizes will grow to meet them
and the asymptotic advantage of Karatsuba will continue to shine.
We see a ~3x and ~4x advantage, respectively, for 2048 and 4096 bit
keys.
•The speedup of a multiplier-bound, w-bit architecture is

9
16
 
4
3
log2 wn
•Strong encryption on low-power handheld devices is challenging
– Ex: A 16MHz 8-bit Atmel AVR computes a 4096-bit RSA in almost 4
minutes with Montgomery, but we can do it in 1.
Digital Enterprise Group
14