Download Talk - Louisiana State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Compiler Support for Optimizing
Tensor Contraction Expressions in
Quantum Chemistry Computations*
Gerald Baumgartner, Ohio State University
Daniel Cociorva, Ohio State University
Chi-Chung Lam, Ohio State University
P. Sadayappan, Ohio State University
J. Ramanujam, Louisiana State University
*Supported
by the National Science Foundation (NSF) through the ITR program
Introduction/Motivation
• Accurate computational models of electronic structure
are very time consuming
– Time/effort to develop code, due to complexity of
models for multiple-electron interactions
– Execution time of code, even on high-end systems
• Goal of multi-disciplinary project: develop a program
synthesis tool for quantum chemists
– Reduce model development time from weeks/months
to hours/days
– Improve execution time of quantum chemistry code
• Focus on a particular computational form of great
importance in accurate modeling of electronic structure:
tensor contractions
The Computational Context
• Accurate electronic structure models often involve tensor
contractions: multi-dimensional summations over products of
large arrays, e.g.
S (a, b, i, j ) 
 A(a, c, i, k ) B(b, e, f , l )C (d , f , j, k ) D(c, d , e, l )  ...
c , d , e , f , k ,l
• Many compile-time optimization optimization issues:
– Operation minimization via algebraic transformations
– Memory minimization via loop fusion and array contraction
– Space-time trade-off: Store and re-use versus redundant re-computation of
temporary intermediate arrays quantities
– Data Locality Optimization: Tile-size selection and partial array expansion
for tiling imperfectly nested loops
– Data and Computation Partitoning: Minimize communication overhead
– Semantic information from high-level expression of computation
enables global optimizations across multiple imperfect loop nests
Realistic Quantum Chemistry Example
hbar[a,b,i,j] == sum[f[b,c]*t[i,j,a,c],{c}] -sum[f[k,c]*t[k,b]*t[i,j,a,c],{k,c}] +sum[f[a,c]*t[i,j,c,b],{c}] -sum[f[k,c]*t[k,a]*t[i,j,c,b],{k,c}] sum[f[k,j]*t[i,k,a,b],{k}] -sum[f[k,c]*t[j,c]*t[i,k,a,b],{k,c}] -sum[f[k,i]*t[j,k,b,a],{k}] -sum[f[k,c]*t[i,c]*t[j,k,b,a],{k,c}]
+sum[t[i,c]*t[j,d]*v[a,b,c,d],{c,d}] +sum[t[i,j,c,d]*v[a,b,c,d],{c,d}] +sum[t[j,c]*v[a,b,i,c],{c}] -sum[t[k,b]*v[a,k,i,j],{k}]
+sum[t[i,c]*v[b,a,j,c],{c}] -sum[t[k,a]*v[b,k,j,i],{k}] -sum[t[k,d]*t[i,j,c,b]*v[k,a,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,b,d]*v[k,a,c,d],{k,c,d}] sum[t[j,c]*t[k,b]*v[k,a,c,i],{k,c}] +2*sum[t[j,k,b,c]*v[k,a,c,i],{k,c}] -sum[t[j,k,c,b]*v[k,a,c,i],{k,c}] sum[t[i,c]*t[j,d]*t[k,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[k,d]*t[i,j,c,b]*v[k,a,d,c],{k,c,d}] -sum[t[k,b]*t[i,j,c,d]*v[k,a,d,c],{k,c,d}] sum[t[j,d]*t[i,k,c,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[i,c]*t[j,k,b,d]*v[k,a,d,c],{k,c,d}] -sum[t[i,c]*t[j,k,d,b]*v[k,a,d,c],{k,c,d}] sum[t[j,k,b,c]*v[k,a,i,c],{k,c}] -sum[t[i,c]*t[k,b]*v[k,a,j,c],{k,c}] -sum[t[i,k,c,b]*v[k,a,j,c],{k,c}] -sum[t[i,c]*t[j,d]*t[k,a]*v[k,b,c,d],{k,c,d}]
-sum[t[k,d]*t[i,j,a,c]*v[k,b,c,d],{k,c,d}] -sum[t[k,a]*t[i,j,c,d]*v[k,b,c,d],{k,c,d}] +2*sum[t[j,d]*t[i,k,a,c]*v[k,b,c,d],{k,c,d}] sum[t[j,d]*t[i,k,c,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,d,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[k,a]*v[k,b,c,j],{k,c}]
+2*sum[t[i,k,a,c]*v[k,b,c,j],{k,c}] -sum[t[i,k,c,a]*v[k,b,c,j],{k,c}] +2*sum[t[k,d]*t[i,j,a,c]*v[k,b,d,c],{k,c,d}] sum[t[j,d]*t[i,k,a,c]*v[k,b,d,c],{k,c,d}] -sum[t[j,c]*t[k,a]*v[k,b,i,c],{k,c}] -sum[t[j,k,c,a]*v[k,b,i,c],{k,c}] -sum[t[i,k,a,c]*v[k,b,j,c],{k,c}]
+sum[t[i,c]*t[j,d]*t[k,a]*t[l,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[k,l,c,d],{k,l,c,d}] 2*sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,b]*t[i,j,c,d]*v[k,l,c,d],{k,l,c,d}] 2*sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,c,d],{k,l,c,d}]
+sum[t[j,d]*t[l,b]*t[i,k,c,a]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[k,l,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,b]*t[j,k,d,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,c,d],{k,l,c,d}]
+4*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,c,d],{k,l,c,d}]
-2*sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[j,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}]
+sum[t[i,j,c,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,c,b]*t[k,l,a,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,a,c]*t[k,l,b,d]*v[k,l,c,d],{k,l,c,d}]
+sum[t[j,c]*t[k,b]*t[l,a]*v[k,l,c,i],{k,l,c}] +sum[t[l,c]*t[j,k,b,a]*v[k,l,c,i],{k,l,c}] -2*sum[t[l,a]*t[j,k,b,c]*v[k,l,c,i],{k,l,c}]
+sum[t[l,a]*t[j,k,c,b]*v[k,l,c,i],{k,l,c}] -2*sum[t[k,c]*t[j,l,b,a]*v[k,l,c,i],{k,l,c}] +sum[t[k,a]*t[j,l,b,c]*v[k,l,c,i],{k,l,c}]
+sum[t[k,b]*t[j,l,c,a]*v[k,l,c,i],{k,l,c}] +sum[t[j,c]*t[l,k,a,b]*v[k,l,c,i],{k,l,c}] +sum[t[i,c]*t[k,a]*t[l,b]*v[k,l,c,j],{k,l,c}]
+sum[t[l,c]*t[i,k,a,b]*v[k,l,c,j],{k,l,c}] -2*sum[t[l,b]*t[i,k,a,c]*v[k,l,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,c,a]*v[k,l,c,j],{k,l,c}]
+sum[t[i,c]*t[k,l,a,b]*v[k,l,c,j],{k,l,c}] +sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,d,c],{k,l,c,d}]
+sum[t[j,d]*t[l,a]*t[i,k,c,b]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}]
+sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,b]*t[j,l,d,a]*v[k,l,d,c],{k,l,c,d}]
+sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[k,a]*t[l,b]*v[k,l,i,j],{k,l}] +sum[t[k,l,a,b]*v[k,l,i,j],{k,l}]
+sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[l,k,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[l,k,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[l,k,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[l,k,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,a]*t[j,k,d,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,c,b]*t[k,l,a,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,a,c]*t[k,l,b,d]*v[l,k,c,d],{k,l,c,d}] 2*sum[t[l,c]*t[i,k,a,b]*v[l,k,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,a,c]*v[l,k,c,j],{k,l,c}] +sum[t[l,a]*t[i,k,c,b]*v[l,k,c,j],{k,l,c}] +v[a,b,i,j]
Operation Minimization
 A(a, c, i, k )B(b, e, f , l )C (d , f , j, k )D(c, d , e, l )
S (a, b, i, j ) 
c , d , e , f , k ,l
 A(a, c, i, k )C (d , f , j, k ) B(b, e, f , l )D(c, d , e, l )
S (a, b, i, j ) 
c , d , e , f , k ,l



S (a, b, i, j ) 
A(a, c, i, k )C (d , f , j , k )
B(b, e, f , l ) D(c, d , e, l ) 


c,d , f ,k
e
,
l





S (a, b, i, j ) 
A(a, c, i, k )  C (d , f , j, k )
B(b, e, f , l ) D(c, d , e, l ) 


d, f
c,k
e
,
l








 B(b, e, f , l )D(c, d , e, l )
T 2(b, c, j , k )   T 1(b, c, d , f )C (d , f , j , k )
S (a, b, i, j )   T 2(b, c, j , k ) A(a, c, i, k )
T 1(b, c, d , f ) 
2N6 Ops
e ,l
2N6 Ops
d, f
c,k
2N6 Ops
Operation Minimization
S (a, b, i, j ) 
 A(a, c, i, k ) B(b, e, f , l )C (d , f , j, k ) D(c, d , e, l )
c , d , e , f , k ,l
• Requires 4N10 operations if indices a – l have range N
• Using associative, commutative and distributive laws, constant
sub-expressions can be factored to reduce operations
• Optimal formula sequence requires only 6N6 operations
T 1(b, c, d , f )   B(b, e, f , l ) D(c, d , e, l )
2N6 Ops
e ,l
T 2(b, c, j , k )   T 1(b, c, d , f )C (d , f , j , k )
2N6 Ops
d, f
S (a, b, i, j )   T 2(b, c, j , k ) A(a, c, i, k )
c ,k
2N6 Ops
• But additional space is required for the temporary intermediates
T1 and T2; often this is much larger than input/output arrays
Loop Fusion for Memory Reduction
S=0
T 1bcdf   Bbefl Dcdel
e ,l
T 2bcjk   T 1bcdf Cdfjk
d, f
S abij   T 2bcjk Aacik
c ,k
T1 = 0; T2 = 0; S = 0
for b, c, d, e, f, l
T1bcdf += Bbefl Dcdel
for b, c, d, f, j, k
T2bcjk += T1bcdf Cdfjk
for a, b, c, i, j, k
Sabij += T2bcjk Aacik
for b, c
T1f = 0; T2f = 0
for d, f
for e, l
T1f += Bbefl Dcdel
for j, k
T2fjk += T1f Cdfjk
for a, i, j, k
Sabij += T2fjk Aacik
Formula sequence
Unfused code
Fused code
Tensor Contraction Engine
Tensor contraction expressions
Operation Minimization
Sequence of loop nests (expression tree/DAG)
Memory Minimization
(Storage required exceeds limits)
(Storage requirements within limits)
Space-Time Trade-Offs
Imperfectly nested loops
Communication Minimization
Imperfectly nested parallel loops
Data Locality Optimization
Partitioned, tiled, parallel Fortran/C code
Example to Illustrate Space-Time Trade-Offs
for a, e, c, f
for i, j
Xaecf += Tijae Tijcf
for c, e, b, k
T1cebk = f1(c, e, b, k)
for a, f, b, k
T2afbk = f2(a, f, b, k)
array
X
T1
T2
Y
E
space
V4
V3O
V3O
V4
1
time
V4O 2
Cf1V3O
Cf2V3O
V5O
V4
for c, e, a, f
for b, k
Yceaf += T1cebk T2afbk
for c, e, a, f
E += Xaecf Yceaf
a .. f: range V = 1000 .. 3000
i .. k: range O = 30 .. 100
Memory-Minimal Form
for a, f, b, k
T2afbk = f2(a, f, b, k)
for c, e
for b, k
T1bk = f1(c, e, b, k)
for a, f
for i, j
array
X
T1
T2
Y
E
space
1
VO
V3O
1
1
time
V4O 2
Cf1V3O
Cf2V3O
V5O
V4
X += Tijae Tijcf
for b, k
Y += T1bk T2afbk
E += X Y
a .. f: range V = 3000
i .. k: range O = 100
Redundant Computation to Allow Full Fusion
for a, e, c, f
for i, j
X += Tijae Tijcf
for b, k
T1 = f1(c, e, b, k)
T2 = f2(a, f, b, k)
Y += T1 T2
E += X Y
array
X
T1
T2
Y
E
space
1
1
1
1
1
time
V 4 O2
Cf1V5O
Cf2V5O
V 5O
V4
Tiling for Reducing Recomputation
for at, et, ct, ft
for a, e, c, f
for i, j
Xaecf += Tijae Tijcf
for b, k
for c, e
T1ce = f1(c, e, b, k)
for a, f
T2af = f2(a, f, b, k)
for c, e, a, f
Yceaf += T1ce T2af
for c, e, a, f
E += Xaecf Yceaf
array
X
T1
T2
Y
E
space
B4
B2
B2
B4
1
time
V 4O2
Cf1(V/B)2V3O
Cf2(V/B)2V3O
V 5O
V4
The Fusion Graph
C (i, k )   A(i, j ) B( j , k )
j
∑
j
*
A(i,j)
B(j,k)
i j
j k
Making Fused Loops Explicit
C (i, k )   A(i, j ) B( j , k )
j
i: range 10
j: range 10
k: range 100
∑
j
*
A(i,j)
B(j,k)
i j
j k
Adding Recomputation Loops
C (i, k )   A(i, j ) B( j , k )
j
i: range 10
j: range 10
k: range 100
∑
*
j
{ (<i j,k>, 3, 9000), . . . }
A(i,j)
B(j,k)
i j
i j k
Tiling Recomputation Loops
C (i, k )   A(i, j ) B( j , k )
j
i: range 10
j: range 10
k: range 100
∑
j
*
A(it,i,j)
it
i j
it
B(j,k)
j k
Space-Time Trade-Off Algorithm
• For e1(i,j,k) * e2(i,j) consider recomputation of e2 in k loop
• For each expression tree node in bottom-up traversal
– Construct set of solutions { (fusion, mem cost, rec cost), . . . }
– Prune inferior solutions
• For all solutions for the root of the expression tree
– Split recomputation loops into tiling/intra-tile loops
– Move intra-tile loops inside tiling loops where needed
– Search for tile sizes that minimize recomputation cost
Conclusions and Status
• Computational domain with exploitable structure
• Several performance optimization issues being
addressed:
–
–
–
–
–
algebraic transformations for minimizing # operations
loop fusion
loop tiling (with partial array expansion)
space-time trade-off
data partitioning
• First prototype of synthesis system by end of 2002
• Future extensions
– Sparse arrays, symmetry, common subexpressions
Related documents