Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Compiler Support for Optimizing
Tensor Contraction Expressions in
Quantum Chemistry Computations*
Gerald Baumgartner, Ohio State University
Daniel Cociorva, Ohio State University
Chi-Chung Lam, Ohio State University
P. Sadayappan, Ohio State University
J. Ramanujam, Louisiana State University
*Supported
by the National Science Foundation (NSF) through the ITR program
Introduction/Motivation
• Accurate computational models of electronic structure
are very time consuming
– Time/effort to develop code, due to complexity of
models for multiple-electron interactions
– Execution time of code, even on high-end systems
• Goal of multi-disciplinary project: develop a program
synthesis tool for quantum chemists
– Reduce model development time from weeks/months
to hours/days
– Improve execution time of quantum chemistry code
• Focus on a particular computational form of great
importance in accurate modeling of electronic structure:
tensor contractions
The Computational Context
• Accurate electronic structure models often involve tensor
contractions: multi-dimensional summations over products of
large arrays, e.g.
S (a, b, i, j )
A(a, c, i, k ) B(b, e, f , l )C (d , f , j, k ) D(c, d , e, l ) ...
c , d , e , f , k ,l
• Many compile-time optimization optimization issues:
– Operation minimization via algebraic transformations
– Memory minimization via loop fusion and array contraction
– Space-time trade-off: Store and re-use versus redundant re-computation of
temporary intermediate arrays quantities
– Data Locality Optimization: Tile-size selection and partial array expansion
for tiling imperfectly nested loops
– Data and Computation Partitoning: Minimize communication overhead
– Semantic information from high-level expression of computation
enables global optimizations across multiple imperfect loop nests
Realistic Quantum Chemistry Example
hbar[a,b,i,j] == sum[f[b,c]*t[i,j,a,c],{c}] -sum[f[k,c]*t[k,b]*t[i,j,a,c],{k,c}] +sum[f[a,c]*t[i,j,c,b],{c}] -sum[f[k,c]*t[k,a]*t[i,j,c,b],{k,c}] sum[f[k,j]*t[i,k,a,b],{k}] -sum[f[k,c]*t[j,c]*t[i,k,a,b],{k,c}] -sum[f[k,i]*t[j,k,b,a],{k}] -sum[f[k,c]*t[i,c]*t[j,k,b,a],{k,c}]
+sum[t[i,c]*t[j,d]*v[a,b,c,d],{c,d}] +sum[t[i,j,c,d]*v[a,b,c,d],{c,d}] +sum[t[j,c]*v[a,b,i,c],{c}] -sum[t[k,b]*v[a,k,i,j],{k}]
+sum[t[i,c]*v[b,a,j,c],{c}] -sum[t[k,a]*v[b,k,j,i],{k}] -sum[t[k,d]*t[i,j,c,b]*v[k,a,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,b,d]*v[k,a,c,d],{k,c,d}] sum[t[j,c]*t[k,b]*v[k,a,c,i],{k,c}] +2*sum[t[j,k,b,c]*v[k,a,c,i],{k,c}] -sum[t[j,k,c,b]*v[k,a,c,i],{k,c}] sum[t[i,c]*t[j,d]*t[k,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[k,d]*t[i,j,c,b]*v[k,a,d,c],{k,c,d}] -sum[t[k,b]*t[i,j,c,d]*v[k,a,d,c],{k,c,d}] sum[t[j,d]*t[i,k,c,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[i,c]*t[j,k,b,d]*v[k,a,d,c],{k,c,d}] -sum[t[i,c]*t[j,k,d,b]*v[k,a,d,c],{k,c,d}] sum[t[j,k,b,c]*v[k,a,i,c],{k,c}] -sum[t[i,c]*t[k,b]*v[k,a,j,c],{k,c}] -sum[t[i,k,c,b]*v[k,a,j,c],{k,c}] -sum[t[i,c]*t[j,d]*t[k,a]*v[k,b,c,d],{k,c,d}]
-sum[t[k,d]*t[i,j,a,c]*v[k,b,c,d],{k,c,d}] -sum[t[k,a]*t[i,j,c,d]*v[k,b,c,d],{k,c,d}] +2*sum[t[j,d]*t[i,k,a,c]*v[k,b,c,d],{k,c,d}] sum[t[j,d]*t[i,k,c,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,d,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[k,a]*v[k,b,c,j],{k,c}]
+2*sum[t[i,k,a,c]*v[k,b,c,j],{k,c}] -sum[t[i,k,c,a]*v[k,b,c,j],{k,c}] +2*sum[t[k,d]*t[i,j,a,c]*v[k,b,d,c],{k,c,d}] sum[t[j,d]*t[i,k,a,c]*v[k,b,d,c],{k,c,d}] -sum[t[j,c]*t[k,a]*v[k,b,i,c],{k,c}] -sum[t[j,k,c,a]*v[k,b,i,c],{k,c}] -sum[t[i,k,a,c]*v[k,b,j,c],{k,c}]
+sum[t[i,c]*t[j,d]*t[k,a]*t[l,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[k,l,c,d],{k,l,c,d}] 2*sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,b]*t[i,j,c,d]*v[k,l,c,d],{k,l,c,d}] 2*sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,c,d],{k,l,c,d}]
+sum[t[j,d]*t[l,b]*t[i,k,c,a]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[k,l,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,b]*t[j,k,d,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,c,d],{k,l,c,d}]
+4*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,c,d],{k,l,c,d}]
-2*sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[j,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}]
+sum[t[i,j,c,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,c,b]*t[k,l,a,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,a,c]*t[k,l,b,d]*v[k,l,c,d],{k,l,c,d}]
+sum[t[j,c]*t[k,b]*t[l,a]*v[k,l,c,i],{k,l,c}] +sum[t[l,c]*t[j,k,b,a]*v[k,l,c,i],{k,l,c}] -2*sum[t[l,a]*t[j,k,b,c]*v[k,l,c,i],{k,l,c}]
+sum[t[l,a]*t[j,k,c,b]*v[k,l,c,i],{k,l,c}] -2*sum[t[k,c]*t[j,l,b,a]*v[k,l,c,i],{k,l,c}] +sum[t[k,a]*t[j,l,b,c]*v[k,l,c,i],{k,l,c}]
+sum[t[k,b]*t[j,l,c,a]*v[k,l,c,i],{k,l,c}] +sum[t[j,c]*t[l,k,a,b]*v[k,l,c,i],{k,l,c}] +sum[t[i,c]*t[k,a]*t[l,b]*v[k,l,c,j],{k,l,c}]
+sum[t[l,c]*t[i,k,a,b]*v[k,l,c,j],{k,l,c}] -2*sum[t[l,b]*t[i,k,a,c]*v[k,l,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,c,a]*v[k,l,c,j],{k,l,c}]
+sum[t[i,c]*t[k,l,a,b]*v[k,l,c,j],{k,l,c}] +sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,d,c],{k,l,c,d}]
+sum[t[j,d]*t[l,a]*t[i,k,c,b]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}]
+sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,b]*t[j,l,d,a]*v[k,l,d,c],{k,l,c,d}]
+sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[k,a]*t[l,b]*v[k,l,i,j],{k,l}] +sum[t[k,l,a,b]*v[k,l,i,j],{k,l}]
+sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[l,k,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[l,k,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[l,k,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[l,k,c,d],{k,l,c,d}]
+sum[t[i,c]*t[l,a]*t[j,k,d,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,c,b]*t[k,l,a,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,a,c]*t[k,l,b,d]*v[l,k,c,d],{k,l,c,d}] 2*sum[t[l,c]*t[i,k,a,b]*v[l,k,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,a,c]*v[l,k,c,j],{k,l,c}] +sum[t[l,a]*t[i,k,c,b]*v[l,k,c,j],{k,l,c}] +v[a,b,i,j]
Operation Minimization
A(a, c, i, k )B(b, e, f , l )C (d , f , j, k )D(c, d , e, l )
S (a, b, i, j )
c , d , e , f , k ,l
A(a, c, i, k )C (d , f , j, k ) B(b, e, f , l )D(c, d , e, l )
S (a, b, i, j )
c , d , e , f , k ,l
S (a, b, i, j )
A(a, c, i, k )C (d , f , j , k )
B(b, e, f , l ) D(c, d , e, l )
c,d , f ,k
e
,
l
S (a, b, i, j )
A(a, c, i, k ) C (d , f , j, k )
B(b, e, f , l ) D(c, d , e, l )
d, f
c,k
e
,
l
B(b, e, f , l )D(c, d , e, l )
T 2(b, c, j , k ) T 1(b, c, d , f )C (d , f , j , k )
S (a, b, i, j ) T 2(b, c, j , k ) A(a, c, i, k )
T 1(b, c, d , f )
2N6 Ops
e ,l
2N6 Ops
d, f
c,k
2N6 Ops
Operation Minimization
S (a, b, i, j )
A(a, c, i, k ) B(b, e, f , l )C (d , f , j, k ) D(c, d , e, l )
c , d , e , f , k ,l
• Requires 4N10 operations if indices a – l have range N
• Using associative, commutative and distributive laws, constant
sub-expressions can be factored to reduce operations
• Optimal formula sequence requires only 6N6 operations
T 1(b, c, d , f ) B(b, e, f , l ) D(c, d , e, l )
2N6 Ops
e ,l
T 2(b, c, j , k ) T 1(b, c, d , f )C (d , f , j , k )
2N6 Ops
d, f
S (a, b, i, j ) T 2(b, c, j , k ) A(a, c, i, k )
c ,k
2N6 Ops
• But additional space is required for the temporary intermediates
T1 and T2; often this is much larger than input/output arrays
Loop Fusion for Memory Reduction
S=0
T 1bcdf Bbefl Dcdel
e ,l
T 2bcjk T 1bcdf Cdfjk
d, f
S abij T 2bcjk Aacik
c ,k
T1 = 0; T2 = 0; S = 0
for b, c, d, e, f, l
T1bcdf += Bbefl Dcdel
for b, c, d, f, j, k
T2bcjk += T1bcdf Cdfjk
for a, b, c, i, j, k
Sabij += T2bcjk Aacik
for b, c
T1f = 0; T2f = 0
for d, f
for e, l
T1f += Bbefl Dcdel
for j, k
T2fjk += T1f Cdfjk
for a, i, j, k
Sabij += T2fjk Aacik
Formula sequence
Unfused code
Fused code
Tensor Contraction Engine
Tensor contraction expressions
Operation Minimization
Sequence of loop nests (expression tree/DAG)
Memory Minimization
(Storage required exceeds limits)
(Storage requirements within limits)
Space-Time Trade-Offs
Imperfectly nested loops
Communication Minimization
Imperfectly nested parallel loops
Data Locality Optimization
Partitioned, tiled, parallel Fortran/C code
Example to Illustrate Space-Time Trade-Offs
for a, e, c, f
for i, j
Xaecf += Tijae Tijcf
for c, e, b, k
T1cebk = f1(c, e, b, k)
for a, f, b, k
T2afbk = f2(a, f, b, k)
array
X
T1
T2
Y
E
space
V4
V3O
V3O
V4
1
time
V4O 2
Cf1V3O
Cf2V3O
V5O
V4
for c, e, a, f
for b, k
Yceaf += T1cebk T2afbk
for c, e, a, f
E += Xaecf Yceaf
a .. f: range V = 1000 .. 3000
i .. k: range O = 30 .. 100
Memory-Minimal Form
for a, f, b, k
T2afbk = f2(a, f, b, k)
for c, e
for b, k
T1bk = f1(c, e, b, k)
for a, f
for i, j
array
X
T1
T2
Y
E
space
1
VO
V3O
1
1
time
V4O 2
Cf1V3O
Cf2V3O
V5O
V4
X += Tijae Tijcf
for b, k
Y += T1bk T2afbk
E += X Y
a .. f: range V = 3000
i .. k: range O = 100
Redundant Computation to Allow Full Fusion
for a, e, c, f
for i, j
X += Tijae Tijcf
for b, k
T1 = f1(c, e, b, k)
T2 = f2(a, f, b, k)
Y += T1 T2
E += X Y
array
X
T1
T2
Y
E
space
1
1
1
1
1
time
V 4 O2
Cf1V5O
Cf2V5O
V 5O
V4
Tiling for Reducing Recomputation
for at, et, ct, ft
for a, e, c, f
for i, j
Xaecf += Tijae Tijcf
for b, k
for c, e
T1ce = f1(c, e, b, k)
for a, f
T2af = f2(a, f, b, k)
for c, e, a, f
Yceaf += T1ce T2af
for c, e, a, f
E += Xaecf Yceaf
array
X
T1
T2
Y
E
space
B4
B2
B2
B4
1
time
V 4O2
Cf1(V/B)2V3O
Cf2(V/B)2V3O
V 5O
V4
The Fusion Graph
C (i, k ) A(i, j ) B( j , k )
j
∑
j
*
A(i,j)
B(j,k)
i j
j k
Making Fused Loops Explicit
C (i, k ) A(i, j ) B( j , k )
j
i: range 10
j: range 10
k: range 100
∑
j
*
A(i,j)
B(j,k)
i j
j k
Adding Recomputation Loops
C (i, k ) A(i, j ) B( j , k )
j
i: range 10
j: range 10
k: range 100
∑
*
j
{ (<i j,k>, 3, 9000), . . . }
A(i,j)
B(j,k)
i j
i j k
Tiling Recomputation Loops
C (i, k ) A(i, j ) B( j , k )
j
i: range 10
j: range 10
k: range 100
∑
j
*
A(it,i,j)
it
i j
it
B(j,k)
j k
Space-Time Trade-Off Algorithm
• For e1(i,j,k) * e2(i,j) consider recomputation of e2 in k loop
• For each expression tree node in bottom-up traversal
– Construct set of solutions { (fusion, mem cost, rec cost), . . . }
– Prune inferior solutions
• For all solutions for the root of the expression tree
– Split recomputation loops into tiling/intra-tile loops
– Move intra-tile loops inside tiling loops where needed
– Search for tile sizes that minimize recomputation cost
Conclusions and Status
• Computational domain with exploitable structure
• Several performance optimization issues being
addressed:
–
–
–
–
–
algebraic transformations for minimizing # operations
loop fusion
loop tiling (with partial array expansion)
space-time trade-off
data partitioning
• First prototype of synthesis system by end of 2002
• Future extensions
– Sparse arrays, symmetry, common subexpressions