Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Compiler Support for Optimizing Tensor Contraction Expressions in Quantum Chemistry Computations* Gerald Baumgartner, Ohio State University Daniel Cociorva, Ohio State University Chi-Chung Lam, Ohio State University P. Sadayappan, Ohio State University J. Ramanujam, Louisiana State University *Supported by the National Science Foundation (NSF) through the ITR program Introduction/Motivation • Accurate computational models of electronic structure are very time consuming – Time/effort to develop code, due to complexity of models for multiple-electron interactions – Execution time of code, even on high-end systems • Goal of multi-disciplinary project: develop a program synthesis tool for quantum chemists – Reduce model development time from weeks/months to hours/days – Improve execution time of quantum chemistry code • Focus on a particular computational form of great importance in accurate modeling of electronic structure: tensor contractions The Computational Context • Accurate electronic structure models often involve tensor contractions: multi-dimensional summations over products of large arrays, e.g. S (a, b, i, j ) A(a, c, i, k ) B(b, e, f , l )C (d , f , j, k ) D(c, d , e, l ) ... c , d , e , f , k ,l • Many compile-time optimization optimization issues: – Operation minimization via algebraic transformations – Memory minimization via loop fusion and array contraction – Space-time trade-off: Store and re-use versus redundant re-computation of temporary intermediate arrays quantities – Data Locality Optimization: Tile-size selection and partial array expansion for tiling imperfectly nested loops – Data and Computation Partitoning: Minimize communication overhead – Semantic information from high-level expression of computation enables global optimizations across multiple imperfect loop nests Realistic Quantum Chemistry Example hbar[a,b,i,j] == sum[f[b,c]*t[i,j,a,c],{c}] -sum[f[k,c]*t[k,b]*t[i,j,a,c],{k,c}] +sum[f[a,c]*t[i,j,c,b],{c}] -sum[f[k,c]*t[k,a]*t[i,j,c,b],{k,c}] sum[f[k,j]*t[i,k,a,b],{k}] -sum[f[k,c]*t[j,c]*t[i,k,a,b],{k,c}] -sum[f[k,i]*t[j,k,b,a],{k}] -sum[f[k,c]*t[i,c]*t[j,k,b,a],{k,c}] +sum[t[i,c]*t[j,d]*v[a,b,c,d],{c,d}] +sum[t[i,j,c,d]*v[a,b,c,d],{c,d}] +sum[t[j,c]*v[a,b,i,c],{c}] -sum[t[k,b]*v[a,k,i,j],{k}] +sum[t[i,c]*v[b,a,j,c],{c}] -sum[t[k,a]*v[b,k,j,i],{k}] -sum[t[k,d]*t[i,j,c,b]*v[k,a,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,b,d]*v[k,a,c,d],{k,c,d}] sum[t[j,c]*t[k,b]*v[k,a,c,i],{k,c}] +2*sum[t[j,k,b,c]*v[k,a,c,i],{k,c}] -sum[t[j,k,c,b]*v[k,a,c,i],{k,c}] sum[t[i,c]*t[j,d]*t[k,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[k,d]*t[i,j,c,b]*v[k,a,d,c],{k,c,d}] -sum[t[k,b]*t[i,j,c,d]*v[k,a,d,c],{k,c,d}] sum[t[j,d]*t[i,k,c,b]*v[k,a,d,c],{k,c,d}] +2*sum[t[i,c]*t[j,k,b,d]*v[k,a,d,c],{k,c,d}] -sum[t[i,c]*t[j,k,d,b]*v[k,a,d,c],{k,c,d}] sum[t[j,k,b,c]*v[k,a,i,c],{k,c}] -sum[t[i,c]*t[k,b]*v[k,a,j,c],{k,c}] -sum[t[i,k,c,b]*v[k,a,j,c],{k,c}] -sum[t[i,c]*t[j,d]*t[k,a]*v[k,b,c,d],{k,c,d}] -sum[t[k,d]*t[i,j,a,c]*v[k,b,c,d],{k,c,d}] -sum[t[k,a]*t[i,j,c,d]*v[k,b,c,d],{k,c,d}] +2*sum[t[j,d]*t[i,k,a,c]*v[k,b,c,d],{k,c,d}] sum[t[j,d]*t[i,k,c,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[j,k,d,a]*v[k,b,c,d],{k,c,d}] -sum[t[i,c]*t[k,a]*v[k,b,c,j],{k,c}] +2*sum[t[i,k,a,c]*v[k,b,c,j],{k,c}] -sum[t[i,k,c,a]*v[k,b,c,j],{k,c}] +2*sum[t[k,d]*t[i,j,a,c]*v[k,b,d,c],{k,c,d}] sum[t[j,d]*t[i,k,a,c]*v[k,b,d,c],{k,c,d}] -sum[t[j,c]*t[k,a]*v[k,b,i,c],{k,c}] -sum[t[j,k,c,a]*v[k,b,i,c],{k,c}] -sum[t[i,k,a,c]*v[k,b,j,c],{k,c}] +sum[t[i,c]*t[j,d]*t[k,a]*t[l,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[k,l,c,d],{k,l,c,d}] 2*sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,b]*t[i,j,c,d]*v[k,l,c,d],{k,l,c,d}] 2*sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,c,d],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,c,a]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,b]*t[j,k,d,a]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,c,d],{k,l,c,d}] +4*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,d,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,c]*t[j,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] +sum[t[i,j,c,d]*t[k,l,a,b]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,c,b]*t[k,l,a,d]*v[k,l,c,d],{k,l,c,d}] -2*sum[t[i,j,a,c]*t[k,l,b,d]*v[k,l,c,d],{k,l,c,d}] +sum[t[j,c]*t[k,b]*t[l,a]*v[k,l,c,i],{k,l,c}] +sum[t[l,c]*t[j,k,b,a]*v[k,l,c,i],{k,l,c}] -2*sum[t[l,a]*t[j,k,b,c]*v[k,l,c,i],{k,l,c}] +sum[t[l,a]*t[j,k,c,b]*v[k,l,c,i],{k,l,c}] -2*sum[t[k,c]*t[j,l,b,a]*v[k,l,c,i],{k,l,c}] +sum[t[k,a]*t[j,l,b,c]*v[k,l,c,i],{k,l,c}] +sum[t[k,b]*t[j,l,c,a]*v[k,l,c,i],{k,l,c}] +sum[t[j,c]*t[l,k,a,b]*v[k,l,c,i],{k,l,c}] +sum[t[i,c]*t[k,a]*t[l,b]*v[k,l,c,j],{k,l,c}] +sum[t[l,c]*t[i,k,a,b]*v[k,l,c,j],{k,l,c}] -2*sum[t[l,b]*t[i,k,a,c]*v[k,l,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,c,a]*v[k,l,c,j],{k,l,c}] +sum[t[i,c]*t[k,l,a,b]*v[k,l,c,j],{k,l,c}] +sum[t[j,c]*t[l,d]*t[i,k,a,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,b]*t[i,k,a,c]*v[k,l,d,c],{k,l,c,d}] +sum[t[j,d]*t[l,a]*t[i,k,c,b]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,c,d]*t[j,l,b,a]*v[k,l,d,c],{k,l,c,d}] -2*sum[t[i,k,a,c]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,a]*t[j,l,b,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,b]*t[j,l,c,d]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,c,b]*t[j,l,d,a]*v[k,l,d,c],{k,l,c,d}] +sum[t[i,k,a,c]*t[j,l,d,b]*v[k,l,d,c],{k,l,c,d}] +sum[t[k,a]*t[l,b]*v[k,l,i,j],{k,l}] +sum[t[k,l,a,b]*v[k,l,i,j],{k,l}] +sum[t[k,b]*t[l,d]*t[i,j,a,c]*v[l,k,c,d],{k,l,c,d}] +sum[t[k,a]*t[l,d]*t[i,j,c,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,d]*t[j,k,b,a]*v[l,k,c,d],{k,l,c,d}] -2*sum[t[i,c]*t[l,a]*t[j,k,b,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,c]*t[l,a]*t[j,k,d,b]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,c,b]*t[k,l,a,d]*v[l,k,c,d],{k,l,c,d}] +sum[t[i,j,a,c]*t[k,l,b,d]*v[l,k,c,d],{k,l,c,d}] 2*sum[t[l,c]*t[i,k,a,b]*v[l,k,c,j],{k,l,c}] +sum[t[l,b]*t[i,k,a,c]*v[l,k,c,j],{k,l,c}] +sum[t[l,a]*t[i,k,c,b]*v[l,k,c,j],{k,l,c}] +v[a,b,i,j] Operation Minimization A(a, c, i, k )B(b, e, f , l )C (d , f , j, k )D(c, d , e, l ) S (a, b, i, j ) c , d , e , f , k ,l A(a, c, i, k )C (d , f , j, k ) B(b, e, f , l )D(c, d , e, l ) S (a, b, i, j ) c , d , e , f , k ,l S (a, b, i, j ) A(a, c, i, k )C (d , f , j , k ) B(b, e, f , l ) D(c, d , e, l ) c,d , f ,k e , l S (a, b, i, j ) A(a, c, i, k ) C (d , f , j, k ) B(b, e, f , l ) D(c, d , e, l ) d, f c,k e , l B(b, e, f , l )D(c, d , e, l ) T 2(b, c, j , k ) T 1(b, c, d , f )C (d , f , j , k ) S (a, b, i, j ) T 2(b, c, j , k ) A(a, c, i, k ) T 1(b, c, d , f ) 2N6 Ops e ,l 2N6 Ops d, f c,k 2N6 Ops Operation Minimization S (a, b, i, j ) A(a, c, i, k ) B(b, e, f , l )C (d , f , j, k ) D(c, d , e, l ) c , d , e , f , k ,l • Requires 4N10 operations if indices a – l have range N • Using associative, commutative and distributive laws, constant sub-expressions can be factored to reduce operations • Optimal formula sequence requires only 6N6 operations T 1(b, c, d , f ) B(b, e, f , l ) D(c, d , e, l ) 2N6 Ops e ,l T 2(b, c, j , k ) T 1(b, c, d , f )C (d , f , j , k ) 2N6 Ops d, f S (a, b, i, j ) T 2(b, c, j , k ) A(a, c, i, k ) c ,k 2N6 Ops • But additional space is required for the temporary intermediates T1 and T2; often this is much larger than input/output arrays Loop Fusion for Memory Reduction S=0 T 1bcdf Bbefl Dcdel e ,l T 2bcjk T 1bcdf Cdfjk d, f S abij T 2bcjk Aacik c ,k T1 = 0; T2 = 0; S = 0 for b, c, d, e, f, l T1bcdf += Bbefl Dcdel for b, c, d, f, j, k T2bcjk += T1bcdf Cdfjk for a, b, c, i, j, k Sabij += T2bcjk Aacik for b, c T1f = 0; T2f = 0 for d, f for e, l T1f += Bbefl Dcdel for j, k T2fjk += T1f Cdfjk for a, i, j, k Sabij += T2fjk Aacik Formula sequence Unfused code Fused code Tensor Contraction Engine Tensor contraction expressions Operation Minimization Sequence of loop nests (expression tree/DAG) Memory Minimization (Storage required exceeds limits) (Storage requirements within limits) Space-Time Trade-Offs Imperfectly nested loops Communication Minimization Imperfectly nested parallel loops Data Locality Optimization Partitioned, tiled, parallel Fortran/C code Example to Illustrate Space-Time Trade-Offs for a, e, c, f for i, j Xaecf += Tijae Tijcf for c, e, b, k T1cebk = f1(c, e, b, k) for a, f, b, k T2afbk = f2(a, f, b, k) array X T1 T2 Y E space V4 V3O V3O V4 1 time V4O 2 Cf1V3O Cf2V3O V5O V4 for c, e, a, f for b, k Yceaf += T1cebk T2afbk for c, e, a, f E += Xaecf Yceaf a .. f: range V = 1000 .. 3000 i .. k: range O = 30 .. 100 Memory-Minimal Form for a, f, b, k T2afbk = f2(a, f, b, k) for c, e for b, k T1bk = f1(c, e, b, k) for a, f for i, j array X T1 T2 Y E space 1 VO V3O 1 1 time V4O 2 Cf1V3O Cf2V3O V5O V4 X += Tijae Tijcf for b, k Y += T1bk T2afbk E += X Y a .. f: range V = 3000 i .. k: range O = 100 Redundant Computation to Allow Full Fusion for a, e, c, f for i, j X += Tijae Tijcf for b, k T1 = f1(c, e, b, k) T2 = f2(a, f, b, k) Y += T1 T2 E += X Y array X T1 T2 Y E space 1 1 1 1 1 time V 4 O2 Cf1V5O Cf2V5O V 5O V4 Tiling for Reducing Recomputation for at, et, ct, ft for a, e, c, f for i, j Xaecf += Tijae Tijcf for b, k for c, e T1ce = f1(c, e, b, k) for a, f T2af = f2(a, f, b, k) for c, e, a, f Yceaf += T1ce T2af for c, e, a, f E += Xaecf Yceaf array X T1 T2 Y E space B4 B2 B2 B4 1 time V 4O2 Cf1(V/B)2V3O Cf2(V/B)2V3O V 5O V4 The Fusion Graph C (i, k ) A(i, j ) B( j , k ) j ∑ j * A(i,j) B(j,k) i j j k Making Fused Loops Explicit C (i, k ) A(i, j ) B( j , k ) j i: range 10 j: range 10 k: range 100 ∑ j * A(i,j) B(j,k) i j j k Adding Recomputation Loops C (i, k ) A(i, j ) B( j , k ) j i: range 10 j: range 10 k: range 100 ∑ * j { (<i j,k>, 3, 9000), . . . } A(i,j) B(j,k) i j i j k Tiling Recomputation Loops C (i, k ) A(i, j ) B( j , k ) j i: range 10 j: range 10 k: range 100 ∑ j * A(it,i,j) it i j it B(j,k) j k Space-Time Trade-Off Algorithm • For e1(i,j,k) * e2(i,j) consider recomputation of e2 in k loop • For each expression tree node in bottom-up traversal – Construct set of solutions { (fusion, mem cost, rec cost), . . . } – Prune inferior solutions • For all solutions for the root of the expression tree – Split recomputation loops into tiling/intra-tile loops – Move intra-tile loops inside tiling loops where needed – Search for tile sizes that minimize recomputation cost Conclusions and Status • Computational domain with exploitable structure • Several performance optimization issues being addressed: – – – – – algebraic transformations for minimizing # operations loop fusion loop tiling (with partial array expansion) space-time trade-off data partitioning • First prototype of synthesis system by end of 2002 • Future extensions – Sparse arrays, symmetry, common subexpressions