Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia*, Krisztián Flautner* University of Michigan 1 *ARM Ltd. University of Michigan Electrical Engineering and Computer Science Computational Efficiency • Low power envelope • More useful work/transistors • Hardware accelerators • Niagara II encryption engine Source: AMD Analyst Day 12/14/06 2 University of Michigan Electrical Engineering and Computer Science How Are Accelerators Used? Program Accel. CPU Control statically placed in binary 3 University of Michigan Electrical Engineering and Computer Science Problem With Static Control Program CPU Accel. CPU Accel. CPU Not forward/backward compatible 4 University of Michigan Electrical Engineering and Computer Science Solution: Virtualization • Statically identify accelerated computation • Abstract accelerator features • Dynamically retarget binary Program Proc. Trans. Engineer/ Compiler Accel. Proc. Trans. Accel. Proc. Trans. 5 University of Michigan Electrical Engineering and Computer Science Liquid SIMD • Virtualize SIMD accelerators • Why virtualize SIMD? – Intel MMX to SSE2 – ARM v6 to Neon – Wide vectors useful [Lin 06] 6 University of Michigan Electrical Engineering and Computer Science SIMD Accelerator Assumptions SIMD Exec Fetch Decode Retire Scalar Exec • Same instruction stream • Separate pipeline – memory interface 7 University of Michigan Electrical Engineering and Computer Science How to Virtualize • Use scalar ISA to represent SIMD operations – Compatibility, low overhead Program Branch • Key: easy to translate 8 University of Michigan Electrical Engineering and Computer Science Virtualization Architecture uCode Cache Accel. Fetch Retire Decode Trans. Execute 9 University of Michigan Electrical Engineering and Computer Science 1. Data Parallel Operations A B + & for(i = r1 = r2 = r3 = r4 = C[i] } 0; i < 8; i++) { A[i]; B[i]; r1 + r2; r3 & constant; = r4; C 10 University of Michigan Electrical Engineering and Computer Science 1a. What If There’s No Scalar Equivalent? A B for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ... } SADD Idioms can always be constructed 11 University of Michigan Electrical Engineering and Computer Science 2. Scalarizing Permutations + & for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1 } for(i r1 r2 r3 … } = = = = 0; i < 8; i++) { offset[i]; tmp[r1 + i] r2 & const offset = {4, 4, 4, 4, -4, -4, -4, -4} 12 University of Michigan Electrical Engineering and Computer Science 3. Scalarizing Reductions + for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; … } 13 University of Michigan Electrical Engineering and Computer Science Applied to ARM Neon • All instructions supported except… • VTBL – indirect indexing v3 v2 v1 = vtbl v2, v3 1 0 Mem v1 • Interleaved memory accesses v1 • Not needed in evaluated benchmarks 14 University of Michigan Electrical Engineering and Computer Science 1 3 Translation to SIMD for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3; } for(i = { v1 = v2 = v3 = v4 v3 = C[i] } 0; i < 8; i += 4 4) A[i]; B[i]; v1 + v2; offset[i]; v3 & constant shuffle v3; = v3; • Update induction variable • Use inverse of defined translation rules 15 University of Michigan Electrical Engineering and Computer Science Translator Design Program Proc. Trans. Engineer/ Compiler Accel. Proc. Trans. Accel. Proc. Trans. Translator: efficiency, speed, flexibility 16 University of Michigan Electrical Engineering and Computer Science Evaluation • Trimaran ARM • Hand SIMDized loops • SimpleScalar model ARM926 w/ Neon SIMD • VHDL translator, 130nm std. cell 17 University of Michigan Electrical Engineering and Computer Science Liquid SIMD Issues • Code bloat – <1% overhead beyond baseline • Register pressure – Not a problem • Translator cost – 0.2 mm2 + 2KB cache • Translation overhead 18 University of Michigan Electrical Engineering and Computer Science Translation Overhead MediaBench SPECfp 19 Kernels University of Michigan Electrical Engineering and Computer Science Summary • Accelerators are more common and evolving – Costly binary migration • SIMD virtualization using scalar ISA – One binary: forward/backward compatibility – Negligible overhead 20 University of Michigan Electrical Engineering and Computer Science Questions ? ? ? ? ? ? ? ? 21 ? ? ? ? University of Michigan Electrical Engineering and Computer Science