Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department, UC Davis Outline • Motivation • Experimental platform: a GALS array processor • Performance analysis of GALS multiprocessors • Power analysis of GALS multiprocessors Why GALS Chip Multiprocessor • Why GALS clocking style – The challenge of globally synchronous systems – The challenge of totally asynchronous systems – GALS is a good compromise • Why chip multiprocessor – The challenge of increasing clock frequency – High performance and high energy efficiency of multiprocessor system GALS Effects • Performance penalty due to additional synchronization delay Total sync. delay clk1 clk2 clk1 module 1 sync. circuit module 2 clk2 te ts A simple GALS system Sync. delay = clk edge alignment (te) + sync. circuit (ts) • High energy efficiency due to independent clock/voltage scaling Synchronous vs. GALS comparisons Synchronous GALS (multiple clock domains) ALU Uniprocessor MEM Previous work Array processor This work MULT Outline • Motivation • Experimental platform: a GALS array processor • Performance analysis of GALS multiprocessor • Power analysis of GALS multiprocessor Our Implementation of a GALS Array and Synchronous Array • Contains multiple identical simple processors • Local oscillator and dual-clock FIFOs are key components for GALS style enhanced dual-clock FIFO IMEM IMEM osc. FIFO0 FIFO1 DMEM 3x3 array processor dualclkFIFO0 ALU MAC Control dualclkFIFO1 config Single processor in a synchronous array ALU MAC Control DMEM config Single processor in a GALS array Micrograph of the 6 x 6 GALS Array 5.68 mm Technology: 0.18 µm CMOS Max speed: 475 MHz Area (1 Proc): 0.66 mm² Fully functional Single Processor 5.65 mm [Yu, ISSCC06] Outline • Motivation • Experimental platform: a GALS array processor • Performance analysis of GALS multiprocessor • Power analysis of GALS multiprocessor Performance Penalty of GALS Uni-processor • Extra pipeline hazards result in ~10% throughput penalty compared with synchronous uni-processor Branch inst IF ID EXE MEM branch penalty WB IF ID EXE MEM WB Branch penalty of 3 cycles in a synchronous uni-processor clk1 Branch inst IF clk2 SYNC ID SYNC clk3 clk4 clk5 EXE SYNC MEM SYNC WB branch penalty IF Branch penalty of 3 cycles and 4 SYNC delays in a GALS uni-processor [Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03] Application Performance of GALS and Synchronous Array • GALS array has only ~1% performance penalty • Simulation conditions: 32-word FIFO, same clock frequency, 2 synchronization registers for GALS 8-pt 8×8 Zig- merg bub. DCT DCT zag sort sort matrix 64 FFT JPEG 802. 11a/g Sync. array 41 498 168 254 444 817.5 11439 1439 87857 GALS array 41 505 168 254 444 819 11710 1443 88989 GALS perf. reduction(%) 0 1.4 0 0 0 0.1 2.3 0.3 1.3 Clock cycles for several applications The Source of Performance Penalty of GALS Array Processor • For all systems, communication delay affects system throughput only when it generates a loop • For GALS array processor, communication loop is the FIFO stall loop – Performance simulation results show that the chance of FIFO stall loop is low for many DSP applications • FIFO depth affects FIFO stalls and hence a reasonable FIFO size is required Importance of the Communication Loop Delay • One way communication does not affect system throughput • Communication loop degrades throughput – In uni-processor, it is caused by pipeline hazards – GALS system has longer communication loop delay comm. (T4) unit1 (T1) comm. (T2) unit2 (T3) One way communication: The throughput is 1/Max (T1, T2, T3) unit1 (T1) comm. (T2) unit2 (T3) Communication loop: The throughput is 1/(T1 + T2 + T3 + T4) Examples of Stall and Stall Loops in a GALS Array Processor Proc. 1 empty stall F I F O Proc. 2 Data producer proc. 1 too slow causes frequent FIFO empty stalls Proc. 1 full stall Proc. 2 Data consumer proc. 2 too slow causes frequent FIFO full stalls empty stall empty stall F I F O Proc. 1 F I F O Proc. 2 full stall Data producer proc. 1 and data consumer proc. 2 both too slow at different times cause FIFO empty and full stalls full stall Proc. 1 F I F O empty stall F I F O Proc. 2 full stall Example of multiple-link loop between two processors Performance Relative Relative Ratio Performance Performance Performance of Synchronous and GALS Array with Different FIFO Sizes Synchronous Array GALS Array GALS vs. Synchronous Array 8 DCT zig-zag bsort 64 FFT 802.11 8x8 DCT msort matrix JPEG Outline • Motivation • Experimental platform: a GALS array processor • Performance analysis of GALS multiprocessor • Power analysis of GALS multiprocessor Clock/Voltage Scaling in GALS Uni-processor • All GALS designs – Independently control clock frequencies to save power – Reduced clock frequency allows voltage reduction • Around 25% energy savings with more than 10% performance penalty clk2 clk1 IF sync circ. task1 task2 ID sync circ. clk3 clk4 EXE sync MEM circ. program with few MEM instructions program with few WB instructions clk5 sync circ. WB clk4 can be reduced clk5 can be reduced [Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03] Clock/Voltage Scaling in GALS Array Processor • Similar basic idea as uni-processor – Use low clock frequency for processors with light computation load – Benefit from unbalanced processor computation loads • Both static and dynamic clock scaling methods – We study only static scaling here • The optimal processor clock frequency is determined by its – Computational load – Position • Can achieve power savings without performance reduction! Unbalanced Processor Computation Loads in Nine Applications 8-pt DCT merge-sort 64 FFT 8x8 DCT bubble-sort JPEG zig-zag matrix 802.11a/g Throughput Changes with Statically Configured Clock for 8x8 DCT 1 408 204 408 Compute load (clk cycles) 204 Relative Throughput 0.95 0.9 0.85 optimal clock scaling points 0.8 0.75 scale 1st processor 0.7 scale 2nd processor scale 3rd processor 0.65 scale 4th processor 0.6 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Relative clock frequency 0.9 0.95 1 Relationship of Processors in an 8x8 DCT • Proc. 2 and 4 have identical computational load • Different position results in a different FIFO stall style, which causes different clock scaling behavior FIFO empty stall of 2nd proc. F I F O 1-DCT F I F O Trans. F I F O FIFO empty stall of 4th proc. 1-DCT FIFO full stall of 2nd proc. F I F O Trans. • 40% power savings without a performance penalty • From simulated clock frequency and referenced clk/voltage/pow relationship Relative Power Power of GALS Array with Static Clock/Voltage Scaling 8 DCT zig-zag JPEG msort matrix 8x8 DCT 64 FFT 802.11 bsort Summary • Compared to a synchronous array processor, the proposed GALS array processor has: – < 1% throughput reduction – ~40% energy savings • These results compare well with reported GALS uni-processors: – ~10% throughput reduction – ~25% energy savings • Source of throughput reduction in GALS system – Extra cost for communication loops – Extra cost for FIFO stall loops in GALS array processors • Energy benefit of GALS clock/voltage scaling – Unbalanced processor computation loads Acknowledgments • Funding – – – – Intel Corporation UC Micro NSF Grant No. 0430090 UCD Faculty Research Grant • Special Thanks – E. Work, T. Mohsenin, other VCL processor codesigners, R. Krishnamurthy, M. Anders, S. Mathew