Download - VLSI Computation Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Performance and Power Analysis of
Globally Asynchronous Locally
Synchronous Multiprocessor Systems
Zhiyi Yu, Bevan M. Baas
VLSI Computation Lab,
ECE department, UC Davis
Outline
• Motivation
• Experimental platform: a GALS array processor
• Performance analysis of GALS multiprocessors
• Power analysis of GALS multiprocessors
Why GALS Chip Multiprocessor
• Why GALS clocking style
– The challenge of globally synchronous systems
– The challenge of totally asynchronous systems
– GALS is a good compromise
• Why chip multiprocessor
– The challenge of increasing clock frequency
– High performance and high energy efficiency of
multiprocessor system
GALS Effects
• Performance penalty due to additional
synchronization delay
Total
sync. delay
clk1
clk2
clk1
module
1
sync.
circuit
module
2
clk2
te
ts
A simple GALS system
Sync. delay = clk edge alignment (te)
+ sync. circuit (ts)
• High energy efficiency due to independent
clock/voltage scaling
Synchronous vs. GALS
comparisons
Synchronous
GALS
(multiple clock domains)
ALU
Uniprocessor
MEM
Previous work
Array
processor
This work
MULT
Outline
• Motivation
• Experimental platform: a GALS array
processor
• Performance analysis of GALS multiprocessor
• Power analysis of GALS multiprocessor
Our Implementation of a GALS
Array and Synchronous Array
• Contains multiple identical simple processors
• Local oscillator and dual-clock FIFOs are key
components for GALS style
enhanced dual-clock FIFO
IMEM
IMEM
osc.
FIFO0
FIFO1
DMEM
3x3 array
processor
dualclkFIFO0
ALU
MAC
Control
dualclkFIFO1
config
Single processor in a
synchronous array
ALU
MAC
Control
DMEM
config
Single processor in a
GALS array
Micrograph of the 6 x 6 GALS Array
5.68
mm
Technology:
0.18 µm CMOS
Max speed:
475 MHz
Area (1 Proc):
0.66 mm²
Fully functional
Single
Processor
5.65 mm
[Yu, ISSCC06]
Outline
• Motivation
• Experimental platform: a GALS array processor
• Performance analysis of GALS
multiprocessor
• Power analysis of GALS multiprocessor
Performance Penalty of
GALS Uni-processor
• Extra pipeline hazards result in ~10% throughput
penalty compared with synchronous uni-processor
Branch inst
IF
ID
EXE
MEM
branch penalty
WB
IF
ID
EXE
MEM
WB
Branch penalty of 3 cycles in a synchronous uni-processor
clk1
Branch inst
IF
clk2
SYNC
ID
SYNC
clk3
clk4
clk5
EXE
SYNC MEM SYNC
WB
branch penalty
IF
Branch penalty of 3 cycles and 4 SYNC delays in a GALS uni-processor
[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]
Application Performance of
GALS and Synchronous Array
• GALS array has only ~1% performance penalty
• Simulation conditions: 32-word FIFO, same clock
frequency, 2 synchronization registers for GALS
8-pt 8×8 Zig- merg bub.
DCT DCT zag sort sort
matrix
64
FFT
JPEG
802.
11a/g
Sync. array
41
498
168
254
444
817.5
11439
1439
87857
GALS array
41
505
168
254
444
819
11710
1443
88989
GALS perf.
reduction(%)
0
1.4
0
0
0
0.1
2.3
0.3
1.3
Clock cycles for several applications
The Source of Performance
Penalty of GALS Array Processor
• For all systems, communication delay affects
system throughput only when it generates a loop
• For GALS array processor, communication loop
is the FIFO stall loop
– Performance simulation results show that the chance
of FIFO stall loop is low for many DSP applications
• FIFO depth affects FIFO stalls and hence a
reasonable FIFO size is required
Importance of the
Communication Loop Delay
• One way communication does not affect
system throughput
• Communication loop degrades throughput
– In uni-processor, it is caused by pipeline hazards
– GALS system has longer communication loop delay
comm.
(T4)
unit1
(T1)
comm.
(T2)
unit2
(T3)
One way communication:
The throughput is 1/Max (T1, T2, T3)
unit1
(T1)
comm.
(T2)
unit2
(T3)
Communication loop:
The throughput is 1/(T1 + T2 + T3 + T4)
Examples of Stall and Stall Loops
in a GALS Array Processor
Proc.
1
empty
stall
F
I
F
O
Proc.
2
Data producer proc. 1 too slow causes
frequent FIFO empty stalls
Proc.
1
full
stall
Proc.
2
Data consumer proc. 2 too slow causes
frequent FIFO full stalls
empty
stall
empty stall
F
I
F
O
Proc.
1
F
I
F
O
Proc.
2
full
stall
Data producer proc. 1 and data consumer
proc. 2 both too slow at different times
cause FIFO empty and full stalls
full stall
Proc.
1
F
I
F
O
empty stall
F
I
F
O
Proc.
2
full stall
Example of multiple-link loop between
two processors
Performance
Relative
Relative
Ratio
Performance Performance
Performance of Synchronous and
GALS Array with Different FIFO Sizes
Synchronous Array
GALS Array
GALS vs. Synchronous Array
8 DCT zig-zag bsort 64 FFT 802.11
8x8 DCT msort matrix JPEG
Outline
• Motivation
• Experimental platform: a GALS array processor
• Performance analysis of GALS multiprocessor
• Power analysis of GALS multiprocessor
Clock/Voltage Scaling in
GALS Uni-processor
• All GALS designs
– Independently control clock frequencies to save power
– Reduced clock frequency allows voltage reduction
• Around 25% energy savings with more than 10%
performance penalty
clk2
clk1
IF
sync
circ.
task1
task2
ID
sync
circ.
clk3
clk4
EXE
sync
MEM
circ.
program with
few MEM
instructions
program with
few WB
instructions
clk5
sync
circ.
WB
clk4 can be
reduced
clk5 can be
reduced
[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]
Clock/Voltage Scaling in
GALS Array Processor
• Similar basic idea as uni-processor
– Use low clock frequency for processors with light
computation load
– Benefit from unbalanced processor computation loads
• Both static and dynamic clock scaling methods
– We study only static scaling here
• The optimal processor clock frequency is
determined by its
– Computational load
– Position
• Can achieve power savings without performance
reduction!
Unbalanced Processor Computation
Loads in Nine Applications
8-pt DCT
merge-sort
64 FFT
8x8 DCT
bubble-sort
JPEG
zig-zag
matrix
802.11a/g
Throughput Changes with Statically
Configured Clock for 8x8 DCT
1
408
204
408
Compute load
(clk cycles)
204
Relative Throughput
0.95
0.9
0.85
optimal clock
scaling points
0.8
0.75
scale 1st processor
0.7
scale 2nd processor
scale 3rd processor
0.65
scale 4th processor
0.6
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Relative clock frequency
0.9
0.95
1
Relationship of Processors in an
8x8 DCT
• Proc. 2 and 4 have identical computational load
• Different position results in a different FIFO stall
style, which causes different clock scaling behavior
FIFO empty stall of 2nd proc.
F
I
F
O
1-DCT
F
I
F
O
Trans.
F
I
F
O
FIFO empty stall of 4th proc.
1-DCT
FIFO full stall of 2nd proc.
F
I
F
O
Trans.
• 40% power
savings without
a performance
penalty
• From simulated
clock frequency
and referenced
clk/voltage/pow
relationship
Relative Power
Power of GALS Array with
Static Clock/Voltage Scaling
8 DCT
zig-zag
JPEG
msort
matrix
8x8 DCT 64 FFT
802.11
bsort
Summary
• Compared to a synchronous array processor, the
proposed GALS array processor has:
– < 1% throughput reduction
– ~40% energy savings
• These results compare well with reported GALS
uni-processors:
– ~10% throughput reduction
– ~25% energy savings
• Source of throughput reduction in GALS system
– Extra cost for communication loops
– Extra cost for FIFO stall loops in GALS array processors
• Energy benefit of GALS clock/voltage scaling
– Unbalanced processor computation loads
Acknowledgments
• Funding
–
–
–
–
Intel Corporation
UC Micro
NSF Grant No. 0430090
UCD Faculty Research Grant
• Special Thanks
– E. Work, T. Mohsenin, other VCL processor codesigners, R. Krishnamurthy, M. Anders, S. Mathew
Related documents