Download Multi-Core Parallelism for Low-Power Design

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Multi-Core Parallelism for LowPower Design
Vishwani D. Agrawal
James J. Danaher Professor
Department of Electrical and Computer Engineering
Auburn University
http://www.eng.auburn.edu/~vagrawal
[email protected]
2/8/06
D&T Seminar
1
Power Consumption of VLSI Chips
Why is it a concern?
2/8/06
D&T Seminar
2
SIA Roadmap for Processors (1999)
Year
1999
2002
2005
2008
2011
2014
Feature size (nm)
180
130
100
70
50
35
Logic transistors/cm2
6.2M
18M
39M
84M
180M
390M
Clock (GHz)
1.25
2.1
3.5
6.0
10.0
16.9
Chip size (mm2)
340
430
520
620
750
900
Power supply (V)
1.8
1.5
1.2
0.9
0.6
0.5
High-perf. Power (W)
90
130
160
170
175
183
Source: http://www.semichips.org
2/8/06
D&T Seminar
3
ISSCC, Feb. 2001, Keynote
Patrick P. Gelsinger
Senior Vice President
General Manager
Digital Enterprise Group
INTEL CORP.
2/8/06
“Ten years from now,
microprocessors will run at
10GHz to 30GHz and be capable
of processing 1 trillion operations
per second -- about the same
number of calculations that the
world's fastest supercomputer
can perform now.
“Unfortunately, if nothing
changes these chips will produce
as much heat, for their
proportional size, as a nuclear
reactor. . . .”
D&T Seminar
4
VLSI Chip Power Density
Source: Intel
Sun’s
Surface
Power Density (W/cm2)
10000
1000
Nuclear
Reactor
100
8086
Hot Plate
10 4004
8008 8085
386
286
8080
1
1970
2/8/06
Rocket
Nozzle
1980
P6
Pentium®
486
1990
Year
D&T Seminar
2000
2010
5
Power Dissipation in
CMOS Logic (0.25µ)
Ptotal (0→1) = CL VDD2 + tscVDD Ipeak + VDDIleakage
VDD
VDD
CL
%75
2/8/06
%20
D&T Seminar
%5
6
Low-Power Datapath Architecture
• Lower supply voltage
– This slows down circuit speed
– Use parallel computing to gain the speed back
• Works well when threshold voltage is also
lowered.
• About 60% reduction in power obtainable.
• Reference: A. P. Chandrakasan and R. W.
Brodersen, Low Power Digital CMOS Design,
Boston: Kluwer Academic Publishers (Now
Springer), 1995.
2/8/06
D&T Seminar
7
Combinational
logic
Register
Input
Register
A Reference Datapath
Output
Cref
CK
Supply voltage
Total capacitance switched per cycle
Clock frequency
Power consumption:
Pref
2/8/06
D&T Seminar
= Vref
= Cref
=f
= CrefVref2f
8
Comb.
Logic
Copy 2
Multiphase
Clock gen.
and mux
control
f/N
Register
f/N
N = Deg. of
parallelism
Register
Input
Comb.
Logic
Copy 1
Supply voltage:
VN ≤ V1 = Vref
N to 1 multiplexer
f/N
Register
A copy processes
every Nth input,
operates at
reduced voltage
Register
A Parallel Architecture
Output
f
Comb.
Logic
Copy N
CK
2/8/06
D&T Seminar
9
Control Signals, N = 4
CK
Phase 1
Phase 2
Phase 3
Phase 4
2/8/06
D&T Seminar
10
Power
PN
=
Pproc + Poverhead
Pproc
=
N(Cinreg+ Ccomb)VN2f/N + CoutregVN2f
=
(Cinreg+ Ccomb+Coutreg)VN2f
=
CrefVN2f
CoverheadVN2f
PN
[1 + δ(N – 1)]CrefVN2f
=
PN
──
P1
2/8/06
≈ δCref(N – 1)VN2f
Poverhead =
=
VN2
[1 + δ(N – 1)] ───
Vref2
D&T Seminar
11
Voltage vs. Speed
Delay of a gate, T
≈
CLVref
────
I
=
CLVref
──────────
k(W/L)(Vref – Vt)2
Normalized
gate delay, T
where I is saturation current
k is a technology parameter
W/L is width to length ratio of transistor
Vt is threshold voltage
4.0
1.2μ CMOS Voltage reduction
slows down as we
N=3
3.0
get closer to Vt
N=2
2.0
N=1
1.0
0.0
2/8/06
Vt
V V2=2.9V Vref =5V
3 D&T Seminar
Supply voltage
12
Increasing Multiprocessing
1.0
1.2μ CMOS, Vref = 5V
0.8
Vt=0.8V
0.6
PN/P1
Vt=0.4V
0.4
0.2
Vt=0V (extreme case)
0.0
1
2
3
4
5
6
7
8
9
10
11
12
N
2/8/06
D&T Seminar
13
Extreme Cases: Vt = 0
Delay, T α 1/ Vref
For N processing elements, delay = NT → VN = Vref/N
PN
──
P1
=
[1+ δ (N – 1)]
1
──
N2
→
1/N
For negligible overhead, δ→0
PN
──
P1
≈
1
──
N2
For Vt > 0, power reduction is less and there will be an
optimum value of N.
2/8/06
D&T Seminar
14
Example: Multiplier Core
• Specification:
• 200MHz Clock
• 15W dissipation @ 5V
• Low voltage operation, VDD ≥ 1.5 volts
Relative clock rate
=
(VDD – 0.5)2
───────
20.25
• Problem:
• Integrate multiplier core on a SOC
• Power budget for multiplier ~ 5W
2/8/06
D&T Seminar
15
Multiphase
Clock gen.
and mux
control
40MHz
Reg
40MHz
Output
Reg
Multiplier
Core 2
5 to 1 mux
Input
Reg
40MHz
Multiplier
Core 1
Reg
A Multicore Design
200MHz
Multiplier
Core 5
200MHz
CK
Core clock frequency = 200/N, N should divide 200.
2/8/06
D&T Seminar
16
How Many Cores?
• For N cores:
• clock frequency = 200/N MHz
• Supply voltage, VDDN= 0.5 + (20.25/N)1/2 Volts
• Assuming 10% overhead per core,
VDDN 2
Power dissipation =15 [1 + 0.1(N – 1)] (───) watts
5
2/8/06
D&T Seminar
17
Design Tradeoffs
Number of cores
N
Clock (MHz)
Core supply
VDDN (Volts)
Total Power
(Watts)
1
200
5.00
15.0
2
100
3.68
8.94
4
50
2.75
5.90
5
40
2.51
5.29
8
25
2.10
4.50
2/8/06
D&T Seminar
18
Power Reduction in Processors
• Just about everything is used.
• Hardware methods:
•
•
•
•
Voltage reduction for dynamic power
Dual-threshold devices for leakage reduction
Clock gating, frequency reduction
Sleep mode
• Architecture:
• Instruction set
• hardware organization
• Software methods
2/8/06
D&T Seminar
19
Parallel Architecture
Processor
Input
Output
Output
Processor
Input
2/8/06
f/2
f
Processor
Capacitance = C
Voltage = V
Frequency = f
Power = CV2f
f/2
D&T Seminar
f
Capacitance = 2.2C
Voltage = 0.6V
Frequency = 0.5f
Power = 0.396CV2f
20
Output Input
½
Proc.
Output
f
f
Capacitance = 1.2C
Voltage = 0.6V
Frequency = f
Power = 0.432CV2f
Capacitance = C
Voltage = V
Frequency = f
Power = CV2f
2/8/06
½
Proc.
Register
Processor
Register
Input
Register
Pipeline Architecture
D&T Seminar
21
Approximate Trend
n-parallel proc.
n-stage pipeline proc.
Capacitance
nC
C
Voltage
V/n
V/n
Frequency
f/n
f
Power
CV2f/n2
CV2f/n2
Chip area
n times
10-20% increase
G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer
Academic Publishers, 1998.
2/8/06
D&T Seminar
22
Performance based on
SPECint2000 and SPECfp2000 benchmarks
Multicore Processors
2/8/06
Computer, May 2005, p. 12
Multicore
Single core
2000
2004
D&T Seminar
2008
23
Multicore Processors
• D. Geer, “Chip Makers Turn to Multicore
Processors,” Computer, vol. 38, no. 5, pp. 11-13,
May 2005.
• A. Jerraya, H. Tenhunen and W. Wolf,
“Multiprocessor Systems-on-Chips,” Computer,
vol. 5, no. 7, pp. 36-40, July 2005; this special
issue contains three more articles on
multicore processors.
• S. K. Moore, “Winner Multimedia Monster –
Cell’s Nine Processors Make It a Supercomputer
on a Chip,” IEEE Spectrum, vol. 43. no. 1, pp.
20-23, January 2006.
2/8/06
D&T Seminar
24
Cell - Cell Broadband Engine
Architecture
© IEEE Spectrum, January 2006
Nine-processor chip:
192 Gflops
2/8/06
L to R
Atsushi Kameyama, Toshiba
James Kahle, IBM
Masakazu Suzoki, Sony
D&T Seminar
25
Cell’s Nine-Processor Chip
© IEEE Spectrum, January 2006
2/8/06
D&T Seminar
Eight Identical
Processors
f = 5.6GHz (max)
44.8 Gflops
26
?
2/8/06
D&T Seminar
27
Amdahl’s Law
P=1–S
S
0
1
Speedup
=
1
─────────
S + (1 – S)/ N
Where N
=
number of parallel processors
Example:
time
S = 0.6, N = 10, Speedup = 1.56
S = 0.6, N = ∞, Speedup = 1.67
Gene Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” AFIPS Conference Proceedings,
(30), pp. 483-485, 1967.
2/8/06
D&T Seminar
28
Question
• Can we find a multi-processing law
– for power reduction, or
– for performance per watt
2/8/06
D&T Seminar
29
Related documents