Download The Rise of Dark Silicon - You should not be here.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Electric power system wikipedia , lookup

History of electric power transmission wikipedia , lookup

Wireless power transfer wikipedia , lookup

Audio power wikipedia , lookup

Voltage optimisation wikipedia , lookup

Microprocessor wikipedia , lookup

Electrification wikipedia , lookup

Mains electricity wikipedia , lookup

Power over Ethernet wikipedia , lookup

Distributed generation wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Magnetic-core memory wikipedia , lookup

Alternating current wikipedia , lookup

Tektronix analog oscilloscopes wikipedia , lookup

Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup

Power engineering wikipedia , lookup

Multi-core processor wikipedia , lookup

Transcript
The Rise of Dark Silicon
Nikos Hardavellas
Northwestern University, EECS
Energy is Shaping the IT Industry
#1 of Grand Challenges for Humanity in the Next 50 Years
[Smalley Institute for Nanoscale Research and Technology, Rice U.]
• A 1,000m2 datacenter is 1.5MW!
• Datacenter energy consumption in US >100 TWh in 2011 [EPA]
 2.5% of domestic power generation, $7.4B
• Global computing consumed ~408 TWh in 2010 [Gartner]
• Carbon footprint of world’s data centers ≈ Czech Republic
• CO2-equiv. emissions of US datacenters ≈ Airline Industry (2%)
• 10% annual growth on installed computers worldwide [Gartner]
Exponential increase in energy consumption
2
© Hardavellas
Scaling Factor
Application Dataset Scaling
20
18
16
14
12
10
8
6
4
2
0
2004
OS Dataset Scaling
(Muhrvold's Law)
Transistor Scaling
(Moore's Law)
TPC Dataset
(Historic)
2007
2010
2013
2016
2019
Year
Application datasets scale faster than Moore’s Law!
3
© Hardavellas
Datasets Grow Exponentially
• SPEC and TPC datasets grow faster than Moore’s Law
• Large Hadron Collider
 March 2011: 1.6PB data produced and transferred to Tier-1
• Large Synoptic Survey Telescope
 Produces 30 TB/night
 Roughly equivalent to 2 Sloan Digital Sky Surveys daily
 Sloan produced more data than entire history of
astronomy before it
• Massive data require massive computations to process them
Exponential increase in energy consumption
4
© Hardavellas
Exponential Growth of Core Counts
512
256
128
Teraflops
TILE64
# Cores
64
Larrabee
32
MIT-RAW
16
Magny-Cours
Beckton
Istanbul
Xeon
Core 2 QuadAgena Barcelona
Core i7
UltraSPARC IV
Toliman
Power4 Pentium-D
Niagara
8
4
2
8086
80286
1
1975
Sun Rock
1980
Cell
Pentium Pro P-4
Xeon Brisbane
Turion X2
80386 80486 Pentium P-II P-III Itanium
Core 2 Duo
Denmark
1985
1990
1995
2000
2005
2010
2015
Year
Does performance follow same curve?
5
© Hardavellas
Performance Expectations vs. Reality
25
Relative Performance
20
Speedup under
Moore's Law
15
Speedup under
Physical Constraints
10
5
0
2004
2007
2010
2013
2016
2019
Year of Technology Introduction
Physical constraints limit speedup
6
© Hardavellas
Pin Bandwidth Scaling
Scaling Factor
16
8
Transistor Scaling
(Moore's Law)
4
Pin Bandwidth
2
1
2004
2007
2010
2013
2016
2019
Year
Cannot feed cores with data fast enough
7
© Hardavellas
Supply Voltage Scaling
Scaling Factor
16
8
Transistor Scaling
(Moore's Law)
4
2
Supply Voltage
(ITRS)
1
0.5
2004
2007
2010
2013
2016
2019
Year
Cannot power up all transistors simultaneously  Dark Silicon
8
© Hardavellas
Chip Power Scaling
250
Watts / Chip
200
Max Power
(air cooling +
heatsink)
150
100
Chip Power
(ITRS)
50
0
2004
2007
2010
2013
2016
2019
Year
Cooling does not scale!
9
© Hardavellas
Range of Operational Voltage
[Watanabe et al., ISCA’10]
Shrinking range of operational voltage hampers voltage-freq. scaling
10
© Hardavellas
Where Does Server Energy Go?
Many sources of power consumption:
• Server only [Fan, ISCA’07]
 Processor chips (37%)
 Memory (17%)
 Peripherals (29%)
 …
• Infrastructure (another 50%)
 Cooling
 Power distribution
11
© Hardavellas
A Study of Server Chip Scalability
• Actual server workloads today
 Easily parallelizable (performance-scalable)
• Actual physical char. of processors/memory
 ITRS projections for technology nodes
 Modeled power/performance across nodes
• For server chips
 Bandwidth is near-term limiter
→ Energy is the ultimate limiter
12
© Hardavellas
First-Order Analytical Modeling
[Hardavellas, IEEE Micro 2011]
Physical characteristics modeled after UltraSPARC T2, ARM11
 Area: Cores + caches = 72% die, scaled across technologies
 Power: ITRS projections of Vdd, Vth, Cgate, Isub, Wgate, S0
o Active: cores=f(GHz), cache=f(access rate), NoC=f(hops)
o Leakage: f(area), f(devices), 66oC
o Devices/ITRS: Bulk Planar CMOS, UTB-FD SOI, FinFETs, HP/LOP
 Bandwidth:
o ITRS projections on I/O pins, off-chip clock, f(miss, GHz)
 Performance: CPI model based on miss rate
o Parameters from real server workloads (DB2, Oracle, Apache)
o Cache miss rate model (validated), Amdahl & Myhrvold Laws
13
© Hardavellas
Caveats
• First-order model
 The intent is to uncover trends relating the effects of
technology-driven physical constraints to the performance
of commercial workloads running on multicores
 The intent is NOT to offer absolute numbers
• Performance model works well for workloads with low MLP
 Database (OLTP, DSS) and web workloads are mostly
memory-latency-bound
• Workloads are assumed parallel
 Scaling server workloads is reasonable
14
© Hardavellas
Area vs. Power Envelope
256
Area (310mm)
Power (130W)
128
Number of Cores
64
32
16
8
4
2
1
1
2
4
8 16 32 64 128256512
Cache Size (MB)
Good news: can fit 100’s cores. Bad news: cannot power them all
15
© Hardavellas
Pack More Slower Cores, Cheaper Cache
256
Area (310mm)
Power (130W)
1 GHz, 0.27V
2.7 GHz, 0.36V
4.4 GHz, 0.45V
5.7 GHz, 0.54V
6.9 GHz, 0.63V
8 GHz, 0.72V
9 GHz, 0.81V
128
Number of Cores
64
32
16
8
4
VFS
2
1
1
2
4
8 16 32 64 128256512
Cache Size (MB)
The reality of The Power Wall: a power-performance trade-off
16
© Hardavellas
Pin Bandwidth Constraint
256
Area (310mm)
Power (130W)
1 GHz, 0.27V
2.7 GHz, 0.36V
4.4 GHz, 0.45V
5.7 GHz, 0.54V
VFS
6.9 GHz, 0.63V
8 GHz, 0.72V
9 GHz, 0.81V
Bandwidth (1 GHz)
Bandwidth (2.7GHz)
128
Number of Cores
64
32
16
8
4
2
1
1
2
4
8 16 32 64 128256512
Cache Size (MB)
Bandwidth constraint favors fewer + slower cores, more cache
17
© Hardavellas
Example of Optimization Results
Performance (GIPS)
600
Power + BW: ~5x loss
500
400
Area (max freq)
BW:
~2x loss
300
200
Power (max freq)
Bandwidth, VFS
Area+Power, VFS
100
Area+P+BW, VFS
0
1
2
4
8
16 32 64 128 256 512
Cache Size (MB)
Jointly optimize parameters, subject to constraints, SW trends
Design is first bandwidth-constrained, then power-constrained
18
© Hardavellas
Number of Cores
Core Counts for Peak-Performance Designs
1024
512
256
128
64
32
16
8
4
2
1
2004 2007 2010 2013 2016 2019
Max EMB Cores
Embedded (EMB)
General-Purpose (GPP)
Physical characteristics
modeled after
• UltraSPARC T2 (GPP)
• ARM11 (EMB)
Year
Designs > 120 cores impractical for general-purpose server apps
B/W and power envelopes + dataset scaling limit core counts
19
© Hardavellas
Short-Term Scaling Implications
Caches are getting huge
• Need cache architectures to deal with >> MB
 Elastic Caches
o Adapt behavior to executing workload to minimize transfers
o Reactive NUCA [Hardavellas, ISCA 2009][Hardavellas, IEEE Micro 2010]
o Dynamic Directories [Das, NUTR 2010, in submission]
o …but that’s another talk…
Need to push back the bandwidth wall!!!
20
© Hardavellas
Mitigating Bandwidth Limitations: 3D-stacking
[Philips]
[Loh et al., ISCA’08]
[IBM]
Delivers TB/sec of bandwidth; use as large “in-package” cache
21
© Hardavellas
Performance Analysis of 3D-Stacked Multicores
Performance (GIPS)
800
Area (max freq)
700
Power (max freq)
600
Bandwidth, VFS
500
Area+Power, VFS
400
Area+P+BW, VFS
300
200
100
0
1
2
4
8
16 32 64 128 256 512
Cache Size (MB)
Chip becomes power-constrained
22
© Hardavellas
The Rise of Dark Silicon
Transistor counts increase exponentially, but…
Scaling Factor
Can no longer power the entire chip
(voltages, cooling do not scale)
Transistor Scaling
(Moore's Law)
Supply Voltage
(ITRS)
Traditional HW power-aware
techniques inadequate
(e.g., voltage-freq. scaling)
Watts / Chip
Year
Max Power (air cooling + heatsink)
Chip Power (ITRS)
[Watanabe et al., ISCA’10]
Dark Silicon !!!
Year
23
© Hardavellas
Exponentially-Large Area Left Unutilized
Max Die Size
DB2-TPCH
Trendline (exp.)
DB2-TPCC
Apache
Die Size (mm2)
512
256
128
64
2004 2007 2010 2013 2016 2019
Year
Should we waste it?
24
© Hardavellas
Repurpose Dark Silicon for Specialized Cores
[Hardavellas, IEEE Micro 2011]
• Don’t waste it; harness it instead!
 Use dark silicon to implement specialized cores
PE
• Applications cherry-pick few cores, rest of chip is powered off
PE
• Vast unused area  many cores  likely to find good matches
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
25
PE
© Hardavellas
Overheads of General-Purpose Processors
100000
2000pJ
Energy (pJ)
10000
1000
100
Overhead
10
Operation
1
50pJ
0.1
FP
MUL
INT
ADD
L1d
L2
Memory
faraway
slice
Core specialization will minimize most overheads
ASICs ~100-700x more efficient than general-purpose cores
26
© Hardavellas
First-Order Core Specialization Model
• Modeling of physically-constrained CMPs across technologies
• Model of specialized cores based on ASIC implementation of H.264:
 Implementations on custom HW (ASICs), FPGAs, multicores
(CMP)
 Wide range
of computational
extensively
Frames
Energy per motifs,
Performance
gap studied
Energy gap of
per sec
ASIC
CMP
frame (mJ)
of CMP vs. ASIC
CMP vs. ASIC
30
4
IME
0.06
1179
525x
707x
FME
0.08
921
342x
468x
Intra
0.48
137
63x
157x
CABAC
1.82
39
17x
261x
[Hameed et al., ISCA 2010]
12x LOWER ENERGY compared to best conventional alternative
27
© Hardavellas
Number of Cores
Specialized Multicores: Power Only Few Cores
1024
512
256
128
64
32
16
8
4
2
1
2004 2007 2010 2013 2016 2019
Max EMB Cores
EMB + 3D mem
EMB
GPP + 3D mem
Year
Only few cores need to run at a time for max speedup
Vast unused die area will allow many cores
28
© Hardavellas
The New Core Design
[analogy by A. Chien]
From fat conventional cores, to a sea of specialized cores
29
© Hardavellas
The New Multicore
Power up only what you need
30
© Hardavellas
The New Multicore Node
3D-stacked
memory (cache)
cores
Note: can push further with more exotic technologies
(e.g., nanophotonics) …but that’s another talk
Specialized cores + 3D-die memory stacking
31
© Hardavellas
Design for Dark Silicon: Many Open Questions
To get 12x lower energy (12x performance for same power budget):
• Which candidates are best for off-loading to specialized cores?
• What should these cores look like?
 Exploit commonalities to avoid core over-specialization
 Can we classify all computations into N bins (for a small N)?
• What are the appropriate language/compiler/runtime
techniques to drive execution migration?
 Impact on scheduler?
• How to restructure software/algorithms for heterogeneity?
32
© Hardavellas
Thank You!
Questions?
References:
• N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward Dark
Silicon in Servers. IEEE Micro, Vol. 31, No. 4, July/August 2011.
• N. Hardavellas. Chip multiprocessors for server workloads. PhD thesis,
Carnegie Mellon University, Dept. of Computer Science, August 2009.
• N. Hardavellas, M. Ferdman, A. Ailamaki, and B. Falsafi. Power scaling: the
ultimate obstacle to 1K-core chips. Technical Report NWU-EECS-10-05,
Northwestern University, Evanston, IL, March 2010.
• R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S.
Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of
inefficiency in general-purpose chips. In Proc. of ISCA, June 2010.
If you want to know more about my “other talks” come find me at the break!