Download ppt - LIFL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Challenges for
High Performance Processors
Hiroshi NAKAMURA
Research Center for Advanced Science and Technology,
The University of Tokyo
What’s the challenge?

Our Primary Goal: Performance

How ?

increase the number and/or operating frequency of
functional units
AND


supply functional units with sufficient data (bandwidth)
Problems:

Memory Wall


system performance is limited by poor memory performance
Power Wall

power consumption is approaching cooling limitation
France-Japan PAAP Workshop
2007. 11. 2.
2
Memory Wall Problem
Performance improvement
CPU
Memory
1000000
100000
10000
1000
100
10
10
20
08
20
06
20
04
20
02
20
00
20
98
19
96
19
94
19
92
19
90
19
88
19
86
19
84
19
19
82
1
80

CPU: 55% / year
DRAM: 7% / year
19

Relative Performance

Year
France-Japan PAAP Workshop
2007. 11. 2.
3
Example of Memory Wall:
Performance of 2GHz Pentium4 for a[i]=b[i]+c[i]
Performance [MFLOPS]
L1 hit
L2 hit
500
450
400
350
300
250
200
150
100
50
0
non-blocking cache &
out-of-order issue
1/6
cache miss
10
100
1000
Vector Length
10000
100000
1000000
 lack of effective memory throughput
France-Japan PAAP Workshop
2007. 11. 2.
4
Recap: Memory Wall Problem

growing gap between processor and memory speed
Itanium2/Montecito : Huge L3 cache (12MB x 2)

performance is limited by memory ability in High Performance
Computing (HPC)

long access latency of main memory

lack of throughput of main memory
 making full use of local memory (on-chip memory)
of wide bandwidth is indispensable

on-chip memory space is valuable resource

not enough for HPC

should exploit data locality
France-Japan PAAP Workshop
2007. 11. 2.
5
Does cache work well in HPC?
works well in many cases, but not the best for HPC
 data location and replacement by hardware
× unfortunate line conflicts occur although most of data
accesses are regular

ex. data used only once flush out other useful data
transfer size of cache  off-chip is fixed



for consecutive data: larger transfer size is preferable
for non-consecutive data: large line transfer incurs
unnecessary data transfer  waste of bandwidth
Most of HPC applications exhibit regularity in data
access, which is sometimes not well enjoyed.
France-Japan PAAP Workshop
2007. 11. 2.
6
SCIMA (Software Controlled Integrated
Memory Architecture) [kondo-ICCD2000]
(joint work with Prof. Boku

@ Univ. of Tsukuba and others)
addressable SCM in addition
to ordinary cache

ALU
FPU
register
reconfigurable


a part of logical address space
no inclusive relations with Cache
SCM and cache are reconfigurable at
the granularity of way
(SCM: Software Controllable Memory)
SCM
Cache
SCM
・・・
Memory
(DRAM)
overview of SCIMA
Cache
NIA
Network
address space
France-Japan PAAP Workshop
2007. 11. 2.
7
Data Transfer Instruction

load/store


Register
page-load/page-store

load/store
register  SCM/Cache

SCM  Off-Chip Memory
large granularity transfer

SCM
Cache

line transfer
wider effective bandwidth
by reducing latency stall
block stride transfer


Off-Chip Memory
New
avoid unnecessary
data transfer
more effective utilization
of On-Chip Memory
page-load/page-store
France-Japan PAAP Workshop
2007. 11. 2.
8
Strategy of Software Control

SCM must be controlled by software

arrays are classified into 6 groups
Consecutiveness
consecutive
stride
irregular
first, apply (1) (2)
(1) use SCM as a
stream buffer
reserve SCM
(4) for reused data
(2) use SCM as a
(5) reserve SCM
(3) not use
(6) reserve SCM
for reused data
stream buffer
SCM
not-reusable
for reused data
allocate small stream
buffer in SCM
second, apply (4)
(5) and (6)
allocate rest area of
SCM for reused data
reusable Reusability
・prototype of semi-automatic compiler :
users specify hints on reusability of data arrays
France-Japan PAAP Workshop
2007. 11. 2.
9
Results of Memory Traffic
unnecessary memory traffic is suppressed
cache miss
page-load/store
benchmark programs
 CG, FT, QCD
assumption
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
cache model:
cache size = 64KB(4way)
SCM size = 0KB

SCIMA mode:
Cache
SCIMA
CG
Cache
SCIMA
FT
32
12
8
32
12
8
32
12
8
32
12
8
cache size = 16KB (1way)
SCM size = 48KB
32
12
8
line size

32
12
8
Normalized Memory Traffic

Cache
SCIMA


total # of way: 4
line size: 32B, 128B
QCD
1% - 61% of memory traffic decreases in SCIMA
due to fully exploitation of data reusability
France-Japan PAAP Workshop
2007. 11. 2.
10
Results of Performance
CPU busy time
latency stall
throughput stall

1
0.8

0.6
0.4
0.2
CPU busy time
latency stall :
elapsed time due to
memory latency
throughput stall :
elapsed time due to
lack of throughput
Cache
SCIMA
CG
Cache
SCIMA
32
12
8
32
12
8
32
12
8
32
12
8
assumption
32
12
8
0
line size


1.2
32
12
8
Normalized Execution Time
normalized execution time
Cache
FT
SCIMA
QCD
load/store latency: 2cycle
bus throughput: 4B/cycle
memory latency: 40cycle
1.3-2.5 times faster than cache


latency stall reduction by large granularity of data transfer
throughput stall reduction by suppressing unnecessary data transfer
France-Japan PAAP Workshop
2007. 11. 2.
11
Power Wall

Next Focus: Power Consumption of Processors


Is there any room for power reduction ?
If yes, then how to reduce ?
Trends of Heat Density
♦ Itanium (130W)
France-Japan PAAP Workshop
2007. 11. 2.
12
Observation(1) Moore’s Law

Num. of transistors : doubles every 18 months
France-Japan PAAP Workshop
2007. 11. 2.
13
Observation (2) – frequency –



Frequency doubles every
3 years.
Number of transistors :
doubles every 18 months
Number of switching on a
chip: 8 times every 3 years
France-Japan PAAP Workshop
2007. 11. 2.
14
Observation (3) – performance –


# of switching on a chip: 8 times every 3 years
effective performance: 4 times every 3 years

“microprocessor performance improved 55% per year”
from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann


unnecessary switching = chance of power reduction:
doubles every 3 years
France-Japan PAAP Workshop
2007. 11. 2.
15
An Evidence of the Observation
access energy per
instruction (nJ)
- unnecessary switching = x2 / 3 years [Zyuban00]
@ ISLPED’00

rename map table
bypass mechanism
load/store window
issue window
register file
functional units
flushed
instruction
committed
instruction
Issue
Width
4
6
8
10
12
energy/instr. increases to exploit ILP for higher performance



at functional units : no increase
at issue window,
register file : increase
flushed instruction by incorrect prediction: increase
France-Japan PAAP Workshop
waste of
power
2007. 11. 2.
16
Registers

Register consumes a lot of power



Open Question





roughly speaking, power ∝(num. of registers) X (num. of ports)
high performance wide issue superscalar processors
 more registers, more read/write ports
in HPC, what is the best way to use many function units (or
accelerators) from the perspective of register file design
scalar registers with SIMD operations
vector registers with vector operations
………
Personal Impression


vector registers are accessed in well-organized fashion, it is easy
to reduce “num. of ports” by sub-banking technique
can vector operations make good use of local on-chip memory?
(at least, traditional vector processors can never!)
France-Japan PAAP Workshop
2007. 11. 2.
17
Dual Core helps …
Rule of thumb
Voltage
Frequency
1%
1%
Power
Performance
3%
0.66%
In the same process technology…
Cache
Core
Voltage = 1
Freq
=1
Area
=1
Power = 1
Perf
=1
Cache
Core
Core
Voltage = -15%
Freq
= -15%
Area
= 2
Power = 1
Perf
= ~1.8
France-Japan PAAP Workshop
2007. 11. 2.
18
Multi-Core helps more …
Power
Cache
4
Power = 1/4
Performance
Performance = 1/2
3
Large Core
2
2
1
1
Small
Core
1
1
no need for wider instruction issue 
C1
C2
Cache
C3
C4
4
4
3
3
2
2
1
1
Multi-Core:
Power efficient
Better power and
thermal management
France-Japan PAAP Workshop
2007. 11. 2.
19
Leakage problem
2
Power (W), Power Density (W/cm)

IEEE Computer Magazine
How to attack
leakage problem?
1400
1200
1000
[Borkar-MICRO05]
[Borkar-MICRO05]
SiO2
VDD
10 mm Die
leakage
current
Lkg
SD Lkg
Active
ON
800
Input 0600
OFF
400
200
0
90nm 65nm 45nm 32nm 22nm 16nm
France-Japan PAAP Workshop
2007. 11. 2.
20
Introduction of our research

Innovative Power Control for Ultra Low-Power and
High-Performance System LSIs

5 years project started October, 2006
supported by JST (Japan Science and Technology Agency) as
a CREST (Core Research for Evolutional Science and
Technology) program
Objective: drastic power reduction of high-performance
system LSIs by innovative power control
through tight cooperation of various design levels including
circuit, architecture, and system software.



Members:




Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader]
Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS
Prof. H. Amano (Keio Univ): architecture & F/E design
Prof. K. Usami (Shibaura I.T.): circuit & B/E design
France-Japan PAAP Workshop
2007. 11. 2.
21
How to reduce leakage: Power Gating

Focusing on Power Gating for reducing leakage


Inserting a Power Switch (PS) between VDD and GND
Turning off PS when sleep
VDD
VDD
logic gates
GND
Sleep
logic gates
Power
Switch
France-Japan PAAP Workshop
Virtual GND
2007. 11. 2.
22
Run-time Power Gating (RTPG)
Circuit B
Circuit
A
Power
Switch
Circuit
C
Sleep
Control ckt

control power switch at run time

Coarse grain: Mobile processor by Renesas
(independent power domains for BB module, MPEG module, ..)

Fine grain (our target): power gating within a module
France-Japan PAAP Workshop
2007. 11. 2.
23
Fine-grain Run-time Power Gating

Longer sleep time is preferable




Leakage savings
Overheads: power penalties for wakeup
Evaluation through a real chip not reported
Test vehicle: 32b x 32b Multiplier


Either or both operands (input data) are likely less than 16-bit
Circuit portions to compute upper bits of product need not to
operate  waste leakage power
By detecting 0s at upper 16-bits of operands, power
gate internal Multiplier array
France-Japan PAAP Workshop
2007. 11. 2.
24
Power dissipation(mW)
Test chip "Pinnacle"
real measurement
4.0
3.5
125C
3.0
85C
2.5
25C
2.0
Sequence 1
(No sleep)
Not applied
Sequence 2
(Domain H
sleeps)
Sequence 3
(Domain H and M
sleep)
FG-RTPG applied
Technology
STARC 90nm CMOS
Multiplier core
Area
# cells
0.544 × 0.378 mm2
15,000
Design time
4.5 months
Design
members
3 Master students,
1 Bachelor student,
1 Faculty
- Exhibits good power reduction
- Current Status
 Designing a pipelined
microprocessor with FG-RTPG
 Compiler (instruction scheduler)
to increase sleep time
France-Japan PAAP Workshop
2007. 11. 2.
25
Low Power Linux Scheduler based on
statistical modeling


Co-optimization of System Software and Architecture
Objective:


process scheduler which reduce power consumption by
DVFS (dynamic voltage and frequency scaling) of each
process with satisfying its performance constraint
How to find the lowest frequency with satisfying
performance constraints ?



it depends on hardware and program characteristics
performance ratio is different from frequency ratio
hard to find the answer straightforward
 modeling by statistical analysis of hardware events
France-Japan PAAP Workshop
2007. 11. 2.
26
Evaluation result
Relative Performance
1.0
Pentium M 760
(Max 2.00 GHz,
FSB 533 MHz)
0.9
0.8

0.7
mcf
bzip2
swim
mgrid
matrix (50)
matrix (600)
matrix (1000)
Threshold
0.6
0.5
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Performance Threshold
May 8, 2007
Specified threshold


Perf. is within the threshold in
all the cases except for mgrid



Black dotted line
3-7% below the threshold
Accurate model is obtained
Linux scheduler using this
model is developed
France-Japan PAAP Workshop
2007. 11. 2.
27
27
Summary

Challenge for high performance processors:


One solution to memory wall


Memory Wall and Power Wall
make good use of
on-chip memory with software controllability
Solutions to power wall




many cores will relax the problem, but
leakage current is getting a big problem
new research/approach is required
our project “Innovative Power Control for Ultra Low-Power
and High-Performance System LSIs” is introduced
France-Japan PAAP Workshop
2007. 11. 2.
28
France-Japan PAAP Workshop
2007. 11. 2.
29
Related documents