Download On Power and Multi-Processors

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Switched-mode power supply wikipedia , lookup

Alternating current wikipedia , lookup

Power engineering wikipedia , lookup

Mains electricity wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Microprocessor wikipedia , lookup

Multi-core processor wikipedia , lookup

Bus (computing) wikipedia , lookup

Transcript
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
On Power and Multi-Processors
Finishing up power issues and how those
issues have led us to multi-core processors.
Introduce multi-processor systems.
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
Stuff upcoming soon
• 3/24: HW4 due
• 3/26: MS2 due
• 3/28: MS2 meetings (sign up posted on 3/24)
• 4/2: Quiz
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
Power from last time
• Power is perhaps the performance limiter
– Can’t remove enough heat to keep performance
increasing
• Even for things with plugs, energy is an issue
– $$$$ for energy, $$$$ to remove heat.
• For things without plugs, energy is huge
– Cost of batteries
– Time between charging
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
What we did last time (1/5)
• Power is important
– It has become a (perhaps the) limiting factor in
performance
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
When last we met (2/5)
• Power is important to computer architects.
– It limits performance.
• With cooling techniques we can increase our power
budget (and thus our performance).
– But these techniques get very expensive very very quickly.
– Both to build and operate active cooling devices.
Costs power (and thus $) to cool
Active cooling system costs $ to build
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
When we last met (3/5)
• Energy is important to computer architects
– Energy is what we really pay for (10¢ per kWh)
– Energy is what limits batteries (usually listed as mAh)
• AA batteries tend to have 1000-2000 mAh or (assuming 1.5V) 1.53.0Wh*
• iPad battery is rated at about 25Wh.
– Some devices limited by energy
they can scavenge.
http://iprojectideas.blogspot.com/2010/06/human-power-harvesting.html
* Watch those units. Assuming it takes 5Wh of energy to charge a 3Wh battery,
how much does it cost for the energy to charge that battery? What about an iPad?
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
What uses power in a chip?
When we last met (4/5)
Power vs. Energy
Dynamic (Capacitive) Power Dissipation
What uses power in a chip?
I
VIN
VOUT
CL
•
Data dependent – a function of switching
activity
8
What uses power in a chip?
Capacitive Power dissipation
Capacitance:
Function of wire
length, transistor size
Supply Voltage:
Has been dropping
with successive fab
generations
Power ~ ½ CV2Af
Activity factor:
How often, on average,
do wires switch?
Clock frequency:
Increasing…
9
Static Power: Leakage Currents
What uses power in a chip?
VIN
VOUT
ISub
IDSub  k  e
 qVT
akaT
CL
Igate
•
Subthreshold currents grow exponentially with increases in
temperature, decreases in threshold voltage
•
•
•
But threshold voltage scaling is key to circuit performance!
Gate leakage primarily dependent on gate oxide thickness,
biases
Both type of leakage heavily dependent on stacking and input
pattern
10
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
When last we met (5/5)
• How do performance and power relate?*
– Power approximately proportional to V2f.
– Performance is approximately proportional to f
– The required voltage is approximately
proportional to the desired frequency.**
• If you accept all of these assumptions, we get
that an X increase in performance would
require an X3 increase in power.
– This is a really important fact
*These are all pretty rough rules of thumb. Consider the second one and discuss its shortcomings.
**This one in particular tends to hold only over fairly small (10-20%?) changes in V.
© Brehob 2014 -- Portions © Brooks, Dutta, Mudge & Wenisch
What we didn’t quite do last time
• For things without plugs, energy is huge
– Cost of batteries
– Time between charging
• (somewhat extreme) Example:
– iPad has ~42Wh (11,560mAh) which is huge
– Still run down a battery in hours
– Costs $99 for a new one (2 years of use or so)
© Brehob 2014 -- Portions © Brooks, Wenisch
E vs. EDP vs. ED2P
What uses power in a chip?
•
Currently have a processor design:
• 80W, 1 BIPS, 1.5V, 1GHz
• Want to reduce power, willing to lose some
performance
• Cache Optimization:
–IPC decreases by 10%, reduces power by
20% =>
Final Processor: 900 MIPS, 64W
–Relative E = MIPS/W (higher is better) =
14/12.5 = 1.125x
• Energy is better, but is this a “better”
processor?
13
© Brehob 2014 -- Portions © Brooks, Wenisch
Not necessarily
•
80W, 1 BIPS, 1.5V, 1GHz
What uses power in a chip?
•
•
Cache Optimization:
– IPC decreases by 10%, reduces power by 20% =>
Final Processor: 900 MIPS, 64W
– Relative E = MIPS/W (higher is better) = 14/12.5 =
1.125x
– Relative EDP = MIPS2/W = 1.01x
– Relative ED2P = MIPS3/W = .911x
What if we just adjust frequency/voltage on
processor?
•
•
•
How to reduce power by 20%?
P = CV2F = CV3 => Drop voltage by 7% (and also Freq) =>
.93*.93*.93 = .8x
So for equal power (64W)
– Cache Optimization = 900MIPS
– Simple Voltage/Frequency Scaling = 930MIPS
14
© Brehob 2014 -- Portions © Brooks, Wenisch
So?
Performance & Power
•
What we’ve concluded is that if we want to
increase performance by a factor of X, we
might be looking at a factor of X3 power!
• But if you are paying attention, that’s just for
voltage scaling!
• What about other techniques?
15
© Brehob 2014 -- Portions © Brooks, Wenisch
Other techniques?
•
Well, we could try to improve the amount of ILP we
take advantage of.
Performance & Power
•
•
•
•
That probably involves making a “wider” processor
(more superscalar)
What are the costs associated with doing that?
– How much bigger do things get?
What do we expect the performance gains to be?
How about circuit techniques?
•
•
Historically the “threshold voltage” has dropped as
circuits get smaller.
– So power drops.
This has (mostly) stopped being true.
– And it’s actually what got us in trouble to begin with!
16
© Brehob 2014 -- Portions © Brooks, Wenisch
So we are hosed…
Performance & Power
•
I mean if voltage scaling doesn’t work,
circuit shrinking doesn’t help (much), and
ILP techniques don’t clearly work…
• What’s left?
•
How about we drop performance to 80% of
what it was and have 2 processors?
•
•
•
•
How much power does that take?
How much performance could we get?
Pros/Cons?
What if I wanted 8 processors?
–How much performance drop needed per
processor?
17
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Multiprocessors
Keeping it all working together
EECS 470
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Why multi-processors?
• Why multi-processors?

Multi-processors have been around for a long time.


Originally used to get best performance for certain highly-parallel tasks.
But as noted, we now use them to get solid performance per unit of
energy.
• So that’s it?

Not so much.


We need to make it possible/reasonable/easy to use.
Nothing comes for free.
If we take a task and break it up so it runs on a number of
processors, there is going to be a price.

EECS 470
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Thread-Level Parallelism
struct acct_t { int bal; };
shared struct acct_t accts[MAX_ACCT];
int id,amt;
if (accts[id].bal >= amt)
{
accts[id].bal -= amt;
spew_cash();
}
0:
1:
2:
3:
4:
5:
6:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
... ... ...
• Thread-level parallelism (TLP)


Collection of asynchronous tasks: not started and stopped
together
Data shared loosely, dynamically
• Example: database/web server (each query is a thread)


EECS 470
accts is shared, can’t register allocate even if it were scalar
id and amt are private variables, register allocated to r1, r2
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Shared-Memory Multiprocessors
• Shared memory

Multiple execution contexts sharing a single address space



Multiple programs (MIMD)
Or more frequently: multiple copies of one program (SPMD)
Implicit (automatic) communication via loads and stores
P1
P2
P3
Memory System
EECS 470
P4
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
What’s the other option?
• Basically the only other option is “message passing”


We communicate via explicit messages.
So instead of just changing a variable, we’d need to call a function to
pass a specific message.
• Message passing systems are easy to build and pretty

effeicent.
But harder to code.
• Shared memory programming is basically the same as multi-
threaded programming on one processors

EECS 470
And (many) programmers already know how to do that.
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
So Why Shared Memory?
Pluses




For applications looks like multitasking uniprocessor
For OS only evolutionary extensions required
Easy to do communication without OS being involved
Software can worry about correctness first then performance
Minuses



Proper synchronization is complex
Communication is implicit so harder to optimize
Hardware designers must implement
Result


EECS 470
Traditionally bus-based Symmetric Multiprocessors (SMPs), and now
the CMPs are the most success parallel machines ever
And the first with multi-billion-dollar markets
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Shared-Memory Multiprocessors
• There are lots of ways to connect processors together
P1
Cache
P2
M1
Interface
Cache
P3
M2
Interface
Cache
P4
M3
Interface
Interconnection Network
EECS 470
Cache
M4
Interface
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Paired vs. Separate Processor/Memory?
• Separate processor/memory

+
–

Uniform memory access (UMA): equal latency to all memory
Simple software, doesn’t matter where you put data
Lower peak performance
Bus-based UMAs common: symmetric multi-processors (SMP)
• Paired processor/memory

–
+
Non-uniform memory access (NUMA): faster to local memory
More complex software: where you put data matters
Higher peak performance: assuming proper data placement
CPU($)
CPU($)
CPU($)
CPU($)
CPU($)
Mem R
Mem
EECS 470
Mem
Mem
Mem
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Shared vs. Point-to-Point Networks
• Shared network: e.g., bus (left)
+
–
+
Low latency
Low bandwidth: doesn’t scale beyond ~16 processors
Shared property simplifies cache coherence protocols (later)
• Point-to-point network: e.g., mesh or ring (right)
–
+
–
Longer latency: may need multiple “hops” to communicate
Higher bandwidth: scales to 1000s of processors
Cache coherence protocols are complex
CPU($)
Mem R
EECS 470
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
CPU($)
Mem R
CPU($)
R Mem
Mem R
CPU($)
R Mem
CPU($)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Organizing Point-To-Point Networks
• Network topology: organization of network

Tradeoff performance (connectivity, latency, bandwidth)  cost
• Router chips


Networks that require separate router chips are indirect
Networks that use processor/memory/router packages are direct
+
Fewer components, “Glueless MP”
• Point-to-point network examples


Indirect tree (left)
Direct mesh or ring (right)
R
R
Mem R
CPU($)
EECS 470
Mem R
CPU($)
CPU($)
Mem R
CPU($)
R Mem
Mem R
CPU($)
R Mem
CPU($)
R
Mem R
CPU($)
Mem R
CPU($)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Implementation #1: Snooping Bus MP
CPU($)
CPU($)
CPU($)
Mem
Mem
CPU($)
• Two basic implementations
• Bus-based systems


Typically small: 2–8 (maybe 16) processors
Typically processors split from memories (UMA)



EECS 470
Sometimes multiple processors on single chip (CMP)
Symmetric multiprocessors (SMPs)
Common, I use one everyday
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Implementation #2: Scalable MP
CPU($)
Mem R
CPU($)
R Mem
Mem R
CPU($)
R Mem
CPU($)
• General point-to-point network-based systems

Typically processor/memory/router blocks (NUMA)


Can be arbitrarily large: 1000’s of processors


Massively parallel processors (MPPs)
In reality only government (DoD) has MPPs…


EECS 470
Glueless MP: no need for additional “glue” chips
Companies have much smaller systems: 32–64 processors
Scalable multi-processors
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Issues for Shared Memory Systems
• Two in particular


Cache coherence
Memory consistency model
• Closely related to each other
EECS 470
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
An Example Execution
Processor 0
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
Processor 1
CPU0
0:
1:
2:
3:
4:
5:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
• Two $100 withdrawals from account #241 at two ATMs


EECS 470
Each transaction maps to thread on different processor
Track accts[241].bal (address is in r3)
CPU1
Mem
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
No-Cache, No-Problem
Processor 0
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
Processor 1
500
500
400
0:
1:
2:
3:
4:
5:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
• Scenario I: processors have no caches

EECS 470
No problem
400
300
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Cache Incoherence
Processor 0
0: addi r1,accts,r3
1: ld 0(r3),r4
2: blt r4,r2,6
3: sub r4,r2,r4
4: st r4,0(r3)
5: call spew_cash
Processor 1
0:
1:
2:
3:
4:
5:
addi r1,accts,r3
ld 0(r3),r4
blt r4,r2,6
sub r4,r2,r4
st r4,0(r3)
call spew_cash
V:500
500
500
D:400
500
D:400
V:500
500
D:400
D:400
500
• Scenario II: processors have write-back caches


EECS 470
Potentially 3 copies of accts[241].bal: memory, p0$, p1$
Can get incoherent (inconsistent)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Hardware Cache Coherence
CPU
• Coherence controller:
CC
bus
EECS 470
D$ data
D$ tags


Examines bus traffic (addresses and data)
Executes coherence protocol

What to do with local copy when you see different
things happening on bus
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Snooping Cache-Coherence Protocols
Bus provides serialization point
Each cache controller “snoops” all bus transactions
take action to ensure coherence





EECS 470
invalidate
update
supply value
depends on state of the block and the protocol
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Snooping Design Choices
Controller updates state of blocks
in response to processor and
snoop events and generates bus
xactions
Processor
ld/st
Cache
State Tag
Often have duplicate cache tags
Snoopy protocol



set of states
state-transition diagram
actions
Basic Choices


EECS 470
write-through vs. write-back
invalidate vs. update
Data
...
Snoop (observed bus transaction)
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
The Simple Invalidate Snooping Protocol
PrRd / --
PrWr / BusWr
Actions: PrRd, PrWr,
BusRd, BusWr
Valid
BusWr
PrRd / BusRd
Invalid
PrWr / BusWr
EECS 470
Write-through, nowrite-allocate cache
Example time
Processor 1
Processor
Transaction
Processor 2
Cache
State
Processor
Transaction
Read A
Read A
Read A
Write A
Read A
Write A
Write A
Actions:
• PrRd, PrWr,
• BusRd, BusWr
EECS 470
Bus
Cache
State
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
More Generally: MOESI
[Sweazey & Smith ISCA86]
M - Modified (dirty)
O - Owned (dirty but shared)
WHY?
E - Exclusive (clean unshared) only copy, not dirty
S - Shared
I - Invalid
S
Variants




EECS 470
MSI
MESI
MOSI
MOESI
O
ownership
M
validity
E
exclusiveness
I
MESI example
Actions:
• PrRd, PrWr,
• BRL – Bus Read Line (BusRd)
• BWL – Bus Write Line (BusWr)
• BRIL – Bus Read and Invalidate
• BIL – Bus Invalidate Line
Processor 1
Processor
Transaction
© Brehob 2014 -- Portions © Falsafi, Hill, Hoe, Lipasti,
Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch
Processor 2
Cache
State
Processor
Transaction
Bus
Cache
State
Read A
Read A
Read A
Write A
Read A
Write A
Write A
•
•
•
•
EECS 470
M - Modified (dirty)
E - Exclusive (clean unshared) only copy, not dirty
S - Shared
I - Invalid