Download Multicores Cnt (EH) - Department of Information Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Old Trend1: Deeper pipelines/higher freq.
Exploring ILP (instruction-level parallelism)
What Thread-Level Parallelism
(TLP) can buy you
I
PC
Erik Hagersten
Uppsala University
Sweden
R
Regs
…
Data Dependence
150cycles
AVDARK
2010
Mem
2
Dept of Information Technology| www.it.uu.se
Old Trend2: Wider pipelines
Exploring more ILP
I
Thread 1
PC
AVDARK
2010
Issue
logic
I
I
I
…
Thread 1
PC
Regs
Dept of Information Technology| www.it.uu.se
Mem
3
© Erik Hagersten| user.it.uu.se/~eh
Old Trend3: Deeper memory hierarchy
Exploring access locality
More pipelines +
Deeper pipelines
Î
Need more
independent
instructions
150cycles
1GB
1GB
© Erik Hagersten| user.it.uu.se/~eh
AVDARK
2010
Issue
logic
…
I
R
B M MW
I
R
B M MW
I
R
B M MW
I
R
B M MW
Regs
2kB
2 cycles
kr
10 cycles
€
64kB
30 cycles
£
2MB
150cycles
Mem
Dept of Information Technology| www.it.uu.se
4
1GB
© Erik Hagersten| user.it.uu.se/~eh
CMP: Chip Multiprocessor (aka Multicores)
Not enough MLP?
more TLP & geographical locality
Thread 1
Issue
logic
I
R
B M MW
I
R
B M MW
Slooow Memory
C
Regs
PC
…
SEK
Thread N
Issue
logic
I
R
B M MW
I
R
B M MW
PC
B
C
CPU
+
A
L1 cache
L3
Ctrl
Regs
B
L2 Cache
SEK
€
A = B + C:
$
AVDARK
2010
AVDARK
2010
Mem
5
Dept of Information Technology| www.it.uu.se
© Erik Hagersten| user.it.uu.se/~eh
Read B
Read C
Add B & C
WriteA
Latency
0.3 - 100
0.3 - 100
0.3
0.3 - 100
6
Dept of Information Technology| www.it.uu.se
TLP Î MLP
ns
ns
ns
ns
© Erik Hagersten| user.it.uu.se/~eh
Not enough ILP
Î memory accesses from many threads (MLP)
Slooow Memory
C
C
B
B
C
C
B
B
Thread
+ 2 A
Thread 1 A
+
AVDARK
2010
Issue Logic
Sloooow memory
TLP Î MLP
Thread-Level Parallellism Î Memory-Level Parallelism (MLP)
Dept of Information Technology| www.it.uu.se
7
© Erik Hagersten| user.it.uu.se/~eh
AVDARK
2010
LD
ST
LD
ST
LD
ST
LD
ST
LD
ST
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
CPU
L1
R3
R3
R3
R3
R3
Dept of Information Technology| www.it.uu.se
8
© Erik Hagersten| user.it.uu.se/~eh
TLP Î MLP
SMT: Simultaneous Multithreading
”Combine TLP&ILP to find independent instr.”
Î feed one superscalar with instr. from many threads
(also used in GPUs)
Thread 1
Issue Logic
PC
…
Thread N
CPU
Issue
logic
L1
I
R
B M MW
I
R
B M MW
I
R
B M MW
I
R
B M MW
Regs 1 …
Regs N
PC
SEK
AVDARK
2010
LD
ST
LD
ST
LD
ST
LD
ST
LD
ST
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
R3
R3
R3
R3
R3
LD
ST
LD
ST
LD
ST
LD
ST
LD
ST
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
X, R2
R1, R2,
€
R3
$
R3
R3
R3
AVDARK
2010
R3
Dept of Information Technology| www.it.uu.se
9
© Erik Hagersten| user.it.uu.se/~eh
Mem
Dept of Information Technology| www.it.uu.se
10
© Erik Hagersten| user.it.uu.se/~eh
Thread-interleaved
Historical Examples:
„ Denelcor, HEP, Tera Computers [B. Smith] 1984
Each thread executes every n:th cycle in a round-robin
fashion
-- Poor single-thread performance
-- Expensive (due to early adoption)
Intel’s “Hyperthreading” (2002)
-- Poor implementation
„
AVDARK
2010
Dept of Information Technology| www.it.uu.se
11
© Erik Hagersten| user.it.uu.se/~eh
Design Issues for Multicores
Erik Hagersten
Uppsala University
Sweden
„
„
„
„
„
„
„
AVDARK
2010
Performance
Performance
Performance
Performance
…
per
per
per
per
Watt?
memory byte?
bandwidth?
$?
13
© Erik Hagersten| user.it.uu.se/~eh
“
“
„
“
“
“
L1
L1
L1
CPU
CPU
CPU
CPU
14
© Erik Hagersten| user.it.uu.se/~eh
Fewer cores but…
wide issue?
O-O-O?
narrow issue?
in-order?
have you ever heard of Amdahl?
SMT, run-ahead, execute-ahead … to cure shortcomings?
Davis, Laudon and Olukotun, PACT 2006
Amdahls Law in the Multicore Era
Mark Hill, IEEE Computer, July 2008
http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf
computing suits CMPs the best
Dept of Information Technology| www.it.uu.se
L1
Read:
Maximizing CMP Throughput with Mediocre Cores
Once the workingset is in memory, work like crazy!
(in general)
L1
Narrow: More cores but…
“
Memory: the major cost of a CMP system!
How do we utilize it the best?
AVDARK
2010
L1
Fat:
“
Memory requirement?
Sharing in cache?
Memory bandwidth requirement?
Î Capability
L1
A Few Fat or Many Narrow Cores?
Issues:
“
L1
Dept of Information Technology| www.it.uu.se
„
“
CPU
Shared
Resources
Capacity? (≈several sequential jobs)
or
Capability? (≈one parallel job)
“
CPU
AVDARK
2010
Capacity or Capability Computing?
“
CPU
L2 Cache
How large fraction of a CMP system cost is in the
CPU chip?
Should the execution (MIPS/FLOPS) be viewed as a
scarce resource?
Dept of Information Technology| www.it.uu.se
CPU
Bandwidth
Shared Bottlenecks
CMP bottlenecks/points of optimization
AVDARK
2010
15
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
16
© Erik Hagersten| user.it.uu.se/~eh
How to Hide Memory Latency
(and create MLP)
Cores vs. caches
„
Depends on your target applications…
„
Niagara’s answer: go for cores
“
“
“
“
“
“
„
Options:
„ O-O-O
„ HW prefetching
„ SMT
„ Run-ahead/Execute-ahead
In-order 5-stage pipeline
8 cores a’ 4 SMT threads each Î 32 threads,
3MB shared L2 cache (96 kB/thread)
SMT to hide memory latency
Memory bandwidth: 25 GB/s
Will this approach scale with technology?
Others: go for cache
“
2-4 cores for now
AVDARK
2010
AVDARK
2010
Dept of Information Technology| www.it.uu.se
17
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
18
© Erik Hagersten| user.it.uu.se/~eh
Fighting for shared resources
Handling shared resources
Binary
Binary
Core
¢
¢
$
1st Order MC Performance Problems
wasted
Mem
Erik Hagersten
Uppsala University
Sweden
•Additional multicore issues: ‐ Even less cache resources per application
‐ Sharing of cache resources
‐ Wasted cache usage AVDARK
2010
Dept of Information Technology| www.it.uu.se
20
© Erik Hagersten| user.it.uu.se/~eh
Andreas’ research: Using Prefetch-NT
Cache Interference in Shared Cache
„
Cache sharing strategies:
1.
2.
3.
miss
rate
AMD, Prefetch NT:
• Install in L1 cache with NT bit set
• Non-inclusive caching Î Not in L2, L3
• Upon eviction from L1, do not install in L2, L3
(if NT is not set, cacheline will get installed)
Avoid cache polution
using existing x86
ISA:
NT = Non-temporal
Fight it out!
Fair share: 50% of the cache each
Maximize throughput: who will benefit the most?
Intel Core2, Prefetch NT:
• Install in L1&L2 caches (inclusive caching)
• Put in MRU place in L2 Î replaced more easily
• Upon eviction from L1, keep in L2
Binary
Core
2
3
1
single-threaded cache profiling
A
1
B
2
¢
¢
Intel i7, Prefetch NT:
• Install in L1&L2 caches (inclusive caching)
• Put in MRU place in L2 Î replaced more easily
• Upon eviction from L1, keep in L2
$
wasted
A examples: ammp, art, …
B examples: vpr_place, vortex, …
Mem
3
$ size
tot
All, Store-NT:
• Keep cachline in a special write-buffer
•When all bytes of the cache line has been updated
write to memory while bypassing caches
•Huge penaly in not all bytes are updated
22
© Erik Hagersten| user.it.uu.se/~eh
Read:
STATSHARE: A Statistical Model for Managing Cache Share via Decay
Pavlos Petoumenos et al in MOBS workshop ISCA 2006
AVDARK
2010
AVDARK
2010
Predicting the inter-thread cache contention on a CMP
Chandra et al in HPCA 2005
21
Dept of Information Technology| www.it.uu.se
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
Example: Hints to avoid cache pollution
(non-temporal prefetches)
Example: Hints for mixed workloads
(non-temporal prefetches)
The larger cache, the better
cache misses
0,25
”streaming”
Libquantum
LBM
0,2
Miss rate
bzip
Hint:
Don’t
allocate!
2x missrate
missrate
actual/4
actual
0,15
”bigger is better”
0,1
0,05
”tiny”
cache size
64M
32M
8M
16M
4M
2M
1M
512
256
64
128
32
8
4
16
2
1
0
Cache size
áctual
3
One Instance
Four Instances
1
0
Original
AVDARK
2010
Orig
Dept of Information Technology| www.it.uu.se
23
Lim=1.7MB
Hint: lim= actual/4
AVDARK
2010
© Erik Hagersten| user.it.uu.se/~eh
In mix
In mix, patched
1,2
Performance
Throughput
Individually
40% faster
2
1
25%
0,8
0,6
0,4
0,2
0
bzip2
AMD Opteron
Dept of Information Technology| www.it.uu.se
Libquantum
24
LBM
Geom mean
© Erik Hagersten| user.it.uu.se/~eh
Looks and Smells Like an SMP (aka UMA)?
SMP system
Multicore system
Memory
Wrapping up about multicores
Interconnect
Erik Hagersten
Uppsala Universitet
L2
L2
L1
L1
CPU
„
AVDARK
2010
CPU
Trends (my guess!)
„
„
t
t
now
Bandwidth/Thread
Thr. Comm. Cost (temporal)
t
now
Dept of Information Technology| www.it.uu.se
t
27
1
1
CPU
TT
TT
TT
TT
…8…
26
© Erik Hagersten| user.it.uu.se/~eh
Are we buying…
“ CPU
Cache/Thread
now
AVDARK
2010
„
t
now
Memory/Chip
L1
What matters for multicore
performance?
Transistors/Thread
t
L2
L2
Cost of parallelism?
Cache capacity per thread?
Memory bandwidth per thread?
Cost of thread communication? …
Dept of Information Technology| www.it.uu.se
now
… 32 …
Well, how about:
„
Threads/Chip
Memory
AVDARK
2010
frequency?
“ Number of cores?
“ MIPS and FLOPS?
“ Memory bandwidth?
“ Cache capacity?
“ Memory capacity?
“ Performance/Watt?
now
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
28
© Erik Hagersten| user.it.uu.se/~eh
MC Questions for the Future
„
„
„
„
„
„
How to get parallelism?
How to get performance/data locality?
How to debug?
A case for new funky languages?
A case for automatic parallelization?
Are we buying:
“
“
“
„
„
AVDARK
2010
„
X86 Architecture
Erik Hagersten
Uppsala University
Sweden
compute power,
memory capacity, or
memory bandwidth?
Will 128 cores be mainstream in 5 years?
Will the CPU market diverge into
desktop/capacity/capability/special-purpose CPUs
again?
A non-question: will it happen?
Dept of Information Technology| www.it.uu.se
29
© Erik Hagersten| user.it.uu.se/~eh
Intel Archeology
„
„
„
„
„
„
„
„
„
„
8086 registers
(8080: 1974, 6.0 kTransistors, 2MHz, 8bit)
8086: 1978, 29 kT, 5-10MHz, 16bit (PC!)
(80186:1982 ? kT, 4-40MHz, integration! )
80286: 1982, 0.1MT, 6-25MHz, chipset (PC-AT)
80386: 1985, 0.3MT, 16-33MHz, 32 bits
80486: 1989, 1.2MT, 25-50MHz, I&D$, FPU
Pentium: 1993, 3.1 MT, 66 MHz, superscalar
Pentium Pro: 1997, 5.5 MT, 200 MHz, O-O-O, 3-way superscalar
Intel Pentium4:2001, 42 MT, 1.5 GHz, Super-pipe, L2$ on-chip
…
„
„
„
„
„
„
„
„
„
„
AVDARK
2010
AVDARK
2010
Dept of Information Technology| www.it.uu.se
31
© Erik Hagersten| user.it.uu.se/~eh
„
AX (Accumulator)
BX (Base)
CX (Count)
DX (Data)
SP (Stack ptr)
BP (Base ptr)
SI (Source index)
DI (Destination index)
CS (Code segment)
DS (Data segment)
SS (Stack segment)
Dept of Information Technology| www.it.uu.se
”General purpose” registers
”Addressing registers”
Segmented addressing
(extending the address
range)
32
© Erik Hagersten| user.it.uu.se/~eh
Complex instructions of x86
„
“
“
“
“
“
„
Micro-ops
RISC (Reduced Instruction Set Computer)
LD/ST with a limited set of address modes
ALU instructions (a minimum)
Many general purpose registers
Simplifications (e.g., read R0 returns the value 0)
Simpler ISA Î more efficient implementations
„
„
x86 CISC (Complex Instruction Set Computer)
“
“
“
“
“
ALU/Memory in the same instruction
Complicated instructions
Few specialized registers (actually accumulator architecture)
Variable instruction length
x86 was lagging in performance to RISC in the 90s
AVDARK
2010
„
„
AVDARK
2010
Dept of Information Technology| www.it.uu.se
33
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
x86-64
„
„
„
„
„
„
„
„
„
SSEn: 16 128-bit SSE ”vector” registers
Backwards compatible with x86
Intel adoptions: IA-32e, EM64T, Intel64
NOTE: IA-64 is Itanium
Dept of Information Technology| www.it.uu.se
35
34
© Erik Hagersten| user.it.uu.se/~eh
x86 Vector instructions
ISA extension to x86 (by AMD 2001)
64-bit virtual address space
64-bit GP registers x16
x86’s regs extended: rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi
x86-64 also has: r8, r9, ... r15 (i.e., a total of 16 regs)
NOTE: dynamic register renaming makes the effective number
of regs higher
AVDARK
2010
Newer pipelines implements RISC-ish μ-ops.
Some complex x86 instructions expanded to
several micro-ops at runtime.
The translated μ-ops may be cached in a
trace-cache [in their predicted order] (first:
Pentium4)
Expanded to “loop cache” in Core-2
MMX: 64 bit vectors (e.g., two 32bit ops)
SSEn: 128 bit vectors(e.g., four 32 bit ops)
AVX: 256 bit vectors(e.g., eight 32 bit ops)
(in Sandy Bridge, ~Q1 2011)
MIC: ”16-way vectors”. Is this 16 x 32 bits??
AVDARK
2010
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
36
© Erik Hagersten| user.it.uu.se/~eh
How to explore SIMD: nVIDIA
Examples of vector instructions
Vector Regs
„
A:
„
SSE_MUL D, B, A
„
„
B:
„
„
L2
C:
(less than the sum of L1:s)
L1
x x x x
512 ”processors” (P)
16 P/StreamProcessor (SP)
SP is SIMD-ish (sort off)
Full DP-FP IEEE support
64kB L1 cache /SP
768kB global shared cache
„
„
„
D:
„
„
Atomic instructions
ECC correction
Debugging support
Giant chip/high power
...
E:
AVDARK
2010
AVDARK
2010
...
Dept of Information Technology| www.it.uu.se
37
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
38
© Erik Hagersten| user.it.uu.se/~eh
How to explore SIMD: Intel MIC
”more than 50 cores”
Exponential Growth
Erik Hagersten
Uppsala University
Sweden
AVDARK
2010
SIMD instructions: 16-way, how wide?
39
Dept of Information Technology| www.it.uu.se
© Erik Hagersten| user.it.uu.se/~eh
Ray Kurzweil pictures
www.KurzweilAI.net/pps/WorldHealthCongress/
AVDARK
2010
Ray Kurzweil pictures
www.KurzweilAI.net/pps/WorldHealthCongress/
AVDARK
2010
Dept of Information Technology| www.it.uu.se
41
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
Ray Kurzweil pictures
www.KurzweilAI.net/pps/WorldHealthCongress/
AVDARK
2010
42
© Erik Hagersten| user.it.uu.se/~eh
Ray Kurzweil pictures
www.KurzweilAI.net/pps/WorldHealthCongress/
AVDARK
2010
Dept of Information Technology| www.it.uu.se
43
© Erik Hagersten| user.it.uu.se/~eh
Dept of Information Technology| www.it.uu.se
44
© Erik Hagersten| user.it.uu.se/~eh
Doubling (or Halving) times
[Kurzweil 2006]
„
Dynamic RAM Memory (bits per dollar)
1.5 years
„
Average Transistor Price
1.6 years
„
Microprocessor Cost per Transistor Cycle
1.1 years
„
Total Bits Shipped
1.1 years
„
Processor Performance in MIPS
1.8 years
„
Transistors in Intel Microprocessors
2.0 years
„
Microprocessor Clock Speed
2.7 years
AVDARK
2010
Dept of Information Technology| www.it.uu.se
45
© Erik Hagersten| user.it.uu.se/~eh