Download Memory Wall - Computer architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Memory Gap:
to Tolerate or to Reduce?
Jean-Luc Gaudiot
Professor
University of California, Irvine
April 2nd, 2002
Outline
The problem: the Memory Gap
 Simultaneous Multithreading
 Decoupled Architectures
 Memory Technology
 Processor-In-Memory

The Memory Latency Problem


Technological Trend: Memory latency is getting longer relative to
microprocessor speed (40% per year)
Problem: Memory Latency - Conventional Memory Hierarchy Insufficient:
•
•

Many applications have large data sets that are accessed non-contiguously.
Some SPEC benchmarks spend more than half of their time stalling [Lebeck and
Wood 1994].
Domain: benchmarks with large data sets: symbolic, signal processing and
scientific programs
Some Solutions
Solution
Limitations
Larger Caches
— Slow
— Works well only if working set fits cache and
there is temporal locality.
Hardware
Prefetching
— Cannot be tailored for each application
Software Prefetching
— Ensure overheads of prefetching do not
outweigh the benefits > conservative
prefetching
— Behavior based on past and present
execution-time behavior
— Adaptive software prefetching is required to
change prefetch distance during run-time
— Hard to insert prefetches for irregular access
patterns
Multithreading
— Solves the throughput problem, not the
memory latency problem
Limitation of Present Solutions

Huge cache:
• Slow and works well only if the working set fits cache
and there is some kind of locality

Prefetching
• Hardware prefetching
– Cannot be tailored for each application
– Behavior based on past and present execution-time behavior
• Software prefetching
– Ensure overheads of prefetching do not outweigh the benefits
– Hard to insert prefetches for irregular access patterns

SMT
• Enhance the utilization and throughput at thread level
Outline
The problem: the memory gap
 Simultaneous Multithreading
 Decoupled Architectures
 Memory Technology
 Processor-In-Memory

Simultaneous Multi-Threading (SMT)
Horizontal and vertical sharing
 Hardware support of multiple threads
 Functional resources shared by multiple
threads
 Shared caches
 Highest utilization with multi-program or
parallel workload

SMT Compared to SS
Superscalar
Cycles INT MEM FP
SMT
Cycles INT MEM FP
1
1
2
2
Stall
Thread 1
Thread 2
3
3
Thread 3
4
4
Thread 4
5
5
Thread 5
Thread 6
6
6
Thread 7
7
7
Thread 8
9 instr



20 instr
Superscalar processors execute multiple instructions per cycle
Superscalar functional units idle due to I-fetch stalls, conditional branches, data
dependencies
SMT dispatches instructions from multiple data streams, allowing efficient execution and
latency tolerance
• Vertical sharing (TLP and block multi-threading)
• Horizontal sharing (ILP and simultaneous multiple thread instruction dispatch)
CMP Compared to SS
CMP-p2
Superscalar
Super
-scalar
Cycles INT MEM FP
INT
FP
Cycles /MEM
1
2
3
4
5
Super
-scalar
INT
FP
/MEM
1
Stall
2
Thread 1
3
Thread 2
4
Thread 3
5
Thread 4
6
Thread 5
7
Thread 6
8
Thread 7
9
Thread 8
6
7
9 instr



13 instr
CMP uses thread-level parallelism to increase throughput
CMP has layout efficiency
• More functional units
• Faster clock rate
CMP hardware partition limits performance
• Smaller level-1 resources cause increased miss rates
• Execution resources not available from across partition
Wide Issue SS Inefficiencies

Architecture and software limitations
• Limited program ILP => idle functional units
• Increased waste of speculative execution

Technology issues
• Area grows O((d3) {d = issue or dispatch width}
• Area grows an additional O(tLog2(t)) {t= #SMT threads}
• Increased wire delays (increased area, tighter spacings, thinner
oxides, thinner metal)
• Increased memory access delays versus processor clock
• Larger pipeline penalties
Problems solved through:
 CMP - localizes processor resources
 SMT - efficient use of FUs, latency tolerance
 Both CMP and SMT - thread level parallelism
POSM Configurations
Ext Int
Ext Int
Fetch
Level 2 Cache
Decode,
Rename
Reorder
Buffer,
Instr. Queues,
O-O-O logic
FP Unit
Wide-issue SMT



Ext Int
Ext Int
iL1
Proc 1
Proc 1
Proc 2
TLB
TLB
TLB
TLB
iL1
dL1
Level 2 Cache
dL1
L2 Cross bar
iL1
INT
unit
dL1
iL1 dL1 iL1 dL1
Level 2 Cache
L2 Cross bar
iL1 dL1 iL1 dL1
TLB
TLB
TLB
Proc 2
Proc 3
Proc 4
Two processor POSM
P1
Four processor POSM
Level 2 Cache
P2
P3
i d i d i d
L L L L L L
1 1 1 1 1 1
L2 Cross bar
i d i d i d
L L L L L L
1 1 1 1 1 1
P5
P6
P7
8 processor CMP
All architectures above have eight threads
Which configuration has the highest performance for an
average workload?
Run benchmarks on various configurations, find optimal
performance point
P4
i d
L L
1 1
i d
L L
1 1
P8
Superscalar, SMT, CMP,
and POSM Processors
CMP-p2
Superscalar
SMT
Super
-scalar
Super
-scalar
INT MEM FP
INT MEM FP
INT
FP
/MEM
INT
FP
/MEM
1
1
2
2

INT
FP
/MEM
SMT
INT
FP
/MEM
1
1
Stall
2
2
Thread 1
3
3
Thread 2
3
4
4
Thread 3
4
4
5
5
Thread 4
6
6
Thread 5
7
7
Thread 6
8
8
9
9
5
6
6
7
7
9 instr

SMT
3
5

POSM-p2
20 instr
13 instr
Thread 7
Thread 8
33 instr
CMP and SMT both have higher throughput than superscalar
Combination of CMP/SMT has highest throughput
Experiment results
IPC
Equivalent Functional Units
10.00
9.00
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
smt.p1.f2.t8.d16
posm.p2.f2.t4.d8
posm.p4.f1.t2.d4
cmp.p8.f1.t1.d2
1
2
3
4
5
6
7
8
Number of threads
•SMT.p1 has highest performance through vertical and horizontal
sharing
•cmp.p8 has linear increase in performance
NIPC
Equivalent Silicon Area and
System Clock Effects
10.00
9.00
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
smt.p1.f2.t8.d9
posm.p2.f2.t4.d6
posm.p4.f1.t2.d4
cmp.p8.f1.t1.d2
1
2
3
4
5
6
7
8
Number of threads
•SMT.p1 throughput is limited
•SMT.p1 and POSM.p2 have equivalent single thread performance
•POSM.p4 and CMP.p8 have highest throughput
Synthesis



“Comparable silicon resources” are required for
processor evaluation
POSM.p4 has 56% more throughput than wide-issue
SMT.p1
Future wide-issue processors are difficult to implement,
increasing the POSM advantage
• Smaller technology spacings have higher routing delays due to
parasitic resistance and capacitance
• The larger the processor, the larger the O(d2tLog2(t)) and O(d3t)
impact on area and delays


SMT works well with deep pipelines
The ISA and micro-architecture affect SMT overhead
• 4-thread x86 SMT would have 1/8th the SMT overhead
• Layout and micro-architecture techniques reduces SMT overhead
Outline
The problem: the memory gap
 Simultaneous Multithreading
 Decoupled Architectures
 Memory Technology
 Processor-In-Memory

The HiDISC Approach
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system
- useful for speculative prefetching
Approach:
• Add a processor to manage prefetching
-> hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
Decoupled Architectures
8-issue
3-issue
5-issue
2-issue
Computation
Processor (CP)
Computation
Processor (CP)
Computation
Processor (CP)
Computation
Processor (CP)
Registers
Registers
Registers
Registers
Access Processor
(AP) - (5-issue)
Cache
Access Processor
(AP) - (3-issue)
Cache
Cache
3-issue
2nd-Level Cache
2nd-Level Cache
and Main Memory
Cache
Cache Mgmt.
Processor (CMP)
Cache Mgmt.
Processor (CMP)
and Main Memory
2nd-Level Cache
and Main Memory
2nd-Level Cache
and Main Memory
MIPS
DEAP
CAPP
HiDISC
(Conventional)
(Decoupled)
2nd-Level Cache
and Main Memory
DEAP: [Kurian, Hulina, & Caraor ‘94]
PIPE: [Goodman ‘85]
Other Decoupled Processors: ACRI, ZS-1, WA
(New Decoupled)
3-issue
What is HiDISC?
Computation
Processor (CP)

A dedicated processor for each
level of the memory hierarchy

Explicitly manage each level of
the memory hierarchy using
instructions generated by the
compiler

Hide memory latency by
converting data access
predictability to data access
locality (Just in Time Fetch)

Exploit instruction-level
parallelism without extensive
scheduling hardware

Zero overhead prefetches for
maximal computation throughput
2-issue
Registers
Store Address
Queue
Load Data
Queue
Slip Control
Queue
Access Processor
(AP)
3-issue
Store Data
Queue
L1 Cache
Cache Mgmt.
Processor (CMP)
3-issue
L2 Cache
and Higher Level
HiDISC
Slip Control Queue

The Slip Control Queue (SCQ) adapts
dynamically
if (prefetch_buffer_full ())
Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;
else
Decrease size of SCQ;
• Late prefetches = prefetched data arrived after load
had been issued
• Useful prefetches = prefetched data arrived before load
had been issued
Decoupling Programs for HiDISC
(Discrete Convolution - Inner Loop)
while (not EOD)
y = y + (x * h);
send y to SDQ
Computation Processor Code
for (j = 0; j < i; ++j)
y[i]=y[i]+(x[j]*h[i-j-1]);
Inner Loop Convolution
SAQ: Store Address Queue
SDQ: Store Data Queue
SCQ: Slip Control Queue
EOD: End of Data
for (j = 0; j < i; ++j) {
load (x[j]);
load (h[i-j-1]);
GET_SCQ;
}
send (EOD token)
send address of y[i] to SAQ
Access Processor Code
for (j = 0; j < i; ++j) {
prefetch (x[j]);
prefetch (h[i-j-1];
PUT_SCQ;
}
Cache Management Code
Benchmarks
Benchmarks Source of
Benchmark
Lines of
Source
Code
LLL1
Livermore
Loops [45]
20
LLL2
Livermore
Loops
24
LLL3
Livermore
Loops
18
LLL4
Livermore
Loops
25
LLL5
Livermore
Loops
17
Tomcatv
SPECfp95 [68]
190
MXM
NAS kernels [5]
113
CHOLSKY
NAS kernels
156
VPENTA
NAS kernels
199
Qsort
Quicksort
sorting
algorithm [14]
58
Description
Data
Set
Size
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
1024-element
arrays, 100
iterations
33x33-element
matrices, 5
iterations
Unrolled matrix
multiply, 2
iterations
Cholesky matrix
decomposition
Invert three
pentadiagonals
simultaneously
Quicksort
24 KB
16 KB
16 KB
16 KB
24 KB
<64 KB
448 KB
724 KB
128 KB
128 KB
Simulation Parameters
Parameter
Value
Parameter
Value
L1 cache size
4 KB
L2 cache size
16 KB
L1 cache associativity
2
L2 cache associativity
2
L1 cache block size
32 B
L2 cache block size
32 B
Memory Latency
Variable, (0-200 cycles)
Memory contention
time
Variable
Victim cache size
32 entries
Prefetch buffer size
8 entries
Load queue size
128
Store address queue
size
128
Store data queue size
128
Total issue width
8
Simulation Results
LLL3
5
Tomcatv
3
MIPS
DEAP
CAPP
HiDISC
4
3
MIPS
DEAP
2.5
CAPP
HiDISC
2
1.5
2
1
1
0
0.5
0
40
80
120
160
Main Memory Latency
200
0
40
80
120
160
Main Memory Latency
200
Vpenta
Cholsky
16
14
12
10
8
6
4
2
0
0
12
MIPS
DEAP
CAPP
HiDISC
MIPS
DEAP
8
CAPP
6 HiDISC
10
4
2
0
40
80
120
160
Main Memory Latency
200
0
0
40
80
120
160
Main Memory Latency
200
VLSI Layout Overhead (I)





Goal: Cost effectiveness of HiDISC architecture
Cache has become a major portion of the chip area
Methodology: Extrapolated HiDISC VLSI Layout based on
MIPS10000 processor (0.35 μm, 1996)
The space overhead of HiDISC is extrapolated to be 11.3%
more than a comparable MIPS processor
The benchmark should be run again using these
parameters and new memory architectures
VLSI Layout Overhead (II)
Component
Original MIPS
R10K(0.35 m)
Extrapolation
(0.15 m)
HiDISC
(0.15 m)
D-Cache (32KB)
26 mm2
6.5 mm2
6.5 mm2
I-Cache (32KB)
28 mm2
7 mm2
14 mm2
TLB Part
10 mm2
2.5 mm2
2.5 mm2
External Interface Unit
27 mm2
6.8 mm2
6.8 mm2
Instruction Fetch Unit and BTB
18 mm2
4.5 mm2
13.5 mm2
Instruction Decode Section
21 mm2
5.3 mm2
5.3 mm2
Instruction Queue
28 mm2
7 mm2
0 mm2
Reorder Buffer
17 mm2
4.3 mm2
0 mm2
Integer Functional Unit
20 mm2
5 mm2
15 mm2
FP Functional Units
24 mm2
6 mm2
6 mm2
Clocking & Overhead
73 mm2
18.3 mm2
18.3 mm2
Total Size without L2 Cache
292 mm2
73.2 mm2
87.9 mm2
129.2 mm2
143.9 mm2
Total Size with on chip L2 Cache
The Flexi-DISC

Fundamental characteristics:
•


Dynamic reconfigurable
central computational kernel
(CK)
Multiple levels of caching
and processing around CK
•

inherently highly dynamic at
execution time.
adjustable prefetching
Multiple processors on a
chip which will provide for a
flexible adaptation from
multiple to single processors
and horizontal sharing of the
existing resources.
The Flexi-DISC

Partitioning of Computation
Kernel
• It can be allocated to the
different portions of the
application or different
applications



CK requires separation of the
next ring to feed it with data
The variety of target
applications makes the memory
accesses unpredictable
Identical processing units for
outer rings
• Highly efficient dynamic
partitioning of the resources
and their run-time allocation
can be achieved
Multiple HiDISC: McDISC





Problem: All extant, large-scale multiprocessors perform poorly
when faced with a tightly-coupled parallel program.
Reason: Extant machines have a long latency when
communication is needed between nodes. This long latency kills
performance when executing tightly-coupled programs. (Note
that multi-threading à la Tera does not help when there are
dependencies.)
The McDISC solution: Provide the network interface processor
(NIP) with a programmable processor to execute not only OS
code (e.g. Stanford Flash), but user code, generated by the
compiler.
Advantage: The NIP, executing user code, fetches data before it
is needed by the node processors, eliminating the network fetch
latency most of the time.
Result: Fast execution (speedup) of tightly-coupled parallel
programs.
The McDISC System: Memory-Centered
Distributed Instruction Set Computer
Understanding
FLIR SAR VIDEO ESS
Inference Analysis
Computation Instructions
Computation
Processor (CP)
Register Links
to CP Neighbors
Sensor
Data
Registers
Program
Compiler
Access
Processor (AP)
Access Instructions
Network
Management
Instructions
3-D Torus
of Pipelined Rings
X
Cache Management
Instructions
Y
Z
to Displays
and Network
Cache
Cache Management
Processor (CMP)
Network Interface
Processor (NIP)
Main Memory
Disc Cache
Disc
Processor (DP)
Adaptive Signal
PIM (ASP)
SAR
Video
RAID
Dynamic
Database
Sensor
Inputs
SES
Adaptive Graphics
PIM (AGP)
Decision Process
Targeting
Situation
Awareness
Disc Farm
Summary

A processor for each level of the memory hierarchy

Adaptive memory hierarchy management

Reduces memory latency for systems with high memory
bandwidths (PIMs, RAMBUS)

2x speedup for scientific benchmarks

3x speedup for matrix decomposition/substitution
(Cholesky)

7x speedup for matrix multiply (MXM) (similar results
expected for ATR/SLD)
Outline
The problem: the memory gap
 Simultaneous Multithreading
 Decoupled Architectures
 Memory Technology
 Processor-In-Memory

Memory Technology

New DRAM technologies
• DDR DRAM, SLDRAM and DRDRAM
• Most DRAM technologies achieve higher
bandwidth

Integrating memory and processor on a
single chip (PIM and IRAM)
• Bandwidth and memory access latency sharply
improve
New Memory Technologies (Cont.)

Rambus DRAM (RDRAM)
• memory interleaving system integrated onto a single
memory chip
• Four outstanding requests with pipelined micro
architecture
• Operates at much higher frequencies than SDRAM

Direct Rambus DRAM (DRDRAM)
• Direct control of all row and column resources
concurrently with data transfer operations
• Current DRDRAM can achieve 1.6 Gbytes/sec
bandwidth transferring on both clock edges
Intelligent RAM (IRAM)
Merging technology of processor and
memory
 All the memory accesses remain within a
single chip

• Bandwidth can be as high as 100 to 200
Gbytes/sec
• Access latency is less than 20ns

Good solution for data intensive streaming
application
Vector IRAM

Cost effective system
• Incorporates vector processing units and the
memory system on a single chip
Beneficial for the multimedia application with
critical DSP features
 Good energy efficiency
 Attractive for future mobile computing
processors

Outline
The problem: the memory gap
 Simultaneous Multithreading
 Decoupled Architectures
 Memory Technology
 Processor-In-Memory

Overview of the System

Proposed DCS (Data-intensive Computing
System) Architecture
DCS System (Cont’d)

Programming
• Different from the conventional programming model
• Applications are divided into two separate sections
– Software : Executed by the host processor
– Hardware : Executed by the CMP
• The programmer must use CMP instructions

CMP
• Several CMPs can be connected to the system bus
• Variable CMP size and configuration depending on the
amount and complexity of job it has to handle
• Variable size, function and location of logics inside of
CMP to better handle the application.

Memory, Coprocessors, I/O
CMP Architecture

CMP (Computational Memory Processor)
Architecture
• The Heart of our work
• Responsible for executing the core operation of dataintensive applications
• Attached to the system bus
• CMP instructions are encapsulated in the normal
memory operations.
• Consists of many ACME (Application-specific
Computational Memory Element) cells interconnected
amongst themselves through dedicated communication
links

CMC(Computing Memory Cluster)
• A small number of ACME cells are put together to form
a CMC
CMP Architecture
CMC Architecture
ACME Architecture

ACME (Application-specific Computational
Memory Elements) Architecture
• ACME-memory, configuration cache,
CE (Computing Element), FSM
• CE is the reconfigurable
computing unit and consists of
many CC (Computing Cells)
• FSM govern the overall execution
of the ACME
Inside the Computing Elements
Synchronization and Interface

Three different kinds of communications
• Host processor with CMP (eventually with each ACME)
– Done by synchronization variables (specific memory locations)
located inside the memory of each ACME cells
– Example : start and end signals for operations. CMP
instructions for each ACME
• ACME to ACME
– Two different approaches
• Host mediated
– Simple
– Not practical for frequent communications
• Distributed mediated approach
– Expensive and complex
– Efficient
• CMP to CMP
Benefits of the Paradigm

All the benefits from being the PIM
• Increased bandwidth and Reduced latency
• Faster Computation
– Parallel execution among many ACMEs




Effective usage of the full memory bandwidth
Efficient co-existence of Software and Hardware
More parallel execution inside of ACMEs by
efficiently configuring the structure with
considerations for applications
Scalability
Implementation of the CMP

Projected how our CMP will be implemented…
• According to 2000 edition of ITRS (International
Technology Roadmap for Semiconductors), in year
2008
– A High-end MPU with 1.381 billion transistors will be in
production with 0.06um technology and 427mm2
– If half of the die size is allocated to memory, 8.13 Gbits storage
will be available and 690 million transistors for logic
– There can be 2048 ACME cells with each 512Kbytes of
memory and 315K transistors for logic, control, anything
inside ACME and rest of resources (36M transistors) for
interconnections inside.
Motion Estimation of MPEG
Finding the motion vectors for a macro block
in the frame.
 It absorbs about 70% of the total execution
time of MPEG
 Huge amount of simple addition, subtraction
and comparisons

Example ME execution

One ACME structure to find a motion vector
for a macro block
• Executes in pipelined fashion reusing the data
Example ME execution

Performance
• For a 8*8 macro block with 8 pixel displacement
• 276 clock cycles to find the motion vector for
one macro block

Performance comparison with other
architectures