Download High Throughput On-Chip Interconnection Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Design and Analysis of a Robust
Pipelined Memory System
Hao Wang†, Haiquan (Chuck) Zhao*,
Bill Lin†, and Jun (Jim) Xu*
†University
of California, San Diego
*Georgia Institute of Technology
Infocom 2010, San Diego
Memory Wall
• Modern Internet routers need to manage large
amounts of packet- and flow-level data at line rates
• e.g., need to maintain per-flow records during a
monitoring period, but
– Core routers have millions of flows, translating to
100’s of megabytes of storage
– 40 Gb/s OC-768 link, new packet can arrive every 8 ns
2
Memory Wall
• SRAM/DRAM dilemma
• SRAM: access latency typically between 5-15 ns
(fast enough for 8 ns line rate)
• But the capacity of SRAMs is substantially
inadequate in many cases: 4 MB largest typically
(much less than 100’s of MBs needed)
3
Memory Wall
• DRAM provides inexpensive bulk storage
• But random access latency typically 50- 100 ns
(much slower than 8 ns needed for 40 Gb/s line rate)
• Conventional wisdom is that DRAMs
are not fast enough to keep up with
ever-increasing line rates
4
Memory Design Wish List
• Line rate memory bandwidth (like SRAM)
• Inexpensive bulk storage (like DRAM)
• Predictable performance
• Robustness to adversarial access patterns
5
Main Observation
• Modern DRAMs can be fast and
cheap!
– Graphics, video games, and HDTV
– At commodity pricing, just
$0.01/MB currently, $20 for 2GB!
6
Example: Rambus XDR Memory
• 16 internal banks
7
Memory Interleaving
• Performance achieved through memory interleaving
– e.g. suppose we have B = 6 DRAM banks and access
pattern is sequential
1
7
13
2
8
14
3
9
15
4
10
16
5
11
17
6
12
18
:
:
:
:
:
:
:
:
:
:
:
:
– Effective memory bandwidth B times faster
8
Memory Interleaving
• But, suppose access pattern is as follows:
1
7
13
19
:
25
:
2
8
14
3
9
15
4
10
16
5
11
17
6
12
18
:
:
:
:
:
:
:
:
:
:
• Memory bandwidth degrades to worst-case DRAM
latency
9
Memory Interleaving
• One solution is to apply pseudo-randomization of
memory locations
:
:
:
:
:
:
:
:
:
:
:
:
10
Adversarial Access Patterns
• However, memory bandwidth can still degrade to
worst-case DRAM latency even with randomization:
1. Lookups to same global variable will trigger accesses to
same memory bank
2. Attacker can flood packets with same TCP/IP header,
triggering updates to the same memory location and
memory bank, regardless of the randomization function.
11
Outline
• Problem and Background
→Proposed Design
• Theoretical Analysis
• Evaluation
12
Pipelined Memory Abstraction
Emulates SRAM with Fixed Delay
op
addr
data
R
R
c
b
W
W
data out
SRAM
R
a
c
c
a
W
R
R
R
5
4
time
time
5
4
3
2
1
0
op
addr
data
R
R
c
b
W
W
SRAM
Emulation
R
3
2
1
data out
a
c
c
a
W
R
R
R
time
0
time
5
4
3
2
1
0
D+5 D+4 D+3 D+2 D+1
D
13
Implications of Emulation
• Fixed pipeline delay: If a read operation is issued at
time t to an emulated SRAM, the data is available
from the memory controller at exactly t + D (instead
of same cycle).
• Coherency: The read operations output the same
results as an ideal SRAM system.
14
Proposed Solution: Basic Idea
• Keep SRAM reservation table of memory operations
and data that occurred in last C cycles
• Avoid introducing new DRAM operation for memory
references to same location within C cycles
15
Details of Memory Architecture
request buffers
B
random
address
permutation
…
data
data
data
op
addr
data
R-link
R-link
R-link
p
p
p
R-link
p
…
addr
addr
addr
data
out
C
…
op
op
op
MRI table
(CAM)
MRW table
(CAM)
C
reservation table
C
input
operations
DRAM banks
16
Merging of Operations
• Requests arrive from right to left.
1. …
2. …
+ WRITE  WRITE
read copies data from
write
WRITE + WRITE  WRITE
2nd write overwrites 1st
write
READ
2nd read copies data
from 1st read
3. …
READ
+
READ

4. …
WRITE +
READ
 WRITE +
READ
READ
17
Proposed Solution
• Rigorously prove that with merging, worst-case delay
for memory operation is bounded by some fixed D
w.h.p.
• Provide pipelined memory abstraction in which
operations issued at time t are completed at exactly
t + D cycles later (instead of same cycle).
• Reservation table with C > D also used to implement
the pipeline delay, as well as serving as a “cache”.
18
Outline
• Problem and Background
• Proposed Design
→Theoretical Analysis
• Evaluation
19
Robustness
• At most one write operation in a request buffer every
C cycles to a particular memory address.
• At most one read operation in a request buffer every
C cycles to a particular memory address.
• At most one read operation followed by one write
operation in a request buffer every C cycles to a
particular address.
20
Theoretical Analysis
• Worst case analysis
• Convex ordering
• Large deviation theory
• Prove: with a cache of size C, the best an attacker can do
is to send repetitive requests every C+1 cycles.
21
Bound on Overflow Probability
• Want to bound the probability that a request buffer
overflows in n cycles
Pr[overflow] 

0 s t  n
Pr[ Ds ,t ]
Ds ,t     : X s ,t    K 
•
is the number of updates to a bank during cycles
[s, t],  t  s , K is the length of a request queue.
X s ,t
• For total overflow probability bound multiply by B.
22
Chernoff Inequality
Pr[ Ds ,t ]  Pr[ X  K +  ]
 Pr[e X  e( K +  ) ]
E[e X  ]
 ( K +  )
e
• Since this is true for all θ>0,
E[e X  ]
Pr[ Ds ,t ]  min ( K +  )
 0 e
• We want to find the update sequence that maximizes
E[e X  ]
23
Worst Case Request Patterns
q1    T 1 C
q2    2Tq1   2T  1 
1
 
T  
C 
•
•
•
•

r    2Tq1   2T 1 q2
C
q1+q2 +1 requests for distinct counters a1 , aq + q + r ,
q1 requests repeat 2T times each
q2 requests repeat 2T-1 times each
1 request repeat r times each
1
2
24
Outline
• Problem and Background
• Proposed Design
• Theoretical Analysis
→Evaluation
25
Evaluation
• Overflow probability for 16 million addresses,
µ=1/10, and B=32.
0
10
-2
Overflow Probability Bound
10
-4
10
-6
10
-8
10
C=6000
-10
C=7000
10
C=8000
SRAM 156 KB,
CAM 24 KB
-12
10
C=9000
-14
10
80
90
100
110
120
130
140
Queue Length K
150
160
170
180
26
Evaluation
• Overflow probability for 16 million addresses,
µ=1/10, and C=8000.
0
10
-5
Overflow Probability Bound
10
-10
10
-15
10
B=32
-20
10
B=34
B=36
-25
10
B=38
-30
10
80
90
100
110
120
130
140
Request Buffer Size K
150
160
170
180
27
Conclusion
• Proposed a robust memory architecture that provides
throughput of SRAM with density of DRAM.
• Unlike conventional caching that have unpredictable
hit/miss performance, our design guarantees w.h.p. a
pipelined memory architecture abstraction that can
support new memory operation every cycle with fixed
pipeline delay.
• Convex ordering and large deviation theory to rigorously
prove robustness under adversarial accesses.
28
Thank You