Download Massively Parallel Processor (MPP) Architectures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bus (computing) wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

IEEE 1355 wikipedia , lookup

Airborne Networking wikipedia , lookup

I²C wikipedia , lookup

CAN bus wikipedia , lookup

Direct memory access wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Transcript
CS/ECE 757: Advanced Computer Architecture II
(Parallel Computer Architecture)
Scalable Multiprocessors
Case Studies(Chapter 7)
Copyright 2001 Mark D. Hill
University of Wisconsin-Madison
Slides are derived from work by
Sarita Adve (Illinois), Babak Falsafi (CMU),
Alvy Lebeck (Duke), Steve Reinhardt (Michigan),
and J. P. Singh (Princeton). Thanks!
Massively Parallel Processor (MPP) Architectures
Processor
Network
Interface
• Network interface typically
close to processor
Cache
Memory Bus
– Memory bus:
» locked to specific
processor architecture/bus
protocol
– Registers/cache:
» only in research machines
I/O Bridge
Network
I/O Bus
Main
Memory
Disk
Controller
Disk
• Time-to-market is long
– processor already available or
work closely with processor
designers
Disk
• Maximize performance
and cost
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 1
CS/ECE 757
2
Network of Workstations
Processor
interrupts
Cache
Core Chip Set
I/O Bus
Main
Memory
Disk
Controller
Disk
Disk
Graphics
Controller
Graphics
Network
Interface
• Network interface on
I/O bus
• Standards (e.g., PCI)
=> longer life,
faster to market
• Slow (microseconds)
to access network
interface
• “System Area
Network” (SAN):
between LAN & MPP
Network
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
3
Transaction Interpretation
• Simplest MP: assist doesn’t interpret much if
anything
– DMA from/to buffer, interrupt or set flag on completion
• User-level messaging: get the OS out of the way
– assist does protection checks to allow direct user access to
network
– may have minimal interpretation otherwise
• Virtual DMA: get the CPU out of the way (maybe)
– basic protection plus address translation: user-level bulk DMA
• Global physical address space (NUMA): everything in
hardware
– complexity increases, but performance does too (if done right)
• Cache coherence: even more so
– stay tuned
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 2
CS/ECE 757
4
Increasing HW Support, Specialization, Intrusiveness, Performance (???)
Spectrum of Designs
• None: Physical bit stream
– physical DMA
nCUBE, iPSC, . . .
• User/System
– User-level port
– User-level handler
CM-5, *T
J-Machine, Monsoon, . . .
• Remote virtual address
– Processing, translation
– Reflective memory (?)
Paragon, Meiko CS-2, Myrinet
Memory Channel, SHRIMP
• Global physical address
– Proc + Memory controller
RP3, BBN, T3D, T3E
• Cache-to-cache (later)
– Cache controller
Dash, KSR, Flash
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
5
Net Transactions: Physical DMA
Data
Dest
DMA
channels
Addr
Length
Rdy
Memory
Status,
interrupt
Cmd
P
°°°
Addr
Length
Rdy
Memory
P
• Physical addresses: OS must initiate transfers
– system call per message on both ends: ouch
• Sending OS copies data to kernel buffer w/ header/trailer
– can avoid copy if interface does scatter/gather
• Receiver copies packet into OS buffer, then interrupts
– user message then copied (or mapped) into user space
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 3
CS/ECE 757
6
Conventional LAN Network Interface
Host Memory
NIC
trncv
NIC Controller
Data
Addr Len
Status
Next
Addr Len
Status
Next
Addr Len
Status
Next
TX
addr
RX
len
IO Bus
mem bus
Addr Len
Status
Next
Addr Len
Status
Next
DMA
Proc
Addr Len
Status
Next
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
7
User Level Ports
User/system
Data
Dest
°°°
Mem
P
Status,
interrupt
Mem
P
• map network hardware into user’s address space
– talk directly to network via loads & stores
• user-to-user communication without OS intervention:
low latency
• protection: user/user & user/system
• DMA hard… CPU involvement (copying) becomes
bottleneck
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 4
CS/ECE 757
8
Case Study: Thinking Machines CM5
• Follow-on to CM2
–
–
–
–
–
Abandons SIMD, bit-serial processing
Uses off-shelf processors/parts
Focus on floating point
32 to 1024 processors
Designed to scale to 16K processors
• Designed to be independent of specific
processor node
• Current" processor node
– 40 MHz SPARC
– 32 MB memory per node
– 4 FP vector chips per node
(C) 2003, J. E. Smith
9
CM5, contd.
• Vector Unit
–
–
–
–
–
Four FP processors
Direct connect to main memory
Each has a 1 word data path
FP unit can do scalar or vector
128 MFLOPS peak: 50 MFLOPS Linpack
(C) 2003, J. E. Smith
Page 5
10
Interconnection Networks
• Two networks: Data and Control
• Network Interface
–
–
–
–
Memory-mapped functions
Store data => data
Part of address => control
Some addresses map to privileged functions
(C) 2003, J. E. Smith
11
Interconnection Networks
• Input and output
FIFO for each
network
• Two data networks
• Save/restore network
buffers on context
switch
Diagnostics network
Control network
Data network
PM PM
Processing
partition
SPARC
Processing Control
partition
processors
FPU
$
ctrl
Data
networks
$
SRAM
I/O partition
Control
network
NI
MBUS
DRAM
ctrl
DRAM
Vector
unit
DRAM
ctrl
DRAM
DRAM
ctrl
DRAM
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 6
Vector
unit
DRAM
ctrl
DRAM
CS/ECE 757
12
Message Management
• Typical function implementation
– Each function has two FIFOs (in and out)
– Two outgoing control registers:
» send and send_first
» send_first initiates message
» send sends any additional data
– read send_ok to check successful send;else retry
– read send_space to check space prior to send
– incoming register receive_ok can be polled
– read receive_length_left for message length
– read receive for input data in FIFO order
(C) 2003, J. E. Smith
13
User Level Network ports
Virtual address space
Net output
port
Net input
port
Processor
Status
Registers
Program counter
• Appears to user as logical message queues plus
status
• What happens if no user pop?
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 7
CS/ECE 757
14
Control Network
• In general, every processor participates
– A mode bit allows a processor to abstain
• Broadcasting
– 3 functional units: broadcast, supervisor broadcast, interrupt
» only 1 broadcast at a time
» broadcast message 1 to 15 32-bit words
• Combining
– Operation types:
» reduction
» parallel prefix
» parallel suffix
» router-done (big-OR)
» Combine message is a 32 to 128 bit integer
» Operator types: OR, XOR, max, signed add, unsigned add
– Operation and operator are encoded in send_first address
(C) 2003, J. E. Smith
15
Control Network, contd
• Global Operations
–
–
–
–
Big OR of 1 bit from each processor
three independent units; one synchronous, 2 asynchronous
Synchronous useful for barrier synch
Asynchronous useful for passing error conditions
independent of other Control unit functions
• Individual instructions are not synchronized
each processor fetches instructions barrier
synch via control is used between instruction
blocks
=> support for a loose form of data parallelism
(C) 2003, J. E. Smith
Page 8
16
Data Network
• Architecture
–
–
–
–
–
–
–
Fat-tree
Packet switched
Bandwidth to neighbor: 20 MB/sec
Latency in large network:
3 to 7 microseconds
Can support message passing or
global address space
• Network interface
–
–
–
–
One data network functional unit
send_first gets destn address + tag
1 to 5 32-bit chunks of data
tag can cause interrupt at receiver
• Addresses may be physical or relative
– physical addresses are privileged
– relative address is bounds-checked and translated
(C) 2003, J. E. Smith
17
Data Network, contd.
• Typical message interchange:
– alternate between pushing onto send-FIFO and receiving on
receive-FIFO
– once all messages are sent, assert this with combine
function on control network
– alternate between receiving more messages and testing
control network
– when router_done goes active, message pass phase is
complete
(C) 2003, J. E. Smith
Page 9
18
User Level Handlers
U se r/sys te m
D a ta
Ad d re ss
D est
° ° °
Mem
M em
P
P
• Hardware support to vector to address specified in
message
– message ports in registers
– alternate register set for handler?
• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
19
CS/ECE 757
20
J-Machine
• Each node a small messagedriven processor
• HW support to queue msgs
and dispatch to msg handler
task
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 10
J-Machine Message-Driven Processor
• MIT research project
• Targets fine-grain message passing
– very low message overheads allow:
– small messages
– small tasks
• J-Machine architecture
–
–
–
–
–
3D mesh direct interconnect
Global address space
up to 1 Mwords of DRAM per node
MDP single-chip supports processing memory control message passing
not an off-the-shelf chip
(C) 2003, J. E. Smith
21
Features/Example: Combining Tree
•
Each node collects data from lower levels, accumulates sum, and
passes sum up tree when all inputs are done.
•
Communication
– SEND instructions send values up tree in small messages
– On arrival, a task is created to perform COMBINE routine
•
Synchronization
–
–
–
–
message arrival dispatches a task
example: combine message invokes COMBINE routine
presence bits (full/empty) on state
value set empty; reads are blocked until it becomes full
(C) 2003, J. E. Smith
Page 11
22
Features/Example, contd.
• Naming
– segmented memory, and translation instructions
– protection via segment descriptors
– allocate a segment of memory for each combining node
(C) 2003, J. E. Smith
23
Instruction Set
• Separate register sets for three execution levels:
– background: no messages
– priority 0: lower priority messages
– priority 1: higher priority messages
• Each Register set:
–
–
–
–
–
4 GPRs
4 Address regs.: contain segment descriptors
base + field length
4 ID registers: P0 and P1 only
instruction pointer: includes status info
• Addressing
– an address is an offset + address register
(C) 2003, J. E. Smith
Page 12
24
Instruction Set, contd
• Tags
–
–
–
–
–
–
4 bit tags per 32-bit data value
data types, e.g. integer, boolean, symbol
system types, e.g. IP, addr, msg
4 user-defined types
Full/empty types: Fut, and Cfut
interrupt on access to Cfut value
• MDP can do operand type checking and interrupt on
miss-match
=> program does not have to do software checking
• Instructions
– 17-bit format (two per word)
– three operands
– words not tagged as instructions are loaded into R0 automatically
(C) 2003, J. E. Smith
25
Instruction Set, contd.
(C) 2003, J. E. Smith
Page 13
26
Instruction set, contd.
• Naming
– ENTER instruction places translation into TLB
– XLATE instruction does TLB translation
• Communication
– SEND instructions
– Example:
» SEND R0,0
;send net address, prio 0
» SEND2
R1,R2,0 ;send two data words
» SEND2E R3,[3,A3],0; send two more words;
(C) 2003, J. E. Smith
27
Instruction set, contd.
• Task scheduling
–
–
–
–
–
–
Message at head of queue causes MDP to create a task
opcode loaded into IP
length field and message head loaded into A3
takes three cycles
low latency messages executed directly
others specify a handler that locates a required method and transfers
control to the method
(C) 2003, J. E. Smith
Page 14
28
Network Architecture
• 3D Mesh; up to 64K nodes
• No torus => faces for I/O
• Bidirectional channels
–
–
–
–
–
–
channels can be turned around on alternate cycles
9 bits data + 6 bits control
=> 9 bit phit
2 phit per flit (granularity of flow control)
Each channel 288 Mbps
Bisection bandwidth (1024 nodes) 18.4 Gps
• Synchronous routers
– clocked at 2X processor clock
– => 9-bit phit per 62.5ns
– messages route at 1 cycle per hop
(C) 2003, J. E. Smith
29
Dedicated Message Processing Without
Specialized Hardware
Network
dest
°°°
Mem
Mem
NI
P
User
•
•
•
•
NI
MP
P
System
User
MP
System
Msg processor performs arbitrary output processing (at system level)
Msg processor interprets incoming network transactions (in system)
User Processor <–> Msg Processor share memory
Msg Processor <–> Msg Processor via system network transaction
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 15
CS/ECE 757
30
Levels of Network Transaction
Network
dest
°°°
Mem
NI
P
Mem
NI
MP
User
MP
P
System
• User Processor stores cmd / msg / data into shared output queue
– must still check for output queue full (or grow dynamically)
• Communication assists make transaction happen
– checking, translation, scheduling, transport, interpretation
• Avoid system call overhead
• Multiple bus crossings likely bottleneck
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
31
Example: Intel Paragon
Service
Network
I/O
Nodes
I/O
Nodes
Devices
Devices
16
Mem
175 MB/s Duplex
2048 B
NI
i860xp
50 MHz
16 KB $
4-way
32B Block
MESI
°°°
EOP
rte
MP handler
Var data
64
400 MB/s
$
$
P
MP
sDMA
rDMA
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 16
CS/ECE 757
32
Dedicated MP w/specialized NI:
Meiko CS-2
• Integrate message processor into network interface
– active messages-like capability
– dedicated threads for DMA, reply handling, simple remote memory
access
– supports user-level virtual DMA
» own page table
» can take a page fault, signal OS, restart
• meanwhile, nack other node
• Problem: processor is slow, time-slices threads
– fundamental issue with building your own CPU
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
33
Myricom Myrinet (Berkeley NOW)
• Programmable network interface on I/O Bus (Sun SBUS
or PCI)
– embedded custom CPU (“Lanai”, ~40 MHz RISC CPU)
– 256KB SRAM
– 3 DMA engines: to network, from network, to/from host memory
• Downloadable firmware executes in kernel mode
– includes source-based routing protocol
• SRAM pages can be mapped into user space
– separate pages for separate processes
– firmware can define status words, queues, etc.
» data for short messages or pointers for long ones
» firmware can do address translation too… w/OS help
– poll to check for sends from user
• Bottom line: I/O bus still bottleneck, CPU could be faster
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 17
CS/ECE 757
34
Shared Physical Address Space
• Implement SAS model in hardware w/o caching
– actual caching must be done by copying from remote memory to
local
– programming paradigm looks more like message passing than
Pthreads
» yet, low latency & low overhead transfers thanks to HW
interpretation; high bandwidth too if done right
» result: great platform for MPI & compiled data-parallel codes
• Implementation:
– “pseudo-memory” acts as memory controller for remote mem,
converts accesses to network transaction (request)
– “pseudo-CPU” on remote node receives requests, performs on local
memory, sends reply
– split-transaction or retry-capable bus required (or dual-ported mem)
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
35
Case Study: Cray T3D
•
•
•
•
•
•
•
•
•
•
•
Processing element nodes
3D Torus interconnect
Wormhole routing
PE numbering
Local memory
Support circuitry
Prefetch
Messaging
Barrier synch
Fetch and inc.
Block transfers
(C) 2003, J. E. Smith
Page 18
36
Processing Element Nodes
• Two processors per node
• Shared block transfer engine
– DMA-like transfer of large blocks of
data
• Shared network interface/router
• Synchronous 6.67 ns clock
(C) 2003, J. E. Smith
37
Communication Links
• Signals:
– Data: 16 bits
– Channel control: 4 bits
– -- request/response, virt.
channel buffer
– Channel acknowledge: 4 bits
• virt. channel buffer status
(C) 2003, J. E. Smith
Page 19
38
Routing
• Dimension order routing
– may go in either + or - direction along a dimension
• Virtual channels
– Four virtual channel buffers per physical channel
– => two request channels, two response channels
• Deadlock avoidance
– In each dimension specify a "dateline" link
– Packets that do not cross dateline use virtual channel 0
– Packets that cross dateline use virtual channel 1
(C) 2003, J. E. Smith
39
Packets
• Size: 8 physical units (phits)
– 16 bits per phit
• Header:
–
–
–
–
–
routing info
destn processor
control info
source processor
memory address
• Read Request
– header: 6 phits
• Read Response
– header: 6 phits
– body: 4 phits (1 word)
– or 16 phits (4 words)
(C) 2003, J. E. Smith
Page 20
40
Processing Nodes
• Processor: Alpha 21064
• Support circuitry:
–
–
–
–
–
–
–
Address interpretation
reads and writes (local/non-local)
data prefetch
messaging
barrier synch.
fetch and incr.
status
(C) 2003, J. E. Smith
41
Address Interpretation
• T3D needs:
– 64 MB memory per node => 26 bits
– up to 2048 nodes => 11 bits
• Alpha 21064 provides:
– 43-bit virtual address
– 32-bit physical address
– (+2 bits for mem mapped devices)
=> Annex registers in DTB
–
–
–
–
–
–
external to Alpha
map 32-bit address onto 48 bit node + 27-bit address
32-annex registers
annex registers also contain function info
e.g. cache / non-cache accesses
DTB modified via load/locked store cond. insts.
(C) 2003, J. E. Smith
Page 21
42
Reads and Writes
• Reads
– note: no hardware cache coherence
– may be cacheable or non-cacheable
– two types of non-cacheable:
» normal and atomic swap
» swap uses "swaperand" register
– three types of cacheable reads:
» normal, atomic swap, cached readahead
» cached readahead buffers and pre-reads 4 word blocks of data
• Writes
–
–
–
–
Alpha uses write-through read-allocate cache
write counter: to help with consistency
increments on write request
decrements on ack
(C) 2003, J. E. Smith
43
Data Prefetch
• Architected Prefetch Queue
– 1 word wide by 16 deep
• Prefetch instruction:
– Alpha prefetch hint => T3D prefetch
• Performance
– Allows multiple outstanding read requests
– (normal 21064 reads are blocking)
(C) 2003, J. E. Smith
Page 22
44
Messaging
• Message queues
– 256 KBytes reserved space in local memory
– => 4080 message packets + 16 overflow locations
• Sending processor:
– Uses Alpha PAL code
– builds message packets of 4 words
– plus address of receiving node
• Receiving node
–
–
–
–
–
–
–
stores message
interrupts processor
processor sends an ack
processor may execute routine at address provided by message
(active messages)
if message queue full; NACK is sent
also, error messages may be generated by support circuitry
(C) 2003, J. E. Smith
45
Barrier Synchronization
• For Barrier or Eureka
• Hardware implementation
–
–
–
–
hierarchical tree
bypasses in tree to limit its scope
masks for barrier bits to further limit scope
interrupting or non-interrupting
(C) 2003, J. E. Smith
Page 23
46
Fetch and Increment
• Special hardware registers
–
–
–
–
2 per node
user accessible
used for auto-scheduling tasks
(often loop iterations)
(C) 2003, J. E. Smith
47
Block Transfer
• Special block transfer engine
–
–
–
–
does DMA transfers
can gather/scatter among processing elements
up to 64K packets
1 or 4 words per packet
• Types of transfers
– constant stride read/write
– gather/scatter
• Requires System Call
– for memory protection
=> big overhead
(C) 2003, J. E. Smith
Page 24
48
Cray T3E
•
•
•
•
•
•
T3D Post Mortem
T3E Overview
Global Communication
Synchronization
Message passing
Kernel performance
(C) 2003, J. E. Smith
49
T3D Post Mortem
• Very high performance interconnect
– 3D torus worthwhile
• Barrier network "overengineered"
– Barrier synch not a significant fraction of runtime
• Prefetch queue useful; should be more of them
• Block Transfer engine not very useful
– high overhead to setup
– yet another way to access memory
• DTB Annex difficult to use in general
– one entry might have sufficed
– every processor must have same mapping for physical page
(C) 2003, J. E. Smith
Page 25
50
T3E Overview
• Alpha 21164 processors
• Up to 2 GB per node
• Caches
–
–
–
–
–
8K L1 and 96K L2 on-chip
supports 2 outstanding 64-byte line fills
stream buffers to enhance cache
only local memory is cached
=> hardware cache coherence straightforward
• 512 (user) + 128 (system) E-registers for
communication/synchronization
• One router chip per processor
(C) 2003, J. E. Smith
51
T3E Overview, contd.
• Clock:
– system (i.e. shell) logic at 75 MHz
– proc at some multiple (initially 300 MHz)
• 3D Torus Interconnect
– bidirectional links
– adaptive multiple path routing
– links run at 300 MHz
(C) 2003, J. E. Smith
Page 26
52
Global Communication: E-registers
•
•
•
•
Extend address space
Implement "centrifuging" function
Memory-mapped (in IO space)
Operations:
– load/stores between E-registers and processor registers
– Global E-register operations
» transfer data to/from global memory
» messaging
» synchronization
(C) 2003, J. E. Smith
53
Global Address Translation
• E-reg block holds base and mask;
– previously stored there as part of setup
• Remote memory access (mem mapped store):
– data bits: E-reg pointer(8) + address index(50)
– address bits: Command + src/dstn E-reg
(C) 2003, J. E. Smith
Page 27
54
Global Address Translation
(C) 2003, J. E. Smith
55
Address Translation, contd.
• Translation steps
–
–
–
–
–
Address index centrifuged with mask => virt address + virt PE
Offset added to base => vseg + seg offset
vseg translated => gseg + base PE
base PE + virtual PE => logical PE
logical PE through lookup table => physical routing tag
• GVA: gseg(6) + offset (32)
– goes out over network to physical PE
» at remote node, global translation => physical address
(C) 2003, J. E. Smith
Page 28
56
Global Communication: Gets and Puts
• Get: global read
– word (32 or 64-bits)
– vector (8 words, with stride)
– stride in access E-reg block
• Put: global store
• Full/Empty synchronization on E-regs
• Gets/Puts are pipelined
– up to 480 MB/sec transfer rate between nodes
(C) 2003, J. E. Smith
57
Synchronization
• Atomic ops between E-regs and memory
–
–
–
–
Fetch & Inc
Fetch & Add
Compare & Swap
Masked Swap
• Barrier/Eureka Synchronization
–
–
–
–
–
–
32 BSUs per processor
accessible as memory-mapped registers
BSUs have states and operations
State transition diagrams
Barrier/Eureka trees are embedded in 3D torus
use highes priority virtual channels
(C) 2003, J. E. Smith
Page 29
58
Message Passing
• Message queues
–
–
–
–
arbitrary number
created in user or system mode
mapped into memory space
up to 128 MBytes
• Message Queue Control Word in memory
–
–
–
–
tail pointer
limit value
threshold triggers interrupt
signal bit set when interrupt is triggered
• Message Send
– from block of 8 E-regs
– Send command, similar to put
• Queue head management done in software
– swap can manages queue in two segments
(C) 2003, J. E. Smith
59
T3E Summary
• Messaging is done well...
– within constraints of COTS processor
<<buzzword alert>>
• Relies more on high communication and memory
bandwidth than caching
=> lower perf on dynamic irregular codes
=> higher perf on memory-intensive codes with
large communication
• Centrifuge; probably unique
• Barrier synch uses dedicated hardware but NOT
dedicated network
(C) 2003, J. E. Smith
Page 30
60
DEC Memory Channel
(Princeton SHRIMP)
Virtual
Virtual
Physical
Physical
• Reflective Memory
• Writes on Sender appear in Receiver’s memory
– send & receive regions
– page control table
• Receive region is pinned in memory
• Requires duplicate writes, really just message
buffers
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
61
Performance of Distributed Memory Machines
• Microbenchmarking
• One-way latency of small (five-word) message
– echo test
– round-trip divided by 2
• Shared Memory remote read
• Message Passing Operations
– see text
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 31
CS/ECE 757
62
Network Transaction Performance
14
Microseconds
12
10
8
Gap
L
Or
Os
6
4
T3D
NOW
CS2
Paragon
CM-5
T3D
NOW
CS2
Paragon
0
CM-5
2
Figure 7.31
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
63
Remote Read Performance
20
15
Gap
L
Issue
10
T3D
NOW
CS2
Paragon
CM-5
T3D
NOW
CS2
0
Paragon
5
CM-5
Microseconds
25
Figure 7.32
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 32
CS/ECE 757
64
Summary of Distributed Memory Machines
• Convergence of architectures
– everything “looks basically the same”
– processor, cache, memory, communication assist
• Communication Assist
– where is it? (I/O bus, memory bus, processor registers)
– what does it know?
» does it just move bytes, or does it perform some functions?
– is it programmable?
– does it run user code?
• Network transaction
– input & output buffering
– action on remote node
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 33
CS/ECE 757
65