Download PowerPoint Template

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 802.1aq wikipedia , lookup

Distributed operating system wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

CAN bus wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
An Efficient Shared Memory Based
Virtual Communication System for
Embedded SMP Cluster
Wenxuan Yin
Institute of Computing Technology
Chinese Academy of Sciences
Joint work with Xiang Gao, Xiaojing Zhu, ICT, CAS
and Deyuan Guo, Tsinghua University
NAS 2011
Background
• Dilemma in Embedded System
– High performance
– Cost, power consumption, size, etc.
Video/media processing
July 2011
Space-born satellite
Wenxuan Yin-NAS 2011
Background
• Why SMP cluster is popular in general
computing?
– High scalability
– Good cost-performance ratio
– Convenient for MPI programming
• It can also benefit the embedded domain
– Embedded Cluster
• Embedded processor nodes
• Commodity networks Tradeoff
– moderate performance
July 2011
cost/power efficiency
Wenxuan Yin-NAS 2011
Motivations
• Challenges by SMP nodes
– Two levels of communication
• inter-node: high-speed network
• intra-node: shared memory/cache
Performance
Gap!
– Memory management
• memory hierarchy: local vs. remote
• coherency maintenance
– MPI Inter-Process Communication (IPC)
• process allocation in different parallelism
– Mutual exclusion and synchronization
July 2011
Wenxuan Yin-NAS 2011
Motivations
• Opportunities in SMP nodes
– More computation capacity
– High-speed chip-to-chip interconnect fabrics
• PCI-E:ARM Cortex A9 MPCore
• Serial RapidIO:Freescale 8641D
• HyperTransport:ICT Godson-3A
– Can we use the fabrics directly to replace traditional
NIC based networks?
• get rid of NICs, switches, cables
How to do?
July 2011
Wenxuan Yin-NAS 2011
Proposed Design
Extending the Shared Memory
Mechanism into Inter-Node
Communications
July 2011
Wenxuan Yin-NAS 2011
Objectives
• Compatibility
– Software virtulized network
TCP/IP protocol
• Efficiency
– Remote memory
Logical shared memory
– Narrow the gap between two levels
• Economization
– Compact interconnect
Space and cost effective
July 2011
Wenxuan Yin-NAS 2011
Comparison
• Chip-to-chip interconnection changes the
network topology
Star
UN
Mesh
…
G
HT
G
UN
Ethernet
Switch
UN
…
HT
G
Virtual
Ethernet
HT
HT
G
UN
UN = Uniprocessor Node
July 2011
G = Godson-3A SMP
Wenxuan Yin-NAS 2011
Architecture
Node 0
Node 1
HT0
P0
P3
…
Cache
Shared
L
O
H
I
H
I
L
O
P0
P3
…
Godson-3A SMP
Nodes
Cache
Local
Local
Shared
Shared Memory
Virtual Network
SMVN
Node 2
P0
Node 3
P3
…
Cache
Shared
L
O
H
I
H
I
L
O
P0
Cache
Local
1
2
P3
…
Shared
3
Shared Memory Pool
July 2011
Configured into 2 parts
HT1: for IO extention
Local
0
HT0: for interconnection
Wenxuan Yin-NAS 2011
Omitted here
Memory in each node is
divided into 2 parts
SMP Nodes
• Godson-3A CPU
PCI/LPC
July 2011
P1
P2
P3
m1
m2
m3
m4
6×6 X1 Switch
m0/s0
m5/s5
s1
s2
s3
s4
S0
S1
S2
S3
m1
m2
m3
m4
5×4 X2 Switch
m0/s0
s1
s2
s3
MC0
MC1
XConf
Wenxuan Yin-NAS 2011
HT Controller
P0
DMA Controller
Godson-3A
DMA Controller
HT Controller
– MIPS64-compatible
– 4-core superscalar
– For high performance and low power consumption
More Details
• Cache coherency
– Directory based cache coherency
– HT holds coherency in the whole interconnection
system, global addressing in remote accessing
– Transparent to programmers
• Reconfigurable memory pool
– Each node can tune its shared memory size
contributing to the memory pool
– Extreme case: only master node cedes its shared part
July 2011
Wenxuan Yin-NAS 2011
X-Y Transmission
• Built-in routing mechanism in HT
– Eliminate switches
Examples
G0 → G3
G0
HT
G2
July 2011
HT
Virtual
Ethernet
HT
G1
HT
G3
Wenxuan Yin-NAS 2011
G3 → G0
SMVN Driver
• Hierarchical design
– Virtual physical layer
• Memory copy & optimization
– Virtual data link layer
• Function and hardware abstraction
• Packets encapsulation
meet frame format of TCP/IP
– Driver management layer
• Treat SMVN as a common NIC class device
• OS inquiry them recurrently to load & start
– Splice SMVN and TCP/IP together!
July 2011
Wenxuan Yin-NAS 2011
SMVN Driver
Application
Layer
Network &
Tranport Layer
Socket MPI Interface
TCP/IP Protocol Stack
Socket MPI Interface
TCP/IP Protocol Stack
TCP/IP upper
protocol
Driver Management Layer
SM Packetization
Layer
SM Packetization
Layer
SM Func. & HW
Abstraction Layer
SM Func. & HW
Abstraction Layer
Virtual Data Link
Layer
Virtual Physical
Layer
Optimized Shared Memory Pool Copy
SMVN Driver
July 2011
Wenxuan Yin-NAS 2011
SMVN
Communication
• How to implement the communication
across networks?
SMVN
192.168.1.*
SMVN
127.9.1.*
SMVN
10.2.5.*
2
3
0
1
gateway
202.38.5.1
2
3
0
0
1
gateway
1
gateway
202.38.5.2
202.38.5.3
Outside Network
July 2011
Wenxuan Yin-NAS 2011
Ethernet or
others
Memory management
• Data structures on SMVN buffer
– Singly Linked List (SLL)
Shared memory pool →L
Packet
Packet
……
Packet
Packet
FreeList: global, unique
InputList: each node maintains one
head
Packet
July 2011
tail
Packet
……
Packet
No Extra Memory Allocation!
Packet
Wenxuan Yin-NAS 2011
Packets transmission
Examples
InputList
FreeList
head
Node 0 as a sender
Node 1 as a receiver
tail
...
1
Unit 0
Ethernet
Frame
SEND
head
tail
...
2
head
2. Sending: fetch (FreeList), copy,
insert (InputList), trigger an
interrupt
tail
tail
head
3. Receiving: fetch (InputList),
copy, insert (FreeList)
...
3
1. FreeList holds all data,
InputList is NULL
Unit 1
head
July 2011
tail
RECV
Ethernet
Frame
Wenxuan Yin-NAS 2011
Optimization
• Essentially an optimization to memory
operations!
• Increase the concurrency
– Pipelining effect
• Minimize memory access numbers
– Zero-copy scheme
• Reduce memory access time
– Instruction-level optimization
July 2011
Wenxuan Yin-NAS 2011
Concurrency
• Overlap SEND/RECV operations!
Node 0
Ethernet
Frame
SEND
head
Node 0
...
2
Ethernet
Frame
tail
head
Pipelining effect!
tail
SEND
tail
tail
...
head
...
3
head
3
head
tail
Node 1
RECV
Node 1
tail
RECV
Ethernet
Frame
serial
July 2011
head
concurrency
Wenxuan Yin-NAS 2011
Ethernet
Frame
Zero-Copy
• Change the head/tail pointers
• Change the relationship which list the packets belong to
tail
head
tail
head
FreeList
InputList
Packets migration
Shared memory pool (L)
Data copy
Only scenario
SMVN mem pool
Network mem pool
• Extra benefit: reduce power consumption!
July 2011
Wenxuan Yin-NAS 2011
Bottom Optimization
• To accelerate memcpy
– Using cache coherency maintained by hardware
• Using cached address space
• Do not need flush/invalidate by programmers
– Godson-3A double-word (64bit) RW
– Unaligned memory access
July 2011
Wenxuan Yin-NAS 2011
Mutual Exclusion
• Why we need this?
– Concurrency leads to an unpredictable outcome
• Solution: spinlock
– Keep atomic in shared resources operations
– Test-And-Set (TAS) primitive
– In Godson-3A nodes
• ll (load-linked) & sc (store-conditional) instruction pair
July 2011
Wenxuan Yin-NAS 2011
Simple Lock
TAS primitive
Lock(a0)
TryAgain:
ll
t0,
bnez t0,
nop
addiu t0,
sc
t0,
beqz t0,
nop
jr
ra
nop
0x0(a0)
TryAgain
t0, 1
0x0(a0)
TryAgain
• ll will record address while
loading
• sc can judge whether the
address is modified by
competitive accesses
• If NO, store successively
• If YES, mark a failure status
in a register implicitly
Unlock(a0)
sw zero, 0x0(a0)
July 2011
Wenxuan Yin-NAS 2011
Synchronization
• Occur between nodes in SMVN
initialization
– Master node initializes the shared memory pool,
others must wait until the pool is available
• When master is ready
G0
HT
G1
Broadcast ready status
HT
G2
July 2011
Virtual
Ethernet
HT
HT
Activate a timer
G3
• SMVN need restart if timeout
Wenxuan Yin-NAS 2011
MPI Processes
• Worker Process (WP)
– Its number decides the parallel degree
– Real working process
• Daemon Process (DP)
– Its mapping decides WP’s allocation which reflects
the parallel granularity
• Intra-node or inter-node
– At most one DP starting in each node
– At least one DP residing in the cluster
July 2011
Wenxuan Yin-NAS 2011
Mapping & Allocation
• Mapping DPs into a binary tree connection
0
1
0
1
0
1
0
1
2
3
2
3
2
3
2
3
DP = 1
DP = 2
DP = 3
DP = 4
• WP is allocated to nodes with DPs in breadth-first
traversal algorithm

  n m   1, i   n mod m 
Node(i )  
  n m  , i   n mod m 

More than 1
July 2011
DP, 1 ≤ m ≤ 4
WP, n ≥ 1
Node(i): num of WPs on Node I
0≤i≤3
OS SMP scheduling!
Wenxuan Yin-NAS 2011
Real Platform
Port MPICH2 library in our real system
Based on socket interface supported by
SMVN
Shared
Memory
Virtual
Network
July 2011
Godson-3A
SMP Node
Wenxuan Yin-NAS 2011
Performance tests
• Benchmark
– OMB micro-benchmarks for MPI IPC evaluation
– We choose two metrics
• Ping-pong latency
• Unidirection bandwidth
• Performance comparison between
– Inter-node vs. intra-node
– Cached vs. uncached
July 2011
Wenxuan Yin-NAS 2011
Testbed Setup
• Towards the embedded environment
– Frequency: 525MHz
– Cache size
• L1: 64KB×2 (including instruction and data)
• L2: 4MB
– Memory size
• local in real-time OS kernel is 256MB
• shared for SMVN buffer is 2MB
– DDR2 working at 200MHz
– HT frequency: 800MHz
July 2011
Wenxuan Yin-NAS 2011
Results-Latency
cliffy
smooth
Basic
latency
July 2011
Wenxuan Yin-NAS 2011
Results-Bandwidth
32.5MB/s
84%
27.3MB/s
July 2011
Wenxuan Yin-NAS 2011
Observations
• Much better than Fast Ethernet (100Mb)
typically used in traditional embedded clusters
– Cache is helpful! Avoid flush/invalidate by software
– Tradeoff between performance and embedded constraints
• Narrow the gap between two levels
– Even superior than some high-end system although our
absolute performance is lower
– Introduce shared memory in both intra- and inter-node
communications
– Compact mesh topology in system
July 2011
Wenxuan Yin-NAS 2011
Related Works
• Comparison of data transfer methods
– User/kernel level shared memory [Buntinas et al.]
– High-speed NIC based copy
• MPI communication system (shared memory)
– Nemesis [Buntinas et al.]
– High-performance and good scalability system [Chai et al.]
• RDMA system
– InfiniBand [Mamidala et al.]
– Quadrics QsNetII[Qian et al.]
July 2011
Wenxuan Yin-NAS 2011
Conclusion
• Proposed a novel shared memory based virtual
communication system --- SMVN
• Goal: make a uniform infrastructure in different
communication levels to implement efficient
MPI IPC under embedded constraints
– Adequate performace
– Compact size, low power consumption, low cost (no NICs, no
switches, no cables)
• Direction: scalability for large system expansion
July 2011
Wenxuan Yin-NAS 2011
Thanks for your attention!
Questions?