Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Efficient Shared Memory Based Virtual Communication System for Embedded SMP Cluster Wenxuan Yin Institute of Computing Technology Chinese Academy of Sciences Joint work with Xiang Gao, Xiaojing Zhu, ICT, CAS and Deyuan Guo, Tsinghua University NAS 2011 Background • Dilemma in Embedded System – High performance – Cost, power consumption, size, etc. Video/media processing July 2011 Space-born satellite Wenxuan Yin-NAS 2011 Background • Why SMP cluster is popular in general computing? – High scalability – Good cost-performance ratio – Convenient for MPI programming • It can also benefit the embedded domain – Embedded Cluster • Embedded processor nodes • Commodity networks Tradeoff – moderate performance July 2011 cost/power efficiency Wenxuan Yin-NAS 2011 Motivations • Challenges by SMP nodes – Two levels of communication • inter-node: high-speed network • intra-node: shared memory/cache Performance Gap! – Memory management • memory hierarchy: local vs. remote • coherency maintenance – MPI Inter-Process Communication (IPC) • process allocation in different parallelism – Mutual exclusion and synchronization July 2011 Wenxuan Yin-NAS 2011 Motivations • Opportunities in SMP nodes – More computation capacity – High-speed chip-to-chip interconnect fabrics • PCI-E:ARM Cortex A9 MPCore • Serial RapidIO:Freescale 8641D • HyperTransport:ICT Godson-3A – Can we use the fabrics directly to replace traditional NIC based networks? • get rid of NICs, switches, cables How to do? July 2011 Wenxuan Yin-NAS 2011 Proposed Design Extending the Shared Memory Mechanism into Inter-Node Communications July 2011 Wenxuan Yin-NAS 2011 Objectives • Compatibility – Software virtulized network TCP/IP protocol • Efficiency – Remote memory Logical shared memory – Narrow the gap between two levels • Economization – Compact interconnect Space and cost effective July 2011 Wenxuan Yin-NAS 2011 Comparison • Chip-to-chip interconnection changes the network topology Star UN Mesh … G HT G UN Ethernet Switch UN … HT G Virtual Ethernet HT HT G UN UN = Uniprocessor Node July 2011 G = Godson-3A SMP Wenxuan Yin-NAS 2011 Architecture Node 0 Node 1 HT0 P0 P3 … Cache Shared L O H I H I L O P0 P3 … Godson-3A SMP Nodes Cache Local Local Shared Shared Memory Virtual Network SMVN Node 2 P0 Node 3 P3 … Cache Shared L O H I H I L O P0 Cache Local 1 2 P3 … Shared 3 Shared Memory Pool July 2011 Configured into 2 parts HT1: for IO extention Local 0 HT0: for interconnection Wenxuan Yin-NAS 2011 Omitted here Memory in each node is divided into 2 parts SMP Nodes • Godson-3A CPU PCI/LPC July 2011 P1 P2 P3 m1 m2 m3 m4 6×6 X1 Switch m0/s0 m5/s5 s1 s2 s3 s4 S0 S1 S2 S3 m1 m2 m3 m4 5×4 X2 Switch m0/s0 s1 s2 s3 MC0 MC1 XConf Wenxuan Yin-NAS 2011 HT Controller P0 DMA Controller Godson-3A DMA Controller HT Controller – MIPS64-compatible – 4-core superscalar – For high performance and low power consumption More Details • Cache coherency – Directory based cache coherency – HT holds coherency in the whole interconnection system, global addressing in remote accessing – Transparent to programmers • Reconfigurable memory pool – Each node can tune its shared memory size contributing to the memory pool – Extreme case: only master node cedes its shared part July 2011 Wenxuan Yin-NAS 2011 X-Y Transmission • Built-in routing mechanism in HT – Eliminate switches Examples G0 → G3 G0 HT G2 July 2011 HT Virtual Ethernet HT G1 HT G3 Wenxuan Yin-NAS 2011 G3 → G0 SMVN Driver • Hierarchical design – Virtual physical layer • Memory copy & optimization – Virtual data link layer • Function and hardware abstraction • Packets encapsulation meet frame format of TCP/IP – Driver management layer • Treat SMVN as a common NIC class device • OS inquiry them recurrently to load & start – Splice SMVN and TCP/IP together! July 2011 Wenxuan Yin-NAS 2011 SMVN Driver Application Layer Network & Tranport Layer Socket MPI Interface TCP/IP Protocol Stack Socket MPI Interface TCP/IP Protocol Stack TCP/IP upper protocol Driver Management Layer SM Packetization Layer SM Packetization Layer SM Func. & HW Abstraction Layer SM Func. & HW Abstraction Layer Virtual Data Link Layer Virtual Physical Layer Optimized Shared Memory Pool Copy SMVN Driver July 2011 Wenxuan Yin-NAS 2011 SMVN Communication • How to implement the communication across networks? SMVN 192.168.1.* SMVN 127.9.1.* SMVN 10.2.5.* 2 3 0 1 gateway 202.38.5.1 2 3 0 0 1 gateway 1 gateway 202.38.5.2 202.38.5.3 Outside Network July 2011 Wenxuan Yin-NAS 2011 Ethernet or others Memory management • Data structures on SMVN buffer – Singly Linked List (SLL) Shared memory pool →L Packet Packet …… Packet Packet FreeList: global, unique InputList: each node maintains one head Packet July 2011 tail Packet …… Packet No Extra Memory Allocation! Packet Wenxuan Yin-NAS 2011 Packets transmission Examples InputList FreeList head Node 0 as a sender Node 1 as a receiver tail ... 1 Unit 0 Ethernet Frame SEND head tail ... 2 head 2. Sending: fetch (FreeList), copy, insert (InputList), trigger an interrupt tail tail head 3. Receiving: fetch (InputList), copy, insert (FreeList) ... 3 1. FreeList holds all data, InputList is NULL Unit 1 head July 2011 tail RECV Ethernet Frame Wenxuan Yin-NAS 2011 Optimization • Essentially an optimization to memory operations! • Increase the concurrency – Pipelining effect • Minimize memory access numbers – Zero-copy scheme • Reduce memory access time – Instruction-level optimization July 2011 Wenxuan Yin-NAS 2011 Concurrency • Overlap SEND/RECV operations! Node 0 Ethernet Frame SEND head Node 0 ... 2 Ethernet Frame tail head Pipelining effect! tail SEND tail tail ... head ... 3 head 3 head tail Node 1 RECV Node 1 tail RECV Ethernet Frame serial July 2011 head concurrency Wenxuan Yin-NAS 2011 Ethernet Frame Zero-Copy • Change the head/tail pointers • Change the relationship which list the packets belong to tail head tail head FreeList InputList Packets migration Shared memory pool (L) Data copy Only scenario SMVN mem pool Network mem pool • Extra benefit: reduce power consumption! July 2011 Wenxuan Yin-NAS 2011 Bottom Optimization • To accelerate memcpy – Using cache coherency maintained by hardware • Using cached address space • Do not need flush/invalidate by programmers – Godson-3A double-word (64bit) RW – Unaligned memory access July 2011 Wenxuan Yin-NAS 2011 Mutual Exclusion • Why we need this? – Concurrency leads to an unpredictable outcome • Solution: spinlock – Keep atomic in shared resources operations – Test-And-Set (TAS) primitive – In Godson-3A nodes • ll (load-linked) & sc (store-conditional) instruction pair July 2011 Wenxuan Yin-NAS 2011 Simple Lock TAS primitive Lock(a0) TryAgain: ll t0, bnez t0, nop addiu t0, sc t0, beqz t0, nop jr ra nop 0x0(a0) TryAgain t0, 1 0x0(a0) TryAgain • ll will record address while loading • sc can judge whether the address is modified by competitive accesses • If NO, store successively • If YES, mark a failure status in a register implicitly Unlock(a0) sw zero, 0x0(a0) July 2011 Wenxuan Yin-NAS 2011 Synchronization • Occur between nodes in SMVN initialization – Master node initializes the shared memory pool, others must wait until the pool is available • When master is ready G0 HT G1 Broadcast ready status HT G2 July 2011 Virtual Ethernet HT HT Activate a timer G3 • SMVN need restart if timeout Wenxuan Yin-NAS 2011 MPI Processes • Worker Process (WP) – Its number decides the parallel degree – Real working process • Daemon Process (DP) – Its mapping decides WP’s allocation which reflects the parallel granularity • Intra-node or inter-node – At most one DP starting in each node – At least one DP residing in the cluster July 2011 Wenxuan Yin-NAS 2011 Mapping & Allocation • Mapping DPs into a binary tree connection 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 DP = 1 DP = 2 DP = 3 DP = 4 • WP is allocated to nodes with DPs in breadth-first traversal algorithm n m 1, i n mod m Node(i ) n m , i n mod m More than 1 July 2011 DP, 1 ≤ m ≤ 4 WP, n ≥ 1 Node(i): num of WPs on Node I 0≤i≤3 OS SMP scheduling! Wenxuan Yin-NAS 2011 Real Platform Port MPICH2 library in our real system Based on socket interface supported by SMVN Shared Memory Virtual Network July 2011 Godson-3A SMP Node Wenxuan Yin-NAS 2011 Performance tests • Benchmark – OMB micro-benchmarks for MPI IPC evaluation – We choose two metrics • Ping-pong latency • Unidirection bandwidth • Performance comparison between – Inter-node vs. intra-node – Cached vs. uncached July 2011 Wenxuan Yin-NAS 2011 Testbed Setup • Towards the embedded environment – Frequency: 525MHz – Cache size • L1: 64KB×2 (including instruction and data) • L2: 4MB – Memory size • local in real-time OS kernel is 256MB • shared for SMVN buffer is 2MB – DDR2 working at 200MHz – HT frequency: 800MHz July 2011 Wenxuan Yin-NAS 2011 Results-Latency cliffy smooth Basic latency July 2011 Wenxuan Yin-NAS 2011 Results-Bandwidth 32.5MB/s 84% 27.3MB/s July 2011 Wenxuan Yin-NAS 2011 Observations • Much better than Fast Ethernet (100Mb) typically used in traditional embedded clusters – Cache is helpful! Avoid flush/invalidate by software – Tradeoff between performance and embedded constraints • Narrow the gap between two levels – Even superior than some high-end system although our absolute performance is lower – Introduce shared memory in both intra- and inter-node communications – Compact mesh topology in system July 2011 Wenxuan Yin-NAS 2011 Related Works • Comparison of data transfer methods – User/kernel level shared memory [Buntinas et al.] – High-speed NIC based copy • MPI communication system (shared memory) – Nemesis [Buntinas et al.] – High-performance and good scalability system [Chai et al.] • RDMA system – InfiniBand [Mamidala et al.] – Quadrics QsNetII[Qian et al.] July 2011 Wenxuan Yin-NAS 2011 Conclusion • Proposed a novel shared memory based virtual communication system --- SMVN • Goal: make a uniform infrastructure in different communication levels to implement efficient MPI IPC under embedded constraints – Adequate performace – Compact size, low power consumption, low cost (no NICs, no switches, no cables) • Direction: scalability for large system expansion July 2011 Wenxuan Yin-NAS 2011 Thanks for your attention! Questions?