Download PowerPoint - UBC Department of Computer Science

SCTP-based Middleware for MPI Humaira Kamal, Brad Penoff, Alan Wagner Department of Computer Science University of British Columbia What is MPI and SCTP?  Message Passing Interface (MPI)   Library that is widely used to parallelize scientific and compute-intensive programs Stream Control Transmission Protocol (SCTP)    General purpose unicast transport protocol for IP network data communications Recently standardized by IETF Can be used anywhere TCP is used What is MPI and SCTP?  Message Passing Interface (MPI)   Library that is widely used to parallelize scientific and compute-intensive programs Stream Control Transmission Protocol (SCTP)    General purpose unicast transport protocol for IP network data communications Recently standardized by IETF Can be used anywhere TCP is used Question Can we take advantage of SCTP features to better support parallel applications using MPI? Communicating MPI Processes TCP is often used as transport protocol for MPI MPI Process MPI Process MPI Middleware MPI Middleware SCTP TCP SCTP TCP IP IP SCTP Key Features  Reliable in-order delivery, flow control, full duplex transfer.  SACK is built in the protocol  TCP-like congestion control SCTP Key Features  Message oriented  Use of associations  Multihoming  Multiple streams within an association Logical View of Multiple Streams in an Association Endpoint X Endpoint Y SEND Outbound Streams Stream 0 RECEIVE Stream 1 Stream 2 SEND RECEIVE Inbound Streams Stream 0 Stream 1 Stream 2 Partially Ordered User Messages Sent on Different Streams Message Stream Number (SNo) 2 2 1 2 0 Fragmentation User messages SCTP Layer Control chunk queue Data chunk queue SCTP Packets Bundling IP Layer MPI Middleware MPI_Send(msg,count,type,dest-rank,tag,context) MPI_Recv(msg,count,type,source-rank,tag,context)    Message matching is done based on Tag, Rank and Context (TRC). Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered. Use of wildcards for receive Envelope Context Rank Tag Payload Format of MPI Message MPI Messages Using Same Context, Two Processes Process X MPI_Send(Msg_1,Tag_A) MPI_Send(Msg_2,Tag_B) MPI_Send(Msg_3,Tag_A) Process Y MPI_Irecv(..ANY_TAG..) Msg_1 Msg_2 Msg_3 Process X MPI_Send(Msg_1,Tag_A) Process Y MPI_Irecv(..ANY_TAG..) Msg_1 MPI_Send(Msg_2,Tag_B) MPI_Send(Msg_3,Tag_A) Msg_3 Msg_2 MPI Messages Using Same Context, Two Processes Process X Process Y MPI_Irecv(..ANY_TAG..) MPI_Send(Msg_1,Tag_A) MPI_Send(Msg_2,Tag_B) MPI_Send(Msg_3,Tag_A) Msg_3 Msg_1 Msg_2 Out of order messages with same tags violate MPI semantics MPI Middleware  Message Progression Layer Receive Request is Issued Application Layer Receive Request Queue MPI Middleware Runtime  Short Messages vs. Long Messages Unexpected Message Queue SCTP Layer Socket Incoming Message is Received Design and Implementation    LAM (Local Area Multi-computer) is an open source implementation of MPI library We redesigned LAM-MPI to use SCTP Three-phased iterative process  Use of One-to-One Style Sockets  Use of Multiple Streams  Use of One-to-Many Style Sockets Using SCTP for MPI  Striking similarities between SCTP and MPI SCTP MPI One-to-Many Socket Context Association Rank / Source Streams Message Tags Implementation Issues  Maintaining State Information   Message Demultiplexing    Extend RPI initialization to map associations to rank. Demultiplexing of each incoming message to direct it to the proper receive function. Concurrency and SCTP Streams   Maintain state appropriately for each request function to work with the one-to-many style. Consistently map MPI tag-rank-context to SCTP streams, maintaining proper MPI semantics. Resource Management  Make RPI more message-driven.  Eliminate the use of the select() system call, making the implementation more scalable. Eliminating the need to maintain a large number of socket descriptors.  Implementation Issues  Eliminating Race Conditions    Reliability   Modify out-of-band daemons and request progression interface (RPI) to use a common transport layer protocol to allow for all components of LAM to multihome successfully. Support for large messages   Finding solutions for race conditions due to added concurrency. Use of barrier after association setup phase. Devised a long-message protocol to handle messages larger than socket send buffer. Experiments with different SCTP stacks Features of Design  Head-of-Line Blocking  Multihoming and Reliability  Security Head-of-Line Blocking Process X Process Y MPI_Send TCP MPI_Send Tag_B Tag_A Msg_B Msg_A MPI_Irecv MPI_Irecv Blocked Process X Process Y MPI_Send Tag_B Tag_A Msg_B Msg_A MPI_Irecv SCTP MPI_Send Delivered MPI_Irecv Multihoming Node 0 Node 1     NIC1 NIC2 NIC3 NIC4 Network 207.10.x.x IP=207.10.3.20 IP=168.1.10.30 Network 168.1.x.x IP=207.10.40.1 IP=168.1.140.10 Heartbeats Failover Retransmissions User adjustable controls Added Security P0 P1 INIT INIT-ACK COOKIE-ECHO User data can be piggy-backed on third and fourth leg COOKIE-ACK SCTP’s Use of Signed Cookie Limitations     Comprehensive CRC32c checksum – offload to NIC not yet commonly available SCTP bundles messages together so it might not always be able to pack a full MTU SCTP stack is in early stages and will improve over time Performance is stack dependant (Linux lksctp stack << FreeBSD KAME stack) Experiments for Loss LAM_SCTP LAM_TCP Total Run Time LAM_SCTP versus LAM_TCP 40 35 30 25 20 15 10 5 0 34.64 16.56 5.76 0.43 7.90 0.63 0% 1% 2% Loss Rate Performance of MPI Program that Uses Multiple Tags Experiments: Head-of-Line Blocking Total Run Time Comparison of Same Tags and Different Tags in a Latency Tolerant Program 1500 1450 1400 1350 1300 1250 1200 1150 1100 LAM_SCTP Same Tags LAM_SCTP Different Tags 1% 2% 10% Loss Rate Use of Different Tags vs. Same Tags Experiments: SCTP versus TCP MPBench Ping Pong Test 1.2 1 0.8 LAM_SCTP LAM_TCP 0.6 0.4 0.2 Message Size (bytes) MPBench Ping Pong Test under No Loss 131069 98302 65535 32768 0 1 Throughput Normalized to LAM_TCP values 1.4 Conclusions  SCTP is a better suited for MPI     Avoids unnecessary head-of-line blocking due to use of streams Increased fault tolerant in presence of multihomed hosts In-built security features SCTP might be key to moving MPI programs from LANs to WANs. Thank you! More information about our work is at: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PowerPoint - UBC Department of Computer Science