Download ppt - LIFL

GridMPI： Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 1 Motivation • MPI has been widely used to program parallel applications • Users want to run such applications over the Grid environment without any modifications of the program • However, the performance of existing MPI implementations is not scaled up on the Grid environment computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 2 Motivation • Focus on metropolitan-area, high-bandwidth environment: 10Gpbs,  500miles (smaller than 10ms one-way latency) – Internet Bandwidth in Grid  Interconnect Bandwidth in Cluster • 10 Gbps vs. 1 Gbps • 100 Gbps vs. 10 Gbps computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 3 Motivation • Focus on metropolitan-area, high-bandwidth environment: 10Gpbs,  500miles (smaller than 10ms one-way latency) – We have already demonstrated that the performance of the NAS parallel benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003 computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 4 Issues Bandwidth (MB/s) TCP MPI • High Performance Communication Facilities for MPI on Long and Fat Designed for streams. Repeat the computation Networks and communication phases. – TCP vs. MPI communication Burst traffic. Change traffic by patterns communication patterns. – Network Topology Repeating 10MB data transfer • Latency and Bandwidth with two second intervals • Interoperability 125 – Most MPI library phase • The slow-start 100 implementations useistheir own •window size set to 1 75 network protocol. 50 • Fault Tolerance and Migration 25 – To survive a site failure 0 • Security 100 200 300 400 500 • These silences results from burst traffic 0 Time (ms) Observed during one 10MB data transfer http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 5 Issues Bandwidth (MB/s) TCP MPI • High Performance Communication Facilities for MPI on Long and Fat Designed for streams. Repeat the computation Networks and communication phases. – TCP vs. MPI communication Burst traffic. Change traffic by patterns communication patterns. – Network Topology Start one-to-one communication • Latency and Bandwidth at time 0 after all-to-all • Interoperability 125 – Most MPI library 100 implementations use their own network protocol. 75 50 • Fault Tolerance and Migration 25 – To survive a site failure 0 • Security 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time (sec) http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 6 Issues TCP MPI • High Performance Communication Facilities for MPI on Long and Fat Designed for streams. Repeat the computation Networks and communication phases. – TCP vs. MPI communication Burst traffic. Change traffic by patterns communication patterns. – Network Topology • Latency and Bandwidth • Interoperability – Most MPI library implementations use their own network protocol. Internet • Fault Tolerance and Migration – To survive a site failure • Security http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 7 Issues • High Performance Communication TCP MPI Facilities for MPI on Long and Fat Designed for streams. Repeat the computation Networks and communication – TCP vs. MPI communication phases. Burst traffic. patterns Change traffic by – Network Topology communication patterns. • Latency and Bandwidth • Interoperability – Many MPI library implementations. Most Using Vendor implementations use their own Using Vendor B’s MPI library A’s MPI library network protocol Interne • Fault Tolerance and Migration t – To survive a site failure • Security http://www.gridmpi.org Using Vendor C’s MPI library Yutaka Ishikawa, The University of Tokyo Using Vendor D’s MPI library 8 GridMPI Features IMPI • MPI-2 implementation MPI API • YAMPII, developed at the RPIM Interface LAC Layer University of Tokyo, is used as (Collectives) the core implementation Request Interface • Intra communication by YAMPII Request Layer （TCP/IP、SCore） P2P Interface • Inter communication by IMPI （Interoperable MPI), protocol and extension to Grid – MPI-2 – New Collective protocols LAC: Latency Aware Collectives • Integration of Vendor MPI • bcast/allreduce algorithms have been developed (to appear at the cluster06 conference) – IBM Regatta MPI, MPICH2, Solaris MPI, Fujitsu MPI, (NEC SX MPI) IPMI/TCP • Incremental checkpoint Interne • High Performance TCP/IP implementation t Vendor MPI O2G MX PMv2 Yutaka Ishikawa, The University of Tokyo TCP/IP http://www.gridmpi.org IMPI Vendor MPI Globus SCore rsh ssh Vendor’s MPI YAMPII 9 High-performance Communication Mechanisms in the Long and Fat Network • Modifications of TCP Behavior – M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005. • Precise Software Pacing – R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Ishikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, 2005. • Collective communication algorithms with respect to network latency and bandwidth. – M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Efficient MPI Collective Operations for Clusters in Long-andFast Networks”, to appear at IEEE Cluster 2006. http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 10 Evaluation • It is almost impossible to reproduce the execution behavior of communication performance in the wide area network • A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc. GtrcNET-1 GtrcNET-1 is developed at AIST. • injection of delay, jitter, error, … • traffic monitor, frame capture http://www.gridmpi.org •Four 1000Base-SX ports •One USB port for Host PC •FPGA (XC2V6000) http://www.gtrc.aist.go.jp/gnet/ Yutaka Ishikawa, The University of Tokyo 11 Experimental Environment 8 PCs •Bandwidth:1Gbps •Delay: 0ms -- 10ms Node8 Host 0 Host 0 Host 0 ……… WAN Emulator GtrcNET-1 Catalyst 3750 Node7 Catalyst 3750 ……… Node0 Host 0 Host 0 Host 0 8 PCs Node15 CPU: Pentium4/2.4GHz, Memory: DDR400 512MB NIC: Intel PRO/1000 (82547EI) OS: Linux-2.6.9-1.6 (Fedora Core 2) Socket Buffer Size: 20MB http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 12 GridMPI vs. MPICH-G2 (1/4) FT (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes 1.2 FT(GridMPI) Relative Performance 1 FT(MPICH-G2) 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 One way delay (msec) http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 13 GridMPI vs. MPICH-G2 (2/4) IS (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes 1.2 Relative Performance 1 IS(GridMPI) IS(MPICH-G2) 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 One way delay (msec) http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 14 GridMPI vs. MPICH-G2 (3/4) LU (Class B) of NAS Parallel Benchmarks 3.2 on 8 x 8 processes 1.2 LU(GridMPI) Relative Performance 1 LU(MPICH-G2) 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 One way delay (msec) http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 15 GridMPI vs. MPICH-G2 (4/4) NAS Parallel Benchmarks 3.2 Class B on 8 x 8 processes 1.2 SP(GridMPI) BT (GridMPI) MG(GridMPI) CG(GridMPI) SP(MPICH-G2) BT(MPICH-G2) MG(MPICH-G2) CG(MPICH-G2) Relative Performance 1 0.8 0.6 0.4 0.2 No parameters tuned in GridMPI 0 0 2 4 6 8 10 12 One way delay (msec) http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 16 • NAS Parallel Benchmarks run using 8 node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara – 16 nodes • Comparing the performance with – result using 16 node (2.4 GHz) – result using 16 node (2.8 GHz) Relative performance GridMPI on Actual Network 1.2 1 0.8 0.6 0.4 0.2 0 2.4 GHz 2.8 GHz BT CG EP FT IS LU MG SP Benchmarks JGN2 Network 10Gbps Bandwidth 1.5 msec RTT Pentium-4 2.4GHz x 8 Pentium-4 2.8 GHz x 8 connected by 1G Ethernet 60 Km (40mi.) Connected by 1G Ethernet @ Tsukuba @ Akihabara http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 17 GridMPI Now and Future • GridMPI version 1.0 has been released – Conformance Tests • MPICH Test Suite: 0/142 (Fails/Tests) • Intel Test Suite: 0/493 (Fails/Tests) – GridMPI is integrated into the NaReGI package • Extension of IMPI Specification – Refine the current extensions – Collective communication and check point algorithms could not be fixed. The current idea is specifying • The mechanism of – dynamic algorithm selection – dynamic algorithm shipment and load » virtual machine to implement algorithms http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 18 Dynamic Algorithm Shipment • A collective communication algorithm implemented in the virtual machine • The code is shipped to all MPI processes • The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters Internet http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 19 Concluding Remarks • Our Main Concern is the metropolitan area network – high-bandwidth environment: 10Gpbs,  500miles (smaller than 10ms one-way latency) • Overseas ( 100 milliseconds) – Applications must be aware of the communication latency – data movement using MPI-IO ? • Collaborations – We would like to ask people, who are interested in this work, for collaborations http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - LIFL