* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1 - Hot Interconnects
Survey
Document related concepts
Transcript
Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc. Cray Inc. Hot Interconnects 1 Overview Network Interface Router Reliability, Availability, and Serviceability Features Software Stack Performance Cray Inc. Hot Interconnects 2 Integrated NIC and Router External HSS Monitoring Supports 2 Nodes per ASIC Advanced Resiliency Features Hardware Global Address Support Advanced NIC designed to efficiently support MPI One-sided MPI Shmem UPC, Coarray FORTRAN Cray Inc. Hot Interconnects 3 Y Z X Z Y X Cray Inc. Hot Interconnects 4 Fast Memory Access (FMA) – fine grain remote PUT/GET Block Transfer Engine (BTE) – offload for long transfers Completion Queue (CQ) – client notification Atomic Memory Op (AMO) – fetch&add, etc. net rsp LB net req net req FMA ht trsp net req ht irsp S S I D net req net req O R B net rsp NPT vc0 vc1 vc1 net rsp ht np ireq net req ht p ireq BTE net req H A R B ht np req ht p req ht p req AMO net req ht p req NAT CQ net rsp headers net req RMT vc0 net req RAT net rsp CLM Cray Inc. Hot Interconnects LM HT3 Cave ht p req ht np req LB Ring T A R B Router Tiles ht treq np NL ht treq p 5 Single-sided Processor stores become remote PUT or GET FMA descriptors hold state to help determine destination node and memory location FMA PUT for short messages Uncached processor store to Gemini window translated directly to network packet FMA GET allows reverse direction data transfer of 1 to 64 bytes Cray Inc. Hot Interconnects 6 Driver managed BTE PUT for long messages DMA transfer to offload data movement from processor BTE SEND for IP traffic, etc. Send message to remote node Single receive queue for all sources Upper level protocol covers lost messages BTE GET support for simplified data transfers In lieu of involving remote side for PUT Cray Inc. Hot Interconnects 7 Hardware remote atomic memory operations in the NIC Add, Compare & Swap, Logical Operations Executed at the node with the memory AMO cache for hot locations Up to 64 locations with AMOs in process Global operations support Barriers Counters Collectives (reductions, global sum) Cray Inc. Hot Interconnects 8 6x8 tile matrix Input queue to one of 6 subswitches Route to one of 8 output buffers Hashed routing preserves order to cachelines Adaptive routing Cray Inc. Hot Interconnects 9 Route around stalled or down links If a link goes down, adaptive routing mask updated in hardware to exclude it OS traffic uses adaptive routing only, recovers from finite loss of packets Quiesce and re-route to repair deterministic routes Congestion feedback to allow routing around bottlenecks Potential for improved performance on difficult traffic patterns such as transpose Packets reordered in receive buffer (DRAM) Separate notification (completion event) when all stored Cray Inc. Hot Interconnects 10 General Network Packet Format 24 bit flit Maximum size packet is 7+24+1=32 flit Put request of 64 bytes Minimum is 2 flit Put response 23 22 21 20 19 18 17 16 15 14 13 12 11 phit 0 vc destination[15:0] phit 1 payload optional hash bits 10 9 8 7 6 5 4 2 1 0 h a r=0 v 3 p c p c p c 1 p c 2 payload payload phit 2 … CRC-16 last phit R R ok R Network Request Packet Format 23 22 21 20 19 18 17 16 vc phit 0 15 14 13 12 11 10 9 8 7 6 destination[15:0] address[23:6] phit 1 MDH[11:0] phit 4 BTEvc phit 6 1 0 c p c p c DstID SrcID vm ra p c reserved addr[45:40] cmd[5:0] addr [39:38] dt pt mask[15:0] phit 5 3 address[37:24] source[15:0] phit 3 4 a r=0v=0 p F ca rmt b ptag[7:0] phit 2 5 h SSID[7:0] p c p c p c 2 1 0 size p c packetID[11:0] Data Payload (up to 24 phits) 23 22 21 20 19 18 17 16 15 14 13 12 11 data[19:0] phit n 10 9 8 7 6 5 4 3 (phit n +1) data[41:20] p c (phit n +2) data[63:42] p c Cray Inc. Hot Interconnects 11 Automatic link-level retries HT3 support including automatic retries and improved CRC Most internal data structures are at least parity protected The longer the occupancy of data at a location, the stronger the protection Errors reported as precisely as possible Payload errors reported directly to user Control errors often cannot be associated with a particular transaction In all cases OS or HSS can be notified of the error Router errors included Reported at the point of error Endpoint(s) (user) see a timeout Cray Inc. Hot Interconnects 12 MPICH MPICH2 SHMEM PGAS DMAPP User level Gemini Network Interface (uGNI) MRT-size page support Registration Cache support Kernel level GNI (kGNI) Direct Access Cray COW solution Lustre Network Driver (LND) Direct Access GART Resource Management (GRM) IOCTL or System Call Linux Core GNI Core IP over Gemini Fabric (IPoGIF) Gemini Hardware Abstraction Layer (GHAL) Cray Inc. Hot Interconnects 13 Latency Bandwidth Atomic operations Cray Inc. Hot Interconnects 14 Gemini expanded to HT3 at up to 5.2 GT/s Expect to sustain greater than 6 GB/s user data injection Network bandwidth is limited by XT packaging Link speed from 3.125 to 6.25 Gbit/sec In some cases, double wide X & Z links also offer increased bandwidth Gemini relies on user level threads MPI processing limits to 2M messages/sec per thread Scales beyond 10M msg/sec per NIC Cray Inc. Hot Interconnects 15 2.5 One way PUT in 2.0 Time (microsecs) 750ns Waiting for Ack in only 1.1 us Remote GET increases to 1.4 us PUT, ping-pong PUT, at source GET 1.5 1.0 0.5 0.0 8 16 32 64 128 256 512 1024 Size (bytes) Cray Inc. Hot Interconnects 16 7000 Peak bandwidth 6000 Bandwidth (Mbytes/sec) reached with small transfers Multiple threads reach peak with smaller, still, transfers 5000 4000 PPN=1 3000 PPN=2 PPN=4 2000 1000 0 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Size (bytes) Cray Inc. Hot Interconnects 17 120 100 100 Mupdates/sec Random locations (GUPS) still over 45 Mupdates/sec AMO rate (millions) Hot location reaches 1 AMO 8192 AMOs 80 60 40 20 0 0 256 512 768 1024 Number of processes Cray Inc. Hot Interconnects 18 Gemini provides low latency, and performance for fine grain operations Gemini has features to scale in performance and reliability to large system size Questions? Cray Inc. Hot Interconnects 19