Download The California Institute for Telecommunications and

The OptIPuter - Implications of Network Bandwidth on System Design CHEP’03 University of California, San Diego March 25, 2003 Dr. Philip M. Papadopoulos Program Director, Grid and Cluster Computing, SDSC Co-PI OptIPuter Agenda • • • • • Some key tech trends in applications, computing and networking Description and research goals of OptIPuter Initial networking plans Key questions on system and network design Conclusions Why Optical Networks Are Emerging as the 21st Century Driver for the Grid George Stix, Scientific American, January 2001 Parallel Lambdas provide the raw capacity to Drive and change the relationship of computer and network OptIPuter Inspiration--Node of a 2009 PetaFLOPS Supercomputer DRAM – 16 GB DRAM - 4MB GB- -HIGHLY HIGHLYINTERLEAVED INTERLEAVED 64/256 5 Terabits/s MULTI-LAMBDA Optical Network CROSS BAR 2nd LEVEL CACHE Coherence 8 MB 640 GB/s 2nd LEVEL CACHE 8 MB 24 Bytes wide 240 GB/s VLIW/RISC CORE 40 GFLOPS 10 GHz ... 24 Bytes wide 240 GB/s VLIW/RISC CORE 40 GFLOPS 10 GHz Updated From Steve Wallach, Supercomputing 2000 Keynote Global Architecture of a 2009 COTS PetaFLOPS System 10 meters= 50 nanosec Delay 3 2 4 5 ... 16 1 17 64 ALL-OPTICAL SWITCH 63 ... 18 ... 32 49 48 Systems Become GRID Enabled 128 Die/Box 4 CPU/Die 47 I/O LAN/WAN ... 33 Multi-Die Multi-Processor 46 Source: Steve Wallach, Supercomputing 2000 Keynote The Biomedical Informatics Research Network a Multi-Scale Brain Imaging Federated Repository BIRN Test-beds: Multiscale Mouse Models of Disease, Human Brain Morphometrics, and FIRST BIRN (10 site project for fMRI’s of Schizophrenics) NIH Plans to Expand to Other Organs and Many Laboratories GEON’s Data Grid Team Has Strong Overlap with BIRN and OptIPuter • Learning From The BIRN Project – The GEON Grid: – Heterogeneous Networks, Compute Nodes, Storage – Deploy Grid And Cluster Software Across GEON – Peer-to-Peer Information Fabric for Sharing: – Data, Tools, And Compute Resources NSF ITR Grant $11.25M 2002-2007 Two Science “Testbeds” Broad Range Of Geoscience Data Sets Source: Chaitan Baru, SDSC, Cal-(IT)2 NSF’s EarthScope Rollout Over 14 Years Starting With Existing Broadband Stations Data Intensive Scientific Applications Require Experimental Optical Networks • Large Data Challenges in Neuro and Earth Sciences – Each Data Object is 3D and Gigabytes – Data are Generated and Stored in Distributed Archives – Research is Carried Out on Federated Repository • Requirements – – – – Computing Requirements  PC Clusters Communications  Dedicated Lambdas Over Fiber Data  Large Peer-to-Peer Lambda Attached Storage Visualization  Collaborative Volume Algorithms • Response – OptIPuter Research Project OptIPuter Software Research • Near-term Goals: – Build Software To Support Applications With Traditional Models – High Speed IP Protocol Variations (RBUDP, SABUL, …) – Switch Control Software For DWDM Management And Dynamic Setup – Distributed Configuration Management For OptIPuter Systems • Long-Term Goals: – System Model Which Supports: – Grid – Single System – Multi-System Views – Architectures Which Can: – Harness High Speed DWDM – Exploit Flexible Dispersion Of Data And Computation – New Communication Abstractions & Data Services – Make Lambda-Based Communication Easily Usable – Use DWDM to Enable Uniform Performance View Of Storage Source: Andrew Chien, UCSD Photonic Data Services & OptIPuter 6. Data Intensive Applications (UCI) 5a. Storage (UCSD) 5b. Data Services – SOAP, DWTP, (UIC/LAC) 4. Transport – TCP, UDP, SABUL,… (USC,UIC) 3. IP 2. Photonic Path Serv. – ODIN, THOR,... (NW) 1. Physical Source: Robert Grossman, UIC/LAC OptIPuter is Exploring Quanta as a High Performance Middleware • Quanta Is A High Performance Networking Toolkit / API • Quanta Uses Reliable Blast UDP: – Assumes An Over-Provisioned Or Dedicated Network – Excellent For Photonic Networks – Don’t Try This On Commodity Internet! – It Is Fast! – It Is Very Predictable – We Give You A Prediction Equation To Predict Performance – It Is Most Suited For Transferring Very Large Payloads • RBUDP, SABUL, and Tsunami Are All Similar Protocols That Use UDP For Bulk Data Transfer Source: Jason Leigh, UIC XCP Is A New Congestion Control Scheme Which is Good for Gigabit Flows • Better Than TCP – Almost Never Drops Packets – Converges To Available Bandwidth Very Quickly, ~1Round Trip Time – Fair Over Large Variations In Flow Bandwidth and RTT • Supports existing TCP semantics – Replaces Only Congestion Control, Reliability Unchanged – No Change To Application/Network API • Status – To Date: Simulations and SIGCOMM Paper (MIT). – See Dina Katabi, Mark Handley, and Charles Rohrs, "Internet Congestion Control for Future High Bandwidth-Delay Product Environments." ACM SIGCOMM 2002, August 2002. http://ana.lcs.mit.edu/dina/XCP/ – Current: – Developing Protocol, Implementation – Extending Simulations (ISI) Source: Aaron Falk, Joe Bannister, ISI USC From SuperComputers to SuperNetworks-Changing the Grid Design Point • The TeraGrid is Optimized for Computing – – – – 1024 IA-64 Nodes Linux Cluster Assume 1 GigE per Node = 1 Terabit/s I/O Grid Optical Connection 4x10Gig Lambdas = 40 Gigabit/s Optical Connections are Only 4% Bisection Bandwidth • The OptIPuter is Optimized for Bandwidth – – – – – 32 IA-64 Node Linux Cluster Assume 1 GigE per Processor = 32 gigabit/s I/O Grid Optical Connection 4x10GigE = 40 Gigabit/s Optical Connections are Over 100% Bisection Bandwidth Grow the network capacity to stay close to full bisection Convergence of Networking Fabrics • Today's Computer Room – Router For External Communications (WAN) – Ethernet Switch For Internal Networking (LAN) – Fibre Channel For Internal Networked Storage (SAN) • Tomorrow's Grid Room – A Unified Architecture Of LAN/WAN/SAN Switching – More Cost Effective – One Network Element vs. Many – One Sphere of Scalability – ALL Resources are GRID Enabled – Layer 3 Switching and Addressing Throughout Source: Steve Wallach, Chiaro Networks Who is OptIPuter? • Larry Smarr, UCSD CSE, PI – Mark Ellisman, Co-PI, UCSD School of Medicine (Neuro Science applications) – Philip Papadopoulos, Co-PI, San Diego Supercomputer Center (Experimental systems) Tom DeFanti, Co-PI, UIC (All optical exchanges and wide-area networking) – Jason Leigh, Co-PI, UIC (High-speed graphics systems) • Other Key Institutions – University of Southern California, ISI (Network Protocols, Grid Software) – Joe Bannister, Carl Kesselman – UCSD/SDSC/SIO(data, Middleware, computers in the arts, optics, systems, security) – Chaitan Baru, Andrew Chien, Sheldon Brown, Sadik Esener, Shaya Fainman, John Orcutt, Graham Kent, Ron Graham, Greg Hidley, Sid Karin, Paul Siegel, Rozeanne Steckler – SDSU (GIS systems) – Eric Frost – UIC (Data systems, Networks, Visualization) – Bob Grossman, Tom Moher, Alan Verloa – UCI (data Systems, real-time computing) – Kane Kim, Padhraic Smyth – Northwestern University (performance analysis, networking) – Joel Mambreti, Valerie Taylor What is OptIPuter? • • It is a large NSF ITR project funded at $13.5M from 2002 – 2007 Fundamentally, we ask the question: – What happens to the structure of machines and programs when the network becomes essentially ‘infinite’? • • • Project is coupled tightly with key applications to keep the IT research grounded and focused Individual researchers are investigating the software and structure from the physical (Photonic) layer, to network protocols, middleware, and applications. We are building (in phases) two high-capacity networks with associated modest-sized endpoints – Experimental apparatus allows investigations at various levels – “Breaking the network” is expected – Start small (only 4 gigabits/clustered endpoint) and grow to 400 Gigabits/clustered endpoint in 2007. – UCSD is building a packet-based (traditional) network (Mod-0) – UIC is building an all-lambda network. (Mod-1) The OptIPuter 2003 Experimental Network Wide Array of Vendors Calient Lambda Switches Now Installed at StarLight (UIC) and NetherLight Data plane 8 GigE Data plane 64x64 MEMS Optical Switch 8 GigE 128x128 MEMS Optical Switch 16 GigE 8 GigE 16 GigE “Groomer” at StarLight 16-processor cluster 8-processor cluster 2 GigE 16 GigE 1 92 C O ps) b G (10 “Groomer” at NetherLight 16-processor cluster 2 GigE 8 GigE 16 GigE Switch/Router Switch/Router Control plane Control plane NETHERLIGHT GigE = Gigabit Ethernet (Gbps connection type) Source: Maxine Brown UCSD is building out a high-speed packetThe UCSD OptIPuter Deployment switched network To CENIC Phase I, Fall 02 Phase II, 2003 Production Router SDSC SDSC SDSC SDSC Annex Annex JSOE Engineering CRCA Arts SOM Medicine Chemistry Phys. Sci Keck Collocation point Preuss High School 6th Undergrad College College Node M Collocation Chiaro Router SIO Earth Sciences ½ Mile Source: Phil Papadopoulos, SDSC; Greg Hidley, Cal-(IT)2 OptIPuter LambdaGrid Enabled by Chiaro Networking Router www.calit2.net/news/2002/11-18-chiaro.html Medical Imaging and Microscopy Chemistry, Engineering, Arts switch switch • Cluster – Disk • Disk – Disk Chiaro Enstara • Viz – Disk • DB – Cluster switch switch San Diego Supercomputer Center • Cluster – Cluster Scripps Institution of Oceanography Image Source: Phil Papadopoulos, SDSC Nodes and Networks • Clustered endpoints where each node has a gigabit interfaces on the optIPuter network. – Linux Redhat 7.3 managed with NPACI Rocks clustering toolkit – Visualization clusters and immersive visualization theaters – Specialized instruments such as light and electron microscopes • Nodes are plugged into a “supercheap” Dell 5254 24 port gigE copper switch with 4 fiber uplinks (~$2K) . Link aggregation is supported to give us 4 gigabits/site today. – Target 40 Gigabits/site in 2004. Off-campus @ 10 gigabits – 400 Gigabits/site in 2006. Off-campus @ ?? gigabits • The “center” of the UCSD is network is Serial #1 of a Chiaro Enstara Router – From our viewpoint it provides unlimited capacity (6 Terabits, if fully provisioned today) – It can scale to more than 2000 gigabit endpoints, today – Packets and IP – ability to route at wire speeds across hundreds of 10 GigE interfaces or 1000s of standard gigE – “Gold plated” in terms of expense and size (we got a really good deal from Chiaro  Chiaro’s Software Enstara™ Summary • Scalable Capacity – 6 Tb/S Initial Capacity – GigE  OC-192 Interfaces – “Soft” Forwarding Plane With Network Processors For Maximum Flexibility • Full protocol suite – Unicast: BGP, OSPF, IS-IS – Multicast: PIM, MBGP, MSDP – MPLS: RSVP-TE, LDP, FRR • Stateful Assured Routing (STAR™) – Provides Service Continuity During Maintenance and Fault Management Actions – Stateful Protocol Protection Extended to BGP, ISIS, OSPF, Multicast, and MPLS • Partitions – Abstraction Permitting Multiple Logical Classes Of Routers To Be Managed As If Separate Physical Routers – Each Partition Has Its Own CLI, SNMP, Security, and Routing Protocols Instances Where Chiaro sits in the landscape? Large Port Count Small Port Count Chiaro Optical Phased Array MEMS Electrical Fabrics Bubble Electrical Fabrics Lithium Niobate l Switching Speeds (ms) Packet Switching Speeds (ns) The Center of the UCSD OptIPuter Network http://132.239.26.190/view/view.shtml Optical Phased Array – Multiple Parallel Optical Waveguides Output Fibers GaAs WG #1 Waveguides Input Optical Fiber WG #128 Chiaro Has a Scalable, Fully Fault Tolerant Architecture • Significant Technical Innovation Network Proc. Line Card Network Proc. Line Card Chiaro OPA Fabric – OPA Fabric Enables Large Port Count – Global Arbitration Provides Guaranteed Performance – Fault-Tolerant Control System Provides Nonstop Performance • Smart Line Cards Network Proc. Line Card Network Proc. Line Card Global Arbitration Optical Electrical – ASICs With Programmable Network Processors – Software Downloads For Features And Standards Evolution Chain of events leading to the recent OptIPuter All Hands Meeting Last month • The hardest thing in networking – getting fiber in the ground and terminated – – – – • 4 pair single mode fiber per site Fiber terminated on Monday– fiber “polishing” took several days  First light/first packet Wed at about 6pm Ran Linpack at about 6:05pm Currently only two sites are connected – Linux Cluster at each site as a baseline. Sizes/capability/architecture vary as needed (and $$) change. – 4 x 1 gigE @ CSC – 4 x 1 gigE @ SDSC – Additonal interfaces and sites being added next week (31 March) • Chiaro – Serial #1 production router. We have the “Tiny size” – One redundant pair of optical cores – 640 gigabits. – We have alpha/beta gigE blades from Chiaro – OptIPuter getting these 3 – 4 months ahead of schedule – Chose multiple gigE striped physically because – Cost at the site end – Start parallel – From the endpoint view – Chiaro works just like a standard router. Netperf Numbers (data at first light) • 2 streams  more than 1 gigabit (aggregate) SDSC -> CSC netperf ( 2 X 1GHz PIII --> 2 x 2.2 GHz PIV) 1800 Bandwidth (Mb/s) 1600 1400 1200 Stream 1 Stream 2 Aggregate 1000 800 600 400 200 0 0 10000 20000 30000 40000 Message Size (bytes) 50000 60000 70000 Linpack Numbers 4 processor Linpack – Run through local copper gigE switch – Run through Chiaro Router Linpack (4 CPUs/4 Nodes) (17.6 Gflops Peak -- Four 2.2GHz Pentium 4) 700 10 9 600 8 500 7 6 400 5 300 4 3 200 2 100 1 0 0 1000 2000 4000 6000 8000 10000 12000 Matirx Dimension 14000 16000 18000 20000 GigaFlops Time (sec) • Local GigE sec Chiaro sec Local GigE Gflops Chiaro Gflops Some fundamental questions already showing up • Lambdas and all optical switches (e.g. 3D MEMS) are cheaper than the monster router – ~10X cheaper than current 10GigE routers on a per interface basis makes a compelling reason to investigate – However, because MEMS are mechanical, they act more like switchboards that can effectively be “rewired” about 10Hz – For comparison Chiaro “rewires” it’s OPA at ~ 1MHz • Question – where do you expose lambdas (circuits)? – Are lambdas only a way to physically multiplex fiber? – Do we expose lambdas to endpoints and build “lambda NICS” (and hence lambda-aware applications) – Implies a hybrid network from the endpoint. Circuit switched and packet switched modes – Even with massive bandwidth, larger networks will have some congestion points. Circuits can create “expressways” through these choke points Where the Rubber meets the Road applications • Optiputer is a concerted effort to explore the effects of enormous bandwidth on the structure of – Machines – Protocols – Applications • Distributed applications are generally written to conserve bandwidth/network resource – Emerging technology moves congestion points from the core to the endpoints. – It will take some time for the software engineering processes to factor in this fundamental change – In the limit, applications should worry about only two things – Latency to the remote resource – Capacity of the remote resource to fulfill requests • Things that once looked immovable (like a terabyte of storage), become practically accessible from a remote site Summary • • OptIPuter is a 5 year large research project funded by NSF – driven by observations of crossing technology exponentials Exploring how to build networks with bandwidth-matched endpoints – We’re building experimental apparatus that will let us test ideas • • Key driving applications keep the IT research focused. These applications also modify based on new capability There are several key IT research topics within optIPuter. Each is key to the notion of how the structure of computers and computing will change over the next decade

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The California Institute for Telecommunications and