* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Titan Final Presentation @ NSF PI meeting
Survey
Document related concepts
Transcript
TITAN: A Next-Generation Infrastructure for Integrating and Communication David E. Culler Computer Science Division U.C. Berkeley NSF Research Infrastructure Meeting Aug 7, 1999 Project Goal: • “Develop a new type of system which harnesses breakthrough communications technology to integrate a large collection of commodity computers into a powerful resource pool that can be accessed directly through its constituent nodes or through inexpensive media stations.” – – – – SW architecture for global operating system programming language support advanced applications multimedia application development Aug, 1999 NSF RI 99 2 Project Components • Computational and Storage Core – architecture – operating systems – compiler, language, and library • High Speed Networking • Multimedia Shell • Driving Applications The Building is the Computer Aug, 1999 NSF RI 99 3 Use what you build, learn from use,... Develop Enabling Systems Technology Develop Driving Applications Aug, 1999 NSF RI 99 4 Highly Leveraged Project • Large industrial contribution – – – – – HP media stations Sun compute stations Sun SMPs Intel media stations Bay networks ATM, ethernet • Enabled several federal grants – – – – NOW Titanium, Castle Daedalus, Mash DLIB • Berkeley Multimedia Research Center Aug, 1999 NSF RI 99 5 Landmarks Top 500 Linpack Performance List MPI, NPB performance on par with MPPs RSA 40-bit Key challenge World Leading External Sort 9 Inktomi search engine Minute Sort 8 7 NPACI resource site Gigabytes sorted • • • • • • Sustains 500 MB/s disk bandwidth and1,000 MB/s network bandwidth Aug, 1999 6 5 4 3 2 1 0 NSF RI 99 SGI Orgin SGI Power Challenge 0 10 20 30 40 50 60 70 80 90 100 Processors 6 Sample of 98 Degrees from Titan • • • • • • • • • • • • • • • Amin Vahdat: Steven Lumetta: Wendy Heffner: Doug Ghormley: Andrea Dusseau: Armando Fox: John Byers: Elan Amir: David Bacon: Kristen Wright: Jeanna Neefe: Steven Gribble: Ian Goldberg: Eshwar Balani: Paul Gautier: Aug, 1999 WebOS Multiprotocol Communication Multicast Communication Protocols Global OS Implicit Co-scheduling TACC Proxy Architecture Fast, Reliable Bulk Communication Media Gateway Compiler Optimization Scalable web cast xFS Web caching Wingman WebOS security Scalable Search Engines NSF RI 99 7 Results • Constructed three prototypes, culminating in 100 processor UltraSparc NOW + three extensions – GLUnix global operating system layer – Active Messages providing fast, general purpose user-level communication – xFS cluster file system – Fast sockets, MPI, and SVM – Titanium and Split-C parallel languages – ScaLapack libraries • Heavily used in dept. and external research => instrumental in establishing clusters as a viable approach to large scale computing => transitioned to an NPACI experimental resource • The Killer App: Scalable Internet Services Aug, 1999 NSF RI 99 8 First HP/fddi Prototype • FDDI on the HP/735 graphics bus. • First fast msg layer on non-reliable network Aug, 1999 NSF RI 99 9 SparcStation ATM NOW • ATM was going to take over the world. • Myrinet SAN emerged The original INKTOMI Aug, 1999 NSF RI 99 10 Technological Revolution • The “Killer Switch” – – – – single chip building block for scalable networks high bandwidth low latency very reliable » if it’s not unplugged => System Area Networks •8 bidirectional ports of 160 MB/s each way •< 500 ns routing delay •Simple - just moves the bits •Detects connectivity and deadlock Aug, 1999 NSF RI 99 11 100 node Ultra/Myrinet NOW Aug, 1999 NSF RI 99 12 NOW System Architecture Parallel Apps Large Seq. Apps Sockets, Split-C, MPI, HPF, vSM Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration UNIX Workstation UNIX Workstation UNIX Workstation Comm. SW Comm. SW Comm. SW UNIX Workstation Comm. SW Net Inter. HW Net Inter. HW Net Inter. HW Net Inter. HW Fast Commercial Switch (Myrinet) Aug, 1999 NSF RI 99 13 Software Warehouse • Coherent software environment throughout the research program – Billions bytes of code • Mirrored externally • New SWW-NT Aug, 1999 NSF RI 99 14 Multi-Tier Networking Infrastructure • • • • Myrinet Cluster Interconnect ATM backbone Switched Ethernet Wireless Aug, 1999 NSF RI 99 15 Multimedia Development Support • • • • Authoring tools Presentation capabilities Media stations Multicast support / MBone Aug, 1999 NSF RI 99 16 Novel Cluster Designs • Tertiary Disk – very low cost massive storage – hosts archive of Museum of Fine Arts • Pleiades Clusters – functionally specialized storage and information servers – constant back-up and restore at large scale – NOW tore apart traditional AUSPEX servers • CLUMPS – cluster of SMPs with multiple NICs per node Aug, 1999 NSF RI 99 17 Massive Cheap Storage •Basic unit: 2 PCs double-ending four SCSI chains Currently serving Fine Art at http://www.thinker.org/imagebase/ Aug, 1999 NSF RI 99 18 Information Servers • Basic Storage Unit: – Ultra 2, 300 GB raid, 800 GB tape stacker, ATM – scalable backup/restore • Dedicated Info Servers – web, – security, – mail, … • VLANs project into dept. Aug, 1999 NSF RI 99 19 Cluster of SMPs (CLUMPS) • Four Sun E5000s – 8 processors – 3 Myricom NICs • Multiprocessor, MultiNIC, Multi-Protocol Aug, 1999 NSF RI 99 20 Novel Systems Design • Virtual networks – integrate communication events into virtual memory system • Implicit Co-scheduling – cause local schedulers to co-schedule parallel computations using a two-phase spin-block and observing round-trip • Co-operative caching – access remote caches, rather than local disk, and enlarge global cache coverage by simple cooperation • • • • • Reactive Scalable I/O Network virtual memory, fast sockets ISAAC “active” security Internet Server Architecture TACC Proxy architecture Aug, 1999 NSF RI 99 21 Fast Communication 16 14 12 g L Or Os µs 10 8 6 4 2 U ltr a ar ag on M ei ko P 10 O W SS N W O N U lt P ra ar ag on M ei ko W O N N O W SS 10 0 • Fast communication on clusters is obtained through direct access to the network, as on MPPs • Challenge is make this general purpose – system implementation should not dictate how it can be used Aug, 1999 NSF RI 99 22 Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. Aug, 1999 NSF RI 99 23 How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory – active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem Aug, 1999 NSF RI 99 P Network Interface 24 Network Interface Support Frame 0 Transmit • NIC has endpoint frames • Services active endpoints • Signals misses to driver – using a system endpont Receive Frame 7 EndPoint Miss Aug, 1999 NSF RI 99 25 Communication under Load Msg burst work Client Server Server Server Client Client continuous 1024 msgs 2048 msgs 4096 msgs 8192 msgs 16384 msgs 70000 60000 50000 40000 30000 20000 28 25 22 19 16 13 10 7 0 4 10000 1 Aggregate msgs/s 80000 Number of virtual networks => Use of networking resources adapts to demand. => VIA (or improvements on it) need to become widespread Aug, 1999 NSF RI 99 26 Implicit Coscheduling A GS GS GS GS LS LS LS LS A A A A A • Problem: parallel programs designed to run in parallel => huge slowdowns with local scheduling – gang scheduling is rigid, fault prone, and complex • Coordinate schedulers implicitly using the communication in the program – very easy to build, robust to component failures – inherently “service on-demand”, scalable – Local service component can evolve. Aug, 1999 NSF RI 99 27 Why it works • Infer non-local state from local observations • React to maintain coordination observation fast response delayed response WS 1 implication partner scheduled partner not scheduled sleep Job A request WS 2 Job B action spin block Job A response Job A spin WS 3 WS 4 Aug, 1999 Job B Job A Job B Job A NSF RI 99 28 I/O Lessons from NOW sort • Complete system on every node powerful basis for data intensive computing – complete disk sub-system – independent file systems » MMAP not read, MADVISE – full OS => threads • Remote I/O (with fast comm.) provides same bandwidth as local I/O. • I/O performance is very tempermental – variations in disk speeds – variations within a disk – variations in processing, interrupts, messaging, ... Aug, 1999 NSF RI 99 29 Reactive I/O • Loosen data semantics – ex: unordered bag of records • Build flows from producers (eg. Disks) to consumers (eg. Summation) • Flow data to where it can be consumed Aug, 1999 D A D D A D D A D D A D NSF RI 99 Distributed Queue Adaptive Parallel Aggregation Static Parallel Aggregation A A A A 30 Adpative Agr. Adpative Agr. Static Agr. Static Agr. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% % of Peak I/O Rate % of Peak I/O Rate Performance Scaling 0 5 10 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 15 5 10 15 Nodes Perturbed Nodes • Allows more data to go to faster consumer Aug, 1999 NSF RI 99 31 Driving Applications • • • • • • Inktomi Search Engine World Record Disk-to_Disk store RSA 40-bit key IRAM simulations, Turbulence, AMR, Lin. Alg. Parallel image processing Protocol verification, Tempest, Bio, Global Climate. . . • Multimedia Work Drove Network Aware Transcoding Services on Demand – Parallel Software-only Video Effects – TACC (transcoding) Proxy » Transcend » Wingman Aug, 1999 – MBONE media gateway NSF RI 99 32 Transcend Transcoding Proxy Service request Front-end service threads Manager User Profile Database Physical processor Caches • Application provides services to clients • Grows/Shrinks according to demand, availability, and faults Aug, 1999 NSF RI 99 33 UCB CSCW Class Sigh… no multicast, no bandwidth, no CSCW class... Problem Enable heterogeneous sets of participants to seamlessly join MBone sessions. Aug, 1999 NSF RI 99 34 A Solution: Media Gateways • Software agents that enable local processing (e.g. transcoding) and forwarding of source streams. • Offer the isolation of a local rate-controller for each source stream. • Controlling bandwidth allocation and format conversion to each source prevents link saturation and accommodates heterogeneity. GW Aug, 1999 GW NSF RI 99 35 A Solution: Media Gateways Sigh… no multicast, no bandwidth, no MBone... AHA! MBone Media GW Aug, 1999 NSF RI 99 36 FIAT LUX: Bringing it all together • Combines – – – – Image Based Modeling and Rendering, Image Based Lighting, Dynamics Simulation and Global Illumination in a completely novel fashion to achieve unprecedented levels of scientific accuracy and realism • Computing Requirements – 15 Days of worth of time for development. – 5 Days for rendering Final piece. – 4 Days for rendering in HDTV resolution on 140 Processors • Storage – 72,000 Frames, 108 Gigabytes of storage – 7.2 Gigs after motion blur – 500 MB JPEG • premiere at the SIGGRAPH 99 Electronic Theater – http://fiatlux.berkeley.edu/ Aug, 1999 NSF RI 99 37