Download Titan Final Presentation @ NSF PI meeting

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed operating system wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Asynchronous Transfer Mode wikipedia , lookup

Airborne Networking wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
TITAN: A Next-Generation
Infrastructure for Integrating and
Communication
David E. Culler
Computer Science Division
U.C. Berkeley
NSF Research Infrastructure Meeting
Aug 7, 1999
Project Goal:
• “Develop a new type of system which harnesses
breakthrough communications technology to
integrate a large collection of commodity
computers into a powerful resource pool that
can be accessed directly through its constituent
nodes or through inexpensive media stations.”
–
–
–
–
SW architecture for global operating system
programming language support
advanced applications
multimedia application development
Aug, 1999
NSF RI 99
2
Project Components
• Computational and
Storage Core
– architecture
– operating systems
– compiler, language,
and library
• High Speed
Networking
• Multimedia Shell
• Driving
Applications
The Building is the Computer
Aug, 1999
NSF RI 99
3
Use what you build, learn from use,...
Develop Enabling
Systems Technology
Develop Driving
Applications
Aug, 1999
NSF RI 99
4
Highly Leveraged Project
• Large industrial contribution
–
–
–
–
–
HP media stations
Sun compute stations
Sun SMPs
Intel media stations
Bay networks ATM, ethernet
• Enabled several federal grants
–
–
–
–
NOW
Titanium, Castle
Daedalus, Mash
DLIB
• Berkeley Multimedia Research Center
Aug, 1999
NSF RI 99
5
Landmarks
Top 500 Linpack Performance List
MPI, NPB performance on par with MPPs
RSA 40-bit Key challenge
World Leading External Sort
9
Inktomi search engine
Minute Sort
8
7
NPACI resource site
Gigabytes sorted
•
•
•
•
•
•
Sustains 500 MB/s disk bandwidth
and1,000 MB/s network bandwidth
Aug, 1999
6
5
4
3
2
1
0
NSF RI 99
SGI Orgin
SGI Power
Challenge
0
10
20
30
40
50
60
70
80
90
100
Processors
6
Sample of 98 Degrees from Titan
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Amin Vahdat:
Steven Lumetta:
Wendy Heffner:
Doug Ghormley:
Andrea Dusseau:
Armando Fox:
John Byers:
Elan Amir:
David Bacon:
Kristen Wright:
Jeanna Neefe:
Steven Gribble:
Ian Goldberg:
Eshwar Balani:
Paul Gautier:
Aug, 1999
WebOS
Multiprotocol Communication
Multicast Communication Protocols
Global OS
Implicit Co-scheduling
TACC Proxy Architecture
Fast, Reliable Bulk Communication
Media Gateway
Compiler Optimization
Scalable web cast
xFS
Web caching
Wingman
WebOS security
Scalable Search Engines
NSF RI 99
7
Results
• Constructed three prototypes, culminating in 100
processor UltraSparc NOW + three extensions
– GLUnix global operating system layer
– Active Messages providing fast, general purpose user-level
communication
– xFS cluster file system
– Fast sockets, MPI, and SVM
– Titanium and Split-C parallel languages
– ScaLapack libraries
• Heavily used in dept. and external research
=> instrumental in establishing clusters as a viable
approach to large scale computing
=> transitioned to an NPACI experimental resource
• The Killer App: Scalable Internet Services
Aug, 1999
NSF RI 99
8
First HP/fddi Prototype
• FDDI on the HP/735
graphics bus.
• First fast msg layer
on non-reliable
network
Aug, 1999
NSF RI 99
9
SparcStation ATM NOW
• ATM was going to
take over the world.
• Myrinet SAN emerged
The original INKTOMI
Aug, 1999
NSF RI 99
10
Technological Revolution
• The “Killer Switch”
–
–
–
–
single chip building block for scalable networks
high bandwidth
low latency
very reliable
» if it’s not unplugged
=> System Area Networks
•8 bidirectional ports of 160 MB/s each way
•< 500 ns routing delay
•Simple - just moves the bits
•Detects connectivity and deadlock
Aug, 1999
NSF RI 99
11
100 node Ultra/Myrinet NOW
Aug, 1999
NSF RI 99
12
NOW System Architecture
Parallel Apps
Large Seq. Apps
Sockets, Split-C, MPI, HPF, vSM
Global Layer UNIX
Resource Management Network RAM Distributed Files Process Migration
UNIX
Workstation
UNIX
Workstation
UNIX
Workstation
Comm. SW
Comm. SW
Comm. SW
UNIX
Workstation
Comm. SW
Net Inter. HW
Net Inter. HW
Net Inter. HW
Net Inter. HW
Fast Commercial Switch (Myrinet)
Aug, 1999
NSF RI 99
13
Software Warehouse
• Coherent software environment throughout the
research program
– Billions bytes of code
• Mirrored externally
• New SWW-NT
Aug, 1999
NSF RI 99
14
Multi-Tier Networking Infrastructure
•
•
•
•
Myrinet Cluster Interconnect
ATM backbone
Switched Ethernet
Wireless
Aug, 1999
NSF RI 99
15
Multimedia Development Support
•
•
•
•
Authoring tools
Presentation capabilities
Media stations
Multicast support / MBone
Aug, 1999
NSF RI 99
16
Novel Cluster Designs
• Tertiary Disk
– very low cost massive storage
– hosts archive of Museum of Fine Arts
• Pleiades Clusters
– functionally specialized storage and information servers
– constant back-up and restore at large scale
– NOW tore apart traditional AUSPEX servers
• CLUMPS
– cluster of SMPs with multiple NICs per node
Aug, 1999
NSF RI 99
17
Massive Cheap Storage
•Basic unit:
2 PCs double-ending four
SCSI chains
Currently serving Fine Art at http://www.thinker.org/imagebase/
Aug, 1999
NSF RI 99
18
Information Servers
• Basic Storage Unit:
– Ultra 2, 300 GB raid, 800 GB
tape stacker, ATM
– scalable backup/restore
• Dedicated Info Servers
– web,
– security,
– mail, …
• VLANs project into dept.
Aug, 1999
NSF RI 99
19
Cluster of SMPs (CLUMPS)
• Four Sun E5000s
– 8 processors
– 3 Myricom NICs
• Multiprocessor, MultiNIC, Multi-Protocol
Aug, 1999
NSF RI 99
20
Novel Systems Design
• Virtual networks
– integrate communication events into virtual memory system
• Implicit Co-scheduling
– cause local schedulers to co-schedule parallel computations
using a two-phase spin-block and observing round-trip
• Co-operative caching
– access remote caches, rather than local disk, and enlarge
global cache coverage by simple cooperation
•
•
•
•
•
Reactive Scalable I/O
Network virtual memory, fast sockets
ISAAC “active” security
Internet Server Architecture
TACC Proxy architecture
Aug, 1999
NSF RI 99
21
Fast Communication
16
14
12
g
L
Or
Os
µs
10
8
6
4
2
U
ltr
a
ar
ag
on
M
ei
ko
P
10
O
W
SS
N
W
O
N
U
lt
P ra
ar
ag
on
M
ei
ko
W
O
N
N
O
W
SS
10
0
• Fast communication on clusters is obtained
through direct access to the network, as on MPPs
• Challenge is make this general purpose
– system implementation should not dictate how it can be used
Aug, 1999
NSF RI 99
22
Virtual Networks
• Endpoint abstracts the notion of “attached to the
network”
• Virtual network is a collection of endpoints that
can name each other.
• Many processes on a node can each have many
endpoints, each with own protection domain.
Aug, 1999
NSF RI 99
23
How are they managed?
• How do you get direct hardware access for
performance with a large space of logical
resources?
• Just like virtual memory
– active portion of large logical space is bound to physical
resources
Host
Memory
Process n
Processor
***
Process 3
Process 2
Process 1
NIC
Mem
Aug, 1999
NSF RI 99
P
Network Interface
24
Network Interface Support
Frame 0
Transmit
• NIC has endpoint frames
• Services active
endpoints
• Signals misses to driver
– using a system endpont
Receive
Frame 7
EndPoint Miss
Aug, 1999
NSF RI 99
25
Communication under Load
Msg
burst
work
Client
Server
Server
Server
Client
Client
continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
70000
60000
50000
40000
30000
20000
28
25
22
19
16
13
10
7
0
4
10000
1
Aggregate msgs/s
80000
Number of virtual networks
=> Use of networking resources adapts to demand.
=> VIA (or improvements on it) need to become widespread
Aug, 1999
NSF RI 99
26
Implicit Coscheduling
A
GS
GS
GS
GS
LS
LS
LS
LS
A
A
A
A
A
• Problem: parallel programs designed to run in
parallel => huge slowdowns with local scheduling
– gang scheduling is rigid, fault prone, and complex
• Coordinate schedulers implicitly using the
communication in the program
– very easy to build, robust to component failures
– inherently “service on-demand”, scalable
– Local service component can evolve.
Aug, 1999
NSF RI 99
27
Why it works
• Infer non-local state from local observations
• React to maintain coordination
observation
fast response
delayed response
WS 1
implication
partner scheduled
partner not scheduled
sleep
Job A
request
WS 2
Job B
action
spin
block
Job A
response
Job A
spin
WS 3
WS 4
Aug, 1999
Job B
Job A
Job B
Job A
NSF RI 99
28
I/O Lessons from NOW sort
• Complete system on every node powerful basis
for data intensive computing
– complete disk sub-system
– independent file systems
» MMAP not read, MADVISE
– full OS => threads
• Remote I/O (with fast comm.) provides same
bandwidth as local I/O.
• I/O performance is very tempermental
– variations in disk speeds
– variations within a disk
– variations in processing, interrupts, messaging, ...
Aug, 1999
NSF RI 99
29
Reactive I/O
• Loosen data semantics
– ex: unordered bag of records
• Build flows from producers (eg. Disks) to
consumers (eg. Summation)
• Flow data to where it can be consumed
Aug, 1999
D
A
D
D
A
D
D
A
D
D
A
D
NSF RI 99
Distributed Queue
Adaptive Parallel Aggregation
Static Parallel Aggregation
A
A
A
A
30
Adpative Agr.
Adpative Agr.
Static Agr.
Static Agr.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
% of Peak I/O Rate
% of Peak I/O Rate
Performance Scaling
0
5
10
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
15
5
10
15
Nodes Perturbed
Nodes
• Allows more data to go to faster consumer
Aug, 1999
NSF RI 99
31
Driving Applications
•
•
•
•
•
•
Inktomi Search Engine
World Record Disk-to_Disk store
RSA 40-bit key
IRAM simulations, Turbulence, AMR, Lin. Alg.
Parallel image processing
Protocol verification, Tempest, Bio, Global
Climate. . .
• Multimedia Work Drove Network Aware
Transcoding Services on Demand
– Parallel Software-only Video Effects
– TACC (transcoding) Proxy
» Transcend
» Wingman
Aug, 1999
– MBONE media gateway NSF RI 99
32
Transcend Transcoding Proxy
Service request
Front-end
service threads
Manager
User Profile
Database
Physical
processor
Caches
• Application provides services to clients
• Grows/Shrinks according to demand, availability,
and faults
Aug, 1999
NSF RI 99
33
UCB CSCW Class
Sigh…
no multicast,
no bandwidth,
no CSCW class...
Problem
Enable heterogeneous sets of participants to
seamlessly join MBone sessions.
Aug, 1999
NSF RI 99
34
A Solution: Media Gateways
• Software agents that enable local processing
(e.g. transcoding) and forwarding of source
streams.
• Offer the isolation of a local rate-controller for
each source stream.
• Controlling bandwidth allocation and format
conversion to each source prevents link
saturation and accommodates heterogeneity.
GW
Aug, 1999
GW
NSF RI 99
35
A Solution: Media Gateways
Sigh…
no multicast,
no bandwidth,
no MBone...
AHA!
MBone
Media GW
Aug, 1999
NSF RI 99
36
FIAT LUX: Bringing it all together
• Combines
–
–
–
–
Image Based Modeling and Rendering,
Image Based Lighting,
Dynamics Simulation and
Global Illumination in a completely novel fashion to achieve
unprecedented levels of scientific accuracy and realism
• Computing Requirements
– 15 Days of worth of time for development.
– 5 Days for rendering Final piece.
– 4 Days for rendering in HDTV resolution on 140 Processors
• Storage
– 72,000 Frames, 108 Gigabytes of storage
– 7.2 Gigs after motion blur
– 500 MB JPEG
• premiere at the SIGGRAPH 99 Electronic Theater
– http://fiatlux.berkeley.edu/
Aug, 1999
NSF RI 99
37