Download Playing Distributed Systems with Memory-to

Document related concepts

Dynamic Host Configuration Protocol wikipedia , lookup

Server Message Block wikipedia , lookup

Distributed firewall wikipedia , lookup

Distributed operating system wikipedia , lookup

AppleTalk wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Lag wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Internet protocol suite wikipedia , lookup

TCP congestion control wikipedia , lookup

Transcript
Building Network-Centric
Systems
Liviu Iftode
Before WWW, people were happy...
E-mail, Telnet
TCP/IP
Emacs
NFS
TCP/IP



CS.umd.EDU
CS.rutgers.EDU
Mostly local computing
Occasional TCP/IP networking with low expectations and mostly
non-interactive traffic
 local area networks: file server (NFS)
 wide area networks -Internet- : E-mail, Telnet, Ftp
Networking was not a major concern for the OS
One Exception: Cluster Computing
Multicomputers
Clusters of computers
 Cost-effective solution for high-performance
distributed computing
 TCP/IP networking was the headache
 large software overheads
 Software DSM not a network-centric system :-(
The Great WWW Challenge
Web Browsing
http://www.Bank.com
TCP/IP
Bank.com




World Wide Web made access over the Internet easy
Internet became commercial
Dramatic increase of interactive traffic
WWW networking creates a network-centric system:
Internet server
 performance: service more network clients
 availability: be accessible all the time over the network
 security: protect resources against network attacks
Network-Centric Systems
Networking dominates the operating system
 Mobile Systems
 mobility aware TCP/IP (Mobile IP, I-TCP, etc), disconnected file
systems (Coda), adaptation-aware applications for
mobility(Odyssey), etc
 Internet Servers
 resource allocation (Lazy Receive Processing, Resource
Containers), OS shortcuts (Scout, IO-Lite), etc
 Pervasive/Ubiquitous Systems
 Tiny OS , sensor networks (Directed Diffusion, etc),
programmability (One World, etc)
 Storage Networking
 network-attached storage (NASD, etc), peer-to-peer systems
(Oceanstore, etc), secure file systems (SFS, Farsite), etc
Big Picture
 Research sparked by various OS-Networking
tensions
 Shift of focus from Performance to Availability
and Manageability
 Networking and Storage I/O Convergence
 Server-based and serverless systems
 TCP/IP and non-TCP/IP protocols
 Local area, wide-area, ad-hoc and
application/overlay networks
 Significant interest from industry
Outline
 TCP Servers
 Migratory-TCP and Service Continuations
 Cooperative Computing, Smart Messages and
Spatial Programming
 Federated File Systems
 Talk Highlights and Conclusions
Problem 1: TCP/IP is too Expensive
User space
20%
Other system
calls
9%
Network
Processing
71%
Breakdown of the CPU time for Apache (uniprocessor based Web-server)
Traditional Send/Receive Communication
App
OS
send(a)
NIC
NIC
OS
App
copy(a,send_buf)
DMA(send_buf,NIC)
interrupt
DMA(NIC,recv_buf)
copy(recv_buf,b)
sender
receive(b)
receiver
A Closer Look
User space
20%
Hardware Interrupt
Processing
Software Interrupt
8%
Processing
11%
IP Receive
0%
IP Send
0%
Other system calls
9%
TCP Receive
7%
TCP Send
45%
Multiprocessor Server Performance
Does not Scale
•Throughput (requests/s)
•700
Dual Processor
•600
Uniprocessor
•500
•400
•300
•200
•100
•0
•300
•350
•400
•450
•500
•550
•600
•650
•700
•750
•Offered load (connections/s)
Apache Web server 1.3.20 on 1 Way and 2 Way 300MHz Pentium II SMP with
repeatedly accessing a static16 KB file
TCP/IP-Application Co-Habitation
 TCP/IP “steals” compute cycles and memory from
applications
 TCP/IP executes in kernel-mode: mode switching
overhead
 TCP/IP executes asynchronously
 interrupt processing overhead
 internal synchronization on multiprocessor servers causes
execution serialization
 Cache pollution
 Hidden “Service-work”
 TCP packet retransmission
 TCP ACK processing
 ARP request service
 Extreme cases can compromise server performance
 Receive livelocks
 Denial-of-service (DoS) attacks
Two Solutions
 Replace TCP/IP with a lightweight transport protocol
 Offload some/all of the TCP from host to a dedicated
computing unit (processor, computer or “intelligent”
network interface)
 Industry: high-performance, expensive solutions
Memory-to-Memory Communication: InfiniBand
“Intelligent” network interface: TCP Offloading Engine(TOE)
 Cost-effective and flexible solutions: TCP Servers
Memory-to-Memory(M-M) Communication
Sender
Send
Receiver
Receive
TCP/IP
Application
OS
Network
Interface (NIC)
Remote
DMA
M-M
OS
NIC
Memory
Buffer
OS
NIC
Memory-to-Memory Communication
is Non-Intrusive
App
NIC
NIC
App
RDMA_Write(a,b)
b is
updated
Sender:
low overhead
Receiver:
zero overhead
TCP Server at a Glance
 A software offloading architecture using existing hardware
 Basic idea: Dedicate one or more computing units
exclusively for TCP/IP
 Compared to TOE
track technology better: latest processors
flexible: adapt to changing load conditions
cost-effective: no extra hardware
 Isolate application computation from network processing
Eliminate network interrupts and context switches
Efficient resource allocation
Additional performance gains (zero-copy) with extended socket API
 Related work
Very preliminary offloading solutions: Piglet, CSP
Socket Direct Protocol, Zero-copy TCP
Two TCP Server Architectures
 TCP Servers for Multiprocessor Servers
TCP/IP
TCP-Server
Server Appl
CPU
CPU
Shared Memory
 TCP Servers for Cluster-based Servers
TCP/IP
M-M
TCP-Server
Server Appl
Where to Split TCP/IP Processing?
(How much to offload?)
APPLICATION
Application
Processors
SEND
TCP Servers
SYSTEM CALLS
RECEIVE
copy_from_application_buffers
copy_to_application_buffers
TCP_send
TCP_receive
IP_send
IP_receive
packet_scheduler
software_interrupt_handler
setup_DMA
interrupt_handler
packet_out
packet_in
Evaluation Testbed
 Multiprocessor Server
4-Way 550MHz Intel Pentium II system
running Apache 1.3.20 web server on Linux 2.4.9
 NIC : 3-Com 996-BT Gigabit Ethernet
 Used sclients as a client program [Banga 97]
Comparative Throughput
3500
Throughput (requests/sec)
3000
2500
2000
1500
1000
500
0
Uniprocessor
SMP 4 processors
SMP - 1 TCP
Server
SMP - 2 TCP
Servers
Clients issue file requests according to a web server trace
Adaptive TCP Servers
 Static TCP Server configuration
Too few TCP Servers can lead to network
processing becoming the bottleneck
Too many TCP Servers lead to degradation in
performance of CPU intensive applications
 Dynamic TCP Server configuration
Monitor the TCP Server queue lengths and
system load
Dynamically add or remove TCP Server
processors
Next Target: The Storage Networking
 Storage Networking dilemma
TCP
TCP Offloading
or not TCP?
M-M Communication (InfiniBand)
iSCSI (SCSI over IP) DAFS (Direct Access File Systems)
 non-TCP/IP solutions require new wiring or tunneling
over IP-based Ethernet networks
 TCP/IP solutions require TCP offloading
Future Work: TCP Servers & iSCSI
Server Appl TCP-Server & iSCSI
CPU
CPU
SCSI Storage
iSCSI
TCP/IP
Shared Memory
 Use TCP-Servers to connect to SCSI storage using
iSCSI protocol over TCP/IP networks
Problem 2: TCP/IP is too Rigid
 Server vs. Service Availability
client interested in Service availability
 Adverse conditions may affect service availability
internetwork congestion or failure
servers overloaded, failed or under DoS attack
 TCP has one response
network delays => packet loss => retransmission
 TCP limits the OS solutions for service availability
early binding of service to a server
client cannot switch to another server for sustained
service after the connection is established
Service Availability through Migration
Server 1
Client
Server 2
Migratory TCP at a Glance
 Migratory TCP migrates live connections among
cooperative servers
 Migration mechanism is generic (not application specific)
lightweight (fine-grained migration) and low-latency
 Migration triggered by client or server
 Servers can be geographically distributed (different IP
addresses)
 Requires changes to the server application
 Totally transparent to the client application
 Interoperates with existing TCP
 Migration policies decoupled from migration mechanism
Basic Idea: Fine-Grained State Migration
Server1 Process
Application state
Connection state
Client
C1 C2 C3 C4
C5
C6
Server2 Process
Migratory-TCP (Lazy) Protocol
Server 1
Client
Server 2
Non-Intrusive Migration
 Migrate state without involving old-server application
(only old server OS)
 Old server exports per-connection state periodically
 Connection state and Application state can go out of
sync
 Upon migration, new server imports the last exported
state of the migrated connection
 OS uses connection state to synchronize with
application
 Non-intrusive migration with M-M communication
 uses RDMA read to extract state from the old server with
zero-overhead
 works even when the old server is overloaded or frozen
Service Continuation (SC)
Front-End
Server Process
socket
SC
pipe
SC
API
pipe
exported
state
exported
state
Connection state
Back-End
Server Process2
Back-End
Server Process1
Pipe state
sc= create_cont(C1);
p1=pipe();
associate(sc,p1);
fork_exec(Process1);
….
export(sc,state)
sc= open_cont(p1);
…
exported
state
Pipe state
sc= open_cont(p2);
….
export(sc, state)
export(sc,state)
Related Work
 Process migration: Sprite [Douglis ‘91], Locus [Walker
‘83], MOSIX [Barak ‘98], etc.
 VM migration [Rosemblum ‘02, Nieh ‘02]
 Migration in web server clusters [Snoeren ‘00, Luo ‘01]
 Fault-tolerant TCP [Alvisi ‘00]
 TCP extensions for host mobility: I-TCP [Bakre ‘95],
Snoop TCP [Balakrishnan ‘95], end-to-end approaches
[Snoeren ‘00], Msocks [Maltz ‘98]
 SCTP (RFC 2960)
Evaluation
 Implemented SC and M-TCP in FreeBSD kernel
 Integrated SC in real Internet servers
web, media streaming, transactional DB
 Microbenchmark
impact of migration on client perceived throughput
for a two-process server using TTCP
 Real applications
sustain web server throughput under load produced
by increasing the number of client connections
Impact of Migration on Throughput
8,000
SC size 1 KB
SC size 5 KB
SC size 10 KB
Effective throughput (KB/s)
7,900
7,800
7,700
7,600
7,500
7,400
7,300
No migration
2
5
Migration period (s)
10
Web Server Throughput
Throughput(replies/s)
800
700
12,000
Migrated Connections
M-Apache
Apache
10,000
600
8,000
500
6,000
400
300
4,000
200
2,000
100
0
0
300
400
500
600
700
800
900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700
Offered load (connections/s)
Migrated connections
900
Future Research: Use SC to Build
Self-Healing Cluster-based Systems
SC2
SC3
Problem 3: Computer Systems move
Outdoors
Sensors
Linux Watch
Linux Camera
Linux Car
 Massive numbers of computers will be embedded
everywhere in the physical world
 Dynamic ad-hoc networking
 How to execute user-defined applications over these
networks?
Outdoor Distributed Computing
 Traditional distributed computing has been indoor
Target: performance and/or fault tolerance
Stable configuration, robust networking (TCP/IP or M-M)
Relatively small scale
Functionally equivalent nodes
Message passing or shared memory programming
 Outdoor Distributed Computing
Target: Collect/Disseminate distributed data and/or perform
collective tasks
Volatile nodes and links
Node equivalence determined by their physical properties
(content-based naming)
Data migration is not good
expensive to perform end-to-end transfer control
too rigid for such a dynamic network
Cooperative Computing at a Glance
 Distributed computing with execution migration
 Smart Message: carries the execution state (and
possibly the code) in addition to the payload
execution state assumed to be small (explicit migration)
code usually cached (few applications)
 Nodes “cooperate” by allowing Smart Messages
to execute on them
to use their memory to store “persistent” data (tags)
 Nodes do not provide routing
Smart Message executes on each node of its path
Application executed on target nodes (nodes of interest)
Routing executed on each node of the path (self-routing)
 During its lifetime, an application generates at least
one, possibly multiple, smart messages
Smart vs. “Dumb” Messages
Mary’s lunch:
Appetizer
Entree
Dessert
Data migration
Execution migration
•`
Smart Messages
Hot
Hot
Hot
SM Execution
0
1
Application
do
migrate(Hot_tag,timeout);
Water_tag = ON;
N=N+1;
until (N==3 or timeout);
1
1
2
2
3
Routing
migrate(tag,timeout) {
do
if (NextHot_tag)
sys_migrate(NextHot_tag,timeout);
else {
spawn_SM(Route_Discovery,Hot);
block_SM(NextHot_tag,timeout);
until (Hot_tag or timeout); }
Cooperative Node Architecure
SM Arrival Admission
Manager
Virtual
Machine
SM Migration
Scheduling
Tag Space
OS & I/O
 Admission control for resource security
 Non-preemptive scheduling with timeout-kill
 Tags created by SMs (limited lifetime) or I/O tags
(permanent)
global tag name space {hash(SM code), tag name}
five protection domains defined using hash(SM code), SM source
node ID, and SM starting time.
Related Work
 Mobile agents (D’Agents, Ajanta)
 Active networks (ANTS, SNAP)
 Sensor networks (Diffusion, TinyOS, TAG)
 Pervasive computing (One.world)
Prototype Implementation
 8 HP iPAQs running Linux
 802.11 wireless communication
 Sun Java K Virtual Machine
 Geographic (simplified GPSR) and
On-Demand (AODV) routing
user node
Routing algorithm
Geographic (GPSR)
On-demand (AODV)
intermediate node
node of interest
Code not cached (ms)
Code cached (ms)
415.6
506.6
126.6
314.7
Completion Time
Self-Routing
 There is no best routing outdoors
Depends on application and node property dynamics
 Application-controlled routing
Possible with Smart Messages (execution state
carried in the message)
When migration times out, the application is upcalled
on the current node to decide what to do next
Self-Routing Effectiveness (simulation)
• geographical routing to reach target regions
• on-demand routing within region
• application decides when to switch between the two
starting node
node of interest
other node
Next Target: Spatial Programming
 Smart Message: too low-level programming
 How to describe distributed computing over dynamic
outdoor networks of embedded systems with limited
knowledge about resource number, location, etc
 Spatial Programming (SP) design guidelines:
space is a first-order programming concept
resources named by their expected location and properties
(spatial reference)
reference consistency: spatial reference-to- resource mappings
are consistent throughout the program
program must tolerate resource dynamics
 SP can be implemented using Smart Messages (the
spatial reference mapping table carried as payload)
Spatial Programming Example
Right Hill
Left Hill
Mobile sprinklers with
temperature sensors
Hot spot
Program sprinklers to water the hottest spot of the Left Hill
for(i=0;i<10;i++)
if {Left_Hill:Hot}[i].temp > Max_temp
Max_temp = {Left_Hill:Hot[I]}.temp;
What if <10 hot spots ?
Spatial Reference for Hot spots
on Left Hill
id = i;
{Left_Hill:Hot}[id].water = ON;
Spatial Reference consistency
Problem 4: Manageable Distributed File
Systems
 Most distributed file servers use TCP/IP both for
client-server and intra-server communication
 Strong file consistency, file locking and load balancing:
difficult to provide
 File servers require significant human effort to manage:
add storage, move directories, etc
 Cluster-based file servers are cost-effective
 Scalable performance requires load balancing
Load balancing may require file migration
File migration limited if file naming is location-dependent
 We need a scalable, location-independent and easy to
manage cluster-based distributed file system
Federated File System at a Glance
A2
A2
A3
A3
A3
A1
FedFS
Local
FS
FedFS
Local
FS
Local
FS
Local
FS
M-M Interconnect
Global file name space over cluster of autonomous local file
systems interconnected by a M-M network
Location Independent Global File Naming
 Virtual Directory (VD): union of local directories
volatile, created on demand (dirmerge)
contains information about files including location (homes of files)
assigned dynamically to nodes (managers)
supports location independent file naming and file migration
 Directory Tables (DT): local caches of VD entries (~TLB)
usr
virtual directory
file1 file2
usr
file1
Local file system 1
usr
local directories
file2
Local file system 2
Direct Access File System (DAFS)
Federated DAFS
Distributed NFS over FedFS
Federated DAFS
NFS Server
DAFS Server
FedFS
M-M Local FS
Direct Access FS
(DAFS)
Application
NFS Client
TCP/IP
Application
NFS Client
TCP/IP
Application
NFS Client
TCP/IP
Application
NFS Server
FedFS
M-M Local FS
DAFS Client
+
M-M
DAFS Server
M-M Local FS
Application
DAFS Client
M-M
Application
DAFS Server
DAFS Client
FedFS
M-M Local FS
M-M
Application
M-M
DAFS Client
M-M
M-M
DAFS Server
FedFS
M-M Local FS
NFS Server
FedFS
M-M Local FS
TCP/IP
FedFS
M-M Local FS
M-M
M-M
Related Work
 Cluster-based File Systems
Frangipani[Thekkath’97], PVFS [Carns’00],GFS,
Archipelago [JI’00], Trapeze (Duke)
 DAFS [NetApp’03,Magoutis’01,02,03]
 User-level communication in cluster-based network
servers [Carrera’02]
Experimental Platform
 Eight node server cluster
 800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM
SCSI
 Client
Dual processor (300 MHz PII), 512 MB SDRAM
 Linux-2.4
 Servers and Clients equipped with Emulex cLAN
adapter (M-M network)
Workload I
 Postmark – Synthetic benchmark
Short-lived small files
Mix of metadata-intensive operations
 Postmark outline
Create a pool of files
Perform transactions – READ/WRITE paired with
CREATE/DELETE
Delete created files
 Each Postmark client performs 30,000 transactions
 Clients distribute requests to servers using a hash
function on pathnames
 Files are physically placed on the node which
receives client requests
Postmark Throughput
•30000
•File size: 2K
•Postmark Throughput (txns/sec)
•File size: 4K
•25000
•File size: 8K
•File size: 16K
•20000
•15000
•10000
•5000
•0
•0
•1
•2
•3
•4
•5
•Number of Servers
•6
•7
•8
•9
Workload II
 Postmark performs only READ transactions
No create/delete operations
Federated DAFS does not control file placement
No client request sent to file’s correct location
Postmark Read Throughput
•60000
•PostmarkRead
•Postmark Read Throughput (txns/sec)
•PostmarkRead - NoCache
•50000
•40000
•30000
•20000
•10000
•0
•2
•4
•Number of Servers
Next Target: Federated DAFS over
the Internet
DAFS Server
FedFS
M-M Local FS
Application
DAFS Client
M-M
TCP/IP
DAFS Server
Application
DAFS Client
M-M
Internet
FedFS
M-M Local FS
Application
DAFS Client
M-M
DAFS Server
FedFS
M-M Local FS
Outline
 TCP Servers
 Migratory-TCP and Service Continuations
 Cooperative Computing, Smart Messages and
Spatial Programming
 Federated File Systems
 Talk Highlights and Conclusions
Talk Highlights
 Back to Migration
Service Continuation: service availability and self-healing clusters
Smart Messages: programming dynamic networks of embedded
systems
 Exploit Non-Intrusive M-M Communication
TCP offloading
State migration
Federated file systems
 Network and Storage I/O Convergence
TCP Servers & iSCSI
Federated File Systems & M-M
 Programmability
Smart Messages and Spatial Programming
Extended Server API: Service Continuation, TCP Servers,
Federated file system
Conclusions
 Network-Centric Systems: very promising bordercrossing systems research area
 Common issues for a large spectrum of systems and
networks
 Tremendous potential to impact industry
Aknowledgements
 UMD students: Andrzej Kochut, Chunyuan Liao, Tamer
Nadeem, Iulian Neamtiu and Jihwang Yeo.
 Rutgers students: Ashok Arumugam, Kalpana Banerjee,
Aniruddha Bohra, Cristian Borcea, Suresh Gopalakrisnan,
Deepa Iyer, Porlin Kang, Vivek Pathak, Murali Rangarajan,
Rabita Sarker, Akhilesh Saxena, Steve Smaldone, Kiran
Srinivasan, Florin Sultan and Gang Xu.
 Post-doc: Chalermek Intanagonwiwat
 Collaborations at Rutgers: EEL (Ulrich Kremer), DARK
(Ricardo Bianchini), PANIC (Rich Martin and Thu Nguyen)
 Support: NSF ITR ANI-0121416 and CAREER CCR-013366