Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed operating system wikipedia , lookup
Backpressure routing wikipedia , lookup
Computer network wikipedia , lookup
Zero-configuration networking wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Airborne Networking wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Tapestry: Scalable and Fault-tolerant Routing and Location Stanford Networking Seminar October 2001 Ben Y. Zhao [email protected] Challenges in the Wide-area Trends: Exponential growth in CPU, storage Network expanding in reach and b/w Can applications leverage new resources? Scalability: increasing users, requests, traffic Resilience: more components inversely low MTBF Management: intermittent resource availability complex management schemes Proposal: an infrastructure that solves these issues and passes benefits onto applications Stanford Networking Seminar, 10/2001 2 Driving Applications Leverage of cheap & plentiful resources: CPU cycles, storage, network bandwidth Global applications share distributed resources Shared computation: Shared storage SETI, Entropia OceanStore, Gnutella, Scale-8 Shared bandwidth Application-level multicast, content distribution networks Stanford Networking Seminar, 10/2001 3 Key: Location and Routing Hard problem: Locating and messaging to resources and data Goals for a wide-area overlay infrastructure Easy to deploy Scalable to millions of nodes, billions of objects Available in presence of routine faults Self-configuring, adaptive to network changes Localize effects of operations/failures Stanford Networking Seminar, 10/2001 4 Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001 5 What is Tapestry? A prototype of a decentralized, scalable, faulttolerant, adaptive location and routing infrastructure (Zhao, Kubiatowicz, Joseph et al. U.C. Berkeley) Network layer of OceanStore Routing: Suffix-based hypercube Similar to Plaxton, Rajamaran, Richa (SPAA97) Decentralized location: Virtual hierarchy per object with cached location references Core API: publishObject(ObjectID, [serverID]) routeMsgToObject(ObjectID) routeMsgToNode(NodeID) Stanford Networking Seminar, 10/2001 6 Routing and Location Namespace (nodes and objects) 160 bits 280 names before name collision Each object has its own hierarchy rooted at Root f (ObjectID) = RootID, via a dynamic mapping function Suffix routing from A to B At hth hop, arrive at nearest node hop(h) s.t. hop(h) shares suffix with B of length h digits Example: 5324 routes to 0629 via 5324 2349 1429 7629 0629 Object location: Root responsible for storing object’s location Publish / search both route incrementally to root Stanford Networking Seminar, 10/2001 7 Publish / Lookup Publish object with ObjectID: // route towards “virtual root,” ID=ObjectID For (i=0, i<Log2(N), i+=j) { //Define hierarchy j is # of bits in digit size, (i.e. for hex digits, j = 4 ) Insert entry into nearest node that matches on last i bits If no matches found, deterministically choose alternative Found real root node, when no external routes left Lookup object Traverse same path to root as publish, except search for entry at each node For (i=0, i<Log2(N), i+=j) { Search for cached object location Once found, route via IP or Tapestry to object Stanford Networking Seminar, 10/2001 8 Tapestry Mesh 3 Incremental suffix-based routing 4 NodeID 0x79FE NodeID 0x23FE NodeID 0x993E 4 NodeID 0x035E 3 NodeID 0x43FE 4 NodeID 0x73FE 3 NodeID 0x44FE 2 2 4 3 1 1 3 NodeID 0xF990 2 3 NodeID 0x555E 1 NodeID 0x73FF 2 NodeID 0xABFE NodeID 0x04FE NodeID 0x13FE 4 NodeID 0x9990 1 2 2 NodeID 0x423E Stanford Networking Seminar, 10/2001 3 NodeID 0x239E 1 NodeID 0x1290 9 Routing in Detail Example: Octal digits, 212 namespace, 5712 7510 Neighbor Map For “5712” (Octal) 5712 5712 0 1 2 3 4 5 6 7 0880 0 1 2 3 4 5 6 7 3210 0 1 2 3 4 5 6 7 4510 7510 0 1 0 1 2 2 3 4 5 3 4 5 6 7 6 7 0712 x012 xx02 xxx0 1712 x112 5712 xxx1 2712 0880 x212 xx22 5712 3712 x312 xx32 xxx3 4712 x412 xx42 3210 xxx4 5712 x512 xx52 xxx5 x612 xx62 xxx6 7712 5712 xx72 xxx7 4 3 2 7510 1 6712 4510 Routing Levels Stanford Networking Seminar, 10/2001 10 Object Location Randomization and Locality Stanford Networking Seminar, 10/2001 11 Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001 12 Fault-tolerant Location Minimized soft-state vs. explicit fault-recovery Redundant roots Object names hashed w/ small salts multiple names/roots Queries and publishing utilize all roots in parallel P(finding reference w/ partition) = 1 – (1/2)n where n = # of roots Soft-state periodic republish 50 million files/node, daily republish, b = 16, N = 2160 , 40B/msg, worst case update traffic: 156 kb/s, expected traffic w/ 240 real nodes: 39 kb/s Stanford Networking Seminar, 10/2001 13 Fault-tolerant Routing Strategy: Detect failures via soft-state probe packets Route around problematic hop via backup pointers Handling: 3 forward pointers per outgoing route (2 backups) 2nd chance algorithm for intermittent failures Upgrade backup pointers and replace Protocols: First Reachable Link Selection (FRLS) Proactive Duplicate Packet Routing Stanford Networking Seminar, 10/2001 14 Summary Decentralized location and routing infrastructure Core routing similar to PRR97 Distributed algorithms for object-root mapping, node insertion / deletion Fault-handling with redundancy, soft-state beacons, self-repair Decentralized and scalable, with locality Analytical properties Per node routing table size: bLogb(N) N = size of namespace, n = # of physical nodes Find object in Logb(n) overlay hops Stanford Networking Seminar, 10/2001 15 Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001 16 Deployment Status Java Implementation in OceanStore Running static Tapestry Deploying dynamic Tapestry with faulttolerant routing Packet-level simulator Delay measured in network hops No cross traffic or queuing delays Topologies: AS, MBone, GT-ITM, TIERS ns2 simulations Stanford Networking Seminar, 10/2001 17 Evaluation Results Cached object pointers Efficient lookup for nearby objects Reasonable storage overhead Multiple object roots Improves availability under attack Improves performance and perf. stability Reliable packet delivery Redundant pointers approximate optimal reachability FRLS, a simple fault-tolerant UDP protocol Stanford Networking Seminar, 10/2001 18 First Reachable Link Selection Use periodic UDP packets to gauge link condition Packets routed to shortest “good” link Assumes IP cannot correct routing table in time for packet delivery IP A B C D E Stanford Networking Seminar, 10/2001 Tapestry No path exists to dest. 19 Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001 20 Bayeux Global-scale application-level multicast (NOSSDAV 2001) Scalability Scales to > 105 nodes Self-forming member group partitions Fault tolerance Multicast root replication FRLS for resilient packet delivery More optimizations Group ID clustering for better b/w utilization Stanford Networking Seminar, 10/2001 21 Bayeux: Multicast Receiver 79FE 23FE 993E F9FE 43FE 73FE 44FE 29FE F990 79FE 035E 79FE 555E 13FE ABFE 9990 793E 79FE 793E Receiver 04FE 093E 793E 793E 423E Stanford Networking Seminar, 10/2001 239E 1290 793E 79FE Multicast Root 22 Bayeux: Tree Partitioning JOIN Receiver 79FE 23FE JOIN 993E F9FE Receiver 43FE 73FE 44FE F990 29FE 035E Multicast Root 13FE 555E ABFE 04FE 9990 JOIN 793E 239E 093E 423E Stanford Networking Seminar, 10/2001 Receiver 1290 Multicast 23 Root Overlay Routing Networks CAN: Ratnasamy et al., (ACIRI / UCB) Uses d-dimensional coordinate space to implement distributed hash table Route to neighbor closest to destination coordinate Fast Insertion / Deletion Constant-sized routing state Unconstrained # of hops Overlay distance not prop. to physical distance Chord: Stoica, Morris, Karger, et al., Simplicity in algorithms (MIT / UCB) Fast fault-recovery Log2(N) hops and routing state Linear namespace modeled as circular address space Overlay distance not prop. to “Finger-table” point to logarithmic # of inc. physical distance remote hosts Pastry: Rowstron and Druschel (Microsoft / Rice ) Hypercube routing similar to PRR97 Objects replicated to servers by name Stanford Networking Seminar, 10/2001 Fast fault-recovery Log(N) hops and routing state Data replication required for fault-tolerance 24 Ongoing Research Fault-tolerant routing Reliable Overlay Networks (MIT) Fault-tolerant Overlay Routing (UCB) Application-level multicast Bayeux (UCB), CAN (AT&T), Scribe and Herald (Microsoft) File systems OceanStore (UCB) PAST (Microsoft / Rice) Cooperative File System (MIT) Stanford Networking Seminar, 10/2001 25 For More Information Tapestry: http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications [email protected] Stanford Networking Seminar, 10/2001 26