Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer network wikipedia , lookup
Backpressure routing wikipedia , lookup
Distributed operating system wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Airborne Networking wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
List of wireless community networks by region wikipedia , lookup
Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January 2003 Scaling Network Applications Complexities of global deployment Network unreliability Lack of administrative control over components BGP slow convergence, redundancy unexploited Constrains protocol deployment: multicast, congestion ctrl. Management of large scale resources / components UCB Winter Retreat Locate, utilize resources despite failures [email protected] 2 Enabling Technology: DOLR (Decentralized Object Location and Routing) GUID1 DOLR GUID2 UCB Winter Retreat GUID1 [email protected] 3 What is Tapestry? DOLR driving OceanStore global storage (Zhao, Kubiatowicz, Joseph et al. 2000) Network structure Nodes assigned bit sequence nodeIds from namespace: 0-2160, based on some radix (e.g. 16) keys from same namespace Keys dynamically map to 1 unique live node: root Base API Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID) UCB Winter Retreat [email protected] 4 Tapestry3 Mesh 4 NodeID 0xEF97 NodeID 0xEF32 NodeID 0xE399 4 NodeID 0xE530 3 NodeID 0xEF34 4 NodeID 0xEF37 3 NodeID 0xEF44 2 2 4 3 1 1 3 NodeID 0x099F 2 3 NodeID 0xE555 1 NodeID 0xFF37 UCB Winter Retreat 2 NodeID 0xEFBA NodeID 0xEF40 NodeID 0xEF31 4 NodeID 0x0999 1 2 2 NodeID 0xE324 3 NodeID 0xE932 [email protected] 1 NodeID 0x0921 5 Object Location UCB Winter Retreat [email protected] 6 Talk Outline Introduction Architecture Node architecture Node implementation Deployment Evaluation Fault-tolerant Routing UCB Winter Retreat [email protected] 7 Single Node Architecture Decentralized File Systems Application-Level Multicast Approximate Text Matching Application Interface / Upcall API Dynamic Node Management Routing Table & Router Object Pointer DB Network Link Management Transport Protocols UCB Winter Retreat [email protected] 8 Single Node Implementation Applications API calls Upcalls Enter/leave Tapestry State Maint. Node Ins/del Core Router Routing Link Maintenance Patchwork route to node / obj Dynamic Tapestry Application Programming Interface Distance Map UDP Pings Network Stage SEDA Event-driven Framework Java Virtual Machine UCB Winter Retreat [email protected] 9 Deployment Status C simulator Packet level simulation Scales up to 10,000 nodes Java implementation 50000 semicolons of Java, 270 class files Deployed on local area cluster (40 nodes) Deployed on Planet Lab global network (~100 distributed nodes) UCB Winter Retreat [email protected] 10 Talk Outline Introduction Architecture Deployment Evaluation Micro-benchmarks Stable network performance Single and parallel node insertion Fault-tolerant Routing UCB Winter Retreat [email protected] 11 Micro-benchmark Methodology Sender Control Tapestry Receiver Control LAN Link Tapestry Experiment run in LAN, GBit Ethernet Sender sends 60001 messages at full speed Measure inter-arrival time for last 50000 msgs 10000 msgs: remove cold-start effects 50000 msgs: remove network jitter effects UCB Winter Retreat [email protected] 12 Micro-benchmark Results Message Processing Latency Sustainable Throughput 30 25 10 TPut (MB/s) Time / msg (ms) 100 1 20 15 10 0.1 5 0 0.01 0.01 0.1 1 10 100 1000 10000 0.01 0.1 Message Size (KB) 1 10 100 1000 10000 Message Size (KB) Constant processing overhead ~ 50s Latency dominated by byte copying For 5K messages, throughput = ~10,000 msgs/sec UCB Winter Retreat [email protected] 13 Large Scale Methodology PlanetLab global network 101 machines at 42 institutions, in North America, Europe, Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM) North American machines (2/3) on Internet2 Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases latency UCB Winter Retreat [email protected] 14 Node to Node Routing 35 RDP (min, med, 90%) 30 Median=31.5, 90th percentile=135 25 20 15 10 5 0 0 50 100 150 200 250 300 Internode RTT Ping time (5ms buckets) Ratio of end-to-end routing latency to shortest ping distance between nodes All node pairs measured, placed into buckets UCB Winter Retreat [email protected] 15 Object Location 25 RDP (min, median, 90%) 90th percentile=158 20 15 10 5 0 0 20 40 60 80 100 120 140 160 180 200 Client to Obj RTT Ping time (1ms buckets) Ratio of end-to-end latency for object location, to shortest ping distance between client and object location Each node publishes 10,000 objects, lookup on all objects UCB Winter Retreat [email protected] 16 Latency to Insert Node 2000 Integration Latency (ms) 1800 1600 1400 1200 1000 800 600 400 200 0 0 100 200 300 400 500 Size of Existing Network (nodes) Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry Humps due to expected filling of each routing level UCB Winter Retreat [email protected] 17 Bandwidth to Insert Node 1.4 Control Traffic BW (KB) 1.2 1 0.8 0.6 0.4 0.2 0 0 50 100 150 200 250 300 350 400 Size of Existing Network (nodes) Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network Per node bandwidth decreases with size of network UCB Winter Retreat [email protected] 18 Parallel Insertion Latency 20000 Latency to Convergence (ms) 18000 90th percentile=55042 16000 14000 12000 10000 8000 6000 4000 2000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Ratio of Insertion Group Size to Network Size Latency to dynamically insert nodes in unison into an existing Tapestry of 200 Shown as function of insertion group size / network size UCB Winter Retreat [email protected] 19 Talk Outline Introduction Architecture Deployment Evaluation Fault-tolerant Routing Tunneling through scalable overlays Example using Tapestry UCB Winter Retreat [email protected] 20 Adaptive and Resilient Routing Goals Reachability as a service Agility / adaptability in routing Scalable deployment Useful for all client endpoints UCB Winter Retreat [email protected] 21 Existing Redundancy in DOLR/DHTs Fault-detection via soft-state beacons Periodically sent to each node in routing table Scales logarithmically with size of network Worst case overhead: 240 nodes, 160b ID 20 hex 1 beacon/sec, 100B each = 240 kbps can minimize B/W w/ better techniques (Hakim, Shelley) Precomputed backup routes Intermediate hops in overlay path are flexible Keep list of backups for outgoing hops (e.g. 3 node pointers for each route entry in Tapestry) Maintain backups using node membership algorithms (no additional overhead) UCB Winter Retreat [email protected] 22 Bootstrapping Non-overlay Endpoints Goal Allow non-overlay nodes to benefit Endpoints communicate via overlay proxies Example: legacy nodes L1, L2 Li registers w/ nearby overlay proxy Pi Pi assigns Li a proxy name Di s.t. Di is the closest possible unique name to Pi (e.g. start w/ Pi, increment for each node) Li and L2 exchange new proxy names messages route to nodes using proxy names UCB Winter Retreat [email protected] 23 Tunneling through an Overlay D2 P1 L1 Overlay Network P2 L2 D1 L1 registers with P1 as document D1 L2 registers with P2 as document D2 Traffic tunnels through overlay via proxies UCB Winter Retreat [email protected] 24 Failure Avoidance in Tapestry UCB Winter Retreat [email protected] 25 Routing Convergence UCB Winter Retreat [email protected] 26 Bandwidth Overhead for Misroute Increase in Latency for 1 Misroute (Secondary Route) Proportional Increase to Path Latency 20 ms 26.66 ms 60 ms 80 ms 93.33 ms 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 Position of Branch (Hop) UCB Winter Retreat Status: under deployment on PlanetLab [email protected] 27 For more information … Tapestry and related projects (and these slides): http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications [email protected] UCB Winter Retreat [email protected] 28 Backup Slides Follow… UCB Winter Retreat [email protected] 29 The Naming Problem Tracking modifiable objects Current approaches Example: email, Usenet articles, tagged audio Goal: verifiable names, robust to small changes Content-based hashed naming Content-independent naming ADOLR Project: (Feng Zhou, Li Zhuang) Approximate names based on feature vectors Leverage to match / search for similar content UCB Winter Retreat [email protected] 30 Approximation Extension to DOLR/DHT Publication using features Objects are described using a set of features: AO ≡ Feature Vector (FV) = {f1, f2, f3, …, fn} Locate AOs in DOLR ≡ find all AOs in the network with |FV* ∩ FV| ≥ Thres, 0 < Thres ≤ |FV| Driving application: decentralized spam filter Humans are the only fool-proof spam filter Mark spam, publish spam by text feature vector Incoming mail filtered by FV query on P2P overlay UCB Winter Retreat [email protected] 31 Evaluation on Real Emails Accuracy of feature vector matching on real emails Spam (29631 Junk Emails from www.spamarchive.org) 14925 (unique), 86% of spam ≤ 5K Normal Emails 9589 (total) = 50% newsgroup posts, 50% personal emails “Similarity” Test “False Positive” Test 3440 modified copies of 39 emails Fail THRES Detected % 9589(normal)×14925(spam) Match FP # pair probability 3/10 3356 84 97.56 2/10 4 2.79e-8 4/10 3172 268 92.21 >2/10 0 0 Status Prototype implemented as Outlook Plug-in Interfaces w/ Tapestry overlay http://www.cs.berkeley.edu/~zf/spamwatch UCB Winter Retreat [email protected] 32 State of the Art Routing High dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord, CAN, etc… Sub-linear storage and # of overlay hops per route Properties dependent on random name distribution Optimized for uniform mesh style networks UCB Winter Retreat [email protected] 33 Reality Transit-stub topology, disparate resources per node Result: Inefficient inter-domain routing (b/w, latency) AS-3 AS-1 S R AS-2 P2P Overlay Network UCB Winter Retreat [email protected] 34 Landmark Routing on P2P Brocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth Secondary overlay on top of Tapestry Select super-nodes by admin. domain Super-nodes form secondary Tapestry Divide network into cover sets Advertise cover set as local objects brocade routes directly into destination’s local network, then resumes p2p routing UCB Winter Retreat [email protected] 35 Brocade Routing Brocade Layer Original Route Brocade Route AS-3 AS-1 S D AS-2 P2P Network UCB Winter Retreat [email protected] 36 Overlay Routing Networks CAN: Ratnasamy et al., (ACIRI / Fast Insertion / Deletion UCB) Constant-sized routing state Uses d-dimensional coordinate space Unconstrained # of hops to implement distributed hash table Overlay distance not prop. to Route to neighbor closest to physical distance destination coordinate Chord: Stoica, Morris, Karger, et al., (MIT / UCB) Linear namespace modeled as circular address space “Finger-table” point to logarithmic # of inc. remote hosts Pastry: Rowstron and Druschel (Microsoft / Rice ) Hypercube routing similar to PRR97 Objects replicated to servers by name UCB Winter Retreat Simplicity in algorithms Fast fault-recovery Log2(N) hops and routing state Overlay distance not prop. to physical distance Fast fault-recovery Log(N) hops and routing state Data replication required for fault-tolerance [email protected] 37 Routing in Detail Example: Octal digits, 212 namespace, 2175 0157 2175 2175 0 1 2 3 4 5 6 7 0880 0 1 2 3 4 5 6 7 0123 0 1 2 3 4 5 6 7 0154 0 1 2 3 4 5 6 7 0157 0 1 2 3 4 5 6 7 UCB Winter Retreat 0880 0123 0154 [email protected] 0157 38 Publish / Lookup Details Publish object with ObjectID: // route towards “virtual root,” ID=ObjectID For (i=0, i<Log2(N), i+=j) { //Define hierarchy j is # of bits in digit size, (i.e. for hex digits, j = 4 ) Insert entry into nearest node that matches on last i bits If no matches found, deterministically choose alternative Found real root node, when no external routes left Lookup object Traverse same path to root as publish, except search for entry at each node For (i=0, i<Log2(N), i+=j) { Search for cached object location Once found, route via IP or Tapestry to object UCB Winter Retreat [email protected] 39 Dynamic Insertion Build up new node’s routing map 1. 2. 3. 4. Send messages to each hop along path from gateway to current node N’ that best approximates N The ith hop along the path sends its ith level route table to N N optimizes those tables where necessary Notify via acked multicast nodes with null entries for N’s ID Notified node issues republish message for relevant objects Notify local neighbors UCB Winter Retreat [email protected] 40 Dynamic Insertion Example 3 4 NodeID 0x779FE NodeID 0x244FE 2 NodeID 0xA23FE NodeID 0x6993E NodeID 0x973FE 3 4 NodeID 0xC035E 3 2 NodeID NodeID 0x243FE 0x243FE 4 4 3 1 1 3 NodeID 0x4F990 2 3 NodeID 2 0xB555E NodeID 0x0ABFE NodeID 0x704FE NodeID 0x913FE 4 NodeID 0x09990 1 2 1 3 Gateway 0xD73FF UCB Winter Retreat NEW 0x143FE NodeID 0x5239E [email protected] 1 NodeID 0x71290 41 Dynamic Root Mapping Problem: choosing a root node for every object Deterministic over network changes Globally consistent Assumptions All nodes with same matching suffix contains same null/non-null pattern in next level of routing map Requires: consistent knowledge of nodes across network UCB Winter Retreat [email protected] 42 PRR Solution Given desired ID N, Find set S of nodes in existing network nodes n matching most # of suffix digits with N Choose Si = node in S with highest valued ID Issues: Mapping must be generated statically using global knowledge Must be kept as hard state in order to operate in changing environment Mapping is not well distributed, many nodes in n get no mappings UCB Winter Retreat [email protected] 43 Tapestry Solution Globally consistent distributed algorithm: Attempt to route to desired ID Ni Whenever null entry encountered, choose next “higher” non-null pointer entry If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S Assumes: Routing maps across network are up to date Null/non-null properties identical at all nodes sharing same suffix UCB Winter Retreat [email protected] 44 Analysis Globally consistent deterministic mapping Null entry no node in network with suffix consistent map identical null entries across same route maps of nodes w/ same suffix Additional hops compared to PRR solution: Reduce to coupon collector problem Assuming random distribution With n ln(n) + cn entries, P(all coupons) = 1-e-c For n=b, c=b-ln(b), P(b2 nodes left) = 1-b/eb = 1.8 10-6 # of additional hops Logb(b2) = 2 Distributed algorithm with minimal additional hops UCB Winter Retreat [email protected] 45 Dynamic Mapping Border Cases Node vanishes undetected Routing proceeds on invalid link, fails No backup router, so proceed to surrogate routing Node enters network undetected; messages going to surrogate node instead New node checks with surrogate after all such nodes have been notified Route info at surrogate is moved to new node UCB Winter Retreat [email protected] 46 SPAA slides follow UCB Winter Retreat [email protected] 47 Network Assumption Nearest neighbor is hard in general metric Assume the following: Ball of radius 2r contains only a factor of c more nodes than ball of radius r. Also, b > c2 [Both assumed by PRR] Start knowing one node; allow distance queries UCB Winter Retreat [email protected] 48 Algorithm Idea Call a node a level i node if it matches the new node in i digits. The whole network is contained in forest of trees rooted at highest possible imax. Let list[imax] contain the root of all trees. Then, starting at imax, while i > 1 list[i-1] = getChildren(list[i]) Certainly, list[i] contains level i neighbors. UCB Winter Retreat [email protected] 49 We Reach The Whole Network 3 4 NodeID 0xEF97 NodeID 0xEF32 NodeID 0xE399 NodeID 0xEF34 4 NodeID 0xEF37 3 NodeID 0xEF44 2 2 1 4 NodeID 0xE530 3 NodeID 0xE555 1 NodeID 0xFF37 UCB Winter Retreat 2 NodeID 0xEFBA 1 NodeID 0x099F 3 NodeID 0xEF40 NodeID 0xEF31 NodeID 0x0999 2 2 NodeID 0xE324 NodeID 0xE932 [email protected] 1 NodeID 0x0921 50 The Real Algorithm Simplified version ALL nodes in the network. But far away nodes are not likely to have close descendents Trim the list at each step. New version: while i > 1 List[i-1] = getChildren(list[i]) Trim(list[i-1]) UCB Winter Retreat [email protected] 51 How to Trim Consider circle of radius r with at least one level i node. Level-(i-1) node in little circle must must point to a leveli node in the big circle Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i UCB Winter Retreat <2r r [email protected] 52 Animation new UCB Winter Retreat [email protected] 53 True in Expectation Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i Suppose list[i-1] has k elements and radius r Expect ball of radius 4r to contain kc2/b Ball of radius 3r contains less than k nodes, so keeping k all along is enough. To work with high probability, = O(log n) UCB Winter Retreat [email protected] k 54 Steps of Insertion Find node with closest matching ID (surrogate) and get preliminary neighbor table Find all nodes that need to put new node in routing table via multicast Optimize neighbor table If surrogate’s table is hole-free, so is this one. w.h.p. contacted nodes in building table only ones that need to update their own tables Need: No fillable holes. Keep objects reachable UCB Winter Retreat [email protected] 55 Need-to-know nodes • Need-to-know = a node with a hole in neighbor table filled by new node • If 1234 is new node, and no 123s existed, must notify 12?? Nodes • Acknowledged multicast to all matching nodes UCB Winter Retreat [email protected] 56 Acknowledged Multicast Algorithm Locates & Contacts all nodes with a given prefix • Create a tree based on IDs as we go • Nodes send acks when all children reached • Starting node knows when all nodes reached The node then sends to any 5430?, any 5431?, any 5434?, etc. if possible 5431? 543?? 5434? 54340 UCB Winter Retreat [email protected] 54345 57