Download pdf

Ken Birman i Cornell University. CS5410 Fall 2008. What is a Distributed Hash Table (DHT)? y Exactly that ☺ y A service, distributed over multiple machines, with h h bl hash table semantics i y Put(key, value), Value = Get(key) y Designed to work in a peer Designed to work in a peer‐to‐peer (P2P) environment to peer (P2P) environment y No central control y Nodes under different administrative control y But of course can operate in an “infrastructure” sense B f i “i f ” More specifically y Hash table semantics: y Put(key, value), y Value = Get(key) y Key is a single flat string l fl y Limited semantics compared to keyword search y Put() causes value to be stored at one (or more) peer(s) y Get() retrieves value from a peer G () i l f y Put() and Get() accomplished with unicast routed messages y In other words, it scales y Other API calls to support application, like notification when Oth API ll t t li ti lik tifi ti h neighbors come and go P2P “environment” y Nodes come and go at will (possibly quite frequently‐‐‐ a few minutes) y Nodes have heterogeneous capacities N d h h ii y Bandwidth, processing, and storage yN Nodes may behave badly d b h b dl y Promise to do something (store a file) and not do it ( (free‐loaders) ) y Attack the system Several flavors, each with variants y Tapestry (Berkeley) y Based on Plaxton trees‐‐‐similar to hypercube routing y The first* DHT y Complex and hard to maintain (hard to understand too!) y CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge) y Second wave of DHTs (contemporary with and independent of each other) * Landmark Routing, 1988, used a form of DHT called Assured Destination Binding (ADB) Basics of all DHTs y Goal is to build some “structured” 127 111 overlay network with the following characteristics: 13 97 33 81 58 y Node IDs can be mapped to the hash key space p y Given a hash key as a “destination address”, you can route through the network to a given node y Always route to the same node no matter where you start from y Simple example (doesn’t scale) y Circular number space 0 to 127 127 111 y Routing rule is to move counter‐ 13 97 33 81 58 clockwise until current node ID ≥ l k i il d ID ≥ key, k and last hop node ID < key y Example: key = 42 y Obviously you will route to node 58 from no matter where you start Building any DHT y Newcomer always starts with at least 127 one known member 13 111 97 33 81 58 24 Building any DHT y Newcomer always starts with at least 127 13 111 97 33 81 one known member y Newcomer searches for “self” in the N h f “ lf” i h network y hash key = newcomer hash key = newcomer’s node ID s node ID y Search results in a node in the vicinity 58 24 where newcomer needs to be Building any DHT y Newcomer always starts with at least 127 111 13 97 one known member N h f “ lf” i h 24 y Newcomer searches for “self” in the network 33 81 y hash key = newcomer hash key = newcomer’s node ID s node ID y Search results in a node in the vicinity 58 where newcomer needs to be y Links are added/removed to satisfy properties of network Building any DHT y Newcomer always starts with at least 127 111 13 97 one known member y Newcomer searches for Newcomer searches for “self” in the self in the 24 network 33 y hash key = newcomer’s node ID y Search results in a node in the vicinity 81 58 where newcomer needs to be y Links are added/removed to satisfy properties of network y Objects that now hash to new node are transferred to new node Insertion/lookup for any DHT 127 111 13 24 97 33 y Hash name of object to produce key y Well‐known way to do this y Use key as destination address to k d dd route through network y Routes to the target node 81 58 foo.htm→93 y Insert object, or retrieve object, at the target node g Properties of most DHTs y Memory requirements grow (something like) logarithmically with N y Routing path length grows (something like) R i h l h ( hi lik ) logarithmically with N y Cost of adding or removing a node grows (something like) logarithmically with N y Has caching, replication, etc… DHT Issues y Resilience to failures y Load Balance y Heterogeneity H t it y Number of objects at each node y Routing hot spots g p y Lookup hot spots y Locality (performance issue) y Churn (performance and correctness issue) y Security We’re going to look at four DHTs y At varying levels of detail… y CAN (Content Addressable Network) y ACIRI (now ICIR) y Chord y MIT y Kelips y Cornell y Pastry y Rice/Microsoft Cambridge Things we’re going to look at y y y y y y What is the structure? How does routing work in the structure? H d it d l ith d d How does it deal with node departures? t ? How does it scale? How does it deal with locality? What are the security issues? CAN structure is a cartesian coordinate coordinate CAN structure is a cartesian space in a D dimensional torus 1 CAN graphics care of Santashil PalChaudhuri, Rice Univ Simple example in two p p dimensions 1 2 N t t “t ” d “ id ” Note: torus wraps on “top” and “sides” 3 1 2 Each node in CAN network occupies Each node in CAN network occupies a “square” in the space 3 1 2 4 With relatively uniform square sizes Neighbors in CAN network y Neighbor is a node that: y Overlaps d-1 dimensions y Abuts along one dimension Route to neighbors closer to target Z1 Z2 Z3 Z4… Z4 Zn y d‐dimensional space (a,b) y n zones y Zone is space occupied by a “square” in one dimension y Avg. route path length Avg route path length y (d/4)(n 1/d) y Number neighbors = O(d) y Tunable (vary d or n) y Can factor proximity into route decision (x,y) Ch d i l ID Chord uses a circular ID space Key ID Node ID K100 N100 N10 K5, K10 Circular ID Space N32 K11, K30 K65, K70 N80 N60 K33, K40, K52 • Successor: node with next highest ID Chord slides care of Robert Morris, MIT Basic Lookup N5 N10 N110 “Where is key 50?” N20 N99 “Key Key 50 is At N60” N32 N40 N80 N60 • Lookups find the ID’s predecessor • Correct if successors are correct Successor Lists Ensure Robust Lookup Successor Lists Ensure Robust Lookup N5 5, 10, 20 N110 10, 20, 32 N10 N20 110, 5, 10 N99 N32 99, 110, 5 N40 N80 N60 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80 99, 80, 99 110 • Each node remembers r successors • Lookup can skip over dead nodes to find blocks • Periodic check of successor and predecessor links Ch d “Fi Chord “Finger Table” Accelerates Lookups T bl ” A l t L k ¼ 1/8 1/16 1/32 1/64 1/128 N80 ½ To build finger tables, new node searches for the key values for each finger To do T d it efficiently, ffi i tl new nodes obtain successor’s finger table, and use as a hint to optimize the search Chord lookups take O(log N) hops N5 N10 N110 K19 N20 N99 N32 N80 N60 Lookup(K19) Drill down on Chord reliability y Interested in maintaining a correct routing table (successors, predecessors, and fingers) y Primary invariant: correctness of successor pointers y Fingers, while important for performance, do not have to be exactly correct for routing to work y Algorithm is to “get closer” to the target y Successor nodes always do this Maintaining successor pointers y Periodically run “stabilize” algorithm y Finds successor’s predecessor y Repair if this isn’t self y This algorithm is also run at join y Eventually routing will repair itself E ll i ill i i lf y Fix_finger also periodically run y For randomly selected finger F d l l t d fi Initial: 25 wants to join correct ring Initial: 25 wants to join correct ring (between 20 and 30) 20 20 20 25 25 30 30 25 finds successor, and tells successor (30) of itself 25 30 20 runs “stabilize”: 20 asks 30 for 30’s predecessor 30 returns 25 20 tells 25 of itself This time, 28 joins before 20 runs This time, 28 joins before 20 runs “stabilize” 20 20 20 28 30 25 25 28 25 30 28 30 28 finds successor, and tells successor (30) of itself 20 runs “stabilize”: 20 asks 30 for 30’s predecessor 30 returns 28 20 tells 28 of itself 20 20 20 25 25 28 28 25 28 30 30 25 runs “stabilize” 30 20 runs “stabilize” “ Pastry also uses a circular y number space d471f1 d467c4 y Difference is in how d462ba the fingers are the “fingers” are d46 1 d46a1c d4213f created y Pastry uses prefix y p Route(d46a1c) 65a1fc match overlap rather d13da3 than binary splitting y More flexibility in neighbor selection bl (f d f ) Pastry routing table (for node 65a1fc) Pastry nodes also have a “leaf leaf set” set of immediate neighbors up and down the ring Similar to Chord’s list of successors Pastry join y X = new node, A = bootstrap, Z = nearest node y A finds Z for X y In process, A, Z, and all nodes in path send state tables to X In process A Z and all nodes in path send state tables to X y X settles on own table y Possibly after contacting other nodes y X tells everyone who needs to know about itself y Pastry paper doesn’t give enough information to understand how concurrent joins work y 18th IFIP/ACM, Nov 2001 Pastry leave y Noticed by leaf set neighbors when leaving node doesn’t respond y Neighbors ask highest and lowest nodes in leaf set for new leaf set y Noticed by routing neighbors when message forward f il fails y Immediately can route to another neighbor y Fix entry by asking another neighbor in the same Fix entry by asking another neighbor in the same “row” row for its neighbor y If this fails, ask somebody a level up For instance this neighbor fails For instance, this neighbor fails Ask other neighbors Ask other neighbors Try asking T ki some neighbor i hb in the same row for its 655x entry If it doesn’t have one, try asking some neighbor in th row b the below, l etc. t CAN, Chord, Pastry differences y CAN, Chord, and Pastry have deep similarities y Some (important???) differences exist y CAN nodes tend to know of multiple nodes that allow CAN d t d t k f lti l d th t ll equal progress y Can therefore use additional criteria (RTT) to pick next hop y Pastry allows greater choice of neighbor y Can thus use additional criteria (RTT) to pick neighbor y In contrast, Chord has more determinism y Harder for an attacker to manipulate system? Security issues y In many P2P systems, members may be malicious y If peers untrusted, all content must be signed to detect forged content y Requires certificate authority y Like we discussed in secure web services talk y This is not hard, so can assume at least this level of security Security issues: Sybil attack y Attacker pretends to be multiple system y If surrounds a node on the circle, can potentially arrange to capture all traffic y Or if not this, at least cause a lot of trouble by being many nodes y Chord requires node ID to be an SHA‐1 hash of its IP address y But to deal with load balance issues, Chord variant allows nodes to replicate themselves y A central authority must hand out node IDs and certificates to go with them y Not P2P in the Gnutella sense General security rules y Check things that can be checked y Invariants, such as successor list in Chord y Minimize invariants, maximize randomness y Hard for an attacker to exploit randomness y Avoid any single dependencies A id i l d d i y Allow multiple paths through the network y Allow content to be placed at multiple nodes y But all this is expensive… Load balancing y Query hotspots: given object is popular y Cache at neighbors of hotspot, neighbors of neighbors, etc. etc y Classic caching issues y Routing hotspot: node is on many paths y Of the three, Pastry seems most likely to have this problem, because neighbor selection more flexible (and based on proximity) p y) y This doesn’t seem adequately studied Load balancing y Heterogeneity (variance in bandwidth or node capacity y Poor distribution in entries due to hash function inaccuracies y One class of solution is to allow each node to be multiple virtual nodes y Higher capacity nodes virtualize more often y But security makes this harder to do Chord node virtualization Chord node virtualization 10K nodes, 1M objects 20 virtual nodes per node has much better load balance, but each node requires ~400 neighbors! Primary concern: churn y Churn: nodes joining and leaving frequently y Join or leave requires a change in some number of links y Those changes depend on correct routing tables in Th h d d t ti t bl i other nodes y Cost of a change is higher if routing tables not correct g g g y In chord, ~6% of lookups fail if three failures per stabilization y But as more changes occur, probability of incorrect B t h b bilit f i t routing tables increases Control traffic load generated by churn y Chord and Pastry appear to deal with churn differently y Chord join involves some immediate work, but repair is done periodically y Extra load only due to join messages y Pastry join and leave involves immediate repair of all effected nodes’ tables d ’ bl y Routing tables repaired more quickly, but cost of each join/leave goes up with frequency of joins/leaves y Scales quadratically with number of changes??? y Can result in network meltdown??? Kelips takes a different approach y Network partitioned into √N “affinity groups” y Hash of node ID determines which affinity group a node is in d i i y Each node knows: y One or more nodes in each group O d i h y All objects and nodes in own group y But this knowledge is soft‐state, spread through peer‐ But this knowledge is soft state spread through peer to‐peer “gossip” (epidemic multicast)! 110 knows about other members – 230, 30… Kelips Affinity Affi i Groups: G peer membership thru consistent hash Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms 0 1 2 N −1 110 N 230 30 Affinity group pointers 202 members per affinity group Kelips Affinity Groups: Affi i 202 G is a “contact” forthru 110 peer membership consistent hash2 in group Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms 0 1 N −1 2 110 N Contacts 230 group contactNode … … 2 202 202 members per affinity group 30 Contact p pointers “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it. Kelips Affinity Affi i Groups: G peer membership thru consistent hash Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms 0 1 2 N −1 110 N Contacts 230 group contactNode … … 2 202 Resource Tuples resource info … … cnn.com 110 202 30 Gossip protocol replicates data cheaply h l members per affinity group How it works y Kelips is entirely gossip based! y Gossip about membership y Gossip to replicate and repair data y Gossip about “last heard from” time used to discard failed nodes y Gossip “channel” uses fixed bandwidth y … fixed rate, packets of limited size … ed ate, pac ets o ted s e Gossip 101 y Suppose that I know something y I’m sitting next to Fred, and I tell him y Now 2 of us “know” y Later, he tells Mimi and I tell Anne y Now 4 y This is an example of a push epidemic y Push‐pull P h ll occurs if we exchange data if h d t Gossip scales very nicely y Participants’ loads independent of size y Network load linear in system size y Information spreads in log(system size) time % infected d 1.0 0.0 Time → Gossip in distributed systems y We can gossip about membership y Need a bootstrap mechanism, but then discuss failures, new members b y Gossip to repair faults in replicated data y “I have 6 updates from Charlie” I have 6 updates from Charlie y If we aren’t in a hurry, gossip to replicate data too Gossip about membership y Start with a bootstrap protocol y For example, processes go to some web site and it lists a dozen nodes where the system has been stable for a long d d h h h b bl f l time y Pick one at random y Then track “processes I’ve heard from recently” and “processes other people have heard from p p p recently” y Use push gossip to spread the word p g p p Gossip about membership y Until messages get full, everyone will known when everyone else last sent a message y With delay of log(N) gossip rounds… Wi h d l f l (N) i d y But messages will have bounded size y Perhaps 8K bytes y Then use some form of “prioritization” to decide what to g g omit – but never send more, or larger messages y Thus: load has a fixed, constant upper bound except on the network itself, which usually has infinite capacity Back to Kelips: Quick reminder Affinity Affi i Groups: G peer membership thru consistent hash Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms 0 1 N −1 2 110 N Contacts 230 group contactNode … … 2 202 202 members per affinity group 30 Contact p pointers How Kelips works Node 175 is a contact for Node 102 in some Hmm…Node 19 looks affinity group like a much better contact in affinity group 2 175 Node 102 19 Gossip data stream y Gossip about everything y Heuristic to pick contacts: periodically ping contacts to g check liveness, RTT… swap so‐so ones for better ones. Replication makes it robust y Kelips should work even during disruptive episodes y After all, tuples are replicated to √N nodes y Query k nodes concurrently to overcome isolated crashes, also reduces risk that very recent data could be missed y … we often overlook importance of showing that f l k f h h systems work while recovering from a disruption Chord can malfunction if the Chord can malfunction if the network partitions… Transient Network Partition Europe 255 0 255 0 30 30 248 248 241 64 202 241 202 199 177 USA 123 108 199 177 64 108 123 … so, who cares? y Chord lookups can fail… and it suffers from high overheads when nodes churn y Loads surge just when things are already disrupted… L d j h hi l d di d quite often, because of loads y And can’t predict how long Chord might remain And can t predict how long Chord might remain disrupted once it gets that way y Worst case scenario: Chord can become inconsistent and stay that way Control traffic load generated by churn None Kelips O(changes) Chord O(Changes x Nodes)? Pastry Take‐Aways? y Surprisingly easy to superimpose a hash‐table lookup onto a potentially huge distributed system! y We’ve seen three O(log N) solutions and one O(1) W’ h O(l N) l i d O( ) solution (but Kelips needed more space) y Sample applications? y Peer to peer file sharing y Amazon uses DHT for the shopping cart pp g y CoDNS: A better version of DNS

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download pdf