Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UCDavis, ecs251 Spring 2007 ecs251 Spring 2007: Operating System Models #3: Peer-to-Peer Systems Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.cs.ucdavis.edu/~wu/ [email protected] 05/03/2007 P2P 1 UCDavis, ecs251 Spring 2007 The role of service provider.. Centralized management of services – DNS, Google, www.cnn.com, Blockbuster, SBC/Sprint/AT&T, cable service, Grid computing, AFS, bank transactions… Information, Computing, & Network resources owned by one or very few administrative domains. – Some with SLA (Service Level Agreement) 05/03/2007 P2P 2 UCDavis, ecs251 Spring 2007 Interacting with the “SP” Service providers are the owner of the information and the interactions – Some enhance/establish the interactions 05/03/2007 P2P 3 UCDavis, ecs251 Spring 2007 Let’s compare … Google Blockbuster CNN MLB/NBA LinkIn e-Bay 05/03/2007 P2P Skype Bittorrent Blog Youtube BotNet Cyber-Paparazzi 4 UCDavis, ecs251 Spring 2007 Toward P2P More participation of the end nodes (or their users) – More decentralized Computing/Network resources available – End-user controllability and interactions – Security/robustness concerns 05/03/2007 P2P 5 UCDavis, ecs251 Spring 2007 Service Providers in P2P We might not like SP, but we still can not avoid SP entirely. – Who is going to lay the fiber and switch? – Can we avoid DNS? – How can we stop “Cyber-Bullying” and other similar? – Copyright enforcement? – Internet becomes a junkyard? 05/03/2007 P2P 6 UCDavis, ecs251 Spring 2007 We will discuss… P2P system examples – Unstructured, structured, incentive Architectural analysis and issues Future P2P applications and why? 05/03/2007 P2P 7 UCDavis, ecs251 Spring 2007 Challenge to you… Define a new P2P-related application, service, or architecture. Justify why it is practical, useful and will scale well. – Example: sharing cooking recipes, experiences & recommendations about restaurants and hotels 05/03/2007 P2P 8 UCDavis, ecs251 Spring 2007 Napster P2P File sharing “Unstructured” 05/03/2007 P2P 9 UCDavis, ecs251 Spring 2007 Napster peers Napster serv er Index 1. Fil e lo cation req uest Napster serv er Index 3. Fil e req uest 2. Lis t of pee rs offerin g th e fi le 5. Ind ex upd ate 4. Fil e de livered 05/03/2007 P2P 10 UCDavis, ecs251 Spring 2007 Napster Advantages? Disadvantages? 05/03/2007 P2P 11 UCDavis, ecs251 Spring 2007 05/03/2007 P2P 12 UCDavis, ecs251 Spring 2007 05/03/2007 P2P 13 UCDavis, ecs251 Spring 2007 Originally conceived of by Justin Frankel, 21 year old founder of Nullsoft March 2000, Nullsoft posts Gnutella to the web A day later AOL removes Gnutella at the behest of Time Warner The Gnutella protocol version 0.4 http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf and version 0.6 http://rfc-gnutella.sourceforge.net/Proposals/Ultrapeer/Ultrapeers.htm there are multiple open source implementations at http://sourceforge.net/ including: – Jtella – Gnucleus Software released under the Lesser Gnu Public License (LGPL) the Gnutella protocol has been widely analyzed 05/03/2007 P2P 14 UCDavis, ecs251 Spring 2007 Gnutella Protocol Messages Broadcast Messages – Ping: initiating message (“I’m here”) – Query: search pattern and TTL (time-to-live) Back-Propagated Messages – Pong: reply to a ping, contains information about the peer – Query response: contains information about the computer that has the needed file Node-to-Node Messages – GET: return the requested file – PUSH: push the file to me 05/03/2007 P2P 15 UCDavis, ecs251 Spring 2007 Steps: • Node 2 initiates search for file A 7 1 A 4 2 6 3 5 05/03/2007 P2P 16 UCDavis, ecs251 Spring 2007 A Steps: • Node 2 initiates search for file A • Sends message to all neighbors 7 1 4 2 3 A 6 A 5 05/03/2007 P2P 17 UCDavis, ecs251 Spring 2007 A A Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message 7 1 4 2 6 3 A 05/03/2007 A 5 P2P 18 UCDavis, ecs251 Spring 2007 A:7 A 7 1 4 2 6 3 A:5 05/03/2007 A A Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message 5 P2P 19 UCDavis, ecs251 Spring 2007 7 1 4 2 3 A:7 A:5 A 6 A 05/03/2007 Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is backpropagated 5 P2P 20 UCDavis, ecs251 Spring 2007 7 1 A:7 2 4 A:5 6 3 Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is backpropagated 5 05/03/2007 P2P 21 UCDavis, ecs251 Spring 2007 Limited Scope Flooding Reverse Path Forwarding download A 1 7 4 2 6 3 5 Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is backpropagated • File download • Note: file transfer between clients behind firewalls is not possible; if only one client, X, is behind a firewall, Y can request that X push the file to Y 05/03/2007 P2P 22 UCDavis, ecs251 Spring 2007 Gnutella Advantages? Disadvantages? 05/03/2007 P2P 23 UCDavis, ecs251 Spring 2007 GUID: Short for Global Unique Identifier, a randomized string that is used to uniquely identify a host or message on the Gnutella Network. This prevents duplicate messages from being sent on the network. GWebCache: a distributed system for helping servants connect to the Gnutella network, thus solving the "bootstrapping" problem. Servants query any of several hundred GWebCache servers to find the addresses of other servants. GWebCache servers are typically web servers running a special module. Host Catcher: Pong responses allow servants to keep track of active gnutella hosts On most servants, the default port for Gnutella is 6346 05/03/2007 P2P 24 05/03/2007 Gnutella Network Growth P2P 05/12/01 05/16/01 05/22/01 05/24/01 05/29/01 50 02/27/01 03/01/01 03/05/01 03/09/01 03/13/01 03/16/01 03/19/01 03/22/01 03/24/01 11/20/00 11/21/00 11/25/00 11/28/00 Number of nodes in the largest network component ('000) UCDavis, ecs251 Spring 2007 . 40 30 20 10 - 25 UCDavis, ecs251 Spring 2007 “Limited Scope Flooding” Ripeanu reported that Gnutella traffic totals 1Gbps (or 330TB/month). – Compare to 15,000TB/month in US Internet backbone (December 2000) – this estimate excludes actual file transfers Reasoning: QUERY and PING messages are flooded. They form more than 90% of generated traffic predominant TTL=7 >95% of nodes are less than 7 hops away measured traffic at each link about 6kbs network with 50k nodes and 170k links 05/03/2007 P2P 26 UCDavis, ecs251 Spring 2007 A B F D E C G H Perfect Mapping 05/03/2007 P2P 27 UCDavis, ecs251 Spring 2007 A B F D E C G H Inefficient mapping Link D-E needs to support six times higher traffic. 05/03/2007 P2P 28 UCDavis, ecs251 Spring 2007 Topology mismatch The overlay network topology doesn’t match the underlying Internet infrastructure topology! 40% of all nodes are in the 10 largest Autonomous Systems (AS) Only 2-4% of all TCP connections link nodes within the same AS Largely ‘random wiring’ Most Gnutella generated traffic crosses AS border, making the traffic more expensive May cause ISPs to change their pricing scheme 05/03/2007 P2P 29 UCDavis, ecs251 Spring 2007 Scalability Whenever a node receives a message, (ping/query) it sends copies out to all of its other connections. existing mechanisms to reduce traffic: – TTL counter – Cache information about messages they received, so that they don't forward duplicated messages. 05/03/2007 P2P 30 UCDavis, ecs251 Spring 2007 70% of Gnutella users share no files 90% of users answer no queries Those who have files to share may limit number of connections or upload speed, resulting in a high download failure rate. If only a few individuals contribute to the public good, these few peers effectively act as centralized servers. 05/03/2007 P2P 31 UCDavis, ecs251 Spring 2007 Anonymity Gnutella provides for anonymity by masking the identity of the peer that generated a query. However, IP addresses are revealed at various points in its operation: HITS packets includes the URL for each file, revealing the IP addresses 05/03/2007 P2P 32 UCDavis, ecs251 Spring 2007 Query Expressiveness Format of query not standardized No standard format or matching semantics for the QUERY string. Its interpretation is completely determined by each node that receives it. String literal vs. regular expression Directory name, filename, or file contents Malicious users may even return files unrelated to the query 05/03/2007 P2P 33 UCDavis, ecs251 Spring 2007 Superpeers Cooperative, long-lived peers typically with significant resources to handle very high amount of query resolution traffic. 05/03/2007 P2P 34 UCDavis, ecs251 Spring 2007 05/03/2007 P2P 35 UCDavis, ecs251 Spring 2007 Gnutella is a self-organizing, large-scale, P2P application that produces an overlay network on top of the Internet; it appears to work Growth is hindered by the volume of generated traffic and inefficient resource use since there is no central authority the open source community must commit to making any changes Suggested changes have been made by – Peer-to-Peer Architecture Case Study: Gnutella Network, by Matei Ripeanu – Improving Gnutella Protocol: Protocol Analysis and Research Proposals by Igor Ivkovic 05/03/2007 P2P 36 UCDavis, ecs251 Spring 2007 Freenet Essentially the same as Gnutella: – Limited-scope flooding – Reverse-path forwarding Difference: – Data objects (I.e., files) are also being delivered via “reverse-path forwarding” 05/03/2007 P2P 37 UCDavis, ecs251 Spring 2007 P2P Issues Scalability & Load Balancing Anonymity Fairness, Incentives & Trust Security and Robustness Efficiency Mobility 05/03/2007 P2P 38 UCDavis, ecs251 Spring 2007 Incentive-driven Fairness P2P means we all should contribute.. – Hopefully fair, but the majority is selfish… “Incentive for people to contribute…” 05/03/2007 P2P 39 UCDavis, ecs251 Spring 2007 Bittorrent: “Tit for Tat” Equivalent Retaliation (Game theory) – A peer will “initially” cooperate, then respond in kind to an opponent's previous action. If the opponent previously was cooperative, the agent is cooperative. If not, the agent is not. 05/03/2007 P2P 40 UCDavis, ecs251 Spring 2007 Bittorrent Fairness of download and upload between a pair of peers Every 10 seconds, estimate the download bandwidth from the other peer – Based on the performance estimation to decide to continue uploading to the other peer or not 05/03/2007 P2P 41 UCDavis, ecs251 Spring 2007 Client & its Peers Client – Download rate (from the peers) Peers – Upload rate (to the client) 05/03/2007 P2P 42 UCDavis, ecs251 Spring 2007 BT Choking by Client By default, every peer is “choked” – stop “uploading” to them, but the TCP connection is still there. Select four peers to “unchoke” – Best “upload rates” and “interested”. – Uploading to the unchoked ones and monitor the download rate for all the peers – “Re-choke” every 30 seconds Optimistic Unchoking – Randomly select a choked peer to unchoke 05/03/2007 P2P 43 UCDavis, ecs251 Spring 2007 “Interested” A request for a piece (or its sub-pieces) 05/03/2007 P2P 44 UCDavis, ecs251 Spring 2007 Becoming “seed” Use “upload” rate to the peers to decide which peers to unchoke. 05/03/2007 P2P 45 UCDavis, ecs251 Spring 2007 05/03/2007 Bittorrent Wiki P2P 46 UCDavis, ecs251 Spring 2007 BT Peer Selection From the “Tracker” – We receive a partial list of all active peers for the same file – We can get another 50 from the tracker if we want 05/03/2007 P2P 47 UCDavis, ecs251 Spring 2007 Piece Selection Piece (64K~1M) Sub-piece (16K) – Piece-size: trade-off between performance and the size of the torrent file itself – A client might request different sub-pieces of the same piece from different peers. Strict Priority - sub-pieces and piece Rarest First – Exception: “random first” – Get the stuff out of Seed(s) as soon as possible.. 05/03/2007 P2P 48 UCDavis, ecs251 Spring 2007 Rarest First Exchanging bitmaps with 20+ peers – Initial messages – “have” messages Array of buckets – Ith buckets contains “pieces” with I known instances – Within the same bucket, the client will randomly select one piece. 05/03/2007 P2P 49 UCDavis, ecs251 Spring 2007 Random-First Usually, rare-first pieces are rare. The client has to get all the sub-pieces from one or very few peers. For the first 4~5 pieces, get some random pieces so the client can have a few pieces to upload. 05/03/2007 P2P 50 UCDavis, ecs251 Spring 2007 BitTorrent Connect to the Tracker Connect to 20+ peers Random-first or Rarest-first Monitoring the download rate from the peers (or upload rate to the client) Unchoke and Optimistic Unchoke 05/03/2007 P2P 51 UCDavis, ecs251 Spring 2007 Bittorrent Advantages Disadvantages 05/03/2007 P2P 52 UCDavis, ecs251 Spring 2007 Trackerless Bittorrent Every BT peer is a tracker! But, how would they share and exchange information regarding other peers? Similar to Napster’s index server or DNS 05/03/2007 P2P 53 UCDavis, ecs251 Spring 2007 Pure P2P Every peer is a tracker Every peer is a DNS server Every peer is a Napster Index server How can this be done? – We try to remove/reduce the role of “special servers”! 05/03/2007 P2P 54 UCDavis, ecs251 Spring 2007 Peer The requirements of Peer? 05/03/2007 P2P 55 UCDavis, ecs251 Spring 2007 Structured Peering Peer identity and routability 05/03/2007 P2P 56 UCDavis, ecs251 Spring 2007 Structured Peering Peer identity and routability Key/content assignment – Which identity owns what? (Google Search?) 05/03/2007 P2P 57 UCDavis, ecs251 Spring 2007 Structured Peering Peer identity and routability Key/content assignment – Which identity owns what? Napster: centralized index service Skype/Kazaa: login-server & super peers DNS: hierarchical DNS servers Two problems: (1). How to connect to the “ring”? (2). How to prevent failures/changes? 05/03/2007 P2P 58 UCDavis, ecs251 Spring 2007 DHT Distributed hash tables (DHTs) – decentralized lookup service of a hash table – (name, value) pairs stored in the DHT – any peer can efficiently retrieve the value associated with a given name – the mapping from names to values is distributed among peers 05/03/2007 P2P 59 UCDavis, ecs251 Spring 2007 HT as a search table Information/content is distributed, and we need to know where? Index key 05/03/2007 Where is this piece of music? What is the location of this type of content? What is the current IP address of this skype user? P2P 60 UCDavis, ecs251 Spring 2007 DHT as a search table ??? Index key 05/03/2007 P2P 61 UCDavis, ecs251 Spring 2007 DHT as a search table ??? Index key 05/03/2007 P2P 62 UCDavis, ecs251 Spring 2007 DHT as a search table ??? Index key 05/03/2007 P2P 63 UCDavis, ecs251 Spring 2007 DHT Scalable Peer arrivals, departures, and failures Unstructured versus structured 05/03/2007 P2P 64 UCDavis, ecs251 Spring 2007 DHT (Name, Value) How to utilize DHT to avoid Trackers in Bittorrent? 05/03/2007 P2P 65 UCDavis, ecs251 Spring 2007 DHT-based Tracker FreeBSD 5.4 CD images Publish the key on the class web site. Index key Whoever owns this hash entry is the tracker for the corresponding key! Seed’s IP address PUT & GET 05/03/2007 P2P 66 UCDavis, ecs251 Spring 2007 Chord Consistent Hashing A Simple Key Lookup Algorithm Scalable Key Lookup Algorithm Node Joins and Stabilization Node Failures 05/03/2007 P2P 67 UCDavis, ecs251 Spring 2007 Chord Given a key (data item), it maps the key onto a peer. Uses consistent hashing to assign keys to peers. Solves problem of locating key in a collection of distributed peers. Maintains routing information as peers join and leave the system 05/03/2007 P2P 68 UCDavis, ecs251 Spring 2007 Issues Load balance: distributed hash function, spreading keys evenly over peers Decentralization: chord is fully distributed, no node more important than other, improves robustness Scalability: logarithmic growth of lookup costs with number of peers in network, even very large systems are feasible Availability: chord automatically adjusts its internal tables to ensure that the peer responsible for a key can always be found 05/03/2007 P2P 69 UCDavis, ecs251 Spring 2007 Example Application File System Block Store Block Store Block Store Chord Chord Chord Client Server Server Highest layer provides a file-like interface to user including userfriendly naming and authentication This file systems maps operations to lower-level block operations Block storage uses Chord to identify responsible node for storing a block and then talk to the block storage server on that node 05/03/2007 P2P 70 UCDavis, ecs251 Spring 2007 Consistent Hashing Consistent hash function assigns each peer and key an m-bit identifier. SHA-1 is used as a base hash function. A peer’s identifier is defined by hashing the peer’s IP address. A key identifier is produced by hashing the key (chord doesn’t define this. Depends on the application). – ID(peer) = hash(IP, Port) – ID(key) = hash(key) 05/03/2007 P2P 71 UCDavis, ecs251 Spring 2007 Consistent Hashing m In an m-bit identifier space, there are 2 identifiers. Identifiers are ordered on an identifier circle m modulo 2 . The identifier ring is called Chord ring. Key k is assigned to the first peer whose identifier is equal to or follows (the identifier of) k in the identifier space. This peer is the successor peer of key k, denoted by successor(k). 05/03/2007 P2P 72 UCDavis, ecs251 Spring 2007 Consistent Hashing - Successor Peers identifier node 6 1 0 6 identifier circle 6 5 2 2 successor(2) = 3 3 4 05/03/2007 key successor(1) = 1 1 7 successor(6) = 0 X 2 P2P 73 UCDavis, ecs251 Spring 2007 Consistent Hashing – Join and Departure When a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n. When node n leaves the network, all of its assigned keys are reassigned to n’s successor. 05/03/2007 P2P 75 UCDavis, ecs251 Spring 2007 Node Join keys 5 7 keys 1 0 1 7 keys 6 2 5 3 keys 2 4 05/03/2007 P2P 76 UCDavis, ecs251 Spring 2007 Node Departure keys 7 keys 1 0 1 7 keys 6 6 2 5 3 keys 2 4 05/03/2007 P2P 77 UCDavis, ecs251 Spring 2007 Technical Issues ??? 05/03/2007 P2P 78 UCDavis, ecs251 Spring 2007 A Simple Key Lookup A very small amount of routing information suffices to implement consistent hashing in a distributed environment If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order. Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key. 05/03/2007 P2P 80 UCDavis, ecs251 Spring 2007 A Simple Key Lookup Pseudo code for finding successor: // ask node n to find the successor of id n.find_successor(id) if (id (n, successor]) return successor; else // forward the query around the circle return successor.find_successor(id); 05/03/2007 P2P 81 UCDavis, ecs251 Spring 2007 A Simple Key Lookup The path taken by a query from node 8 for key 54: 05/03/2007 P2P 82 UCDavis, ecs251 Spring 2007 Successor Each active node MUST know the IP address of its successor! – N8 has to know that the next node on the ring is N14. Departure N8 => N21 But, how about failure or crash? 05/03/2007 P2P 83 UCDavis, ecs251 Spring 2007 Robustness Successor in R hops – N8 => N14, N21, N32, N38 (R=4) – Periodic pinging along the path to check, & also find out maybe there are “new members” in between 05/03/2007 P2P 84 UCDavis, ecs251 Spring 2007 Is that good enough? 05/03/2007 P2P 85 UCDavis, ecs251 Spring 2007 Complexity of the search Time/messages: O(N) – N: # of nodes on the Ring Space: O(1) – We only need to remember R IP addresses Stablization depends on “period”. 05/03/2007 P2P 86 UCDavis, ecs251 Spring 2007 Scalable Key Location To accelerate lookups, Chord maintains additional routing information. This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor. 05/03/2007 P2P 87 UCDavis, ecs251 Spring 2007 Scalable Key Location – Finger Tables Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least i-1 2 on the identifier circle. i-1 s = successor(n+2 ). s is called the ith finger of node n, denoted by n.finger(i) 05/03/2007 P2P 88 UCDavis, ecs251 Spring 2007 Scalable Key Location – Finger Tables finger table start For. 0+20 0+21 0+22 1 2 4 1 6 1 3 0 0 1+2 1+21 1+22 2 3 5 succ. keys 1 3 3 0 2 5 finger table For. start 3 0 3+2 3+21 3+22 4 05/03/2007 succ. finger table For. start 0 7 keys 6 P2P 4 5 7 succ. keys 2 0 0 0 89 UCDavis, ecs251 Spring 2007 Finger Tables A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node. The first finger of n is the immediate successor of n on the circle. 05/03/2007 P2P 90 UCDavis, ecs251 Spring 2007 Scalable Key Location – Example query The path a query for key 54 starting at node 8: 05/03/2007 P2P 91 UCDavis, ecs251 Spring 2007 Scalable Key Location – A characteristic Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem: Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN). 05/03/2007 P2P 92 UCDavis, ecs251 Spring 2007 Complexity of the Search Time/messages: O(logN) – N: # of nodes on the Ring Space: O(logN) – We need to remember R IP addresses – We need to remember logN Fingers Stablization depends on “period”. 05/03/2007 P2P 93 UCDavis, ecs251 Spring 2007 An Example M = 4096 (identifier size), ring size is 24096 N = 216 (# of nodes) How many entries we need to have for the Finger Table? Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle. s = successor(n+2i-1). 05/03/2007 P2P 94 UCDavis, ecs251 Spring 2007 Complexity of the Search Time/messages: O(M) – M: # of bits of the identifier Space: O(M) – We need to remember R IP addresses – We need to remember M Fingers Stablization depends on “period”. 05/03/2007 P2P 95 UCDavis, ecs251 Spring 2007 Structured Peering Peer identity and routability – 2M identifiers, Finger Table routing Key/content assignment – Hashing Dynamics/Failures – Inconsistency?? 05/03/2007 P2P 96 UCDavis, ecs251 Spring 2007 Node Joins and Stabilizations The most important thing is the successor pointer. If the successor pointer is ensured to be up to date, which is sufficient to guarantee correctness of lookups, then finger table can always be verified. Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table. 05/03/2007 P2P 97 UCDavis, ecs251 Spring 2007 Node Joins and Stabilizations “Stabilization” protocol contains 6 functions: – – – – – – create( ) join( ) stabilize( ) notify( ) fix_fingers( ) check_predecessor( ) 05/03/2007 P2P 98 UCDavis, ecs251 Spring 2007 Node Joins – join() When node n first starts, it calls n.join(n’), where n’ is any known Chord node. The join() function asks n’ to find the immediate successor of n. join() does not make the rest of the network aware of n. 05/03/2007 P2P 99 UCDavis, ecs251 Spring 2007 Node Joins – join() // create a new Chord ring. n.create() predecessor = nil; successor = n; // join a Chord ring containing node n’. n.join(n’) predecessor = nil; successor = n’.find_successor(n); 05/03/2007 P2P 100 UCDavis, ecs251 Spring 2007 Node Joins – stabilize() Each time node n runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be n’s successor instead. stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n. The successor does this only if it knows of no closer predecessor than n. 05/03/2007 P2P 102 UCDavis, ecs251 Spring 2007 Node Joins – stabilize() // called periodically. verifies n’s immediate // successor, and tells the successor about n. n.stabilize() x = successor.predecessor; if (x (n, successor)) successor = x; successor.notify(n); // n’ thinks it might be our predecessor. n.notify(n’) if (predecessor is nil or n’ (predecessor, n)) predecessor = n’; 05/03/2007 P2P 103 UCDavis, ecs251 Spring 2007 Node Joins – Join and Stabilization nil succ(np) = n n succ(np) = ns np 05/03/2007 – – pred(ns) = n pred(ns) = np ns n joins n runs stabilize – – n notifies ns being the new predecessor ns acquires n as its predecessor np runs stabilize – – – – predecessor = nil n acquires ns as successor via some n’ np asks ns for its predecessor (now n) np acquires n as its successor np notifies n n will acquire np as its predecessor all predecessor and successor pointers are now correct fingers still need to be fixed, but old P2P fingers will still work 104 UCDavis, ecs251 Spring 2007 Node Joins – fix_fingers() Each node periodically calls fix fingers to make sure its finger table entries are correct. It is how new nodes initialize their finger tables It is how existing nodes incorporate new nodes into their finger tables. 05/03/2007 P2P 105 UCDavis, ecs251 Spring 2007 Node Joins – fix_fingers() // called periodically. refreshes finger table entries. n.fix_fingers() next = next + 1 ; if (next > m) next = 1 ; finger[next] = find_successor(n + 2next-1); // checks whether predecessor has failed. n.check_predecessor() if (predecessor has failed) predecessor = nil; 05/03/2007 P2P 106 UCDavis, ecs251 Spring 2007 Node Failures Key step in failure recovery is maintaining correct successor pointers To help achieve this, each node maintains a successor-list of its r nearest successors on the ring If node n notices that its successor has failed, it replaces it with the first live entry in the list Successor lists are stabilized as follows: – node n reconciles its list with its successor s by copying s’s successor list, removing its last entry, and prepending s to it. – If node n notices that its successor has failed, it replaces it with the first live entry in its successor list and reconciles its successor list with its new successor. 05/03/2007 P2P 108 UCDavis, ecs251 Spring 2007 Chord – The Math Every node is responsible for about K/N keys (N nodes, K keys) When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node) Lookups need O(log N) messages To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required 05/03/2007 P2P 109