Download Talk 2 - IIT Guwahati

An Introduction to Peer-to-Peer networks Diganta Goswami IIT Guwahati Outline Overview of P2P overlay networks  Applications of overlay networks  Classification of overlay networks   Structured overlay networks  Unstructured overlay networks  Overlay multicast networks 2 Overview of P2P overlay networks  What is P2P systems?   P2P refers to applications that take advantage of resources (storage, cycles, content, human presence) available at the end systems of the internet. What is overlay networks?  Overlay networks refer to networks that are constructed on top of another network (e.g. IP).  What is P2P overlay network?  Any overlay network that is constructed by the Internet peers in the application layer on top of the IP network. 3 What is P2P systems? Multiple sites (at edge)  Distributed resources  Sites are autonomous (different owners)  Sites are both clients and servers  Sites have equal functionality  4 Internet P2P Traffic Statistics  Between 50 and 65 percent of all download traffic is P2P related. Between 75 and 90 percent of all upload traffic is P2P related. And it seems that more people are using p2p today  So what do people download?     61.4 % video 11.3 % audio 27.2 % games/software/etc. Source: http://torrentfreak.com/peer-to-peer-trafficstatistics/ 5 P2P overlay networks properties Efficient use of resources  Self-organizing  All peers organize themselves into an application layer network on top of IP.  Scalability  Consumers of resources also donate resources  Aggregate resources grow naturally with utilization  6 P2P overlay networks properties  Reliability No single point of failure  Redundant overlay links between the peers  Redundant data source  Ease of deployment and administration  The nodes are self-organized  No need to deploy servers to satisfy demand.  Built-in fault tolerance, replication, and load balancing  No need any change in underlay IP networks  7 P2P Applications  P2P File Sharing  Napster, Gnutella, Kazaa, eDonkey, BitTorrent  Chord, CAN, Pastry/Tapestry, Kademlia  P2P Communications  Skype,  Social Networking Apps P2P Distributed Computing  Seti@home 8 Popular file sharing P2P Systems Napster, Gnutella, Kazaa, Freenet  Large scale sharing of files.   User A makes files (music, video, etc.) on their computer available to others  User B connects to the network, searches for files and downloads files directly from user A  Issues of copyright infringement 9 P2P/Grid Distributed Processing  seti@home  Search for ET intelligence  Central site collects radio telescope data  Data is divided into work chunks of 300 Kbytes  User obtains client, which runs in background  Peer sets up TCP connection to central computer, downloads chunk  Peer does FFT on chunk, uploads results, gets new chunk  Not P2P communication, but exploit Peer computing power 10 Key Issues  Management  How to maintain the P2P system under high rate of churn efficiently  Application reliability is difficult to guarantee  Lookup  How to find out the appropriate content/resource that a user wants  Throughput  Content distribution/dissemination applications  How to copy content fast, efficiently, reliably 11 Management Issue  A P2P network must be self-organizing.     Join and leave operations must be self-managed. The infrastructure is untrusted and the components are unreliable. The number of faulty nodes grows linearly with system size. Tolerance to failures and churn     Content replication, multiple paths Leverage knowledge of executing application Load balancing Dealing with free riders  Freerider : rational or selfish users who consume more than their fair share of a public resource, or shoulder less than a fair share of the costs of its production. 12 Lookup Issue   How do you locate data/files/objects in a large P2P system built around a dynamic set of nodes in a scalable manner without any centralized server or hierarchy? Efficient routing even if the structure of the network is unpredictable.  Unstructured P2P : Napster, Gnutella, Kazaa  Structured P2P : Chord, CAN, Pastry/Tapestry, Kademlia 13 Classification of overlay networks  Structured overlay networks    Unstructured overlay networks   Are based on Distributed Hash Tables (DHT) the overlay network assigns keys to data items and organizes its peers into a graph that maps each data key to a peer. The overlay networks organize peers in a random graph in flat or hierarchical manners. Overlay multicast networks  The peers organize themselves into an overlay tree for multicasting. 14 Structured overlay networks  Overlay topology construction is based on NodeID’s that are generated by using Distributed Hash Tables (DHT).  The overlay network assigns keys to data items and organizes its peers into a graph that maps each data key to a peer.  This structured graph enables efficient discovery of data items using the given keys.  It Guarantees object detection in O(log n) hops. 15 Unstructured P2P overlay networks  An Unstructured system composed of peers joining the network with some loose rules, without any prior knowledge of the topology.  Network uses flooding or random walks as the mechanism to send queries across the overlay with a limited scope. 16 Unstructured P2P File Sharing Networks    Centralized Directory based P2P systems Pure P2P systems Hybrid P2P systems 17 Unstructured P2P File Sharing Networks  Centralized Directory based P2P systems  All peers are connected to central entity  Peers establish connections between each other on demand to exchange user data (e.g. mp3 compressed data)  Central entity is necessary to provide the service  Central entity is some kind of index/group database  Central entity is lookup/routing table  Examples: Napster, Bittorent 18 Napster    was used primarily for file sharing NOT a pure P2P network=> hybrid system Ways of action:  Client sends server the query, server ask everyone and responds to client  Client gets list of clients from server  All Clients send ID’s of the data they hold to the server and when client asks for data, server responds with specific addresses  peer downloads directly from other peer(s) 19 Centralized Network  Napster model Client Client Server Reply Query • Nodes register their contents with server • Centralized server for searches • File access done on a peer to peer basis – Poor scalability – Single point of failure File Transfer 20 Napster  Further services:  Chat program, instant messaging service, tracking program,…  Centralized system  Single point of failure => limited fault tolerance  Limited scalability (server farms with load balancing)  Query is fast and upper bound for duration can be given 21 Gnutella pure peer-to-peer  very simple protocol  no routing "intelligence"  Constrained broadcast   Life-time of packets limited by TTL (typically set to 7)  Packets have unique ids to detect loops 22 Query flooding: Gnutella  fully distributed  no   central server public domain protocol many Gnutella clients implementing protocol overlay network: graph  edge between peer X and Y if there’s a TCP connection  all active peers and edges is overlay net  Edge is not a physical link  Given peer will typically be connected with < 10 overlay neighbors 23 Gnutella: protocol r Query message sent over existing TCP connections r peers forward Query message r QueryHit sent over reverse Query path Scalability: limited scope flooding File transfer: HTTP Query QueryHit QueryHit 24 Gnutella : Scenario Step 0: Join the network Step 1: Determining who is on the network • "Ping" packet is used to announce your presence on the network. • Other peers respond with a "Pong" packet. • Also forwards your Ping to other connected peers • A Pong packet also contains: • an IP address • port number • amount of data that peer is sharing • Pong packets come back via same route Step 2: Searching •Gnutella "Query" ask other peers (usually 7) if they have the file you desire • A Query packet might ask, "Do you have any content that matches the string ‘Hey Jude"? • Peers check to see if they have matches & respond (if they have any matches) & send packet to connected peers if not (usually 7) • Continues for TTL (how many hops a packet can go before it dies, typically 10 ) Step 3: Downloading • Peers respond with a “QueryHit” (contains contact info) • File transfers use direct connection using HTTP protocol’s GET method 25 Gnutella: Peer joining 1. 2. 3. 4. 5. Joining peer X must find some other peer in Gnutella network: use list of candidate peers X sequentially attempts to make TCP with peers on list until connection setup with Y X sends Ping message to Y; Y forwards Ping message. All peers receiving Ping message respond with Pong message X receives many Pong messages. It can then setup additional TCP connections 26 Gnutella - PING/PONG 3 6 Ping 1 Ping 1 Pong 3 Pong 6,7,8 Pong 6,7,8 Pong 6 Ping 1 1 Known Hosts: 2 3,4,5 Pong 3,4,5 Pong 5 2 Ping 1 5 Pong 7 Ping 1 Ping 1 Pong 2 Ping 1 Pong 4 7 Pong 8 8 6,7,8 4 Query/Response analogous 27 Unstructured Blind - Gnutella Breadth-First Search (BFS) = source = forward query = processe query = found result = forward respons 28 Unstructured Blind - Gnutella   A node/peer connects to a set of Gnutella neighbors Forward queries to neighbors  Client which has the Information responds.  Flood network with TTL for termination + Results are complete – Bandwidth wastage 29 Gnutella : Reachable Users (analytical estimate) T : TTL, N : Neighbors for Query 30 Gnutella : Search Issue  Flooding based search is extremely wasteful with bandwidth     A large (linear) part of the network is covered irrespective of hits found Enormous number of redundant messages All users do this in parallel: local load grows linearly with size What can be done?  Controlling topology to allow for better search   Random walk, Degree-biased Random Walk Controlling placement of objects  Replication 31 Gnutella : Random Walk  Basic strategy  In scale-free graph: high degree nodes are easy to find by (biased) random walk    And high degree nodes can store the index about a large portion of the network Random walk   Scale-free graph is a graph whose degree distribution follows a power law avoiding the visit of last visited node Degree-biased random walk    Select highest degree node, that has not been visited This first climbs to highest degree node, then climbs down on the degree sequence Provably optimal coverage 32 Gnutella : Replication  Spread copies of objects to peers: more popular objects can be found easier  Replication strategies     Owner replication Path replication Random replication But there is still the difficulty with rare objects. 33 Random Walkers  Improved Unstructured Blind •Similar structure to Gnutella •Forward the query (called walker) to random subset of its neighbors + Reduced bandwidth requirements – Incomplete results Peer nodes 34 Unstructured Informed Networks  Zero in on target based on information about the query and the neighbors.  Intelligent routing + Reduces number of messages + Not complete, but more accurate – COST: Must thus flood in order to get initial information 35 Informed Searches: Local Indices  Node keeps track of information available within a radius of r hops around it.  Queries are made to neighbors just beyond the r radius. + Flooding limited to bounded part of network 36 Routing Indices  For each query, calculate goodness of each neighbor.  Calculating goodness:  Categorize or separate query into themes  Rank best neighbors for a given theme based on number of matching documents  Follows chain of neighbors that are expected to yield the best results  Backtracking possible 37 Free riding   File sharing networks rely on users sharing data Two types of free riding  Downloading but not sharing any data  Not sharing any interesting data  On Gnutella  15% of users contribute 94% of content  63% of users never responded to a query  Didn’t have “interesting” data 38 Gnutella:summary         Hit rates are high High fault tolerance Adopts well and dynamically to changing peer populations High network traffic No estimates on duration of queries No probability for successful queries Topology is unknown => algorithm cannot exploit it Free riding is a problem    A significant portion of Gnutella peers are free riders Free riders are distributed evenly across domains Often hosts share files nobody is interested in 39 Gnutella discussion  Search types:   Scalability     High, since many paths are explored Autonomy:    Search very poor with respect to number of messages Updates excellent: nothing to do Routing information: low cost Robustness   Any possible string comparison Storage: no restriction, peers store the keys of their files Routing: peers are target of all kind of requests Global knowledge  None required 40 Exploiting heterogeneity: KaZaA  Each peer is either a group leader or assigned to a group leader.    TCP connection between peer and its group leader. TCP connections between some pairs of group leaders. Group leader tracks the content in all its children. ordinary peer group-leader peer neighoring relationships in overlay network 41 iMesh, Kazaa   Hybrid of centralized Napster and decentralized Gnutella Super-peers act as local search hubs  Each super-peer is similar to a Napster server for a small portion of the network  Super-peers are automatically chosen by the system based on their capacities (storage, bandwidth, etc.) and availability (connection time)    Users upload their list of files to a super-peer Super-peers periodically exchange file lists Queries are sent to a super-peer for files of interest 42 Overlay Multicasting  IP multicast has not be deployed over the Internet due to some fundamental problems in congestion control, flow control, security, group management and etc.  For the new emerging applications such as multimedia streaming, internet multicast service is required.  Solution: Overlay Multicasting  Overlay multicasting (or Application layer multicasting) is increasingly being used to overcome the problem of nonubiquitous deployment of IP multicast across heterogeneous networks. 43 Overlay Multicasting  Main idea  Internet peers organize themselves into an overlay tree on top of the Internet.  Packet replication and forwarding are performed by peers in the application layer by using IP unicast service. 44 Overlay Multicasting  Overlay multicasting benefits  Easy deployment  It is self-organized  it is based on IP unicast service  There is not any protocol support requirement by the Internet routers.  Scalability  It is scalable with multicast groups and the number of members in each group.  Efficient resource usage  Uplink resources of the Internet peers is used for multicast data distribution.  It is not necessary to use dedicated infrastructure and bandwidths for massive data distribution in the Internet. 45 Overlay Multicasting  Overlay multicast approaches  DHT based  Tree based  Mesh-tree based 46 Overlay Multicasting  DHT based  Overlay tree is constructed on top of the DHT based P2P routing infrastructure such as pastry, CAN, Chord, etc.  Example: Scribe in which the overlay tree is constructed on a Pastry networks by using a multicast routing algorithm 47 Structured Overlay Networks / DHTs Chord, Pastry, Tapestry, CAN, Kademlia, P-Grid, Viceroy Set of Nodes Keys of Nodes Common Identifier Space Connect The nodes Smartly Keys of Values … Node Identifier Value Identifier 48 The Principle Of Distributed Hash Tables  A dynamic distribution of a hash table onto a set of cooperating nodes Key Value 1 Algorithms 9 Routing 11 DS 12 Peer-to-Peer 21 Networks 22 Grids • Basic service: lookup operation • Key resolution from any node node A node B node D node C →Node D : lookup(9) • Each node has a routing table • Pointers to some other nodes • Typically, a constant or a logarithmic number of pointers 49 DHT Desirable Properties Keys mapped evenly to all nodes in the network Each node maintains information about only a few other nodes Messages can be routed to a node efficiently Node arrival/departures only affect a few nodes 50 Chord [MIT] Problem adressed: efficient node localization  Distributed lookup protocol  Simplicity, provable performance, proven correctness  Support of just one operation: given a key, Chord maps the key onto a node  51 The Chord algorithm – Construction of the Chord ring  the consistent hash function assigns each node and each key an m-bit identifier using SHA 1 (Secure Hash Standard). m = any number big enough to make collisions improbable Key identifier = SHA-1(key) Node identifier = SHA-1(IP address)   Both are uniformly distributed Both exist in the same ID space 52 Chord      consistent hashing (SHA-1) assigns each node and object an m-bit ID IDs are ordered in an ID circle ranging from 0 – (2m-1). New nodes assume slots in ID circle according to their ID Key k is assigned to first node whose ID ≥ k  successor(k) 53 Consistent Hashing - Successor Nodes identifier node 6 1 0 successor(6) = 0 6 identifier circle 6 5 key successor(1) = 1 1 7 X 2 2 successor(2) = 3 3 4 2 54 Consistent Hashing – Join and Departure When a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n.  When node n leaves the network, all of its assigned keys are reassigned to n’s successor.  55 Consistent Hashing – Node Join keys 5 7 keys 1 0 1 7 keys 6 2 5 3 keys 2 4 56 Consistent Hashing – Node Dep. keys 7 keys 1 0 1 7 keys 6 6 2 5 3 keys 2 4 57 Simple node localization // ask node n to find the successor of id n.find_successor(id) if (id (n; successor]) return successor; else // forward the query around the circle return successor.find_successor(id); => Number of messages linear in the number of nodes ! 58 Scalable Key Location – Finger Tables  To accelerate lookups, Chord maintains additional routing information. This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor. Each node n, maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. The ith entry in the table at node n contains the identity of i-1 the first node s that succeeds n by at least 2 on the identifier circle. s = successor(n+2i-1).  s is called the ith finger of node n, denoted by n.finger(i)     59 Scalable Key Location – Finger Tables finger table start For. 0+20 0+21 0+22 1 2 4 1 6 succ. 1 3 0 finger table For. start 0 7 keys 6 0 1+2 1+21 1+22 2 3 5 succ. keys 1 3 3 0 2 5 3 4 finger table For. start 0 3+2 3+21 3+22 4 5 7 succ. keys 2 0 0 0 60 Finger Tables finger table start int. 1 2 4 [1,2) [2,4) [4,0) 1 6 1 3 0 finger table start int. 0 7 succ. keys 6 2 3 5 [2,3) [3,5) [5,1) succ. keys 1 3 3 0 2 5 3 4 finger table start int. 4 5 7 [4,5) [5,7) [7,3) succ. keys 2 0 0 0 61 Chord key location   Lookup in finger table the furthest node that precedes key -> O(log n) hops 62 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 63 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1 ) 64 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 65 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 66 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 67 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 68 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 69 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 70 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1) 71 Scalable node localization Finger table: finger[i] = successor (n + 2 i-1 ) 72 Scalable node localization Important characteristics of this scheme:  Each node stores information about only a small number of nodes (m)  Each nodes knows more about nodes closely following it than about nodes farer away  A finger table generally does not contain enough information to directly determine the successor of an arbitrary key k 73 Scalable node localization   Search in finger table for the nodes which most immediatly precedes id Invoke find_successor from that node => Number of messages O(log N)! 74 Scalable node localization   Search in finger table for the nodes which most immediatly precedes id Invoke find_successor from that node => Number of messages O(log N)! 75 Scalable Lookup Scheme  Each node forwards query at least halfway along distance remaining to the target  Theorem: With high probability, the number of nodes that must be contacted to find a successor in a N-node network is O(log N) 76 Node Joins and Stabilizations The most important thing is the successor pointer.  If the successor pointer is ensured to be up to date, which is sufficient to guarantee correctness of lookups, then finger table can always be verified.  Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.  77 Node Joins and Stabilizations  “Stabilization” protocol contains 6 functions:  create()  join()  stabilize()  notify()  fix_fingers()  check_predecessor()  When node n first starts, it calls n.join(n’), where n’ is any known Chord node. The join() function asks n’ to find the immediate successor of n.  78 Node joins and stabilization To ensure correct lookups, all successor pointers must be up to date  => stabilization protocol running periodically in the background  Updates finger tables and successor pointers  79 Node joins and stabilization Stabilization protocol:  Stabilize(): n asks its successor for its predecessor p and decides whether p should be n‘s successor instead (this is the case if p recently joined the system).  Notify(): notifies n‘s successor of its existence, so it can change its predecessor to n  Fix_fingers(): updates finger tables 80 Node Joins – Join and Stabilization  n joins   pred(ns) = n  n runs stabilize n nil np succ(np) = ns  succ(np) = n pred(ns) = np ns   predecessor = nil n acquires ns as successor via some n’ n notifies ns being the new predecessor ns acquires n as its predecessor np runs stabilize     np asks ns for its predecessor (now n) np acquires n as its successor np notifies n n will acquire np as its predecessor  all predecessor and successor pointers are now correct  fingers still need to be fixed, but old fingers will still work 81 Node joins and stabilization 82 Node joins and stabilization • N26 joins the system • N26 aquires N32 as its successor • N26 notifies N32 • N32 aquires N26 as its predecessor 83 Node joins and stabilization • N26 copies keys • N21 runs stabilize() and asks its successor N32 for its predecessor which is N26. 84 Node joins and stabilization • N21 aquires N26 as its successor • N21 notifies N26 of its existence • N26 aquires N21 as predecessor 85 Node Joins – with Finger Tables finger table start int. 1 2 4 [1,2) [2,4) [4,0) finger table start int. 7 0 2 [7,0) [0,2) [2,6) keys succ. 1 0 0 3 6 1 3 06 finger table start int. 0 7 succ. keys 6 2 3 5 [2,3) [3,5) [5,1) succ. keys 1 3 3 06 2 5 3 4 finger table start int. 4 5 7 [4,5) [5,7) [7,3) succ. keys 2 06 06 0 86 Node Departures – with Finger Tables finger table start int. 1 2 4 [1,2) [2,4) [4,0) finger table start int. 7 0 2 [7,0) [0,2) [2,6) succ. 0 0 3 keys 6 1 6 succ. 13 3 06 finger table start int. 0 7 keys 2 3 5 [2,3) [3,5) [5,1) succ. keys 1 3 3 06 2 5 3 4 finger table start int. 4 5 7 [4,5) [5,7) [7,3) succ. keys 2 6 6 00 87 Node Failures  Key step in failure recovery is maintaining correct successor pointers  To help achieve this, each node maintains a successor-list of its r nearest successors on the ring  If node n notices that its successor has failed, it replaces it with the first live entry in the list  Successor lists are stabilized as follows: node n reconciles its list with its successor s by copying s’s successor list, removing its last entry, and prepending s to it.  If node n notices that its successor has failed, it replaces it with the first live entry in its successor list and reconciles its successor list with its new successor.  88 Handling failures: redundancy Each node knows IP addresses of next r nodes.  Each key is replicated at next r nodes  89 Impact of node joins on lookups   All finger table entries are correct => O(log N) lookups Successor pointers correct, but fingers inaccurate => correct but slower lookups 90 Impact of node joins on lookups Stabilization completed => no influence on performence  Only for the negligible case that a large number of nodes joins between the target‘s predecessor and the target, the lookup is slightly slower  No influence on performance as long as fingers are adjusted faster than the network doubles in size  91 Failure of nodes    Correctness relies on correct successor pointers What happens, if N14, N21, N32 fail simultaneously? How can N8 aquire N38 as successor? 92 Failure of nodes    Correctness relies on correct successor pointers What happens, if N14, N21, N32 fail simultaneously? How can N8 aquire N38 as successor? 93 Failure of nodes Each node maintains a successor list of size r  If the network is initially stable, and every node fails with probability ½, find_successor still finds the closest living successor to the query key and the expected time to execute find_succesor is O(log N)  94 Failure of nodes Failed Lookups (Percent) Massive failures have little impact 1.4 (1/2)6 is 1.6% 1.2 1 0.8 0.6 0.4 0.2 0 5 10 15 20 25 30 35 40 45 50 Failed Nodes (Percent) 95 Chord – simulation result [Stoica et al. Sigcomm2001] 96 Chord discussion  Search types   Scalability     Replication might be used by storing replicas at successor nodes Autonomy   Search O(logn) Update requires search, thus O(logn) Construction: O(log2 n) if a new node joins Robustness   Only equality, exact keys need to be known Storage and routing: none Global knowledge  Mapping of IP addresses and data keys to key common key space 97 YAPPERS: a P2P lookup service over arbitrary topology  Gnutella-style    work on arbitrary topology, flood for query Robust but inefficient Support for partial query, good for popular resources  DHT-based    Systems Systems Efficient lookup but expensive maintenance By nature, no support for partial query Solution: Hybrid System  Operate on arbitrary topology  Provide DHT-like search efficiency 98 Design Goals  Impose no constraints on topology  No  underlying structure for the overlay network Optimize for partial lookups for popular keys  Observation: Many users are satisfied with partial lookup  Contact only nodes that can contribute to the search results  no  blind flooding Minimize the effect of topology changes  Maintenance overhead is independent of system size 99 Basic Idea: Keyspace is partitioned into a small number of buckets. Each bucket corresponds to a color.  Each node is assigned a color.  #  of buckets = # of colors Each node sends the <key, value> pairs to the node with the same color as the key within its Immediate Neighborhood.  IN(N): All nodes within h hops from Node N. 100 Partition Nodes Given any overlay, first partition nodes into buckets (colors) based on hash of IP 101 Partition Nodes (2) Around each node, there is at least one node of each color X Y May require backup color assignments 102 Register Content Partition content space into buckets (colors) and register pointer at “nearby” nodes. Nodes around Z form a small hash table! Z register red content locally register yellow content at a yellow node 103 Searching Content Start at a “nearby” colored node, search other nodes of the same color. X W Y U V Z 104 Searching Content (2) A smaller overlay for each color and use Gnutella-style flood Fan-out = degree of nodes in the smaller overlay 105 More…  When node X is inserting <key, value>  Multiple nodes in IN(X) have the same color?  No node in IN(X) has the same color as key k?  Solution:  P1: randomly select one  P2: Backup scheme: Node with next color   Primary color (unique) & Secondary color (zero or more) Problems coming with this solution:  No longer consistent and stable  The effect is isolated within the Immediate neighborhood 106 Extended Neighborhood   IN(A): Immediate Neighborhood F(A): Frontier of Node A   All nodes that are directly connected to IN(A), but not in IN(A) EN(A): Extended Neighborhood  The union of IN(v) where v is in F(A)  Actually EN(A) includes all nodes within 2h + 1 hops  Each node needs to maintain these three set of nodes for query. 107 The network state information for node A (h = 2) 108 Searching with Extended Neighborhood  Node A wants to look up a key k of color C(k), it picks a node B with C(k) in IN(A)  If multiple nodes, randomly pick one  If none, pick the backup node    B, using its EN(B), sends the request to all nodes which are in color C(k). The other nodes do the same thing as B. Duplicate Message problem:  Each node caches the unique query identifier. 109 More on Extended Neighborhood All <key, value> pairs are stored among IN(X). (h hops from node X)  Why each node needs to keep an EN(X)?  Advantage:   The forwarding node is chosen based on local knowledge  Completeness: a query (C(k)) message can reach all nodes in C(k) without touching any nodes in other colors (Not including backup node) 110 Maintaining Topology  Edge Deletion: X-Y  Deletion message needs to be propagated to all nodes that have X and Y in their EN set  Necessary Adjustment:    Change IN, F, EN sets Move <key, value> pairs if X/Y is in IN(A) Edge Insertion:  Insertion message needs to include the neighbor info  So other nodes can update their IN and EN sets 111 Maintaining Topology  Node Departure: a node X with w edges is leaving  Just like w edge deletion  Neighbors of X initiates the propagation  Node Arrival: X joins the network  Ask its new neighbors for their current topology view  Build its own extended neighborhood  Insert w edges. 112 Problems with basic design  Fringe node:  Those low connectivity node allocates a large number of secondary colors to its high-connectivity neighbors.  Large fan-out:  The forwarding fan-out degree at A is proportional to the size of F(A)  This is desirable for partial lookup, but not good for full lookup 113 A is overloaded by secondary colors from B, C, D, E 114 Solutions:  Prune Fringe Nodes:  If  the degree of a node is too small, find a proxy node. Biased Backup Node Assignment: X assigns a secondary color to y only when a * |IN(x)| > |IN(y)|  Reducing Forward Fan-out:  Basic   idea: try backup node, try common nodes 115 Experiment:    H = 2 (1 too small, >2 EN too large) Topology: Gnutella snapshot Exp1: Search Efficiency 116 Distribution of colors per node 117 Fan-out: 118 Num of colors: effect on Search 119 Num of colors: effect on Fan-out 120 Discussion    Each search only disturbs a small fraction of the nodes in the overlay. No restructure the overlay Each node has only local knowledge  scalable  Hybrid (unstructured and local DHT) system 121 PASTRY 122 Pastry  Identifier space:  Nodes and data items are uniquely associated with m-bit ids – integers in the range (0 – 2m -1) – m is typically 128 views ids as strings of digits to the base 2b where b is typically chosen to be 4  Pastry A key is located on the node to whose node id it is numerically closest 123 Routing Goal  Pastry routes messages to the node whose nodeId is numerically closest to the given key in less than log2b (N) steps:  “A heuristic ensures that among a set of nodes with the k closest nodeIds to the key, the message is likely to first reach a node near the node from which the message originates, in term of the proximity metric” 124 Routing Information  Pastry’s node state is divided into 3 main elements routing table – similar to Chord’s finger table – stores links to id-space  The  The leaf set contains nodes which are close in the idspace  Nodes that are closed together in terms of network locality are listed in the neighbourhood set 125 Routing Table  A Pastry node’s routing table is made up of m/b (log2b N) rows with 2b -1 entries per row  On node n, entries in row i hold the identities of Pastry nodes whose node-id share an i-digit prefix with n but differ in digit n itself  For ex, the first row is populated with nodes that have no prefix in common with n  When there is no node with an appropriate prefix, the corresponding entry is left empty  Single digit entry in each row shows the corresponding digit of the present node’s id – i.e. prefix matches the current id up to the given value of p – the next row down or leaf set should be examined to find a route. 126 Routing Table  Routing tables (RT) thus built achieve an effect similar to Chord finger table  The detail of the routing information increases with the proximity of other nodes in the id-space  Without a large no. of nearby nodes, the last rows of the RT are only sparsely populated – intuitively, the id-space would need to be fully exhausted with node-ids for complete RTs on all nodes  In populating the RT, there is a choice from the set of nodes with the appropriate id-prefix  During the routing process, network locality can be exploited by selecting nodes which are close in terms of proximity ntk. metric 127 Leaf Set  The Routing tables sort node ids by prefix. To increase lookup efficiency, the leaf set L of nodes holds the |L| nodes numerically closest to n (|L|/2 smaller and |L|/2 larger, L = 2 or 2 × 2b, normally)  The RT and the leaf set are the two sources of information relevant for routing  The leaf set also plays a role similar to Chord’s successor list in recovering from failures of adjacent nodes 128 Neighbourhood Set  Instead of numeric closeness, the neighbourhood set M is concerned with nodes that are close to the current node with regard to the network proximity metric  Thus, it is not involved in routing itself but in maintaining network locality in the routing information 129 Pastry Node State (Base 4) L Nodes that are numerically closer to the present Node (2b or 2x2b entry) R Common prefix with 10233102next digit-rest of NodeId (log2b (N) rows, 2b-1 columns) M Nodes that are closest according to the proximity metric (2b or 2x2b entry) 130 Routing  Key D arrives at nodeId A  Ril enetry in routing table at column i and row l  Li i-th closest nodeId in leaf set  Dl value of the l’s digit in the key D  shl(A,B) length of the prefix shared in digits 131 Routing  Routing is divided into two main steps:  First, a node checks whether the key K is within the range of its leaf set  If it is the case, it implies that K is located in one of the nearby nodes of the leaf set. Thus, the node forwards the query to the leaf set node numerically closest to K. In case this is the node itself, the routing process is finished. 132 Routing  If K does not fall within the range of the leaf set, the query needs to be forwarded over a large distance using the routing table  In this case, a node n tries to pass the query on to a node which shares a longer common prefix with K than n itself  If there is no such entry in the RT, the query is forwarded to a node which shares a prefix with K of the same length as n but which is numerically close to K than n 133 Routing  This scheme ensures that routing loop do not occur because the query is routed strictly to a node with a longer common identifier prefix than the current node, or to a numerically closer node with the same prefix 134 Routing performance  Routing procedure converges, each step takes the message to node that either:  Shares a longer prefix with the key than the local node  Shares as long a prefix with, but is numerically closer to the key than the local node. 135 Routing performance  Assumption: Routing tables are accurate and no recent node failures  There are 3 cases in the Pastry routing scheme:  Case 1: Forward the query (according to the RT) to a node with a longer prefix match than the current node.  Thus, the no. of nodes with longer prefix matches is reduced by at least a factor of 2b in each step, so the destination is reached in log2b N steps. 136 Routing performance  There are 3 cases:  Case 2: The query is routed via leaf set (one step). This increases the no. of hop by one 137 Routing performance  There are 3 cases:  Case 3: The key is neither covered by the leaf set nor does the RT contains an entry with a longer matching prefix than the current node  Consequently, the query is forwarded to a node with the same prefix length, adding an additional routing hop.  For a moderate leaf set size ( |L| = 2 × 2b), the probability of this case is less than 0.6%. So, it is very unlikely that more than one additional hop is incurred. 138 Routing performance  As a result, the complexity of routing remains at O(log2b N) on average  Higher values of b leads to fast routing but also increases the amount of state that needs to managed at each node  Thus, b is typically 4 but Pastry implementation can choose an appropriate trade-off for specific application 139 Join and Failure  Join    Use routing to find numerically closest node already in network Ask state from all nodes on the route and initialize own state Error correction  Failed leaf node: contact a leaf node on the side of the failed node and add appropriate new neighbor  Failed table entry: contact a live entry with same prefix as failed entry until new live entry found, if none found, keep trying with longer prefix table entries 140 Self Organization: Node Arrival   The new node n is assumed to know a nearby Pastry node k based on the network proximity metric Now n needs to initialize its RT, leaf set and neighbourhood set.  Since K is assumed to be close to n, the nodes in K’s neghbourhood set are reasonably good choices for n, too.  Thus, n copies the neighbourhood set from K. 141 Self Organization: Node Arrival  To build its RT and leaf set, n routes a special join message via k to a key equal to n  According to the standard routing rules, the query is forwarded to the node c with the numerically closest id and hence the leaf set of c is suitable for n, so it retrieves c’s leaf set for itself.  The join request triggers all nodes, which forwarded the query towards c, to provide n with their routing information. 142 Self Organization: Node Arrival  Node n’s RT is constructed from the routing information of these nodes starting at row 0.  As this row is independent of the local node id, n can use these entries at row zero of k’s routing table    In particular, it is assumed that n and k are close in terms of network proximity metric Since k stores nearby nodes in its RT, these entries are also close to n. In the general case of n and k not sharing a common prefix, n cannot reuse entries from any other row in K’s RT. 143 Self Organization: Node Arrival  The route of the join message from n to c leads via nodes v1, v2, … vn with increasingly longer common prefixes of n and vi  Thus, row 1 from the RT of v1 is also a good choice for the same row of the RT of n  The same is true for row 2 on node v2 and so on  Based on this information, the RT of n can be constructed. 144 Self Organization: Node Arrival  Finally, the new node sends its node state to all nodes in its routing data so that these nodes can update their own routing information accordingly  In contrast to lazy updates in Chord, this mechanism actively updates the state in all affected nodes when a new node joins the system  At this stage, the new node is fully present and reachable in the Pastry network 145 Node Failure  Node failure is detected when a communication attempt with another node fails. Routing requires contacting nodes from RT and leaf set, resulting in lazy detection of failures  During routing, the failure of a single node in the RT does not significantly delay the routing process. The local node can chose to forward the query to a different node from the same row in the RT. (Alternatively, a node could store backup nodes with each entry in the RT) 146 Node Failure  Repairing a failed entry in the leaf set of a node is straightforward – utilizing the leaf set of other nodes referenced in the local leaf set.  Contacts the leaf set of the largest index on the side of the failed node If this node is unavailable, the local node can revert to leaf set with smaller indices  147 Node Departure  Neighborhood node: asks other members for their M, checks the distance of each of the newly discovered nodes, and updates its own neighborhood set accordingly. 148 Locality  “Route chosen for a message is likely to be good with respect to the proximity metric”  Discussion:  Locality in the routing table  Route locality  Locating the nearest among k nodes 149 Locality in the routing table  Node A is near X  A’s R0 entries are close to A, A is close to X, and triangulation inequality holds  entries in X are relatively near A.  Likewise, obtaining X’s neighborhood set from A is appropriate.  B’s R1 entries are reasonable choice for R1of X   Entries in each successive row are chosen from an exponentially decreasing set size. The expected distance from B to any of its R1 entry is much larger than the expected distance traveled from node A to B.  Second stage: X requests the state from each of the nodes in its routing table and neighborhood set to update its entries to closer nodes. 150 Routing locality   Each routing step moves the message closer to the destination in the nodeId space, while traveling the least possible distance in the proximity space. Given that:    A message routed from A to B at distance d cannot subsequently be routed to a node with a distance of less than d from A The expected distance traveled by a message during each successive routing step is exponentially increasing  Since a message tends to make larger and larger strides with no possibility of returning to a node within di of any node i encountered on the route, the message has nowhere to go but towards its destination 151 Node Failure     To replace the failed node at entry i in row j of its RT (Rji), a node contacts another node referenced in row j Entries in the same row j of the remote node are valid for the local node and hence it can copy entry Rji from the remote node to its own RT In case it failed as well, it can probe another node in row j for entry Rji If no live node with appropriate nodeID prefix can be obtained in this way, the local node queries nodes from the preceding row Rj-1 152 Locating the nearest among k nodes  Goal:  among the k numerically closest nodes to a key, a message tends to first reach a node near the client.  Problem:  Since Pastry routes primarily based on nodeId prefixes, it may miss nearby nodes with a different prefix than the key.  Solution (using a heuristic):  Based on estimating the density of nodeIds, it detects when a message approaches the set of k and then switches to numerically nearest address based routing to locate the nearest replica. 153 Arbitrary node failures and network partitions  Node continues to be responsive, behaves incorrectly or even maliciously. but  Repeated queries fail each time since they normally take the same route.  Solution: Routing can be randomized  The choice among multiple nodes that satisfy the routing criteria should be made randomly 154 Content-Addressable Network (CAN) Proc. ACM SIGCOMM (San Diego, CA, August 2001) Motivation  Primary scalability issue in peer-to-peer systems is the indexing scheme used to locate the peer containing the desired content  Content-Addressable Network (CAN) is a scalable indexing mechanism  Also a central issue in large scale storage management systems 156 Basic Design  Basic Idea: A virtual d-dimensional Coordinate space Each node owns a Zone in the virtual space Data is stored as (key, value) pair Hash(key) --> a point P in the virtual space (key, value) pair is stored on the node within whose Zone the point P locates 157 An Example of CAN 1 158 An Example of CAN (cont) 1 2 159 An Example of CAN (cont) 3 1 2 160 An Example of CAN (cont) 3 1 2 4 161 An Example of CAN (cont) 162 An Example of CAN (cont) I 163 An Example of CAN (cont) node I::insert(K,V) I 164 An Example of CAN (cont) node I::insert(K,V) (1) a = hx(K) I x=a 165 An Example of CAN (cont) node I::insert(K,V) I (1) a = hx(K) b = hy(K) y=b x=a 166 An Example of CAN (cont) node I::insert(K,V) (1) a = hx(K) b = hy(K) I (2) route(K,V) -> (a,b) 167 An Example of CAN (cont) node I::insert(K,V) (1) a = hx(K) b = hy(K) (2) route(K,V) -> (a,b) I (K,V) (3) (a,b) stores (K,V) 168 An Example of CAN (cont) node J::retrieve(K) (1) a = hx(K) b = hy(K) (2) route “retrieve(K)” to (a,b) (K,V) J 169 Important Thing…. Important note: Data stored in CAN is addressable by name (ie key) not by location (ie IP address.) 170 Routing in CAN 171 Routing in CAN (cont) (a,b) (x,y) 172 Routing in CAN (cont) Important note: A node only maintain state for its immediate neighboring nodes. 173 Node Insertion In CAN (cont) I new node 1) discover some node “I” already in CAN 175 Node Insertion In CAN (cont) (p,q) 2) pick random point in space I 176 Node Insertion In CAN (cont) (p,q) J I new node 3) I routes to (p,q), discovers node J 177 Node Insertion In CAN (cont) J new 4) split J’s zone in half… new owns one half 178 Node Insertion In CAN (cont) Important note: Inserting a new node affects only a single other node and its immediate neighbors 179 Review about CAN (part2)     Requests (insert, lookup, or delete) for a key are routed by intermediate nodes using a greedy routing algorithm Requires no centralized control (completely distributed) Small per-node state is independent of the number of nodes in the system (scalable) Nodes can route around failures (fault-tolerant) 180 CAN: node failures  Need to repair the space  recover database (weak point)  soft-state updates  use replication, rebuild database from replicas  repair routing  takeover algorithm 181 CAN: takeover algorithm  Simple failures know your neighbor’s neighbors  when a node fails, one of its neighbors takes over its zone   More complex failure modes  simultaneous failure of multiple adjacent nodes  scoped flooding to discover neighbors  hopefully, a rare event 182 CAN: node failures Important note: Only the failed node’s immediate neighbors are required for recovery 183 CAN Improvements 184 Adding Dimensions 185 Multiple independent coordinate spaces (realities)    Nodes can maintain multiple independent coordinate spaces (realities) For a CAN with r realities: a single node is assigned r zones and holds r independent neighbor sets  Contents of the hash table are replicated for each reality Example: for three realities, a (K,V) mapping to P:(x,y,z) may be stored at three different nodes  (K,V) is only unavailable when all three copies are unavailable  Route using the neighbor on the reality closest to (x,y,z) 186 Dimensions vs. Realities     Increasing the number of dimensions and/or realities decreases path length and increases per-node state More dimensions has greater effect on path length More realities provides stronger fault-tolerance and increased data availability Authors do not quantify the different storage requirements  More realities requires replicating (K,V) pairs 187 RTT Ratio & Zone Overloading   Incorporate RTT in routing metric  Each node measures RTT to each neighbor  Forward messages to neighbor with maximum ratio of progress to RTT Overload coordinate zones  - Allow multiple nodes to share the same zone, bounded by a threshold MAXPEERS  Nodes maintain peer state, but not additional neighbor state  Periodically poll neighbor for its list of peers, measure RTT to each peer, retain lowest RTT node as neighbor  (K,V) pairs may be divided among peer nodes or replicated 188 Multiple Hash Functions     Improve data availability by using k hash functions to map a single key to k points in the coordinate space Replicate (K,V) and store at k distinct nodes (K,V) is only unavailable when all k replicas are simultaneously unavailable Authors suggest querying all k nodes in parallel to reduce average lookup latency 189 Topology sensitive       Use landmarks for topologically-sensitive construction Assume the existence of well-known machines like DNS servers Each node measures its RTT to each landmark  Order each landmark in order of increasing RTT  For m landmarks: m! possible orderings Partition coordinate space into m! equal size partitions Nodes join CAN at random point in the partition corresponding to its landmark ordering Latency Stretch is the ratio of CAN latency to IP network latency 190 Other optimizations    Run a background load-balancing technique to offload from densely populated bins to sparsely populated bins (partitions of the space) Volume balancing for more uniform partitioning  When a JOIN is received, examine zone volume and neighbor zone volumes  Split zone with largest volume  Results in 90% of nodes of equal volume Caching and replication for “hot spot” management 191 Strengths More resilient than flooding broadcast networks  Efficient at locating information  Fault tolerant routing  Node & Data High Availability (w/ improvement)  Manageable routing table size & network traffic  192 Weaknesses Impossible to perform a fuzzy search  Susceptible to malicious activity  Maintain coherence of all the indexed data (Network overhead, Efficient distribution)  Still relatively higher routing latency  Poor performance w/o improvement  193 Summary  CAN  an Internet-scale hash table  potential  Scalability  O(d)  per-node state Low-latency routing  simple  building block in Internet applications heuristics help a lot Robust  decentralized, can route around trouble 194 Some Main Research Areas in P2P Efficiency of search, queries and topologies ( Chord, CAN, YAPPER…)  Data delivery (ZIGZAG..)  Resource Management  Security  195 Resource Management Problem:  Autonomous nature of peers: essentially selfish peers must be given an incentive to contribute resources.  The scale of the system: makes it hard to get a complete picture of what resources are available An approach: Use concepts from economics to construct a resource marketplace, where peers can buy and sell or trade resources as necessary 196 Security Problem Problem: - Malicious attacks: nodes in a P2P system operate in an autonomous fashion, and any node that speaks the system protocol may participate in the system An approach: Mitigating attacks by nodes that abuse the P2P network by exploiting the implicit trust peers place on them. 197 Reference      Kien A. Hua, Duc A. Tran, and Tai Do, “ZIGZAG: An Efficient Peer-to-Peer Scheme for Media Streaming”, INFOCOM 2003. RATNASAMY, S., FRANCIS, P., HANDLEY, M., KARP, R., AND SHENKER, S. A scalable content-addressable network. In Proc. ACM SIGCOMM (San Diego, CA, August 2001) Mayank Bawa, Brian F. Cooper, Arturo Crespo, Neil Daswani, Prasanna Ganesan, Hector Garcia-Molina, Sepandar Kamvar, Sergio Marti, Mario Schlosser, Qi Sun, Patrick Vinograd, Beverly Yang” Peer-to-Peer Research at Stanford” Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, ACM SIGCOMM 2001 Prasanna Ganesan, Qixiang Sun, and Hector Garcia-Molina, YAPPERS: A Peer-toPeer Lookup Service over Arbitrary Topology, INFOCOM 2003. 198

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Talk 2 - IIT Guwahati