* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download What is Gnutella?
Survey
Document related concepts
Transcript
An overview of Gnutella 1 What is Gnutella? Gnutella is a protocol for distributed search • peer-to-peer comm • decentralized model • No third party lookup Two stages: 1. Join Network … later 2. Use Network, I.e discover / search other peers 2 Gnutella Jargon Servent: A Gnutella node. Each servent is both a server and a client. 2 Hops Hops: a hop is a pass through an intermediate node 1 Hop client TTL: how many hops a packet can go before it dies (default setting is 7 in Gnutella) 3 Gnutella Scenario Step 0: Join the network Step 1: Determining who is on the network • "Ping" packet is used to announce your presence on the network. • Other peers respond with a "Pong" packet. • Also forwards your Ping to other connected peers • A Pong packet also contains: • an IP address • port number • amount of data that peer is sharing • Pong packets come back via same route Step 2: Searching •Gnutella "Query" ask other peers if they have the file you desire A Query packet might ask, "Do you have any content that matches the string ‘Homer"? • Peers check to see if they have matches & respond (if they have any matches) & send packet to connected peers • Continues for TTL Step 3: Downloading • Peers respond with a “QueryHit” (contains contact info) 4 • File transfers use direct connection using HTTP protocol’s GET method • When there is a firewall a "Push" packet is used – reroutes via Push path Remarks Simple idea , but lacks scalability, since bandwidth is wasted. Sometimes, existing objects may not be located due to limited TTL. Various improved search strategies have been proposed. 5 Searching in Gnutella The topology is dynamic, I.e. constantly changing. How do we model a constantly changing topology? Usually, we begin with a static topology, and later account for the effect of churn. Modeling Static topology (measurements provide useful inputs) Random graph Power law graph (Q. Is Gnutella topology a random graph?) Search strategies Flooding Random walk / Biased random walk / multiple walker one-hop replication / two-hop replication 6 Gnutella topology Gnutella topology is a power-law graph. (Also called scale-free graph) What is a power-law graph? The number of nodes with degree k = c.k - r Contrast this with Gaussian distribution where the number of nodes with degree k = c. 2 - k. Many graphs in the nature exhibit power-law characteristics. Examples, world-wide web (the number of pages that have k in-links Is proportional to k - 2), fraction of scientic papers that receive k citations is k -3 etc. 7 # of telephone numbers from which calls were made AT&T Call Graph How many telephone numbers receive calls from k different telephone numbers? # of telephone numbers called 8 4 Gnutella network power-law link distribution proportion of nodes 10 10 10 data power-law fit t = 2.07 2 1 0 10 0 1 10 number of neighbors summer 2000, data provided by Clip2 9 5 A possible explanation Nodes join at different times. The more connections a node has, the more likely it is to acquire new connections (Rich gets richer). Popular webpages attract new pointers. Such a growth process produces power-law network 10 7 Search via random walk Existence of a path does not necessarily mean that such a path can be discovered 11 Search via Random Walk Search metrics Discovery time in hops (also called delay) Distance covered (overhead) by the walker Both should be as small as possible. For a single random walker, these are equal. For search by flooding, if delay = h but distance = d + d2 + … + dh where d = degree of a node. K random walker (fixed K) is a compromise. 12 A simple analysis of random walk Let p = Population of the object. i.e. the fraction of nodes hosting the object T = TTL (time to live) Hop count h probability 1 p 2 (1-p).p 3 (1-p)2.p T (1-p)T-1.p 13 A simple analysis of random walk Expected hop count E(h) = 1.p + 2.(1-p).p + 3(1-p)2.p + …+ T.(1-p)T-1.p = 1/p. (1-(1-p)T) - T(1-p)T With a large TTL, E(h) = 1/p, which is intuitive. With a small TTL, there is a risk that search will time out before an existing object is located. 14 K random walkers Assume they all k walkers start in unison. Probability that none could find the object after one hop = (1-p)k. Prob. that none succeeded after k hops = (1-p)kT. So the probability that at least one walker succeeded is 1-(1-p)kT. A typical assumption is that the search is abandoned as soon as at least one walker succeeds Expected overhead = 1-(1-p)T-1 - T. (1-p)T-1 . k p As k increases, the probability of success increases, the overhead increases, but the delay decreases. There is a tradeoff here. 15 Increasing search efficiency Major strategies 1. Biased walk utilizing node degree heterogeneity. 2. Utilizing structural properties like random graph, power-law graphs, or small-world properties 3. Topology adaptation for faster search 4. Introducing two layers in the graph structure using supernodes 16 Dynamic Topology Adaptation • Make high-capacity nodes have high degree (i.e., more neighbors) • Per-node level of satisfaction, S: – 0 no neighbors, 1 enough neighbors – Function of: • Node’s capacity • Neighbors’ degrees Neighbors’ capacities Their age – When S << 1, look for neighbors aggressively 17 One hop replication Each node keeps track of the indices of the files belonging to its immediate neighbors. As a result, high capacity / high degree nodes can provide useful clues to a large number of search queries. Where is 18 Biased random walk P=2/10 P=5/10 P=3/10 Each node records the degree of the neighboring nodes. Search easily gravitates towards high degree nodes that hold more clues. 19 power-law graph number of nodes found 94 67 63 54 2 6 1 Deterministic biased walk 20 9 The KaZaA approach Where is ABC? ABC Supernode Supernode Powerful nodes (supernodes) act as local index servers, and client queries are propagated to other supernodes. Two-layered. 21 Case Study: Gia (Chawathe et al. Making Gnutella-like P2P systems scalable. SIGCOMM ‘03) Three major decisions 1. Flooding replaced by (biased) random walk 2. Topology adaptation: high-capacity nodes maintain more clues via one hop replication. They should be easily reachable 3. Token based flow control to prevent overloading of high-degree nodes. 22 Topology Control Steps for a node X when Y wants to join {X has spare capacity} If # of neighbors + 1 < max then accept Y as a neighbor {when X has no spare capacity, it has to drop a neighbor} If C(Y) > C(i: i is a current neighbor of X) then accept Y as neighbor and drop any one of the existing neighbors. {Otherwise, X has to pick a Z that can be dropped in favor Y} pick the highest degree current neighbor Z If C(Y) > C(Z) or Y has fewer neighbors than Z then drop Z, accept Y else reject Y 23 {This last step ensures that we do not drop poorly connected neighbors} Simulation Results • Compare four systems – – – – FLOOD: TTL-scoped, random topologies RWRT: Random walks, random topologies SUPER: Supernode-based search GIA: search using GIA protocol suite • Metric: – Collapse point: aggregate throughput that the system can sustain. The success rate drops to 90%. 24 Collapse Point (qps/node) System Performance 1000 GIA: N=10,000 SUPER: N=10,000 10 RW RT: N=10,000 FLOOD: N=10,000 0.1 0.001 0.00001 0.01 % 0.1 % Replication Rate (percentage) 1% population of the object GIA outperforms SUPER, RWRT & FLOOD by many orders of magnitude in terms of aggregate query load 25 Factor Analysis Algorithm Collapse point Algorithm Collapse point RWRT 0.0005 GIA 7 RWRT+OHR 0.005 GIA – OHR RWRT+BIAS 0.0015 GIA – BIAS 0.004 6 RWRT+TADAPT 0.001 0.0006 GIA – TADAPT 0.2 GIA – FLWCTL 2 RWRT+FLWCTL Topology adaptation Flow control 26