Download ecs251 Spring 2007 - UC Davis Computer Science

Document related concepts

Airborne Networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Deep packet inspection wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

CAN bus wikipedia , lookup

Peer-to-peer wikipedia , lookup

Transcript
UCDavis, ecs251
Fall 2007
ecs251 Fall 2007:
Operating System Models
#4: Peer-to-Peer Systems
Dr. S. Felix Wu
Computer Science Department
University of California, Davis
http://www.cs.ucdavis.edu/~wu/
[email protected]
10/23/2007
P2P
1
UCDavis, ecs251
Fall 2007
The role of service provider..

Centralized management of services
– DNS, Google, www.cnn.com, Blockbuster,
SBC/Sprint/AT&T, cable service, Grid
computing, AFS, bank transactions…

Information, Computing, & Network
resources owned by one or very few
administrative domains.
– Some with SLA (Service Level Agreement)
10/23/2007
P2P
2
UCDavis, ecs251
Fall 2007
Interacting with the “SP”

Service providers are the owner of the
information and the interactions
– Some enhance/establish the interactions
10/23/2007
P2P
3
UCDavis, ecs251
Fall 2007
Incentives for SP
Business model
 By law

10/23/2007
P2P
4
UCDavis, ecs251
Fall 2007
Let’s compare …






Google
Blockbuster
CNN
MLB/NBA
LinkIn
e-Bay
10/23/2007






P2P
Skype
Bittorrent
Blog
Youtube
BotNet
Cyber-Paparazzi
5
UCDavis, ecs251
Fall 2007
Toward P2P

More participation of the end nodes (or their
users)
– More decentralized Computing/Network
resources available
– End-user controllability and interactions
– Security/robustness concerns
10/23/2007
P2P
6
UCDavis, ecs251
Fall 2007

Peer’s role
Shifting functionality/service/information
from the SP to the peers.
– Peer’s CPU cycles
– Peer’s Internet access
– Peer’s storage
10/23/2007
P2P
7
UCDavis, ecs251
Fall 2007
Service Providers in P2P

We might not like SP, but we still can not
avoid SP entirely.
– Who is going to lay the fiber and switch?
– Can we avoid DNS?
– How can we stop “Cyber-Bullying” and other
similar?
– Copyright enforcement?
– Internet becomes a junkyard?
10/23/2007
P2P
8
UCDavis, ecs251
Fall 2007
We will discuss…

P2P system examples
– Unstructured, structured, incentive
Architectural analysis and issues
 Future P2P applications and why?

10/23/2007
P2P
9
UCDavis, ecs251
Fall 2007
Questions to ask
Peer’s role (or SP’s role)
 Peer’s controllability and vulnerability
 Incentives to contribute
 Peer’s mobility and dynamics
 Scalability

10/23/2007
P2P
10
UCDavis, ecs251
Fall 2007
Challenge to you…
Define a new P2P-related application,
service, or architecture.
 Justify why it is practical, useful and will
scale well.

– Example: sharing cooking recipes, experiences
& recommendations about restaurants and
hotels
10/23/2007
P2P
11
UCDavis, ecs251
Fall 2007
Napster
P2P File sharing
 “Unstructured”

10/23/2007
P2P
12
UCDavis, ecs251
Fall 2007
Napster
peers
Napster serv er
Index
1. Fil e lo cation
req uest
Napster serv er
Index
3. Fil e req uest
2. Lis t of pee rs
offerin g th e fi le
5. Ind ex upd ate
4. Fil e de livered
10/23/2007
P2P
13
UCDavis, ecs251
Fall 2007
Napster
Advantages?
 Disadvantages?

10/23/2007
P2P
14
UCDavis, ecs251
Fall 2007
Napster: Questions to ask
Peer’s role (or SP’s role)
 Peer’s controllability and vulnerability
 Incentives to contribute
 Peer’s mobility and dynamics
 Scalability

10/23/2007
P2P
15
UCDavis, ecs251
Fall 2007
10/23/2007
P2P
16
UCDavis, ecs251
Fall 2007
10/23/2007
P2P
17
UCDavis, ecs251
Fall 2007
Originally conceived of by Justin Frankel, 21 year old founder of Nullsoft
 March 2000, Nullsoft posts Gnutella to the web
 A day later AOL removes Gnutella at the behest of Time Warner
 The Gnutella protocol version 0.4
http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf
and version 0.6
http://rfc-gnutella.sourceforge.net/Proposals/Ultrapeer/Ultrapeers.htm
 there are multiple open source implementations at http://sourceforge.net/
including:
– Jtella
– Gnucleus
 Software released under the Lesser Gnu Public License (LGPL)
 the Gnutella protocol has been widely analyzed

10/23/2007
P2P
18
UCDavis, ecs251
Fall 2007

Gnutella Protocol Messages
Broadcast Messages
– Ping: initiating message (“I’m here”)
– Query: search pattern and TTL (time-to-live)

Back-Propagated Messages
– Pong: reply to a ping, contains information about the
peer
– Query response: contains information about the
computer that has the needed file

Node-to-Node Messages
– GET: return the requested file
– PUSH: push the file to me
10/23/2007
P2P
19
UCDavis, ecs251
Fall 2007
Steps:
• Node 2 initiates search for file A
7
1
A
4
2
6
3
5
10/23/2007
P2P
20
UCDavis, ecs251
Fall 2007
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
7
1
4
2
3
A
6
A
5
10/23/2007
P2P
21
UCDavis, ecs251
Fall 2007
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
7
1
4
2
6
3
A
10/23/2007
A
5
P2P
22
UCDavis, ecs251
Fall 2007
A:7
A
7
1
4
2
6
3
A:5
10/23/2007
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
5
P2P
23
UCDavis, ecs251
Fall 2007
7
1
4
2
3
A:7
A:5
A 6
A
10/23/2007
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
5
P2P
24
UCDavis, ecs251
Fall 2007
7
1
A:7
2
4
A:5
6
3
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
5
10/23/2007
P2P
25
UCDavis, ecs251
Fall 2007
Limited Scope Flooding
Reverse Path Forwarding
download A
1
7
4
2
6
3
5
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
• File download
• Note: file transfer between
clients behind firewalls is not
possible; if only one client, X, is
behind a firewall, Y can request
that X push the file to Y
10/23/2007
P2P
26
UCDavis, ecs251
Fall 2007
Gnutella
Advantages?
 Disadvantages?

10/23/2007
P2P
27
UCDavis, ecs251
Fall 2007
Gnutella: Questions to ask
Peer’s role (or SP’s role)
 Peer’s controllability and vulnerability
 Incentives to contribute
 Peer’s mobility and dynamics
 Scalability

10/23/2007
P2P
28
UCDavis, ecs251
Fall 2007
Gnutella vs. Napster
10/23/2007
P2P
29
UCDavis, ecs251
Fall 2007




GUID:
Short for Global Unique Identifier, a randomized string
that is used to uniquely identify a host or message on the
Gnutella Network. This prevents duplicate messages from
being sent on the network.
GWebCache:
a distributed system for helping servants connect to the
Gnutella network, thus solving the "bootstrapping"
problem. Servants query any of several hundred
GWebCache servers to find the addresses of other servants.
GWebCache servers are typically web servers running a
special module.
Host Catcher:
Pong responses allow servants to keep track of active
gnutella hosts
On most servants, the default port for Gnutella is 6346
10/23/2007
P2P
30
10/23/2007
Gnutella Network Growth
P2P
05/12/01
05/16/01
05/22/01
05/24/01
05/29/01
50
02/27/01
03/01/01
03/05/01
03/09/01
03/13/01
03/16/01
03/19/01
03/22/01
03/24/01
11/20/00
11/21/00
11/25/00
11/28/00
Number of nodes in the largest
network component ('000)
UCDavis, ecs251
Fall 2007
.
40
30
20
10
-
31
UCDavis, ecs251
Fall 2007
“Limited Scope Flooding”
Ripeanu reported that Gnutella traffic totals 1Gbps (or
330TB/month).
– Compare to 15,000TB/month in US Internet backbone
(December 2000)
– this estimate excludes actual file transfers
Reasoning:
 QUERY and PING messages are flooded. They form
more than 90% of generated traffic
 predominant TTL=7
 >95% of nodes are less than 7 hops away
 measured traffic at each link about 6kbs
 network with 50k nodes and 170k links
10/23/2007
P2P
32
UCDavis, ecs251
Fall 2007
A
B
F
D
E
C
G
H
Perfect Mapping
10/23/2007
P2P
33
UCDavis, ecs251
Fall 2007
A
B
F
D
E
C
G
H
Inefficient mapping
 Link D-E needs to support six times higher
traffic.

10/23/2007
P2P
34
UCDavis, ecs251
Fall 2007
Topology mismatch
The overlay network topology doesn’t match
the underlying Internet infrastructure
topology!





40% of all nodes are in the 10 largest Autonomous
Systems (AS)
Only 2-4% of all TCP connections link nodes
within the same AS
Largely ‘random wiring’
Most Gnutella generated traffic crosses AS border,
making the traffic more expensive
May cause ISPs to change their pricing scheme
10/23/2007
P2P
35
UCDavis, ecs251
Fall 2007
Scalability
Whenever a node receives a message,
(ping/query) it sends copies out to all of its
other connections.
 existing mechanisms to reduce traffic:

– TTL counter
– Cache information about messages they
received, so that they don't forward duplicated
messages.
10/23/2007
P2P
36
UCDavis, ecs251
Fall 2007




70% of Gnutella users share no files
90% of users answer no queries
Those who have files to share may limit number of connections or
upload speed, resulting in a high download failure rate.
If only a few individuals contribute to the public good, these few
peers effectively act as centralized servers.
10/23/2007
P2P
37
UCDavis, ecs251
Fall 2007
Anonymity
Gnutella provides for anonymity by
masking the identity of the peer that
generated a query.
 However, IP addresses are revealed at
various points in its operation: HITS
packets includes the URL for each file,
revealing the IP addresses

10/23/2007
P2P
38
UCDavis, ecs251
Fall 2007





Query Expressiveness
Format of query not standardized
No standard format or matching semantics for the
QUERY string. Its interpretation is completely
determined by each node that receives it.
String literal vs. regular expression
Directory name, filename, or file contents
Malicious users may even return files unrelated to
the query
10/23/2007
P2P
39
UCDavis, ecs251
Fall 2007
Superpeers

Cooperative, long-lived peers typically with
significant resources to handle very high
amount of query resolution traffic.
10/23/2007
P2P
40
UCDavis, ecs251
Fall 2007
10/23/2007
P2P
41
UCDavis, ecs251
Fall 2007




Gnutella is a self-organizing, large-scale, P2P application
that produces an overlay network on top of the Internet; it
appears to work
Growth is hindered by the volume of generated traffic and
inefficient resource use
since there is no central authority the open source
community must commit to making any changes
Suggested changes have been made by
– Peer-to-Peer Architecture Case Study: Gnutella Network, by Matei
Ripeanu
– Improving Gnutella Protocol: Protocol Analysis and Research
Proposals by Igor Ivkovic
10/23/2007
P2P
42
UCDavis, ecs251
Fall 2007
Freenet before v0.7

Essentially the same as Gnutella:
– Limited-scope flooding
– Reverse-path forwarding

Difference:
– Data objects (I.e., files) are also being delivered
via “reverse-path forwarding”

Freenet v0.7 -- “Darknet routing”
10/23/2007
P2P
43
UCDavis, ecs251
Fall 2007
P2P Issues
Scalability & Load Balancing
 Anonymity
 Fairness, Incentives & Trust
 Security and Robustness
 Efficiency
 Mobility

10/23/2007
P2P
44
UCDavis, ecs251
Fall 2007
Incentive-driven Fairness

P2P means we all should contribute..
– Hopefully fair, but the majority is selfish…

“Incentive for people to contribute…”
10/23/2007
P2P
45
UCDavis, ecs251
Fall 2007
10/23/2007
P2P
46
UCDavis, ecs251
Fall 2007
File Organization
File
1
2
3
4
Piece
256KB
Block
16KB
Incomplete Piece
10/23/2007
P2P
47
UCDavis, ecs251
Fall 2007
Initialization
HTTP GET MYFILE.torrent
webserver
tracker
MYFILE.torrent
http://mytracker.com:6969/
S3F5YHG6FEB
FG5467HGF367
“register”
F456JI9N5FF4E
…
list of peers
ID1 169.237.234.1:6881
ID2 190.50.34.6:5692
ID3 34.275.89.143:4545
…
ID50 231.456.31.95:6882
…
Peer 40
10/23/2007
user
P2P
Peer 2
Peer 1
48
UCDavis, ecs251
Fall 2007
Peer/Seed
1
10/23/2007
2
3
P2P
4
49
UCDavis, ecs251
Fall 2007
“On the Wire” Protocol
(Over TCP)
0
BitField
Remote Peer
Interested = 0
choked
= 1
10/23/2007
Non-keepalive messages:
0 – choke
11
– unchoke 0
1
2 – interested
3 – not interested
ID/Infohash
4 – haveHandshake
BitField
5 – bitfield
6 – request
7 – piece
8 – cancel
P2P
Local Peer
Interested = 0
choked
= 1
50
UCDavis, ecs251
Fall 2007
Choking

By default, every peer is “choked”
– stop “uploading” to them, but the TCP
connection is still there.

Select 4~6 peers to “unchoke” ??
– “Re-choke” every 30 seconds
– How to decide?

Optimistic Unchoking
– What is this?
10/23/2007
P2P
51
UCDavis, ecs251
Fall 2007

“Interested”
A request for a piece (or its sub-pieces)
10/23/2007
P2P
52
UCDavis, ecs251
Fall 2007
Get a piece/block!!

Download:
– Which peer? (download from whom? Does it
matter?)
– Which piece?

How about “upload”?
– Which peer?
– Which piece?
10/23/2007
P2P
53
UCDavis, ecs251
Fall 2007
Piece Selection
Pipelining (5 requests)
 Strict Priority (incomplete pieces first)
 Rarest First


What is the problem?
10/23/2007
P2P
54
UCDavis, ecs251
Fall 2007
Rarest First

Exchanging bitmaps with 20+ peers
– Initial messages
– “have” messages

Array of buckets
– Ith buckets contains “pieces” with I known
instances
– Within the same bucket, the client will
randomly select one piece.
10/23/2007
P2P
55
UCDavis, ecs251
Fall 2007
Piece Selection
Pipelining (5 requests)
 Strict Priority
 3 stages:

– Random first piece
– Rarest First
– Endgame mode
10/23/2007
P2P
56
UCDavis, ecs251
Fall 2007
Piece Selection

Piece (64K~1M) Sub-piece (16K)
– Piece-size: trade-off between performance and the size
of the torrent file itself
– A client might request different sub-pieces of the same
piece from different peers.


Strict Priority - sub-pieces and piece
Rarest First
– Exception: “random first”
– Get the stuff out of Seed(s) as soon as possible..
10/23/2007
P2P
57
UCDavis, ecs251
Fall 2007
Get a piece/block!!

Download:
– Which peer?
– Which piece?

How about “upload”?
– Which peer?
– Which piece?
10/23/2007
P2P
58
UCDavis, ecs251
Fall 2007
Peer Selection
Focus on Rate
 Upload to 4~6 peers
 Random Unchoke
 Global rate cap only

10/23/2007
P2P
59
UCDavis, ecs251
Fall 2007
Bittorrent: “Tit for Tat”

Equivalent Retaliation (Game theory)
– A peer will “initially” cooperate, then respond
in kind to an opponent's previous action. If the
opponent previously was cooperative, the agent
is cooperative. If not, the agent is not.
10/23/2007
P2P
60
UCDavis, ecs251
Fall 2007
Choking

By default, every peer is “choked”
– stop “uploading” to them, but the TCP
connection is still there.

Select 4~6 peers to “unchoke” ??
– Best “upload rates” and “interested”.
– Uploading to the unchoked ones and monitor
the download rate for all the peers
– “Re-choke” every 30 seconds

Optimistic Unchoking (6+1)
– Randomly select a choked peer to unchoke
10/23/2007
P2P
61
UCDavis, ecs251
Fall 2007
Bittorrent
Fairness of download and upload between a
pair of peers
 Every 10 seconds, estimate the download
bandwidth from the other peer

– Based on the performance estimation to decide
to continue uploading to the other peer or not
10/23/2007
P2P
62
UCDavis, ecs251
Fall 2007
Properties
Bigger “%” = better chance of unchoked
 Bigger “%” ~= better UL and DL rates ?!

10/23/2007
P2P
63
UCDavis, ecs251
Fall 2007
Who to Unchoke?
Peer/Seed
1
10/23/2007
2
3
P2P
4
64
UCDavis, ecs251
Fall 2007
Seed unchoking

old algorithm
– unchoke the fastest peers (how?)
– problem: fastest peers may monopolize seeds
 new algorithm
 periodically sort all peers according to
their last unchoke time
 prefer the most recently unchoked peers;
on a tie, prefer the fastest
 (presumably) achieves equal spread of seed
bandwidth
10/23/2007
P2P
65
UCDavis, ecs251
Fall 2007
Seed unchoking

old algorithm
– unchoke the fastest peers (how?)
– problem: fastest peers may monopolize seeds
 new algorithm
 periodically sort all peers according to
their last unchoke time
 prefer the most recently unchoked peers;
on a tie, prefer the fastest
 (presumably) achieves equal spread of seed
bandwidth
10/23/2007
P2P
66
UCDavis, ecs251
Fall 2007
Attacks to BT

???
10/23/2007
P2P
67
UCDavis, ecs251
Fall 2007
Attacks to BT
Download only from the seeds
 Download only from fastest peers
 Announcing false pieces
 Privacy -- (Torrent, source IP addresses)

10/23/2007
P2P
68
UCDavis, ecs251
Fall 2007
BitTorrent: Questions to ask
Peer’s role (or SP’s role)
 Peer’s controllability and vulnerability
 Incentives to contribute
 Peer’s mobility and dynamics
 Scalability

10/23/2007
P2P
69
UCDavis, ecs251
Fall 2007
Bittorrent
“Tic-for-Tat” incentive model within the
same torrent
 Piece/Peer selection and choking
 The need for tracker and torrent file

10/23/2007
P2P
70
UCDavis, ecs251
Fall 2007
Client implementations




mainline: written in Python; right now, the only
one employing the new seed unchoking algorithm
Azureus: the most popular, written in Java;
implements a special protocol between clients
(e.g. peers can exchange peer lists)
other popular clients: ABC, BitComet, BitLord,
BitTornado, μTorrent, Opera browser
various non-standard extensions
– retaliation mode: detect compromised/malicious peers
– anti-snubbing: ignore a peer who ignores us
– super seeding: seed masquerading as a leecher
10/23/2007
P2P
71
UCDavis, ecs251
Fall 2007






Resources
Basic BitTorrent mechanisms
[Cohen, P2PECON’03]
BitTorrent specification Wiki
http://wiki.theory.org/BitTorrentSpecification
Measurement studies
[Izal et al., PAM’04],
[Pouwelse et al., Delft TR 2004 and IPTPS’05], [Guo et al., IMC’05], and
[Legout et al., INRIA-TR-2006]
Theoretical analysis and modeling
[Qiu et al., SIGCOMM’04], and
[Tian et al., Infocom’06]
Simulations
[Bharambe et al., MSR-TR-2005]
Sharing incentives and exploiting them
[Shneidman et al., PINS’04],
[Jun et al., P2PECON’05], and
[Liogkas et al., IPTPS’06]
10/23/2007
P2P
72
UCDavis, ecs251
Fall 2007
Trackerless Bittorrent
Every BT peer is a tracker!
 But, how would they share and exchange
information regarding other peers?
 Similar to Napster’s index server or DNS

10/23/2007
P2P
73
UCDavis, ecs251
Fall 2007
Pure P2P
Every peer is a tracker
 Every peer is a DNS server
 Every peer is a Napster Index server


How can this be done?
– We try to remove/reduce the role of “special
servers”!
10/23/2007
P2P
74
UCDavis, ecs251
Fall 2007
Unstructured P2P

P2P network topology is arbitrary!
– Gnutella
– BitTorrent and Napster
10/23/2007
P2P
75
UCDavis, ecs251
Fall 2007
Unstructured P2P

P2P network topology is arbitrary!
– Gnutella
– BitTorrent and Napster

Mapping - [Content/Object, Peer] is arbitrary!
– How to search the content in Gnutella?
– BitTorrent and Napster?
10/23/2007
P2P
76
UCDavis, ecs251
Fall 2007
Disadvantages of u-P2P
Flooding cost
 Rare contents

– Also, the issue of “All or 1”
10/23/2007
P2P
77
UCDavis, ecs251
Fall 2007
From u-P2P to s-P2P
P2P network topology is formed according
to the content ownership
 Unique “naming/keying” for the content
object as well as the peer

10/23/2007
P2P
78
UCDavis, ecs251
Fall 2007
Peer

The requirements of Peer?
10/23/2007
P2P
79
UCDavis, ecs251
Fall 2007
Structured Peering

Peer identity and routability
10/23/2007
P2P
80
UCDavis, ecs251
Fall 2007
Structured Peering
Peer identity and routability
 Key/content assignment

– Which identity owns what? (Google Search?)
10/23/2007
P2P
81
UCDavis, ecs251
Fall 2007
Structured Peering
Peer identity and routability
 Key/content assignment

– Which identity owns what?
Napster: centralized index service
Skype/Kazaa: login-server & super peers
DNS: hierarchical DNS servers
Two problems:
(1). How to connect to the “topology”?
(2). How to prevent failures/changes?
10/23/2007
P2P
82
UCDavis, ecs251
Fall 2007
DHT
Most s-P2P systems are DHT-based.
 Distributed hash tables (DHTs)

– decentralized lookup service of a hash table
– (name, value) pairs stored in the DHT
– any peer can efficiently retrieve the value
associated with a given name
– the mapping from names to values is distributed
among peers
10/23/2007
P2P
83
UCDavis, ecs251
Fall 2007
HT as a search table
(BitTorrent, Napster)
“160 bits”
Index key
Content
Object/
Peer
naming
10/23/2007
Information/content is distributed, and we need
to know where?
Where is this piece of music?
Is this BT piece available?
What is the location of this type of content?
What is the current IP address of this skype
user?
P2P
84
UCDavis, ecs251
Fall 2007
DHT as a search table
???
Index key
10/23/2007
P2P
85
UCDavis, ecs251
Fall 2007
DHT as a search table
???
Index key
10/23/2007
P2P
86
UCDavis, ecs251
Fall 2007
DHT segment ownership
???
Index key
10/23/2007
P2P
87
UCDavis, ecs251
Fall 2007
DHT
Scalable
 Peer arrivals, departures, and failures
 Unstructured versus structured

10/23/2007
P2P
88
UCDavis, ecs251
Fall 2007
DHT (Name, Value)

How to utilize DHT to avoid Trackers in
Bittorrent?
10/23/2007
P2P
89
UCDavis, ecs251
Fall 2007
DHT-based Tracker
FreeBSD 5.4 CD images
Publish the key on
the class web site.
Index key
Whoever owns
this hash entry is
the tracker for the
corresponding
key!
Seed’s IP address
PUT & GET
10/23/2007
P2P
90
UCDavis, ecs251
Fall 2007
Chord
Given a key (content object), it maps the
key onto a peer -- consistent hash
 Assign keys to peers.
 Solves problem of locating key in a
collection of distributed peers.
 Maintains routing information as peers join
and leave the system

10/23/2007
P2P
91
UCDavis, ecs251
Fall 2007
Chord





Consistent Hashing
A Simple Key Lookup Algorithm
Scalable Key Lookup Algorithm
Node Joins and Stabilization
Node Failures
10/23/2007
P2P
92
UCDavis, ecs251
Fall 2007
Consistent Hashing




Consistent hash function assigns each peer and
key an m-bit identifier (e.g., 140 bits).
SHA-1 as a base hash function.
A peer’s identifier is defined by hashing the peer’s
IP address. (other possibilities?)
A content identifier is produced by hashing the
key:
– ID(peer) = SHA-1(IP, Port)
– ID(content) = SHA-1(related to the content object)
– Application-dependent!
10/23/2007
P2P
93
UCDavis, ecs251
Fall 2007
Peer, Content
In an m-bit identifier space, there are 2
identifiers (for both peer and content).
 Which peer handles which content?

10/23/2007
P2P
m
94
UCDavis, ecs251
Fall 2007
Peer, Content
In an m-bit identifier space, there are 2
identifiers (for both peer and content).
 Which peer handles which contents?

m
– We will not have 2m peers/contents!
– Each peer might need to handle more than one
contents.
– In that case, which peer has what?
10/23/2007
P2P
95
UCDavis, ecs251
Fall 2007
Consistent Hashing
m
In an m-bit identifier space, there are 2
identifiers.
m
 an identifier circle modulo 2 .
 The identifier ring is called Chord ring.
 Content X is assigned to the first peer
whose identifier is equal to or follows (the
identifier of) X in the identifier space.
 This peer is the successor peer of key X,
denoted by successor(X).

10/23/2007
P2P
96
UCDavis, ecs251
Fall 2007
Successor Peers
identifier
node
6
1
0
6
identifier
circle
6
5
2
2
successor(2) = 3
3
4
10/23/2007
key
successor(1) = 1
1
7
successor(6) = 0
X
2
P2P
97
UCDavis, ecs251
Fall 2007
Join and Departure
When a node N joins the network, certain
contents previously assigned to N’s
successor now become assigned to N.
 When node N leaves the network, all of its
assigned contents are reassigned to N’s
successor.

10/23/2007
P2P
99
UCDavis, ecs251
Fall 2007
Join
keys
5
7
keys
1
0
1
7
keys
6
2
5
3
keys
2
4
10/23/2007
P2P
100
UCDavis, ecs251
Fall 2007
Departure
keys
7
keys
1
0
1
7
keys
6
6
2
5
3
keys
2
4
10/23/2007
P2P
101
UCDavis, ecs251
Fall 2007
Join/Depart

What information must be maintained?
10/23/2007
P2P
102
UCDavis, ecs251
Fall 2007
Join/Depart

What information must be maintained?
– Pointer to successor(s)
– Content itself (but application dependent)
10/23/2007
P2P
103
UCDavis, ecs251
Fall 2007
Tracker gone?
FreeBSD 5.4 CD images
Publish the key on
the class web site.
Index key
Whoever owns
this hash entry is
the tracker for the
corresponding
key!
Seed’s IP address
PUT & GET
10/23/2007
P2P
104
UCDavis, ecs251
Fall 2007
How to identify the tracker?

And, its IP address, of course?
10/23/2007
P2P
105
UCDavis, ecs251
Fall 2007
A Simple Key Lookup

A very small amount of routing information suffices
to implement consistent hashing in a distributed
environment

If each node knows only how to contact its current
successor node on the identifier circle, all node can
be visited in linear order.
Queries for a given identifier could be passed
around the circle via these successor pointers until
they encounter the node that contains the key.

10/23/2007
P2P
106
UCDavis, ecs251
Fall 2007
A Simple Key Lookup

Pseudo code for finding successor:
// ask node n to find the successor of id
N.find_successor(id)
if (id  (N, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
10/23/2007
P2P
107
UCDavis, ecs251
Fall 2007
A Simple Key Lookup

The path taken by a query from node 8 for
key 54:
10/23/2007
P2P
108
UCDavis, ecs251
Fall 2007
Successor

Each active node MUST know the IP
address of its successor!
– N8 has to know that the next node on the ring is
N14.
Departure N8 => N21
 But, how about failure or crash?

10/23/2007
P2P
109
UCDavis, ecs251
Fall 2007
Robustness

Successor in R hops
– N8 => N14, N21, N32, N38 (R=4)
– Periodic pinging along the path to check, &
also find out maybe there are “new
members” in between
10/23/2007
P2P
110
UCDavis, ecs251
Fall 2007
Is that good enough?
10/23/2007
P2P
111
UCDavis, ecs251
Fall 2007
Without Periodic Ping…??
Triggered only by dynamics (Join/Depart)!
10/23/2007
P2P
112
UCDavis, ecs251
Fall 2007
Complexity of the search

Time/messages: O(N)
– N: # of nodes on the Ring

Space: O(1)
– We only need to remember R IP addresses

Stablization depends on “period”.
10/23/2007
P2P
113
UCDavis, ecs251
Fall 2007
Scalable Key Location
To accelerate lookups, Chord maintains
additional routing information.
 This additional information is not essential
for correctness, which is achieved as long as
each node knows its correct successor.

10/23/2007
P2P
114
UCDavis, ecs251
Fall 2007
Finger Tables




Each node N’ maintains a routing table with up to m
entries (which is in fact the number of bits in
identifiers), called finger table.
The ith entry in the table at node N contains the
identity of the first node s that succeeds N by at
i-1
least 2 on the identifier circle.
i-1
s = successor (n+2 ).
s is called the ith finger of node N, denoted by
N.finger(i)
10/23/2007
P2P
115
UCDavis, ecs251
Fall 2007
Finger Tables
i-1
s = successor (n+2 ).
finger table
start
For.
0+20
0+21
0+22
1
2
4
1
6
1
3
0
0
1+2
1+21
1+22
2
3
5
succ.
keys
1
3
3
0
2
5
finger table
For.
start
3
0
3+2
3+21
3+22
4
10/23/2007
succ.
finger table
For.
start
0
7
keys
6
P2P
4
5
7
succ.
keys
2
0
0
0
116
UCDavis, ecs251
Fall 2007
Finger Tables
A finger table entry includes both the Chord
identifier and the IP address (and port
number) of the relevant node.
 The first finger of N is the immediate
successor of N on the circle.

10/23/2007
P2P
117
UCDavis, ecs251
Fall 2007
Example query

The path a query for key 54 starting at node 8:
10/23/2007
P2P
118
UCDavis, ecs251
Fall 2007
Scalable Key Location

Since each node has finger entries at power of two
intervals around the identifier circle, each node
can forward a query at least halfway along the
remaining distance between the node and the
target identifier. From this intuition follows a
theorem:
Theorem: With high probability, the number of nodes
that must be contacted to find a successor in an N-node
network is O(logN).
10/23/2007
P2P
119
UCDavis, ecs251
Fall 2007
Complexity of the Search

Time/messages: O(logN)
– N: # of nodes on the Ring

Space: O(logN)
– We need to remember R IP addresses
– We need to remember logN Fingers

Stablization depends on “period”.
10/23/2007
P2P
120
UCDavis, ecs251
Fall 2007
An Example
M = 140 (identifier size), ring size is 2140
 N = 216 (# of nodes)
 How many entries we need to have for the
Finger Table?

Each node n’ maintains a routing table with up to m entries
(which is in fact the number of bits in identifiers), called
finger table.
The ith entry in the table at node n contains the identity of
the first node s that succeeds n by at least 2i-1 on the
identifier circle.
s = successor(n+2i-1).
10/23/2007
P2P
121
UCDavis, ecs251
Fall 2007
Complexity of the Search

Time/messages: O(M)
– M: # of bits of the identifier

Space: O(M)
– We need to remember R IP addresses
– We need to remember M Fingers

Stablization depends on “period”.
10/23/2007
P2P
122
UCDavis, ecs251
Fall 2007
Structured Peering

Peer identity and routability
– 2M identifiers, Finger Table routing

Key/content assignment
– Hashing

Dynamics/Failures
– Inconsistency??
10/23/2007
P2P
123
UCDavis, ecs251
Fall 2007
Joins and Stabilizations



The most important thing is the successor pointer.
If the successor pointer is ensured to be up to date,
which is sufficient to guarantee correctness of
lookups, then finger table can always be verified.
Each node runs a “stabilization” protocol
periodically in the background to update successor
pointer and finger table.
10/23/2007
P2P
124
UCDavis, ecs251
Fall 2007
Node Joins – stabilize()



Each time node N runs stabilize(), it asks its
successor for the it’s predecessor p, and decides
whether p should be N’s successor instead.
stabilize() notifies node N’s successor of N’s
existence, giving the successor the chance to
change its predecessor to N.
The successor does this only if it knows of no
closer predecessor than N.
10/23/2007
P2P
125
UCDavis, ecs251
Fall 2007
Node Joins – stabilize()
// called periodically. verifies N’s immediate
// successor, and tells the successor about N.
N.stabilize()
x = successor.predecessor;
if (x  (N, successor))
successor = x;
successor.notify(N);
// N’ thinks it might be our predecessor.
n.notify(N’)
if (predecessor is nil or N’  (predecessor, N))
predecessor = N’;
10/23/2007
P2P
126
UCDavis, ecs251
Fall 2007
Stabilization

–
–

pred(ns) = n
nil
succ(np) = ns
n

predecessor = nil
n acquires ns as successor via some n’
n runs stabilize
–
–
succ(np) = n
pred(ns) = np
ns
n notifies ns being the new predecessor
ns acquires n as its predecessor
np runs stabilize
–
–
–
–
np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor

all predecessor and successor pointers are
now correct

fingers still need to be fixed, but old
fingers will still work
np
10/23/2007
n joins
P2P
127
UCDavis, ecs251
Fall 2007
fix_fingers()
Each node periodically calls fix fingers to
make sure its finger table entries are correct.
 It is how new nodes initialize their finger
tables
 It is how existing nodes incorporate new
nodes into their finger tables.

10/23/2007
P2P
128
UCDavis, ecs251
Fall 2007
Node Joins – fix_fingers()
// called periodically. refreshes finger table entries.
N.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(N + 2next-1);
// checks whether predecessor has failed.
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
10/23/2007
P2P
129
UCDavis, ecs251
Fall 2007
Node Failures

Key step in failure recovery is maintaining correct successor pointers

To help achieve this, each node maintains a successor-list of its r nearest
successors on the ring

If node n notices that its successor has failed, it replaces it with the first
live entry in the list

Successor lists are stabilized as follows:
– node n reconciles its list with its successor s by copying s’s successor list,
removing its last entry, and prepending s to it.
– If node n notices that its successor has failed, it replaces it with the first
live entry in its successor list and reconciles its successor list with its new
successor.
10/23/2007
P2P
131
UCDavis, ecs251
Fall 2007
Chord – The Math

Every node is responsible for about K/N keys (N nodes,
K keys)

When a node joins or leaves an N-node network, only
O(K/N) keys change hands (and only to and from
joining or leaving node)

Lookups need O(log N) messages

To reestablish routing invariants and finger tables after
node joining or leaving, only O(log2N) messages are
required
10/23/2007
P2P
132
UCDavis, ecs251
Fall 2007
Chord: Questions to ask
Peer’s role (or SP’s role)
 Peer’s controllability and vulnerability
 Incentives to contribute
 Peer’s mobility and dynamics
 Scalability

10/23/2007
P2P
133
UCDavis, ecs251
Fall 2007
Pros/Cons about DHT
10/23/2007
P2P
134