Download Talk 2 - IIT Guwahati

Document related concepts

AppleTalk wikipedia , lookup

Computer network wikipedia , lookup

Backpressure routing wikipedia , lookup

Airborne Networking wikipedia , lookup

Distributed operating system wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

CAN bus wikipedia , lookup

Peer-to-peer wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Kademlia wikipedia , lookup

Transcript
An Introduction to
Peer-to-Peer networks
Diganta Goswami
IIT Guwahati
Outline
Overview of P2P overlay networks
 Applications of overlay networks
 Classification of overlay networks


Structured overlay networks
 Unstructured overlay networks
 Overlay multicast networks
2
Overview of P2P overlay networks

What is P2P systems?


P2P refers to applications that take advantage of
resources (storage, cycles, content, human presence)
available at the end systems of the internet.
What is overlay networks?
 Overlay
networks refer to networks that are
constructed on top of another network (e.g. IP).

What is P2P overlay network?

Any overlay network that is constructed by the
Internet peers in the application layer on top of the IP
network.
3
What is P2P systems?
Multiple sites (at edge)
 Distributed resources
 Sites are autonomous (different owners)
 Sites are both clients and servers
 Sites have equal functionality

4
Internet P2P Traffic Statistics

Between 50 and 65 percent of all download traffic is
P2P related.
Between 75 and 90 percent of all upload traffic is P2P
related.
And it seems that more people are using p2p today

So what do people download?




61.4 % video
11.3 % audio
27.2 % games/software/etc.
Source: http://torrentfreak.com/peer-to-peer-trafficstatistics/
5
P2P overlay networks properties
Efficient use of resources
 Self-organizing
 All peers organize themselves into an application
layer network on top of IP.
 Scalability
 Consumers of resources also donate resources
 Aggregate resources grow naturally with
utilization

6
P2P overlay networks properties
 Reliability
No single point of failure
 Redundant overlay links between the peers
 Redundant data source
 Ease of deployment and administration
 The nodes are self-organized
 No need to deploy servers to satisfy demand.
 Built-in fault tolerance, replication, and load
balancing
 No need any change in underlay IP networks

7
P2P Applications

P2P File Sharing
 Napster,
Gnutella, Kazaa, eDonkey,
BitTorrent
 Chord, CAN, Pastry/Tapestry, Kademlia

P2P Communications
 Skype,

Social Networking Apps
P2P Distributed Computing
 Seti@home
8
Popular file sharing P2P Systems
Napster, Gnutella, Kazaa, Freenet
 Large scale sharing of files.

 User
A makes files (music, video, etc.) on
their computer available to others
 User B connects to the network, searches for
files and downloads files directly from user A

Issues of copyright infringement
9
P2P/Grid Distributed Processing

seti@home
 Search for ET intelligence
 Central site collects radio telescope data
 Data is divided into work chunks of 300 Kbytes
 User obtains client, which runs in background
 Peer
sets up TCP connection to central computer,
downloads chunk
 Peer does FFT on chunk, uploads results, gets
new chunk

Not P2P communication, but exploit Peer
computing power
10
Key Issues

Management
 How
to maintain the P2P system under high rate of
churn efficiently
 Application reliability is difficult to guarantee

Lookup
 How
to find out the appropriate content/resource that
a user wants

Throughput
 Content
distribution/dissemination applications
 How to copy content fast, efficiently, reliably
11
Management Issue

A P2P network must be self-organizing.




Join and leave operations must be self-managed.
The infrastructure is untrusted and the components are
unreliable.
The number of faulty nodes grows linearly with system size.
Tolerance to failures and churn




Content replication, multiple paths
Leverage knowledge of executing application
Load balancing
Dealing with free riders

Freerider : rational or selfish users who consume more than
their fair share of a public resource, or shoulder less than a
fair share of the costs of its production.
12
Lookup Issue


How do you locate data/files/objects in a large
P2P system built around a dynamic set of nodes
in a scalable manner without any centralized
server or hierarchy?
Efficient routing even if the structure of the
network is unpredictable.
 Unstructured
P2P : Napster, Gnutella, Kazaa
 Structured P2P : Chord, CAN, Pastry/Tapestry,
Kademlia
13
Classification of overlay networks

Structured overlay networks



Unstructured overlay networks


Are based on Distributed Hash Tables (DHT)
the overlay network assigns keys to data items and
organizes its peers into a graph that maps each data
key to a peer.
The overlay networks organize peers in a random
graph in flat or hierarchical manners.
Overlay multicast networks

The peers organize themselves into an overlay tree
for multicasting.
14
Structured overlay networks

Overlay topology construction is based on NodeID’s that
are generated by using Distributed Hash Tables (DHT).

The overlay network assigns keys to data items and
organizes its peers into a graph that maps each data key
to a peer.

This structured graph enables efficient discovery of data
items using the given keys.

It Guarantees object detection in O(log n) hops.
15
Unstructured P2P overlay networks

An Unstructured system composed of peers
joining the network with some loose rules,
without any prior knowledge of the topology.

Network uses flooding or random walks as
the mechanism to send queries across the
overlay with a limited scope.
16
Unstructured P2P File Sharing Networks



Centralized Directory based P2P systems
Pure P2P systems
Hybrid P2P systems
17
Unstructured P2P File Sharing Networks

Centralized Directory based P2P systems
 All peers are connected to central entity
 Peers establish connections between each
other on demand to exchange user data (e.g.
mp3 compressed data)
 Central entity is necessary to provide the
service
 Central entity is some kind of index/group
database
 Central entity is lookup/routing table
 Examples: Napster, Bittorent
18
Napster



was used primarily for file sharing
NOT a pure P2P network=> hybrid system
Ways of action:
 Client
sends server the query, server ask everyone
and responds to client
 Client gets list of clients from server
 All Clients send ID’s of the data they hold to the
server and when client asks for data, server responds
with specific addresses
 peer downloads directly from other peer(s)
19
Centralized Network

Napster model
Client
Client
Server
Reply
Query
• Nodes register their
contents with server
• Centralized server for
searches
• File access done on a
peer to peer basis
– Poor scalability
– Single point of failure
File Transfer
20
Napster

Further services:
 Chat
program, instant messaging service, tracking
program,…

Centralized system
 Single
point of failure => limited fault tolerance
 Limited scalability (server farms with load balancing)

Query is fast and upper bound for duration can
be given
21
Gnutella
pure peer-to-peer
 very simple protocol
 no routing "intelligence"
 Constrained broadcast

 Life-time
of packets limited by TTL (typically
set to 7)
 Packets have unique ids to detect loops
22
Query flooding: Gnutella

fully distributed
 no


central server
public domain
protocol
many Gnutella clients
implementing protocol
overlay network: graph
 edge between peer X and
Y if there’s a TCP
connection
 all active peers and
edges is overlay net
 Edge is not a physical
link
 Given peer will typically
be connected with < 10
overlay neighbors
23
Gnutella: protocol
r Query message
sent over existing TCP
connections
r peers forward
Query message
r QueryHit
sent over
reverse
Query
path
Scalability:
limited scope
flooding
File transfer:
HTTP
Query
QueryHit
QueryHit
24
Gnutella : Scenario
Step 0: Join the network
Step 1: Determining who is on the network
• "Ping" packet is used to announce your presence on the network.
• Other peers respond with a "Pong" packet.
• Also forwards your Ping to other connected peers
• A Pong packet also contains:
• an IP address
• port number
• amount of data that peer is sharing
• Pong packets come back via same route
Step 2: Searching
•Gnutella "Query" ask other peers (usually 7) if they have the file you desire
• A Query packet might ask, "Do you have any content that matches the string
‘Hey Jude"?
• Peers check to see if they have matches & respond (if they have any matches)
& send packet to connected peers if not (usually 7)
• Continues for TTL (how many hops a packet can go before it dies, typically 10 )
Step 3: Downloading
• Peers respond with a “QueryHit” (contains contact info)
• File transfers use direct connection using HTTP protocol’s GET method 25
Gnutella: Peer joining
1.
2.
3.
4.
5.
Joining peer X must find some other peer in
Gnutella network: use list of candidate peers
X sequentially attempts to make TCP with
peers on list until connection setup with Y
X sends Ping message to Y; Y forwards Ping
message.
All peers receiving Ping message respond with
Pong message
X receives many Pong messages. It can then
setup additional TCP connections
26
Gnutella - PING/PONG
3
6
Ping 1
Ping 1
Pong 3 Pong 6,7,8
Pong 6,7,8
Pong 6
Ping 1
1
Known Hosts:
2
3,4,5
Pong 3,4,5
Pong 5
2
Ping 1
5
Pong 7
Ping 1
Ping 1
Pong 2
Ping 1
Pong 4
7
Pong 8
8
6,7,8
4
Query/Response
analogous
27
Unstructured Blind - Gnutella
Breadth-First Search (BFS)
= source
= forward
query
= processe
query
= found
result
= forward
respons
28
Unstructured Blind - Gnutella


A node/peer connects to a set of Gnutella
neighbors
Forward queries to neighbors

Client which has the Information responds.

Flood network with TTL for termination
+ Results are complete
– Bandwidth wastage
29
Gnutella : Reachable Users
(analytical estimate)
T : TTL, N : Neighbors for Query
30
Gnutella : Search Issue

Flooding based search is extremely wasteful with
bandwidth




A large (linear) part of the network is covered irrespective of
hits found
Enormous number of redundant messages
All users do this in parallel: local load grows linearly with size
What can be done?

Controlling topology to allow for better search


Random walk, Degree-biased Random Walk
Controlling placement of objects

Replication
31
Gnutella : Random Walk

Basic strategy

In scale-free graph: high degree nodes are easy to find by (biased)
random walk



And high degree nodes can store the
index about a large portion of the network
Random walk


Scale-free graph is a graph whose degree distribution follows a power
law
avoiding the visit of last visited node
Degree-biased random walk



Select highest degree node, that has
not been visited
This first climbs to highest degree node,
then climbs down on the degree sequence
Provably optimal coverage
32
Gnutella : Replication

Spread copies of objects to peers: more
popular objects can be found easier

Replication strategies




Owner replication
Path replication
Random replication
But there is still the difficulty with rare objects.
33
Random Walkers

Improved Unstructured Blind
•Similar structure to
Gnutella
•Forward the query
(called walker) to
random subset of its
neighbors
+ Reduced bandwidth
requirements
– Incomplete results
Peer nodes
34
Unstructured Informed Networks

Zero in on target based on information about the query
and the neighbors.

Intelligent routing
+ Reduces number of messages
+ Not complete, but more accurate
– COST: Must thus flood in order to get initial information
35
Informed Searches: Local Indices

Node keeps track of information available within
a radius of r hops around it.

Queries are made to neighbors just beyond the r
radius.
+ Flooding limited to bounded part of network
36
Routing Indices

For each query, calculate goodness of each
neighbor.

Calculating goodness:
 Categorize
or separate query into themes
 Rank best neighbors for a given theme based on
number of matching documents

Follows chain of neighbors that are expected to
yield the best results

Backtracking possible
37
Free riding


File sharing networks rely on users sharing data
Two types of free riding
 Downloading
but not sharing any data
 Not sharing any interesting data

On Gnutella
 15%
of users contribute 94% of content
 63% of users never responded to a query

Didn’t have “interesting” data
38
Gnutella:summary








Hit rates are high
High fault tolerance
Adopts well and dynamically to changing peer
populations
High network traffic
No estimates on duration of queries
No probability for successful queries
Topology is unknown => algorithm cannot exploit it
Free riding is a problem



A significant portion of Gnutella peers are free riders
Free riders are distributed evenly across domains
Often hosts share files nobody is interested in
39
Gnutella discussion

Search types:


Scalability




High, since many paths are explored
Autonomy:



Search very poor with respect to number of messages
Updates excellent: nothing to do
Routing information: low cost
Robustness


Any possible string comparison
Storage: no restriction, peers store the keys of their files
Routing: peers are target of all kind of requests
Global knowledge

None required
40
Exploiting heterogeneity: KaZaA

Each peer is either a group
leader or assigned to a
group leader.



TCP connection between
peer and its group leader.
TCP connections between
some pairs of group leaders.
Group leader tracks the
content in all its children.
ordinary peer
group-leader peer
neighoring relationships
in overlay network
41
iMesh, Kazaa


Hybrid of centralized Napster and
decentralized Gnutella
Super-peers act as local search
hubs

Each super-peer is similar to a
Napster server for a small portion of
the network
 Super-peers are automatically
chosen by the system based on
their capacities (storage,
bandwidth, etc.) and availability
(connection time)



Users upload their list of files to a
super-peer
Super-peers periodically exchange
file lists
Queries are sent to a super-peer for
files of interest
42
Overlay Multicasting
 IP
multicast has not be deployed over the Internet
due to some fundamental problems in congestion
control, flow control, security, group management and
etc.
 For the new emerging applications such as
multimedia streaming, internet multicast service is
required.
 Solution: Overlay Multicasting

Overlay multicasting (or Application layer multicasting) is
increasingly being used to overcome the problem of nonubiquitous deployment of IP multicast across heterogeneous
networks.
43
Overlay Multicasting

Main idea

Internet peers organize themselves into an
overlay tree on top of the Internet.
 Packet replication and forwarding are
performed by peers in the application layer
by using IP unicast service.
44
Overlay Multicasting

Overlay multicasting benefits
 Easy deployment
 It is self-organized
 it is based on IP unicast service
 There is not any protocol support requirement by the Internet
routers.
 Scalability
 It is scalable with multicast groups and the number of
members in each group.
 Efficient resource usage
 Uplink resources of the Internet peers is used for multicast
data distribution.
 It is not necessary to use dedicated infrastructure and
bandwidths for massive data distribution in the Internet.
45
Overlay Multicasting

Overlay multicast approaches
 DHT
based
 Tree based
 Mesh-tree based
46
Overlay Multicasting

DHT based
 Overlay tree is constructed on top of the DHT based
P2P routing infrastructure such as pastry, CAN,
Chord, etc.
 Example: Scribe in which the overlay tree is
constructed on a Pastry networks by using a multicast
routing algorithm
47
Structured Overlay Networks / DHTs
Chord, Pastry, Tapestry, CAN,
Kademlia, P-Grid, Viceroy
Set of Nodes
Keys of Nodes
Common Identifier
Space
Connect
The nodes
Smartly
Keys of Values
…
Node Identifier
Value Identifier
48
The Principle Of Distributed Hash Tables

A dynamic distribution of a hash table onto a set of cooperating
nodes
Key
Value
1
Algorithms
9
Routing
11
DS
12
Peer-to-Peer
21
Networks
22
Grids
• Basic service: lookup operation
• Key resolution from any node
node A
node B
node D
node C
→Node D : lookup(9)
• Each node has a routing table
• Pointers to some other nodes
• Typically, a constant or a logarithmic number of pointers
49
DHT Desirable Properties
Keys mapped evenly to all nodes in the network
Each node maintains information about only a
few other nodes
Messages can be routed to a node efficiently
Node arrival/departures only affect a few nodes
50
Chord [MIT]
Problem adressed: efficient node
localization
 Distributed lookup protocol
 Simplicity, provable performance, proven
correctness
 Support of just one operation: given a key,
Chord maps the key onto a node

51
The Chord algorithm –
Construction of the Chord ring

the consistent hash function assigns each node
and each key an m-bit identifier using SHA 1
(Secure Hash Standard).
m = any number big enough to make collisions
improbable
Key identifier = SHA-1(key)
Node identifier = SHA-1(IP address)


Both are uniformly distributed
Both exist in the same ID space
52
Chord





consistent hashing (SHA-1) assigns each
node and object an m-bit ID
IDs are ordered in an ID circle ranging from
0 – (2m-1).
New nodes assume slots in ID circle
according to their ID
Key k is assigned to first node whose ID ≥ k
 successor(k)
53
Consistent Hashing - Successor Nodes
identifier
node
6
1
0
successor(6) = 0
6
identifier
circle
6
5
key
successor(1) = 1
1
7
X
2
2
successor(2) = 3
3
4
2
54
Consistent Hashing – Join and
Departure
When a node n joins the network, certain
keys previously assigned to n’s successor
now become assigned to n.
 When node n leaves the network, all of its
assigned keys are reassigned to n’s
successor.

55
Consistent Hashing – Node Join
keys
5
7
keys
1
0
1
7
keys
6
2
5
3
keys
2
4
56
Consistent Hashing – Node Dep.
keys
7
keys
1
0
1
7
keys
6
6
2
5
3
keys
2
4
57
Simple node localization
// ask node n to find the successor of id
n.find_successor(id)
if (id (n; successor])
return successor;
else
// forward the query around the
circle
return successor.find_successor(id);
=> Number of messages linear in
the number of nodes !
58
Scalable Key Location – Finger Tables

To accelerate lookups, Chord maintains additional routing
information.
This additional information is not essential for correctness,
which is achieved as long as each node knows its correct
successor.
Each node n, maintains a routing table with up to m
entries (which is in fact the number of bits in identifiers),
called finger table.
The ith entry in the table at node n contains the
identity of
i-1
the first node s that succeeds n by at least 2 on the
identifier circle.
s = successor(n+2i-1).

s is called the ith finger of node n, denoted by n.finger(i)




59
Scalable Key Location – Finger Tables
finger table
start
For.
0+20
0+21
0+22
1
2
4
1
6
succ.
1
3
0
finger table
For.
start
0
7
keys
6
0
1+2
1+21
1+22
2
3
5
succ.
keys
1
3
3
0
2
5
3
4
finger table
For.
start
0
3+2
3+21
3+22
4
5
7
succ.
keys
2
0
0
0
60
Finger Tables
finger table
start int.
1
2
4
[1,2)
[2,4)
[4,0)
1
6
1
3
0
finger table
start int.
0
7
succ.
keys
6
2
3
5
[2,3)
[3,5)
[5,1)
succ.
keys
1
3
3
0
2
5
3
4
finger table
start int.
4
5
7
[4,5)
[5,7)
[7,3)
succ.
keys
2
0
0
0
61
Chord key location


Lookup in finger
table the furthest
node that
precedes key
-> O(log n) hops
62
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
63
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1 )
64
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
65
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
66
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
67
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
68
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
69
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
70
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1)
71
Scalable node localization
Finger table:
finger[i] =
successor (n + 2 i-1 )
72
Scalable node localization
Important characteristics of this scheme:
 Each node stores information about only a
small number of nodes (m)
 Each nodes knows more about nodes
closely following it than about nodes farer
away
 A finger table generally does not contain
enough information to directly determine
the successor of an arbitrary key k
73
Scalable node localization


Search in finger table
for the nodes which
most immediatly
precedes id
Invoke
find_successor
from that node
=> Number of
messages O(log N)!
74
Scalable node localization


Search in finger table
for the nodes which
most immediatly
precedes id
Invoke
find_successor
from that node
=> Number of
messages O(log N)!
75
Scalable Lookup Scheme

Each node forwards query at least halfway along
distance remaining to the target

Theorem: With high probability, the number of
nodes that must be contacted to find a
successor in a N-node network is O(log N)
76
Node Joins and Stabilizations
The most important thing is the successor
pointer.
 If the successor pointer is ensured to be
up to date, which is sufficient to guarantee
correctness of lookups, then finger table
can always be verified.
 Each node runs a “stabilization” protocol
periodically in the background to update
successor pointer and finger table.

77
Node Joins and Stabilizations

“Stabilization” protocol contains 6 functions:
 create()
 join()
 stabilize()
 notify()
 fix_fingers()
 check_predecessor()

When node n first starts, it calls n.join(n’), where
n’ is any known Chord node.
The join() function asks n’ to find the immediate
successor of n.

78
Node joins and stabilization
To ensure correct lookups, all successor
pointers must be up to date
 => stabilization protocol running
periodically in the background
 Updates finger tables and successor
pointers

79
Node joins and stabilization
Stabilization protocol:
 Stabilize(): n asks its successor for its
predecessor p and decides whether p
should be n‘s successor instead (this is
the case if p recently joined the system).
 Notify(): notifies n‘s successor of its
existence, so it can change its
predecessor to n
 Fix_fingers(): updates finger tables
80
Node Joins – Join and Stabilization

n joins


pred(ns) = n

n runs stabilize
n
nil
np
succ(np) = ns

succ(np) = n
pred(ns) = np
ns


predecessor = nil
n acquires ns as successor via some
n’
n notifies ns being the new
predecessor
ns acquires n as its predecessor
np runs stabilize




np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor

all predecessor and successor
pointers are now correct

fingers still need to be fixed, but old
fingers will still work
81
Node joins and stabilization
82
Node joins and stabilization
• N26 joins the system
• N26 aquires N32 as its successor
• N26 notifies N32
• N32 aquires N26 as its predecessor
83
Node joins and stabilization
• N26 copies keys
• N21 runs stabilize() and asks its
successor N32 for its predecessor
which is N26.
84
Node joins and stabilization
• N21 aquires N26 as its successor
• N21 notifies N26 of its existence
• N26 aquires N21 as predecessor
85
Node Joins – with Finger Tables
finger table
start int.
1
2
4
[1,2)
[2,4)
[4,0)
finger table
start int.
7
0
2
[7,0)
[0,2)
[2,6)
keys
succ.
1
0
0
3
6
1
3
06
finger table
start int.
0
7
succ.
keys
6
2
3
5
[2,3)
[3,5)
[5,1)
succ.
keys
1
3
3
06
2
5
3
4
finger table
start int.
4
5
7
[4,5)
[5,7)
[7,3)
succ.
keys
2
06
06
0
86
Node Departures – with Finger Tables
finger table
start int.
1
2
4
[1,2)
[2,4)
[4,0)
finger table
start int.
7
0
2
[7,0)
[0,2)
[2,6)
succ.
0
0
3
keys
6
1
6
succ.
13
3
06
finger table
start int.
0
7
keys
2
3
5
[2,3)
[3,5)
[5,1)
succ.
keys
1
3
3
06
2
5
3
4
finger table
start int.
4
5
7
[4,5)
[5,7)
[7,3)
succ.
keys
2
6
6
00
87
Node Failures

Key step in failure recovery is maintaining correct successor
pointers

To help achieve this, each node maintains a successor-list of its r
nearest successors on the ring

If node n notices that its successor has failed, it replaces it with
the first live entry in the list

Successor lists are stabilized as follows:
node n reconciles its list with its successor s by copying s’s
successor list, removing its last entry, and prepending s to
it.
 If node n notices that its successor has failed, it replaces it
with the first live entry in its successor list and reconciles
its successor list with its new successor.

88
Handling failures: redundancy
Each node knows IP addresses of next r
nodes.
 Each key is replicated at next r nodes

89
Impact of node joins on lookups


All finger table entries
are correct =>
O(log N) lookups
Successor pointers
correct, but fingers
inaccurate =>
correct but slower
lookups
90
Impact of node joins on lookups
Stabilization completed => no influence on
performence
 Only for the negligible case that a large
number of nodes joins between the
target‘s predecessor and the target, the
lookup is slightly slower
 No influence on performance as long as
fingers are adjusted faster than the
network doubles in size

91
Failure of nodes



Correctness relies on
correct successor
pointers
What happens, if N14,
N21, N32 fail
simultaneously?
How can N8 aquire N38
as successor?
92
Failure of nodes



Correctness relies on
correct successor
pointers
What happens, if N14,
N21, N32 fail
simultaneously?
How can N8 aquire N38
as successor?
93
Failure of nodes
Each node maintains a successor list of
size r
 If the network is initially stable,
and every node fails with probability ½,
find_successor still finds the closest living
successor to the query key and
the expected time to execute find_succesor
is O(log N)

94
Failure of nodes
Failed Lookups (Percent)
Massive failures have little impact
1.4
(1/2)6 is 1.6%
1.2
1
0.8
0.6
0.4
0.2
0
5
10
15
20
25 30
35
40
45
50
Failed Nodes (Percent)
95
Chord – simulation result
[Stoica et al. Sigcomm2001]
96
Chord discussion

Search types


Scalability




Replication might be used by storing replicas at successor nodes
Autonomy


Search O(logn)
Update requires search, thus O(logn)
Construction: O(log2 n) if a new node joins
Robustness


Only equality, exact keys need to be known
Storage and routing: none
Global knowledge

Mapping of IP addresses and data keys to key common key
space
97
YAPPERS: a P2P lookup service
over arbitrary topology
 Gnutella-style



work on arbitrary topology, flood for query
Robust but inefficient
Support for partial query, good for popular resources
 DHT-based



Systems
Systems
Efficient lookup but expensive maintenance
By nature, no support for partial query
Solution: Hybrid System
 Operate
on arbitrary topology
 Provide DHT-like search efficiency
98
Design Goals

Impose no constraints on topology
 No

underlying structure for the overlay network
Optimize for partial lookups for popular keys
 Observation:
Many users are satisfied with partial
lookup

Contact only nodes that can contribute to the
search results
 no

blind flooding
Minimize the effect of topology changes
 Maintenance
overhead is independent of system size
99
Basic Idea:
Keyspace is partitioned into a small
number of buckets. Each bucket
corresponds to a color.
 Each node is assigned a color.

#

of buckets = # of colors
Each node sends the <key, value>
pairs to the node with the same color
as the key within its Immediate
Neighborhood.
 IN(N):
All nodes within h hops from Node
N.
100
Partition Nodes
Given any overlay, first partition nodes into
buckets (colors) based on hash of IP
101
Partition Nodes (2)
Around each node, there is at least one
node of each color
X
Y
May require backup color assignments
102
Register Content
Partition content space into buckets (colors)
and register pointer at “nearby” nodes.
Nodes around
Z form a small
hash table!
Z
register red
content locally
register yellow
content at a
yellow node
103
Searching Content
Start at a “nearby” colored node, search
other nodes of the same color.
X
W
Y
U
V
Z
104
Searching Content (2)
A smaller overlay for each color and use
Gnutella-style flood
Fan-out = degree of nodes in the smaller overlay
105
More…

When node X is inserting <key, value>
 Multiple
nodes in IN(X) have the same color?
 No node in IN(X) has the same color as key k?

Solution:
 P1:
randomly select one
 P2: Backup scheme: Node with next color


Primary color (unique) & Secondary color (zero or
more)
Problems coming with this solution:
 No
longer consistent and stable
 The effect is isolated within the Immediate
neighborhood
106
Extended Neighborhood


IN(A): Immediate Neighborhood
F(A): Frontier of Node A


All nodes that are directly connected to IN(A), but not in
IN(A)
EN(A): Extended Neighborhood
 The
union of IN(v) where v is in F(A)
 Actually EN(A) includes all nodes within 2h + 1 hops

Each node needs to maintain these three set of
nodes for query.
107
The network state information for node A (h = 2)
108
Searching with Extended
Neighborhood

Node A wants to look up a key k of color C(k), it
picks a node B with C(k) in IN(A)
 If
multiple nodes, randomly pick one
 If none, pick the backup node



B, using its EN(B), sends the request to all
nodes which are in color C(k).
The other nodes do the same thing as B.
Duplicate Message problem:
 Each
node caches the unique query identifier.
109
More on Extended
Neighborhood
All <key, value> pairs are stored among
IN(X). (h hops from node X)
 Why each node needs to keep an EN(X)?
 Advantage:

 The
forwarding node is chosen based on
local knowledge
 Completeness: a query (C(k)) message can
reach all nodes in C(k) without touching any
nodes in other colors (Not including backup
node)
110
Maintaining Topology

Edge Deletion: X-Y
 Deletion
message needs to be propagated to all
nodes that have X and Y in their EN set
 Necessary Adjustment:



Change IN, F, EN sets
Move <key, value> pairs if X/Y is in IN(A)
Edge Insertion:
 Insertion
message needs to include the neighbor info
 So other nodes can update their IN and EN sets
111
Maintaining Topology

Node Departure:
a
node X with w edges is leaving
 Just like w edge deletion
 Neighbors of X initiates the propagation

Node Arrival: X joins the network
 Ask
its new neighbors for their current
topology view
 Build its own extended neighborhood
 Insert w edges.
112
Problems with basic design

Fringe node:
 Those
low connectivity node allocates a
large number of secondary colors to its
high-connectivity neighbors.

Large fan-out:
 The
forwarding fan-out degree at A is
proportional to the size of F(A)
 This is desirable for partial lookup, but not
good for full lookup
113
A is overloaded by secondary
colors from B, C, D, E
114
Solutions:

Prune Fringe Nodes:
 If

the degree of a node is too small, find a proxy node.
Biased Backup Node Assignment:
X
assigns a secondary color to y only when
a * |IN(x)| > |IN(y)|

Reducing Forward Fan-out:
 Basic


idea:
try backup node,
try common nodes
115
Experiment:



H = 2 (1 too small, >2 EN too large)
Topology: Gnutella snapshot
Exp1: Search Efficiency
116
Distribution of colors per node
117
Fan-out:
118
Num of colors: effect on Search
119
Num of colors: effect on Fan-out
120
Discussion



Each search only disturbs a small fraction of the
nodes in the overlay.
No restructure the overlay
Each node has only local knowledge
 scalable
 Hybrid
(unstructured and local DHT) system
121
PASTRY
122
Pastry

Identifier space:
 Nodes
and data items are uniquely associated with
m-bit ids – integers in the range (0 – 2m -1) – m is
typically 128
views ids as strings of digits to the base 2b
where b is typically chosen to be 4
 Pastry
A
key is located on the node to whose node id it is
numerically closest
123
Routing Goal

Pastry routes messages to the node whose
nodeId is numerically closest to the given key in
less than log2b (N) steps:
 “A
heuristic ensures that among a set of nodes with
the k closest nodeIds to the key, the message is
likely to first reach a node near the node from which
the message originates, in term of the proximity
metric”
124
Routing Information

Pastry’s node state is divided into 3 main elements
routing table – similar to Chord’s finger table – stores
links to id-space
 The
 The
leaf set contains nodes which are close in the idspace
 Nodes
that are closed together in terms of network
locality are listed in the neighbourhood set
125
Routing Table

A Pastry node’s routing table is made up of m/b (log2b N)
rows with 2b -1 entries per row

On node n, entries in row i hold the identities of Pastry nodes
whose node-id share an i-digit prefix with n but differ in digit n itself

For ex, the first row is populated with nodes that have no prefix in
common with n

When there is no node with an appropriate prefix, the
corresponding entry is left empty

Single digit entry in each row shows the corresponding digit of the
present node’s id – i.e. prefix matches the current id up to the given
value of p – the next row down or leaf set should be examined to
find a route.
126
Routing Table

Routing tables (RT) thus built achieve an effect similar to
Chord finger table

The detail of the routing information increases with the proximity of
other nodes in the id-space

Without a large no. of nearby nodes, the last rows of the RT are
only sparsely populated – intuitively, the id-space would need to be
fully exhausted with node-ids for complete RTs on all nodes

In populating the RT, there is a choice from the set of nodes with
the appropriate id-prefix

During the routing process, network locality can be exploited by
selecting nodes which are close in terms of proximity ntk. metric
127
Leaf Set

The Routing tables sort node ids by prefix. To increase
lookup efficiency, the leaf set L of nodes holds the |L|
nodes numerically closest to n (|L|/2 smaller and |L|/2
larger, L = 2 or 2 × 2b, normally)

The RT and the leaf set are the two sources of information
relevant for routing

The leaf set also plays a role similar to Chord’s successor
list in recovering from failures of adjacent nodes
128
Neighbourhood Set

Instead of numeric closeness, the neighbourhood set M is
concerned with nodes that are close to the current node
with regard to the network proximity metric

Thus, it is not involved in routing itself but in maintaining network
locality in the routing information
129
Pastry Node State
(Base 4)
L
Nodes that are numerically closer
to the present Node (2b or
2x2b entry)
R
Common prefix with 10233102next digit-rest of NodeId
(log2b (N) rows, 2b-1
columns)
M
Nodes that are closest according
to the proximity metric (2b or
2x2b entry)
130
Routing

Key D arrives at nodeId A

Ril enetry in routing table
at column i and row l

Li i-th closest nodeId in
leaf set

Dl value of the l’s digit in
the key D

shl(A,B) length of the
prefix shared in digits
131
Routing

Routing is divided into two main steps:
 First,
a node checks whether the key K is within the
range of its leaf set

If it is the case, it implies that K is located in one of the
nearby nodes of the leaf set. Thus, the node forwards the
query to the leaf set node numerically closest to K. In case
this is the node itself, the routing process is finished.
132
Routing
 If
K does not fall within the range of the leaf set, the
query needs to be forwarded over a large distance
using the routing table
 In
this case, a node n tries to pass the query on to a
node which shares a longer common prefix with K than
n itself

If there is no such entry in the RT, the query is forwarded to a
node which shares a prefix with K of the same length as n but
which is numerically close to K than n
133
Routing
 This
scheme ensures that routing loop do not occur
because the query is routed strictly to a node with a
longer common identifier prefix than the current node, or
to a numerically closer node with the same prefix
134
Routing performance

Routing procedure converges, each step takes the
message to node that either:
 Shares
a longer prefix with the key than the local node
 Shares
as long a prefix with, but is numerically closer to
the key than the local node.
135
Routing performance

Assumption: Routing tables are accurate and no recent
node failures

There are 3 cases in the Pastry routing scheme:
 Case
1: Forward the query (according to the RT) to a
node with a longer prefix match than the current node.

Thus, the no. of nodes with longer prefix matches is
reduced by at least a factor of 2b in each step, so the
destination is reached in log2b N steps.
136
Routing performance

There are 3 cases:
 Case
2: The query is routed via leaf set (one step). This
increases the no. of hop by one
137
Routing performance

There are 3 cases:
 Case
3: The key is neither covered by the leaf set nor
does the RT contains an entry with a longer matching
prefix than the current node

Consequently, the query is forwarded to a node with the same
prefix length, adding an additional routing hop.

For a moderate leaf set size ( |L| = 2 × 2b), the probability of this
case is less than 0.6%. So, it is very unlikely that more than one
additional hop is incurred.
138
Routing performance

As a result, the complexity of routing remains at
O(log2b N) on average
 Higher
values of b leads to fast routing but also
increases the amount of state that needs to managed at
each node
 Thus,
b is typically 4 but Pastry implementation can
choose an appropriate trade-off for specific application
139
Join and Failure

Join



Use routing to find numerically closest node already in network
Ask state from all nodes on the route and initialize own state
Error correction

Failed leaf node: contact a leaf node on the side of the failed
node and add appropriate new neighbor

Failed table entry: contact a live entry with same prefix as failed
entry until new live entry found, if none found, keep trying with
longer prefix table entries
140
Self Organization: Node Arrival


The new node n is assumed to know a nearby
Pastry node k based on the network proximity
metric
Now n needs to initialize its RT, leaf set and
neighbourhood set.
 Since
K is assumed to be close to n, the nodes in K’s
neghbourhood set are reasonably good choices for n,
too.
 Thus, n copies the neighbourhood set from K.
141
Self Organization: Node Arrival

To build its RT and leaf set, n routes a special
join message via k to a key equal to n
 According
to the standard routing rules, the query is
forwarded to the node c with the numerically closest
id and hence the leaf set of c is suitable for n, so it
retrieves c’s leaf set for itself.
 The
join request triggers all nodes, which forwarded
the query towards c, to provide n with their routing
information.
142
Self Organization: Node Arrival

Node n’s RT is constructed from the routing
information of these nodes starting at row 0.
 As
this row is independent of the local node id, n can
use these entries at row zero of k’s routing table



In particular, it is assumed that n and k are close in terms of
network proximity metric
Since k stores nearby nodes in its RT, these entries are also
close to n.
In the general case of n and k not sharing a common prefix,
n cannot reuse entries from any other row in K’s RT.
143
Self Organization: Node Arrival
 The
route of the join message from n to c leads via
nodes v1, v2, … vn with increasingly longer common
prefixes of n and vi
 Thus,
row 1 from the RT of v1 is also a good choice
for the same row of the RT of n
 The
same is true for row 2 on node v2 and so on
 Based
on this information, the RT of n can be
constructed.
144
Self Organization: Node Arrival

Finally, the new node sends its node state to all
nodes in its routing data so that these nodes can
update their own routing information accordingly
 In
contrast to lazy updates in Chord, this mechanism
actively updates the state in all affected nodes when
a new node joins the system
 At this stage, the new node is fully present and
reachable in the Pastry network
145
Node Failure

Node failure is detected when a communication attempt
with another node fails. Routing requires contacting
nodes from RT and leaf set, resulting in lazy detection of
failures

During routing, the failure of a single node in the RT does
not significantly delay the routing process. The local node
can chose to forward the query to a different node from
the same row in the RT. (Alternatively, a node could
store backup nodes with each entry in the RT)
146
Node Failure

Repairing a failed entry in the leaf set of a node is
straightforward – utilizing the leaf set of other nodes
referenced in the local leaf set.

Contacts the leaf set of the largest index on the side of
the failed node
If this node is unavailable, the local node can revert to
leaf set with smaller indices

147
Node Departure

Neighborhood node: asks other members
for their M, checks the distance of each of the
newly discovered nodes, and updates its own
neighborhood set accordingly.
148
Locality

“Route chosen for a message is likely to be
good with respect to the proximity metric”

Discussion:
 Locality
in the routing table
 Route locality
 Locating the nearest among k nodes
149
Locality in the routing table
 Node
A is near X
 A’s R0 entries are close to A, A is close to X, and
triangulation inequality holds  entries in X are relatively
near A.
 Likewise, obtaining X’s neighborhood set from A is
appropriate.
 B’s R1 entries are reasonable choice for R1of X


Entries in each successive row are chosen from an exponentially
decreasing set size.
The expected distance from B to any of its R1 entry is much
larger than the expected distance traveled from node A to B.
 Second
stage: X requests the state from each of the
nodes in its routing table and neighborhood set to
update its entries to closer nodes.
150
Routing locality


Each routing step moves the message closer to the
destination in the nodeId space, while traveling the least
possible distance in the proximity space.
Given that:



A message routed from A to B at distance d cannot
subsequently be routed to a node with a distance of less than d
from A
The expected distance traveled by a message during each
successive routing step is exponentially increasing
 Since a message tends to make larger and larger
strides with no possibility of returning to a node within di
of any node i encountered on the route, the message
has nowhere to go but towards its destination
151
Node Failure




To replace the failed node at entry i in row j of its RT (Rji),
a node contacts another node referenced in row j
Entries in the same row j of the remote node are valid for
the local node and hence it can copy entry Rji from the
remote node to its own RT
In case it failed as well, it can probe another node in row j
for entry Rji
If no live node with appropriate nodeID prefix can be
obtained in this way, the local node queries nodes from
the preceding row Rj-1
152
Locating the nearest among k nodes

Goal:
 among
the k numerically closest nodes to a key, a
message tends to first reach a node near the client.

Problem:
 Since
Pastry routes primarily based on nodeId
prefixes, it may miss nearby nodes with a different
prefix than the key.

Solution (using a heuristic):
 Based
on estimating the density of nodeIds, it
detects when a message approaches the set of k
and then switches to numerically nearest address
based routing to locate the nearest replica.
153
Arbitrary node failures and network
partitions

Node continues to be responsive,
behaves incorrectly or even maliciously.
but

Repeated queries fail each time since they
normally take the same route.

Solution: Routing can be randomized
 The
choice among multiple nodes that satisfy the
routing criteria should be made randomly
154
Content-Addressable Network
(CAN)
Proc. ACM SIGCOMM (San
Diego, CA, August 2001)
Motivation

Primary scalability issue in peer-to-peer
systems is the indexing scheme used to
locate the peer containing the desired
content
 Content-Addressable
Network (CAN) is a
scalable indexing mechanism
 Also a central issue in large scale storage
management systems
156
Basic Design

Basic Idea:
A virtual d-dimensional Coordinate space
Each node owns a Zone in the virtual space
Data is stored as (key, value) pair
Hash(key) --> a point P in the virtual space
(key, value) pair is stored on the node
within whose Zone the point P locates
157
An Example of CAN
1
158
An Example of CAN (cont)
1
2
159
An Example of CAN (cont)
3
1
2
160
An Example of CAN (cont)
3
1
2
4
161
An Example of CAN (cont)
162
An Example of CAN (cont)
I
163
An Example of CAN (cont)
node I::insert(K,V)
I
164
An Example of CAN (cont)
node I::insert(K,V)
(1) a = hx(K)
I
x=a
165
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
y=b
x=a
166
An Example of CAN (cont)
node I::insert(K,V)
(1) a = hx(K)
b = hy(K)
I
(2) route(K,V) -> (a,b)
167
An Example of CAN (cont)
node I::insert(K,V)
(1) a = hx(K)
b = hy(K)
(2) route(K,V) -> (a,b)
I
(K,V)
(3) (a,b) stores (K,V)
168
An Example of CAN (cont)
node J::retrieve(K)
(1) a = hx(K)
b = hy(K)
(2) route “retrieve(K)” to (a,b)
(K,V)
J
169
Important Thing….
Important note:
Data stored in CAN is addressable by name
(ie key) not by location (ie IP address.)
170
Routing in CAN
171
Routing in CAN (cont)
(a,b)
(x,y)
172
Routing in CAN (cont)
Important note:
A node only maintain state for its immediate neighboring
nodes.
173
Node Insertion In CAN (cont)
I
new node
1) discover some node “I” already in CAN
175
Node Insertion In CAN (cont)
(p,q)
2) pick random
point in space
I
176
Node Insertion In CAN (cont)
(p,q)
J
I
new node
3) I routes to (p,q), discovers node J
177
Node Insertion In CAN (cont)
J
new
4) split J’s zone in half… new owns one half
178
Node Insertion In CAN (cont)
Important note:
Inserting a new node affects only a single
other node and its immediate neighbors
179
Review about CAN (part2)




Requests (insert, lookup, or delete) for a key are
routed by intermediate nodes using a greedy
routing algorithm
Requires no centralized control (completely
distributed)
Small per-node state is independent of the
number of nodes in the system (scalable)
Nodes can route around failures (fault-tolerant)
180
CAN: node failures

Need to repair the space
 recover database (weak point)
 soft-state updates
 use replication, rebuild database from replicas
 repair routing
 takeover algorithm
181
CAN: takeover algorithm

Simple failures
know your neighbor’s neighbors
 when a node fails, one of its neighbors takes over its
zone


More complex failure modes

simultaneous failure of multiple adjacent nodes
 scoped flooding to discover neighbors
 hopefully, a rare event
182
CAN: node failures
Important note:
Only the failed node’s immediate neighbors
are required for recovery
183
CAN Improvements
184
Adding Dimensions
185
Multiple independent coordinate
spaces (realities)



Nodes can maintain multiple independent coordinate spaces
(realities)
For a CAN with r realities:
a single node is assigned r zones
and holds r independent
neighbor sets
 Contents of the hash table
are replicated for each reality
Example: for three realities, a
(K,V) mapping to P:(x,y,z) may
be stored at three different nodes
 (K,V) is only unavailable when
all three copies are unavailable
 Route using the neighbor on the reality closest to (x,y,z)
186
Dimensions vs. Realities




Increasing the number of dimensions
and/or realities decreases path
length and increases per-node state
More dimensions has greater effect
on path length
More realities provides
stronger fault-tolerance and
increased data availability
Authors do not quantify the different
storage requirements
 More realities requires replicating
(K,V) pairs
187
RTT Ratio & Zone Overloading


Incorporate RTT in routing metric
 Each node measures RTT to each neighbor
 Forward messages to neighbor with maximum ratio of progress
to RTT
Overload coordinate zones
 - Allow multiple nodes to share the same zone, bounded by a
threshold MAXPEERS
 Nodes maintain peer state, but not additional neighbor state
 Periodically poll neighbor for its list of peers, measure RTT to
each peer, retain lowest RTT node as neighbor
 (K,V) pairs may be divided among peer nodes or replicated
188
Multiple Hash Functions




Improve data availability by using k hash functions to
map a single key to k points in the coordinate space
Replicate (K,V) and store
at k distinct nodes
(K,V) is only unavailable
when all k replicas are
simultaneously
unavailable
Authors suggest querying
all k nodes in parallel to
reduce average lookup latency
189
Topology sensitive






Use landmarks for topologically-sensitive construction
Assume the existence of well-known machines like DNS servers
Each node measures its RTT
to each landmark
 Order each landmark in order of
increasing RTT
 For m landmarks:
m! possible orderings
Partition coordinate space
into m! equal size partitions
Nodes join CAN at random
point in the partition corresponding
to its landmark ordering
Latency Stretch is the ratio of CAN
latency to IP network latency
190
Other optimizations



Run a background load-balancing technique to offload
from densely populated bins to sparsely populated bins
(partitions of the space)
Volume balancing for more uniform partitioning
 When a JOIN is received, examine zone volume and
neighbor zone volumes
 Split zone with largest volume
 Results in 90% of nodes of equal volume
Caching and replication for “hot spot” management
191
Strengths
More resilient than flooding broadcast
networks
 Efficient at locating information
 Fault tolerant routing
 Node & Data High Availability (w/
improvement)
 Manageable routing table size & network
traffic

192
Weaknesses
Impossible to perform a fuzzy search
 Susceptible to malicious activity
 Maintain coherence of all the indexed data
(Network overhead, Efficient distribution)
 Still relatively higher routing latency
 Poor performance w/o improvement

193
Summary

CAN
 an
Internet-scale hash table
 potential

Scalability
 O(d)

per-node state
Low-latency routing
 simple

building block in Internet applications
heuristics help a lot
Robust
 decentralized,
can route around trouble
194
Some Main Research Areas in P2P
Efficiency of search, queries and topologies
( Chord, CAN, YAPPER…)
 Data delivery (ZIGZAG..)
 Resource Management
 Security

195
Resource Management
Problem:
 Autonomous nature of peers: essentially selfish
peers must be given an incentive to contribute
resources.
 The scale of the system: makes it hard to get a
complete picture of what resources are available
An approach:
Use concepts from economics to construct a
resource marketplace, where peers can buy and sell
or trade resources as necessary
196
Security Problem
Problem:
- Malicious attacks: nodes in a P2P system
operate in an autonomous fashion, and any
node that speaks the system protocol may
participate in the system
An approach:
Mitigating attacks by nodes that abuse the P2P
network by exploiting the implicit trust peers
place on them.
197
Reference





Kien A. Hua, Duc A. Tran, and Tai Do, “ZIGZAG: An Efficient Peer-to-Peer Scheme
for Media Streaming”, INFOCOM 2003.
RATNASAMY, S., FRANCIS, P., HANDLEY, M., KARP, R., AND SHENKER, S. A
scalable content-addressable network. In Proc. ACM SIGCOMM (San Diego, CA,
August 2001)
Mayank Bawa, Brian F. Cooper, Arturo Crespo, Neil Daswani, Prasanna Ganesan,
Hector Garcia-Molina, Sepandar Kamvar, Sergio Marti, Mario Schlosser, Qi Sun,
Patrick Vinograd, Beverly Yang” Peer-to-Peer Research at Stanford”
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan,
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, ACM
SIGCOMM 2001
Prasanna Ganesan, Qixiang Sun, and Hector Garcia-Molina, YAPPERS: A Peer-toPeer Lookup Service over Arbitrary Topology, INFOCOM 2003.
198