Download chord - CSE, IIT Bombay

Document related concepts

Backpressure routing wikipedia , lookup

Airborne Networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Distributed operating system wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

CAN bus wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Kademlia wikipedia , lookup

Transcript
Distributed Hash-based Lookup
for Peer-to-Peer Systems
Mohammed Junaid Azad 09305050
Gopal Krishnan 09305915
Mtech1 ,CSE
Agenda
•
•
•
•
Peer-to-Peer System
Initial Approaches to Peer-to-Peer Systems
Their Limitations
Distributed Hash Tables
– CAN-Content Addressable Network
– CHORD
Peer-to-Peer Systems
• Distributed and Decentralized Architecture
• No centralized Server(Unlike Client Server
Architecture)
• Any Peer can behave as Server
Napster
• P2P file sharing system
• Central Server stores the index of all the files
available on the network
• To retrieve a file, central server contacted to
obtain location of desired file
• Not completely decentralized system
• Central directory not scalable
• Single point of failure
Gnutella
• P2P file sharing system
• No Central Server store to index the files
available on the network
• File location process decentralized as well
• Requests for files are flooded on the network
• No Single point of failure
• Flooding on every request not scalable
File Systems for P2P systems
• The file system would store files and their
metadata across nodes in the P2P network
• The nodes containing blocks of files could be
located using hash based lookup
• The blocks would then be fetched from those
nodes
Scalable File indexing Mechanism
• In any P2P system, File transfer process is
inherently scalable
• However, the indexing scheme which maps
file names to location crucial for scalability
• Solution:- Distributed Hash Table
Distributed Hash Tables
• Traditional name and location services provide a direct mapping between
keys and values
• What are examples of values? A value can be an address, a document, or
an arbitrary data item
• Distributed hash tables such as CAN/Chord implement a distributed
service for storing and retrieving key/value pairs
8
DNS vs. Chord/CAN
DNS
Chord
• provides a host name to IP
address mapping
• can provide same service: Name
= key, value = IP
• relies on a set of special root
servers
• requires no special servers
• imposes no naming structure
• names reflect administrative
boundaries
• is specialized to finding named
hosts or services
• can also be used to find data
objects that are not tied to
certain machines
9
Example Application using Chord:
Cooperative Mirroring
File System
Block Store
Block Store
Block Store
Chord
Chord
Chord
Client
Server
Server
•
Highest layer provides a file-like interface to user including user-friendly
naming and authentication
•
This file systems maps operations to lower-level block operations
•
Block storage uses Chord to identify responsible node for storing a block
and then talk to the block storage server on that node
10
CAN
What is CAN ?
• CAN is a distributed infrastructure that provides
hash table like functionality
• CAN is composed of many individual nodes
• Each CAN node stores a chunk (zone) of the
entire hash table
• Request for a particular key is routed by
intermediate CAN nodes whose zone contains
that key
• The design can be implemented in application
level (no changes to kernel required)
Co-ordinate space in CAN
Design Of CAN
• Involves a virtual d-dimensional Cartesian Coordinate space
– The co-ordinate space is completely logical
– Lookup keys hashed into this space
• The co-ordinate space is partitioned into
zones among all nodes in the system
– Every node in the system owns a distinct zone
• The distribution of zones into nodes forms an
overlay network
Design of CAN (..continued)
• To store (Key,value) pairs, keys are mapped
deterministically onto a point P in co-ordinate
space using a hash function
• The (Key,value) pair is then stored at the node which
owns the zone containing P
• To retrieve an entry corresponding to Key K, the
same hash function is applied to map K to the
point P
• The retrieval request is routed from requestor node to
node owning zone containing P
Routing in CAN
• Every CAN node holds IP address and virtual coordinates of each of it’s neighbours
• Every message to be routed holds the destination
co-ordinates
• Using it’s neighbour’s co-ordinate set, a node
routes a message towards the neighbour with coordinates closest to the destination co-ordinates
• Progress: how much closer the message gets to
the destination after being routed to one of the
neighbours
Routing in CAN(continued…)
• For a d-dimensional space partitioned into n
equal zones, routing path length = O(d.n1/d )
hops
– With increase in no. of nodes, routing path length
grows as O(n1/d )
• Every node has 2d neighbours
– With increase in no. of nodes, per node state does
not change
Before a node joins CAN
After a Node Joins
Allocation of a new node to a zone
1. First the new node must find a node already in CAN(Using
Bootstrap Nodes)
2. The new node randomly chooses a point P in the coordinate space
3. It sends a JOIN request to point P via any existing CAN
node
4. The request is forwarded using CAN routing mechanism to
the node D owning the zone containing P
5. D then splits it’s node into half and assigns one half to
new node
6. The new neighbour information is determined for both
the nodes
Failure of node
• Even if one of the neighbours fails, messages can be
routed through other neighbours in that direction
• If a node leaves CAN, the zone it occupies is taken
over by the remaining nodes
– If a node leaves voluntarily, it can handover it’s database to
some other node
– When a node simply becomes unreachable, the database
of the failed node is lost
• CAN depends on sources to resubmit data, to recover lost data
CHORD
Features
• CHORD is a distributed hash table implementation
• Addresses a fundamental problem in P2P
 Efficient location of the node that stores desired data item
 One operation: Given a key, maps it onto a node
 Data location by associating a key with each data item
• Adapts Efficiently
 Dynamic with frequent node arrivals and departures
 Automatically adjusts internal tables to ensure availability
• Uses Consistent Hashing
 Load balancing in assigning keys to nodes
 Little movement of keys when nodes join and leave
Features (continued)
• Efficient Routing
Distributed routing table
Maintains information about only O(logN) nodes
Resolves lookups via O(logN) messages
• Scalable
Communication cost and state maintained at each
node scales logarithmically with number of nodes
• Flexible Naming
Flat key-space gives applications flexibility to map
their own names to Chord keys
• Decentralized
Some Terminology
• Key
 Hash key or its image under hash function, as per context
 m-bit identifier, using SHA-1 as a base hash function
• Node
 Actual node or its identifier under the hash function
 Length m such that low probability of a hash conflict
• Chord Ring

The identifier circle for ordering of 2m node identifiers
• Successor Node
 First node whose identifier is equal to or follows key k in the identifier space
• Virtual Node
 Introduced to limit the bound on keys per node to K/N
 Each real node runs Ω(logN) virtual nodes with its own identifier
Chord Ring
Consistent Hashing
• A consistent hash function is one which changes minimally
with changes in the range of keys and a total remapping is not
required
• Desirable properties
High probability that the hash function balances load
Minimum disruption, only O(1/N) of the keys moved when
a nodes joins or leaves
Every node need not know about every other node, but a
small amount of “routing” information
• m-bit identifier for each node and key
• Key k assigned to Successor Node
Simple Key Location
Example
Scalable Key Location
• A very small amount of routing information suffices to implement
consistent hashing in a distributed environment
• Each node need only be aware of its successor node on the circle
• Queries for a given identifier can be passed around the circle via these
successor pointers
• Resolution scheme correct, BUT inefficient: it may require traversing all N
nodes!
30
Acceleration of Lookups
• Lookups are accelerated by maintaining additional routing information
• Each node maintains a routing table with (at most) m entries (where
N=2m) called the finger table
• ith entry in the table at node n contains the identity of the first node, s,
that succeeds n by at least 2i-1 on the identifier circle (clarification on next slide)
• s = successor(n + 2i-1) (all arithmetic mod 2)
• s is called the ith finger of node n, denoted by n.finger(i).node
31
Finger Tables (1)
finger table
start int.
1
2
4
[1,2)
[2,4)
[4,0)
1
3
0
finger table
start int.
0
1
7
succ.
keys
6
6
2
3
5
[2,3)
[3,5)
[5,1)
succ.
keys
1
3
3
0
2
5
3
4
finger table
start int.
4
5
7
[4,5)
[5,7)
[7,3)
succ.
keys
2
0
0
0
32
Finger Tables (2) - characteristics
• Each node stores information about only a small number of other nodes,
and knows more about nodes closely following it than about nodes farther
away
• A node’s finger table generally does not contain enough information to
determine the successor of an arbitrary key k
• Repetitive queries to nodes that immediately precede the given key will
lead to the key’s successor eventually
33
Pseudo code
Example
Node Joins – with Finger Tables
finger table
start int.
1
2
4
[1,2)
[2,4)
[4,0)
finger table
start int.
7
0
2
[7,0)
[0,2)
[2,6)
1
3
06
finger table
start int.
0
1
7
succ.
keys
6
2
3
5
[2,3)
[3,5)
[5,1)
succ.
keys
1
3
3
06
keys
succ.
0
0
3
6
2
5
3
4
finger table
start int.
4
5
7
[4,5)
[5,7)
[7,3)
succ.
keys
2
06
06
0
36
Node Departures – with Finger Tables
finger table
start int.
1
2
4
[1,2)
[2,4)
[4,0)
finger table
start int.
7
0
2
[7,0)
[0,2)
[2,6)
succ.
0
0
3
keys
6
succ.
13
3
06
finger table
start int.
0
1
7
keys
6
2
3
5
[2,3)
[3,5)
[5,1)
succ.
keys
1
3
3
06
2
5
3
4
finger table
start int.
4
5
7
[4,5)
[5,7)
[7,3)
succ.
keys
2
6
6
00
37
Source of Inconsistencies:
Concurrent Operations and Failures
• Basic “stabilization” protocol is used to keep nodes’ successor pointers up
to date, which is sufficient to guarantee correctness of lookups
• Those successor pointers can then be used to verify the finger table
entries
• Every node runs stabilize periodically to find newly joined nodes
38
Pseudo code
Pseudo Code(Continue..)
Stabilization after Join

ns
n joins





succ(np) = ns
nil
np runs stabilize

n
succ(np) = n
pred(ns) = np
pred(ns) = n

predecessor = nil
n acquires ns as successor via some n’
n notifies ns being the new predecessor
ns acquires n as its predecessor


np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor

all predecessor and successor pointers are
now correct

fingers still need to be fixed, but old fingers
will still work
np
41
Failure Recovery
•
Key step in failure recovery is maintaining correct successor pointers
•
To help achieve this, each node maintains a successor-list of its r nearest
successors on the ring
•
If node n notices that its successor has failed, it replaces it with the first live entry
in the list
•
stabilize will correct finger table entries and successor-list entries pointing to failed
node
•
Performance is sensitive to the frequency of node joins and leaves versus the
frequency at which the stabilization protocol is invoked
42
Impact of Node Joins on Lookups:
Correctness
• For a lookup before stabilization has finished,
1. Case 1: All finger table entries involved in the lookup
are reasonably current then lookup finds correct
successor in O(logN) steps
2. Case 2: Successor pointers are correct, but finger
pointers are inaccurate. This scenario yields correct
lookups but may be slower
3. Case 3: Incorrect successor pointers or keys not
migrated yet to newly joined nodes. Lookup may
fail. Option of retrying after a quick pause, during
which stabilization fixes successor pointers
Impact of Node Joins on Lookups:
Performance
• After stabilization, no effect other than increasing
the value of N in O(logN)
• Before stabilization is complete
• Possibly incorrect finger table entries
• Does not significantly affect lookup speed, since
distance halving property depends only on ID-space
distance
• If new nodes’ IDs are between the target predecessor
and the target, then lookup speed is influenced
• Still takes O(logN) time for N new nodes
Handling Failures
• Problem: what if node does not know who its new successor is, after
failure of old successor
• May be in a gap in the finger table
• Chord would be stuck!
•
Maintain successor list of size r, containing the node’s first r successors
• If immediate successor does not respond, substitute the next entry in the
successor list
• Modified version of stabilize protocol to maintain the successor list
• Modified closest_preceding_node to search not only finger table but also
successor list for most immediate predecessor
• If find_successsor fails, retry after some time
• Voluntary Node Departures
• Transfer keys to successor before departure
• Notify predecessor p and successor s before leaving
Theorems
• Theorem IV.3: Inconsistencies in successor are
transient
• If any sequence of join operations is executed interleaved
with stabilizations, then at sometime after the last join the
successor pointers will form a cycle on all the nodes in the
network.
• Theorem IV.4: Lookup take log(N) time with high
probability even if N nodes join a stable N node
network, once successor pointers are correct, even if
finger pointers are not updated
• Theorem IV.6: If network is initially stable, even if every
node fails with probability ½, expected time to execute
find_succcessor is O(log N)
Simulation
• Implements Iterative Style (other one is recursive style)
• Node resolving a lookup initiates all communication unlike
Recursive Style, where intermediate nodes forward request
Optimizations
• During stabilization, a node updates its immediate
successor and 1 other entry in successor list or finger table
• Each entry out of k unique entries gets refreshed once in
’k’ stabilization rounds
• Size of successor list is 1
• Immediate notification of predecessor change to old
predecessor, without waiting for next stabilization round
Parameters
•
•
•
•
Mean of delay of each packet is 50 ms
Round trip time is 500 ms
Number of nodes is 104
Number of Keys vary from 104 to 106
Load Balance
• Test ability of consistent hashing, to allocate keys
to nodes evenly
• Number of keys per node exhibits large variations,
that increase linearly with the number of keys
• Association of keys with Virtual Nodes Makes the number
of keys per node more uniform and Significantly improves
load balance
• Asymptotic value of query path length not affected much
• Total identifier space covered remains same on average
• Worst-case number of queries does not change
• Not much increase in routing state maintained
• Asymptotic number of control messages not affected
In the absence of Virtual Node
In Presence of Virtual Nodes
Path Length
• Number of nodes that must be visited to resolve
a query, measured as the query path length
• As per theorem, IV.2
• The number of nodes that must be contacted to
find a successor in an N-node Network is O(log N)
• Observed Results :
• Mean query path length increases logarithmically
with number of nodes
• Average Same as expected average query path
length
Path Length Simulator Parameters
• A network with N = 2K nodes
• No of Keys 100 x 2K
• K varied between 3 to 14 and Path length is
measured
Graph
Future Work
•
•
•
•
•
•
•
•
•
•
•
Resilience against Network Partitions
Detect and heal partitions
For every node have a set of initial nodes
Maintain a long term memory of a random set of nodes
Likely to include nodes from other partition
Handle Threats to Availability of data
Malicious participants could present incorrect view of data
Periodical Global Consistency Checks for each node
Better Efficiency
O(logN) messages per lookup too many for some apps
Increase the number of fingers
References
• Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications,
I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, H. Balakrishnan,
In Proc. ACM SIGCOMM 2001. Expanded version appears in IEEE/ACM
Trans. Networking, 11(1), February 2003.
• A Scalable Content-Addressable Network,
S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker, In Proc. ACM
SIGCOMM 2001)
• Querying the Internet with PIER
Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott
Shenker, and Ion Stoica, VLDB 03
Thank You !
Any Question ??