Download pdf

Document related concepts

Backpressure routing wikipedia , lookup

Distributed operating system wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Airborne Networking wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Peer-to-peer wikipedia , lookup

CAN bus wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Kademlia wikipedia , lookup

Transcript
Ken Birman
i
Cornell University. CS5410 Fall 2008. What is a Distributed Hash Table (DHT)?
y Exactly that ☺
y A service, distributed over multiple machines, with h h bl hash table semantics
i
y Put(key, value), Value = Get(key)
y Designed to work in a peer
Designed to work in a peer‐to‐peer (P2P) environment
to peer (P2P) environment
y No central control
y Nodes under different administrative control
y But of course can operate in an “infrastructure” sense
B f i “i f
” More specifically
y Hash table semantics:
y Put(key, value), y Value = Get(key)
y Key is a single flat string
l fl
y Limited semantics compared to keyword search
y Put() causes value to be stored at one (or more) peer(s)
y Get() retrieves value from a peer
G () i
l f
y Put() and Get() accomplished with unicast routed messages
y In other words, it scales
y Other API calls to support application, like notification when Oth API ll t t li ti lik tifi ti h neighbors come and go
P2P “environment”
y Nodes come and go at will (possibly quite frequently‐‐‐
a few minutes)
y Nodes have heterogeneous capacities
N d h h
ii
y Bandwidth, processing, and storage
yN
Nodes may behave badly
d b h b dl
y Promise to do something (store a file) and not do it (
(free‐loaders)
)
y Attack the system
Several flavors, each with variants
y Tapestry (Berkeley)
y Based on Plaxton trees‐‐‐similar to hypercube routing
y The first* DHT
y Complex and hard to maintain (hard to understand too!)
y CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge)
y Second wave of DHTs (contemporary with and independent of each other)
* Landmark Routing, 1988, used a form of DHT
called Assured Destination Binding (ADB)
Basics of all DHTs
y Goal is to build some “structured” 127
111
overlay network with the following characteristics:
13
97
33
81
58
y Node IDs can be mapped to the hash key space
p
y Given a hash key as a “destination address”, you can route through the network to a given node
y Always route to the same node no matter where you start from
y
Simple example (doesn’t scale)
y Circular number space 0 to 127
127
111
y Routing rule is to move counter‐
13
97
33
81
58
clockwise until current node ID ≥
l k i il d ID ≥ key, k and last hop node ID < key
y Example: key = 42
y Obviously you will route to node 58 from no matter where you start
Building any DHT
y Newcomer always starts with at least 127
one known member
13
111
97
33
81
58
24
Building any DHT
y Newcomer always starts with at least 127
13
111
97
33
81
one known member
y Newcomer searches for “self” in the N
h f “ lf” i h network
y hash key = newcomer
hash key = newcomer’s node ID
s node ID
y Search results in a node in the vicinity 58
24
where newcomer needs to be
Building any DHT
y Newcomer always starts with at least 127
111
13
97
one known member
N
h f “ lf” i h 24 y Newcomer searches for “self” in the network
33
81
y hash key = newcomer
hash key = newcomer’s node ID
s node ID
y Search results in a node in the vicinity 58
where newcomer needs to be
y Links are added/removed to satisfy properties of network
Building any DHT
y Newcomer always starts with at least 127
111
13
97
one known member
y Newcomer searches for Newcomer searches for “self” in the self in the 24
network
33
y hash key = newcomer’s node ID
y Search results in a node in the vicinity 81
58
where newcomer needs to be
y Links are added/removed to satisfy properties of network
y Objects that now hash to new node are transferred to new node
Insertion/lookup for any DHT
127
111
13
24
97
33
y Hash name of object to produce key
y Well‐known way to do this
y Use key as destination address to k
d
dd
route through network
y Routes to the target node
81
58
foo.htm→93
y Insert object, or retrieve object, at the target node
g
Properties of most DHTs
y Memory requirements grow (something like) logarithmically with N
y Routing path length grows (something like) R i h l
h (
hi lik ) logarithmically with N
y Cost of adding or removing a node grows (something like) logarithmically with N
y Has caching, replication, etc…
DHT Issues
y Resilience to failures
y Load Balance
y Heterogeneity
H t
it
y Number of objects at each node
y Routing hot spots
g
p
y Lookup hot spots
y Locality (performance issue)
y Churn (performance and correctness issue)
y Security
We’re going to look at four DHTs
y At varying levels of detail…
y CAN (Content Addressable Network)
y
ACIRI (now ICIR)
y Chord
y MIT
y Kelips
y Cornell
y Pastry
y Rice/Microsoft Cambridge
Things we’re going to look at
y
y
y
y
y
y
What is the structure?
How does routing work in the structure?
H d it d l ith d d
How does it deal with node departures?
t
?
How does it scale?
How does it deal with locality?
What are the security issues?
CAN structure is a cartesian coordinate coordinate
CAN structure is a cartesian
space in a D dimensional torus
1
CAN graphics care of Santashil PalChaudhuri, Rice Univ
Simple example in two p
p
dimensions
1
2
N t t
“t ” d “ id ”
Note: torus wraps on “top” and “sides”
3
1
2
Each node in CAN network occupies
Each node in CAN network occupies a “square” in the space
3
1
2
4
With relatively uniform square sizes
Neighbors in CAN network
y Neighbor is a node
that:
y Overlaps d-1
dimensions
y Abuts along one
dimension
Route to neighbors closer to target
Z1
Z2
Z3
Z4…
Z4
Zn
y d‐dimensional space
(a,b)
y n zones
y Zone is space occupied by a “square” in one dimension
y Avg. route path length
Avg route path length
y (d/4)(n 1/d)
y Number neighbors = O(d)
y Tunable (vary d or n)
y Can factor proximity into route decision
(x,y)
Ch d
i l ID
Chord uses a circular ID space
Key ID Node ID
K100 N100
N10 K5, K10
Circular
ID Space
N32 K11, K30
K65, K70 N80
N60
K33, K40, K52
• Successor: node with next highest ID
Chord slides care of Robert Morris, MIT
Basic Lookup
N5
N10
N110
“Where is key 50?”
N20
N99
“Key
Key 50 is
At N60”
N32
N40
N80
N60
• Lookups find the ID’s predecessor
• Correct if successors are correct
Successor Lists Ensure Robust Lookup
Successor Lists Ensure Robust Lookup
N5
5, 10, 20 N110
10, 20, 32
N10
N20
110, 5, 10 N99
N32
99, 110, 5
N40
N80
N60
20, 32, 40
32, 40, 60
40, 60, 80
60, 80, 99
80 99,
80,
99 110
• Each node remembers r successors
• Lookup can skip over dead nodes to find blocks
• Periodic check of successor and predecessor links
Ch d “Fi
Chord “Finger Table” Accelerates Lookups
T bl ” A l t L k
¼
1/8
1/16
1/32
1/64
1/128
N80
½
To build finger tables, new
node searches for the key
values for each finger
To do
T
d it efficiently,
ffi i tl new
nodes obtain successor’s
finger table, and use as a
hint to optimize the search
Chord lookups take O(log N) hops
N5
N10
N110
K19
N20
N99
N32
N80
N60
Lookup(K19)
Drill down on Chord reliability
y Interested in maintaining a correct routing table (successors, predecessors, and fingers)
y Primary invariant: correctness of successor pointers
y Fingers, while important for performance, do not have to be exactly correct for routing to work
y Algorithm is to “get closer” to the target
y Successor nodes always do this
Maintaining successor pointers
y Periodically run “stabilize” algorithm
y Finds successor’s predecessor
y Repair if this isn’t self
y This algorithm is also run at join
y Eventually routing will repair itself
E
ll i ill i i lf
y Fix_finger also periodically run
y For randomly selected finger
F d l l t d fi
Initial: 25 wants to join correct ring
Initial: 25 wants to join correct ring (between 20 and 30)
20
20
20
25
25
30
30
25 finds successor,
and tells successor
(30) of itself
25
30
20 runs “stabilize”:
20 asks 30 for 30’s predecessor
30 returns 25
20 tells 25 of itself
This time, 28 joins before 20 runs
This time, 28 joins before 20 runs “stabilize”
20
20
20
28
30
25
25
28
25
30
28
30
28 finds successor,
and tells successor
(30) of itself
20 runs “stabilize”:
20 asks 30 for 30’s predecessor
30 returns 28
20 tells 28 of itself
20
20
20
25
25
28
28
25
28
30
30
25 runs “stabilize”
30
20 runs “stabilize”
“
Pastry also uses a circular y
number space
d471f1
d467c4 y Difference is in how d462ba
the fingers are the “fingers” are d46 1
d46a1c
d4213f created
y Pastry uses prefix y
p
Route(d46a1c)
65a1fc
match overlap rather d13da3 than binary splitting
y More flexibility in neighbor selection
bl (f
d
f )
Pastry routing table (for node 65a1fc)
Pastry nodes also
have a “leaf
leaf set”
set of
immediate neighbors
up and down the ring
Similar to Chord’s list
of successors
Pastry join
y X = new node, A = bootstrap, Z = nearest node
y A finds Z for X
y In process, A, Z, and all nodes in path send state tables to X
In process A Z and all nodes in path send state tables to X
y X settles on own table
y Possibly after contacting other nodes
y X tells everyone who needs to know about itself
y Pastry paper doesn’t give enough information to understand how concurrent joins work
y 18th IFIP/ACM, Nov 2001
Pastry leave
y Noticed by leaf set neighbors when leaving node doesn’t respond
y Neighbors ask highest and lowest nodes in leaf set for new leaf set
y Noticed by routing neighbors when message forward f il
fails
y Immediately can route to another neighbor
y Fix entry by asking another neighbor in the same Fix entry by asking another neighbor in the same “row” row for its neighbor
y If this fails, ask somebody a level up
For instance this neighbor fails
For instance, this neighbor fails
Ask other neighbors
Ask other neighbors
Try asking
T
ki some neighbor
i hb
in the same row for its
655x entry
If it doesn’t have one, try
asking some neighbor in
th row b
the
below,
l
etc.
t
CAN, Chord, Pastry differences
y CAN, Chord, and Pastry have deep similarities
y Some (important???) differences exist
y CAN nodes tend to know of multiple nodes that allow CAN d t d t k
f lti l d th t ll equal progress
y
Can therefore use additional criteria (RTT) to pick next hop
y Pastry allows greater choice of neighbor
y Can thus use additional criteria (RTT) to pick neighbor
y In contrast, Chord has more determinism
y Harder for an attacker to manipulate system?
Security issues
y In many P2P systems, members may be malicious
y If peers untrusted, all content must be signed to detect forged content
y Requires certificate authority
y Like we discussed in secure web services talk
y This is not hard, so can assume at least this level of security
Security issues: Sybil attack
y Attacker pretends to be multiple system
y If surrounds a node on the circle, can potentially arrange to capture all traffic
y Or if not this, at least cause a lot of trouble by being many nodes
y Chord requires node ID to be an SHA‐1 hash of its IP address
y But to deal with load balance issues, Chord variant allows nodes to replicate themselves
y A central authority must hand out node IDs and certificates to go with them
y Not P2P in the Gnutella sense
General security rules
y Check things that can be checked
y Invariants, such as successor list in Chord
y Minimize invariants, maximize randomness
y Hard for an attacker to exploit randomness
y Avoid any single dependencies
A id i l d
d
i
y Allow multiple paths through the network
y Allow content to be placed at multiple nodes
y But all this is expensive…
Load balancing
y Query hotspots: given object is popular
y Cache at neighbors of hotspot, neighbors of neighbors, etc.
etc
y Classic caching issues
y Routing hotspot: node is on many paths
y Of the three, Pastry seems most likely to have this problem, because neighbor selection more flexible (and based on proximity)
p
y)
y This doesn’t seem adequately studied
Load balancing
y Heterogeneity (variance in bandwidth or node capacity
y Poor distribution in entries due to hash function inaccuracies
y One class of solution is to allow each node to be multiple virtual nodes
y Higher capacity nodes virtualize more often
y But security makes this harder to do
Chord node virtualization
Chord node virtualization
10K nodes, 1M objects
20 virtual nodes per node has much
better load balance, but each node
requires ~400 neighbors!
Primary concern: churn
y Churn: nodes joining and leaving frequently
y Join or leave requires a change in some number of links
y Those changes depend on correct routing tables in Th h
d
d t ti t bl i other nodes
y Cost of a change is higher if routing tables not correct
g
g
g
y In chord, ~6% of lookups fail if three failures per stabilization
y But as more changes occur, probability of incorrect B t h
b bilit f i
t routing tables increases
Control traffic load generated by churn
y Chord and Pastry appear to deal with churn differently
y Chord join involves some immediate work, but repair is done periodically
y Extra load only due to join messages
y Pastry join and leave involves immediate repair of all effected nodes’ tables
d ’ bl
y Routing tables repaired more quickly, but cost of each join/leave goes up with frequency of joins/leaves
y Scales quadratically with number of changes???
y Can result in network meltdown???
Kelips takes a different approach
y Network partitioned into √N “affinity groups”
y Hash of node ID determines which affinity group a node is in
d i i
y Each node knows:
y One or more nodes in each group
O d i h y All objects and nodes in own group
y But this knowledge is soft‐state, spread through peer‐
But this knowledge is soft state spread through peer
to‐peer “gossip” (epidemic multicast)!
110 knows about
other members –
230, 30…
Kelips
Affinity
Affi i Groups:
G
peer membership thru
consistent hash
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
0
1
2
N −1
110
N
230
30
Affinity group
pointers
202
members
per affinity
group
Kelips
Affinity
Groups:
Affi i 202
G is a
“contact” forthru
110
peer membership
consistent
hash2
in group
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
0
1
N −1
2
110
N
Contacts
230
group
contactNode
…
…
2
202
202
members
per affinity
group
30
Contact
p
pointers
“cnn.com” maps to group 2.
So 110 tells group 2 to “route”
inquiries about cnn.com to it.
Kelips
Affinity
Affi i Groups:
G
peer membership thru
consistent hash
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
0
1
2
N −1
110
N
Contacts
230
group
contactNode
…
…
2
202
Resource Tuples
resource
info
…
…
cnn.com
110
202
30
Gossip protocol
replicates data
cheaply
h l
members
per affinity
group
How it works
y Kelips is entirely gossip based!
y Gossip about membership
y Gossip to replicate and repair data
y Gossip about “last heard from” time used to discard failed nodes
y Gossip “channel” uses fixed bandwidth
y … fixed rate, packets of limited size
… ed ate, pac ets o
ted s e
Gossip 101
y Suppose that I know something
y I’m sitting next to Fred, and I tell him
y Now 2 of us “know”
y Later, he tells Mimi and I tell Anne
y Now 4
y This is an example of a push epidemic
y Push‐pull
P h ll occurs if we exchange data
if h
d t
Gossip scales very nicely
y Participants’ loads independent of size
y Network load linear in system size
y Information spreads in log(system size) time
% infected
d
1.0
0.0
Time →
Gossip in distributed systems
y We can gossip about membership
y Need a bootstrap mechanism, but then discuss failures, new members
b
y Gossip to repair faults in replicated data
y “I have 6 updates from Charlie”
I have 6 updates from Charlie
y If we aren’t in a hurry, gossip to replicate data too
Gossip about membership
y Start with a bootstrap protocol
y For example, processes go to some web site and it lists a dozen nodes where the system has been stable for a long d
d h h h b
bl f l
time
y Pick one at random
y Then track “processes I’ve heard from recently” and “processes other people have heard from p
p p
recently”
y Use push gossip to spread the word
p
g
p
p
Gossip about membership
y Until messages get full, everyone will known when everyone else last sent a message
y With delay of log(N) gossip rounds…
Wi h d l f l (N) i d
y But messages will have bounded size
y Perhaps 8K bytes
y Then use some form of “prioritization” to decide what to g
g
omit – but never send more, or larger messages
y Thus: load has a fixed, constant upper bound except on the network itself, which usually has infinite capacity
Back to Kelips: Quick reminder
Affinity
Affi i Groups:
G
peer membership thru
consistent hash
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
0
1
N −1
2
110
N
Contacts
230
group
contactNode
…
…
2
202
202
members
per affinity
group
30
Contact
p
pointers
How Kelips works
Node 175 is a
contact for Node
102 in some
Hmm…Node 19 looks
affinity group
like a much better contact
in affinity group 2
175
Node 102
19
Gossip data stream
y Gossip about everything
y Heuristic to pick contacts: periodically ping contacts to g
check liveness, RTT… swap so‐so ones for better ones.
Replication makes it robust
y Kelips should work even during disruptive episodes
y After all, tuples are replicated to √N nodes
y Query k nodes concurrently to overcome isolated crashes, also reduces risk that very recent data could be missed
y … we often overlook importance of showing that f
l k
f h
h
systems work while recovering from a disruption
Chord can malfunction if the
Chord can malfunction if the network partitions…
Transient Network
Partition
Europe
255
0
255
0
30
30 248
248
241
64
202
241
202
199
177
USA
123
108 199
177
64
108
123
… so, who cares?
y Chord lookups can fail… and it suffers from high overheads when nodes churn
y Loads surge just when things are already disrupted… L d j h hi l d di
d quite often, because of loads
y And can’t predict how long Chord might remain And can t predict how long Chord might remain disrupted once it gets that way
y Worst case scenario: Chord can become inconsistent and stay that way
Control traffic load generated by churn
None
Kelips
O(changes)
Chord
O(Changes
x Nodes)?
Pastry
Take‐Aways?
y Surprisingly easy to superimpose a hash‐table lookup onto a potentially huge distributed system!
y We’ve seen three O(log N) solutions and one O(1) W’ h O(l N) l i
d O( ) solution (but Kelips needed more space)
y Sample applications?
y Peer to peer file sharing
y Amazon uses DHT for the shopping cart
pp g
y CoDNS: A better version of DNS