Download Decentralized Location Services

Document related concepts

Computer network wikipedia , lookup

Backpressure routing wikipedia , lookup

Distributed operating system wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Airborne Networking wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

CAN bus wikipedia , lookup

Routing wikipedia , lookup

Kademlia wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Transcript
Tapestry Deployment and
Fault-tolerant Routing
Ben Y. Zhao
L. Huang, S. Rhea, J. Stribling,
A. D. Joseph, J. D. Kubiatowicz
Berkeley Research Retreat
January 2003
Scaling Network Applications

Complexities of global deployment

Network unreliability


Lack of administrative control over components


BGP slow convergence, redundancy unexploited
Constrains protocol deployment: multicast,
congestion ctrl.
Management of large scale resources /
components

UCB Winter Retreat
Locate, utilize resources despite failures
[email protected]
2
Enabling Technology: DOLR
(Decentralized Object Location and Routing)
GUID1
DOLR
GUID2
UCB Winter Retreat
GUID1
[email protected]
3
What is Tapestry?

DOLR driving OceanStore global storage
(Zhao, Kubiatowicz, Joseph et al. 2000)

Network structure



Nodes assigned bit sequence nodeIds from
namespace: 0-2160, based on some radix (e.g. 16)
keys from same namespace
Keys dynamically map to 1 unique live node: root
Base API



Publish / Unpublish (Object ID)
RouteToNode (NodeId)
RouteToObject (Object ID)
UCB Winter Retreat
[email protected]
4
Tapestry3 Mesh
4
NodeID
0xEF97
NodeID
0xEF32
NodeID
0xE399
4
NodeID
0xE530
3
NodeID
0xEF34
4
NodeID
0xEF37
3
NodeID
0xEF44
2
2
4
3
1
1
3
NodeID
0x099F
2
3
NodeID
0xE555
1
NodeID
0xFF37
UCB Winter Retreat
2
NodeID
0xEFBA
NodeID
0xEF40
NodeID
0xEF31
4
NodeID
0x0999
1
2
2
NodeID
0xE324
3
NodeID
0xE932
[email protected]
1
NodeID
0x0921
5
Object Location
UCB Winter Retreat
[email protected]
6
Talk Outline

Introduction

Architecture

Node architecture

Node implementation

Deployment Evaluation

Fault-tolerant Routing
UCB Winter Retreat
[email protected]
7
Single Node Architecture
Decentralized
File Systems
Application-Level
Multicast
Approximate
Text Matching
Application Interface / Upcall API
Dynamic Node
Management
Routing Table
&
Router
Object Pointer DB
Network Link Management
Transport Protocols
UCB Winter Retreat
[email protected]
8
Single Node Implementation
Applications
API calls
Upcalls
Enter/leave
Tapestry
State Maint.
Node Ins/del
Core Router
Routing Link
Maintenance
Patchwork
route to
node / obj
Dynamic Tapestry
Application Programming Interface
Distance Map
UDP Pings
Network Stage
SEDA Event-driven Framework
Java Virtual Machine
UCB Winter Retreat
[email protected]
9
Deployment Status

C simulator



Packet level simulation
Scales up to 10,000 nodes
Java implementation



50000 semicolons of Java, 270 class files
Deployed on local area cluster (40 nodes)
Deployed on Planet Lab global network (~100
distributed nodes)
UCB Winter Retreat
[email protected]
10
Talk Outline

Introduction

Architecture

Deployment Evaluation


Micro-benchmarks

Stable network performance

Single and parallel node insertion
Fault-tolerant Routing
UCB Winter Retreat
[email protected]
11
Micro-benchmark Methodology
Sender
Control
Tapestry



Receiver
Control
LAN
Link
Tapestry
Experiment run in LAN, GBit Ethernet
Sender sends 60001 messages at full speed
Measure inter-arrival time for last 50000 msgs


10000 msgs: remove cold-start effects
50000 msgs: remove network jitter effects
UCB Winter Retreat
[email protected]
12
Micro-benchmark Results
Message Processing Latency
Sustainable Throughput
30
25
10
TPut (MB/s)
Time / msg (ms)
100
1
20
15
10
0.1
5
0
0.01
0.01
0.1
1
10
100
1000
10000
0.01
0.1
Message Size (KB)



1
10
100
1000
10000
Message Size (KB)
Constant processing overhead ~ 50s
Latency dominated by byte copying
For 5K messages, throughput = ~10,000 msgs/sec
UCB Winter Retreat
[email protected]
13
Large Scale Methodology

PlanetLab global network




101 machines at 42 institutions, in North America, Europe,
Australia (~ 60 machines utilized)
1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM)
North American machines (2/3) on Internet2
Tapestry Java deployment




6-7 nodes on each physical machine
IBM Java JDK 1.30
Node virtualization inside JVM and SEDA
Scheduling between virtual nodes increases latency
UCB Winter Retreat
[email protected]
14
Node to Node Routing
35
RDP (min, med, 90%)
30
Median=31.5, 90th percentile=135
25
20
15
10
5
0
0
50
100
150
200
250
300
Internode RTT Ping time (5ms buckets)


Ratio of end-to-end routing latency to shortest ping distance
between nodes
All node pairs measured, placed into buckets
UCB Winter Retreat
[email protected]
15
Object Location
25
RDP (min, median, 90%)
90th percentile=158
20
15
10
5
0
0
20
40
60
80
100
120
140
160
180
200
Client to Obj RTT Ping time (1ms buckets)


Ratio of end-to-end latency for object location, to shortest ping
distance between client and object location
Each node publishes 10,000 objects, lookup on all objects
UCB Winter Retreat
[email protected]
16
Latency to Insert Node
2000
Integration Latency (ms)
1800
1600
1400
1200
1000
800
600
400
200
0
0
100
200
300
400
500
Size of Existing Network (nodes)


Latency to dynamically insert a node into an existing Tapestry, as
function of size of existing Tapestry
Humps due to expected filling of each routing level
UCB Winter Retreat
[email protected]
17
Bandwidth to Insert Node
1.4
Control Traffic BW (KB)
1.2
1
0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
350
400
Size of Existing Network (nodes)


Cost in bandwidth of dynamically inserting a node into the Tapestry,
amortized for each node in network
Per node bandwidth decreases with size of network
UCB Winter Retreat
[email protected]
18
Parallel Insertion Latency
20000
Latency to Convergence (ms)
18000
90th percentile=55042
16000
14000
12000
10000
8000
6000
4000
2000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Ratio of Insertion Group Size to Network Size


Latency to dynamically insert nodes in unison into an existing
Tapestry of 200
Shown as function of insertion group size / network size
UCB Winter Retreat
[email protected]
19
Talk Outline

Introduction

Architecture

Deployment Evaluation

Fault-tolerant Routing

Tunneling through scalable overlays

Example using Tapestry
UCB Winter Retreat
[email protected]
20
Adaptive and Resilient Routing

Goals




Reachability as a service
Agility / adaptability in routing
Scalable deployment
Useful for all client endpoints
UCB Winter Retreat
[email protected]
21
Existing Redundancy in DOLR/DHTs

Fault-detection via soft-state beacons



Periodically sent to each node in routing table
 Scales logarithmically with size of network
Worst case overhead: 240 nodes, 160b ID  20 hex
1 beacon/sec, 100B each = 240 kbps
can minimize B/W w/ better techniques (Hakim, Shelley)
Precomputed backup routes

Intermediate hops in overlay path are flexible
Keep list of backups for outgoing hops
(e.g. 3 node pointers for each route entry in Tapestry)

Maintain backups using node membership algorithms
(no additional overhead)
UCB Winter Retreat
[email protected]
22
Bootstrapping Non-overlay Endpoints

Goal



Allow non-overlay nodes to benefit
Endpoints communicate via overlay proxies
Example: legacy nodes L1, L2




Li registers w/ nearby overlay proxy Pi
Pi assigns Li a proxy name Di
s.t. Di is the closest possible unique name to Pi
(e.g. start w/ Pi, increment for each node)
Li and L2 exchange new proxy names
messages route to nodes using proxy names
UCB Winter Retreat
[email protected]
23
Tunneling through an Overlay
D2
P1
L1
Overlay Network
P2
L2
D1



L1 registers with P1 as document D1
L2 registers with P2 as document D2
Traffic tunnels through overlay via proxies
UCB Winter Retreat
[email protected]
24
Failure Avoidance in Tapestry
UCB Winter Retreat
[email protected]
25
Routing Convergence
UCB Winter Retreat
[email protected]
26
Bandwidth Overhead for Misroute
Increase in Latency for 1 Misroute (Secondary Route)
Proportional Increase to
Path Latency
20 ms
26.66 ms
60 ms
80 ms
93.33 ms
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
Position of Branch (Hop)

UCB Winter Retreat
Status: under deployment on PlanetLab
[email protected]
27
For more information …
Tapestry and related projects (and these slides):
http://www.cs.berkeley.edu/~ravenben/tapestry
OceanStore:
http://oceanstore.cs.berkeley.edu
Related papers:
http://oceanstore.cs.berkeley.edu/publications
http://www.cs.berkeley.edu/~ravenben/publications
[email protected]
UCB Winter Retreat
[email protected]
28
Backup Slides Follow…
UCB Winter Retreat
[email protected]
29
The Naming Problem

Tracking modifiable objects



Current approaches



Example: email, Usenet articles, tagged audio
Goal: verifiable names, robust to small changes
Content-based hashed naming
Content-independent naming
ADOLR Project: (Feng Zhou, Li Zhuang)


Approximate names based on feature vectors
Leverage to match / search for similar content
UCB Winter Retreat
[email protected]
30
Approximation Extension to DOLR/DHT

Publication using features



Objects are described using a set of features:
AO ≡ Feature Vector (FV) = {f1, f2, f3, …, fn}
Locate AOs in DOLR ≡ find all AOs in the network
with |FV* ∩ FV| ≥ Thres, 0 < Thres ≤ |FV|
Driving application: decentralized spam filter



Humans are the only fool-proof spam filter
Mark spam, publish spam by text feature vector
Incoming mail filtered by FV query on P2P overlay
UCB Winter Retreat
[email protected]
31
Evaluation on Real Emails

Accuracy of feature vector matching on real emails

Spam (29631 Junk Emails from www.spamarchive.org)


14925 (unique), 86% of spam ≤ 5K
Normal Emails

9589 (total) = 50% newsgroup posts, 50% personal emails
“Similarity” Test
“False Positive” Test
3440 modified copies of 39
emails Fail
THRES Detected
%
9589(normal)×14925(spam)

Match FP # pair probability
3/10
3356
84
97.56
2/10
4
2.79e-8
4/10
3172
268
92.21
>2/10
0
0
Status



Prototype implemented as Outlook Plug-in
Interfaces w/ Tapestry overlay
http://www.cs.berkeley.edu/~zf/spamwatch
UCB Winter Retreat
[email protected]
32
State of the Art Routing

High dimensionality and
coordinate-based P2P routing




Tapestry, Pastry, Chord, CAN,
etc…
Sub-linear storage and # of
overlay hops per route
Properties dependent on random
name distribution
Optimized for uniform mesh style
networks
UCB Winter Retreat
[email protected]
33
Reality


Transit-stub topology, disparate resources per node
Result: Inefficient inter-domain routing (b/w, latency)
AS-3
AS-1
S
R
AS-2
P2P Overlay
Network
UCB Winter Retreat
[email protected]
34
Landmark Routing on P2P

Brocade



Exploit non-uniformity
Minimize wide-area routing hops / bandwidth
Secondary overlay on top of Tapestry

Select super-nodes by admin. domain


Super-nodes form secondary Tapestry


Divide network into cover sets
Advertise cover set as local objects
brocade routes directly into destination’s local network,
then resumes p2p routing
UCB Winter Retreat
[email protected]
35
Brocade Routing
Brocade Layer
Original Route
Brocade Route
AS-3
AS-1
S
D
AS-2
P2P
Network
UCB Winter Retreat
[email protected]
36
Overlay Routing Networks

CAN: Ratnasamy et al., (ACIRI /
Fast Insertion / Deletion
UCB)
Constant-sized routing state
 Uses d-dimensional coordinate space
Unconstrained # of hops
to implement distributed hash table
Overlay distance not prop. to
 Route to neighbor closest to
physical distance
destination coordinate

Chord: Stoica, Morris, Karger, et al.,
(MIT / UCB)
 Linear namespace modeled as
circular address space
 “Finger-table” point to logarithmic # of
inc. remote hosts

Pastry: Rowstron and Druschel
(Microsoft / Rice )
 Hypercube routing similar to PRR97
 Objects replicated to servers by name
UCB Winter Retreat
Simplicity in algorithms
Fast fault-recovery
Log2(N) hops and routing state
Overlay distance not prop. to
physical distance
Fast fault-recovery
Log(N) hops and routing state
Data replication required for
fault-tolerance
[email protected]
37
Routing in Detail
Example: Octal digits, 212 namespace, 2175  0157
2175
2175
0 1
2
3 4 5
6 7
0880
0 1
2
3 4 5
6 7
0123
0 1
2
3 4 5
6 7
0154
0 1
2
3 4 5
6 7
0157
0 1
2
3 4 5
6 7
UCB Winter Retreat
0880
0123
0154
[email protected]
0157
38
Publish / Lookup Details

Publish object with ObjectID:
// route towards “virtual root,” ID=ObjectID
For (i=0, i<Log2(N), i+=j) { //Define hierarchy
 j is # of bits in digit size, (i.e. for hex digits, j = 4 )
 Insert entry into nearest node that matches on
last i bits
 If no matches found, deterministically choose alternative
 Found real root node, when no external routes left

Lookup object
Traverse same path to root as publish, except search for entry at
each node
For (i=0, i<Log2(N), i+=j) {
 Search for cached object location
 Once found, route via IP or Tapestry to object
UCB Winter Retreat
[email protected]
39
Dynamic Insertion
Build up new node’s routing map
1.



2.
3.
4.
Send messages to each hop along path from gateway to
current node N’ that best approximates N
The ith hop along the path sends its ith level route table to N
N optimizes those tables where necessary
Notify via acked multicast nodes with null entries for
N’s ID
Notified node issues republish message for relevant
objects
Notify local neighbors
UCB Winter Retreat
[email protected]
40
Dynamic
Insertion
Example
3
4
NodeID
0x779FE
NodeID
0x244FE
2
NodeID
0xA23FE
NodeID
0x6993E
NodeID
0x973FE
3
4
NodeID
0xC035E
3
2
NodeID
NodeID
0x243FE
0x243FE
4
4
3
1
1
3
NodeID
0x4F990
2
3
NodeID 2
0xB555E
NodeID
0x0ABFE
NodeID
0x704FE
NodeID
0x913FE
4
NodeID
0x09990
1
2
1
3
Gateway
0xD73FF
UCB Winter Retreat
NEW
0x143FE
NodeID
0x5239E
[email protected]
1
NodeID
0x71290
41
Dynamic Root Mapping

Problem: choosing a root node for every
object



Deterministic over network changes
Globally consistent
Assumptions


All nodes with same matching suffix contains
same null/non-null pattern in next level of routing
map
Requires: consistent knowledge of nodes across
network
UCB Winter Retreat
[email protected]
42
PRR Solution

Given desired ID N,



Find set S of nodes in existing network nodes n matching
most # of suffix digits with N
Choose Si = node in S with highest valued ID
Issues:



Mapping must be generated statically using global
knowledge
Must be kept as hard state in order to operate in changing
environment
Mapping is not well distributed, many nodes in n get no
mappings
UCB Winter Retreat
[email protected]
43
Tapestry Solution

Globally consistent distributed algorithm:




Attempt to route to desired ID Ni
Whenever null entry encountered, choose next “higher”
non-null pointer entry
If current node S is only non-null pointer in rest of route
map, terminate route, f (N) = S
Assumes:


Routing maps across network are up to date
Null/non-null properties identical at all nodes sharing same
suffix
UCB Winter Retreat
[email protected]
44
Analysis
Globally consistent deterministic mapping


Null entry  no node in network with suffix
 consistent map  identical null entries across same route
maps of nodes w/ same suffix
Additional hops compared to PRR solution:




Reduce to coupon collector problem
Assuming random distribution
With n  ln(n) + cn entries, P(all coupons) = 1-e-c
For n=b, c=b-ln(b),
P(b2 nodes left) = 1-b/eb = 1.8 10-6
# of additional hops  Logb(b2) = 2
Distributed algorithm with minimal additional hops
UCB Winter Retreat
[email protected]
45
Dynamic Mapping Border Cases

Node vanishes undetected



Routing proceeds on invalid link, fails
No backup router, so proceed to surrogate routing
Node enters network undetected; messages going
to surrogate node instead


New node checks with surrogate after all such nodes have
been notified
Route info at surrogate is moved to new node
UCB Winter Retreat
[email protected]
46
SPAA slides follow
UCB Winter Retreat
[email protected]
47
Network Assumption


Nearest neighbor is hard in general metric
Assume the following:




Ball of radius 2r contains only a factor of c more
nodes than ball of radius r.
Also, b > c2
[Both assumed by PRR]
Start knowing one node; allow distance
queries
UCB Winter Retreat
[email protected]
48
Algorithm Idea



Call a node a level i node if it matches the
new node in i digits.
The whole network is contained in forest of
trees rooted at highest possible imax.
Let list[imax] contain the root of all trees.
Then, starting at imax, while i > 1


list[i-1] = getChildren(list[i])
Certainly, list[i] contains level i neighbors.
UCB Winter Retreat
[email protected]
49
We Reach The Whole Network
3
4
NodeID
0xEF97
NodeID
0xEF32
NodeID
0xE399
NodeID
0xEF34
4
NodeID
0xEF37
3
NodeID
0xEF44
2
2
1
4
NodeID
0xE530
3
NodeID
0xE555
1
NodeID
0xFF37
UCB Winter Retreat
2
NodeID
0xEFBA
1
NodeID
0x099F
3
NodeID
0xEF40
NodeID
0xEF31
NodeID
0x0999
2
2
NodeID
0xE324
NodeID
0xE932
[email protected]
1
NodeID
0x0921
50
The Real Algorithm




Simplified version ALL nodes in the network.
But far away nodes are not likely to have
close descendents
 Trim the list at each step.
New version: while i > 1


List[i-1] = getChildren(list[i])
Trim(list[i-1])
UCB Winter Retreat
[email protected]
51
How to Trim



Consider circle of
radius r with at least
one level i node.
Level-(i-1) node in
little circle must
must point to a leveli node in the big
circle
Want: list[i] had
radius three times
list[i-1] and list[i –1]
contains one level i
UCB Winter Retreat
<2r
r
[email protected]
52
Animation
new
UCB Winter Retreat
[email protected]
53
True in Expectation


Want: list[i] had radius three times list[i-1]
and list[i –1] contains one level i
Suppose list[i-1] has k elements and
radius r



Expect ball of radius 4r to contain kc2/b
Ball of radius 3r contains less than k nodes, so
keeping k all along is enough.
To work with high probability,
= O(log n)
UCB Winter Retreat
[email protected]
k
54
Steps of Insertion

Find node with closest matching ID (surrogate) and
get preliminary neighbor table



Find all nodes that need to put new node in routing
table via multicast
Optimize neighbor table


If surrogate’s table is hole-free, so is this one.
w.h.p. contacted nodes in building table only ones that
need to update their own tables
Need:


No fillable holes.
Keep objects reachable
UCB Winter Retreat
[email protected]
55
Need-to-know nodes
• Need-to-know = a node with a hole
in neighbor table filled by new node
• If 1234 is new node, and no 123s
existed, must notify 12?? Nodes
• Acknowledged multicast to all
matching nodes
UCB Winter Retreat
[email protected]
56
Acknowledged Multicast Algorithm
Locates & Contacts all nodes with a given prefix
• Create a tree based on IDs as we go
• Nodes send acks when all children reached
• Starting node knows when all nodes reached
The node then sends to any
5430?, any 5431?, any
5434?, etc. if possible
5431?
543??
5434?
54340
UCB Winter Retreat
[email protected]
54345
57