Download Peer-to-Peer

Document related concepts

Lag wikipedia , lookup

Backpressure routing wikipedia , lookup

Airborne Networking wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Distributed operating system wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

Everything2 wikipedia , lookup

CAN bus wikipedia , lookup

Peer-to-peer wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Kademlia wikipedia , lookup

Transcript
peer-to-peer file systems
Presented by: Serge Kreiker
“P2P” in the Internet

Napster: A peer-to-peer file sharing application






allow Internet users to exchange files directly
simple idea … hugely successful
fastest growing Web application
50 Million+ users in January 2001
shut down in February 2001
similar systems/startups followed in rapid successi

Napster,Gnutella, Freenet
Napster
128.1.2.3
Central Napster server
Napster
128.1.2.3
Central Napster server
Napster
128.1.2.3
Central Napster server
Gnutella
Gnutella
xyz.mp3 ?
Gnutella
Gnutella
So Far


Centralized : Napster
- Table size – O(n)
- Number of hops – O(1)
Flooded queries: Gnutella
- Table size – O(1)
- Number of hops – O(n)
Storage Management Systems
challenges

Distributed

Nodes have identical capabilities and responsibilities

anonymity

Storage management : spread storage burden evenly

Tolerate unreliable participants

Robustness : surviving massive failures


Resilience to DoS attacks, censorship, other node
failures.
Cache management :cache additional copies
of popular files
Routing challanges

Efficiency : O(log(N)) messages per
lookup



N is the total number of servers
Scalability : O(log(N)) state per
node
Robustness : surviving massive
failures
We are going to look at


PAST (Rice and Microsoft Research, routing
substrate - Pastry)
CFS (MIT, routing substrate - Chord)
What is PAST ?

Archival storage and content distribution
utility

Not a general purpose file system

Stores multiple replicas of files

Caches additional copies of popular files in
the local file system
How it works



Built over a self-organizing, Internet-based
overlay network
Based on Pastry routing scheme
Offers persistent storage services for
replicated read-only files

Owners can insert/reclaim files

Clients just lookup
PAST Nodes

The collection of PAST nodes form an overlay
network

Minimally, a PAST node is an access point

Optionally, it contributes to storage and
participate in the routing
PAST operations

fileId = Insert(name, owner-credentials, k,
file);

file = Lookup(fileId);

Reclaim(fileId, owner-credentials);
Insertion



fileId computed as the secure hash of name,
owner’s public key, salt
Stores the file on the k nodes whose nodeIds
are numerically closest to the 128 msb of
fileId
How to map Key IDs to Node IDs?

Use Pastry
Insert contd





The required storage is debited against the owner’s
storage quota
A file certificate is returned

Signed with owner’s private key

Contains: fileId, hash of content, replication factor + others
The file & certificate are routed via Pastry
Each node of the k replica storing nodes attach a
store receipt
Ack sent back after all k-nodes have accepted the file
Insert file with fileId=117, k=4
1.Node 200 insert file 117
source
dest
200
122
2. 122 is one of the 4 closest nodes
to 117, 125 was reached first
Because it is the nearest node to 200.
124
120
115
Lookup & Reclaim


Lookup: Pastry locates a “near” node
that has a copy and retrieves it
Reclaim: weak consistency


After it, a lookup is no longer guaranteed
to retrieve the file
But, it does not guarantee that the file is
no longer available
Pastry: Peer-to-peer routing



Provide generic, scalable indexing, data
location and routing
Inspiration from Plaxton’s algorithm (used in
web content distribution eg. Akamai) and
Landmark hierarchy routing
Goals




Efficiency
Scalability
Fault Resilience
Self-organization (completely decentralized)
Pastry:How it works?







Each node has Unique nodeId.
Each Message has a key.
Both are uniformly distributed and lie in the
same namespace
Pastry node routes the message to the node
with the closest nodeId to the key.
Number of routing steps is O(log N).
Pastry takes into account network locality.
PAST – uses fileID as key, and stores the file
in k closest nodes.
Pastry: Node ID space




Each node is assigned a 128-bit node
identifier - nodeId.
nodeId is assigned randomly when joining the
system. (e.g. using SHA-1 hash of its IP or
nodes public Key)
Nodes with adjacent nodeId’s are diverse in
geography, ownership, network attachment,
etc.
nodeId and keys are in base 2b. b is
configuration param with typical value 4.
Pastry:Node ID space
128 bits (=> max. 2128 nodes)
Node id =
0
1
…
…
L–1
b bits
2128|0
1 1
Circular Namespace
L levels
b = 128/L bits per level
NodeId = sequence of L, base 2b (bbit) digits
Pastry: Node State (1)




Each node maintains: routing table-R,
neighborhood set-M, leaf set-L.
Routing table is organized into log2bN rows
with 2b-1 entry each.
Each entry n contains the IP address of a
close node which ID matches in the first n
digits, differs in digit (n+1)
Choice of b - tradeoff between size of routing
table and length of route.
Pastry: Node State (2)

Neighborhood set - nodeId’s , IP
addresses of M nearby nodes based on
proximity in nodeId space



Leaf set – set of L nodes with
closest nodeId to current node.
L - divided into 2 : L /2 closest larger, L
/2 closest smaller.
values for L and M are 2b
Example: NodeId=10233102, b=2, nodeId is 16
bit. All numbers in base 4.
Pastry: Routing Requests
Route (my-id, key-id, message)
if (key-id in range of my leaf-set)
forward to the numerically closest node in leaf set;
else
forward to a node node-id in the routing table s. th.
node-id shares a longer prefix with key-id than myid;
else
forward to a node node-id that shares the same
length prefix with key-id as my-id but is numerically
closer
Routing takes O(log N) messages
B=2, l=4,key = 1230
source
2331
X0: 0130,1331,,2331,3001
1331
X1: 1030,1123,1211,1301
dest
1211
1223
X2: 1201,1213,1223,12331
1233
L: 1232,1223,1300,1301
Pastry:Node Addition




X – joining node
A – node nearby X (network
proximity)
Z – node numerically closest to X2
Routing Table of X



leaf-set(X) = leaf-set(Z)
neighborhood-set(X) =
neighborhood-set(A)
routing table X, row i =
routing table Ni, row i, where Ni is the 240
ith node encountered along the route from A to
Z

X notifies all-nodes in leaf-set(X);
A = 10
N36
N1
Lookup(216)
N2
Z = 210
X joins the system , first stage
X
X
joins
Join message
Route message
Key =X
A
B
Z
C
Pastry: Node Failures, Recovery

Rely on a soft-state protocol to deal with
node failures
Neighboring nodes in the nodeId space periodically
exchange keepalive msgs
 unresponsive nodes for a period T removed from
leaf-sets
 recovering nodes contacts last known leaf set,
updates its own leaf set, notifies members of its
presence.


Randomized routing to deal with malicious
nodes that can cause repeated query failures
Security

Each PAST node and each user of the system
hold a smartcard

Private/public key pair is associated with each
card

Smartcards generate and verify certificates
and maintain storage quotas
More on Security

Smartcards ensures integrity of nodeId and fileId
assignments

Store receipts prevent malicious nodes to create
fewer than k copies

File certificates allow storage nodes and clients to
verify integrity and authenticity of stored content, or
to enforce the storage quota
Storage Management


Based on local coordination among
nodes nearby with nearby nodeIds
Responsibilities:


Balance the free storage among nodes
Maintain the invariant that replicas for each
file are are stored on k nodes closest to its
fileId
Causes for storage imbalance &
solutions

The number of files assigned to each node may vary

The size of the inserted files may vary

The storage capacity of PAST nodes differs

Solutions

Replica diversion

File diversion
Replica diversion

Recall: each node maintains a leaf set



l nodes with nodeIds numerically closest to given
node
If a node A cannot accommodate a copy
locally, it considers replica diversion
A chooses B in its leaf set and asks it to store
the replica

Then, enters a pointer to B’s copy in its table and
issues a store receipt
Policies for accepting a replica

If (file size/remaining free storage) > t



Reject
t is a fixed threshold
T has different values for primary replica (
nodes among k numerically closest ) and
diverted replica ( nodes in the same leaf set,
but not k closest )

t(primary) > t(diverted)
File diversion




When one of the k nodes declines to store a
replica  try replica diversion
If the chosen node for diverted replica also
declines  the entire file is diverted
Negative ack is sent, the client will generate
another fileId, and start again
After 3 rejections the user is announced
Maintaining replicas

Pastry uses keep-alive messages and it
adjusts the leaf set after failures



The same adjustment takes place at join
What happens with the copies stored by a
failed node ?
How about the copies stored by a node that
leaves or enters a new leaf set ?
Maintaining replicas contd

To maintain the invariant ( k copies ) the
replicas have to be re-created in the previous
cases

Big overhead

Proposed solution for join: lazy re-creation

First insert a pointer to the node that holds them,
then migrate them gradually
Caching

The k replicas are maintained in PAST for
availability

The fetch distance is measured in terms of
overlay network hops ( which doesn’t mean
anything for the real case )

Caching is used to improve performance
Caching contd



PAST uses the “unused” portion of their
advertised disk space to cache files
When store a new primary or a diverted
replica, a node evicts one or more cached
copies
How it works: a file that is routed through a
node by Pastry ( insert or lookup ) is inserted
into the local cache f its size < c

c is a fraction of the current cache size
Evaluation
PAST implemented in JAVA
 Network Emulation using JavaVM
 2 workloads (based on NLANR traces)
for
file sizes
 4 normal distributions of node storage
sizes

Key Results

STORAGE



Replica and file diversion improved global storage
utilization from 60.8% to 98% compared to
without;
insertion failures drop to < 5% from 51%.
Caveat: Storage capacities used in experiment,
1000x times below what might be expected in
practice.
CACHING


Routing Hops with caching lower than without
caching even with 99% storage utilization
Caveat: median file sizes very low, likely caching
performance will degrade if this is higher.
CFS:Introduction


Peer-to-peer read only storage system
Decentralized architecture focusing mainly on







Provides a distributed hash table for block storage
Uses Chord to map keys to nodes.
Does not provide



efficiency of data access
robustness
load balance
scalability
anonymity
strong protection against malicious participants
Focus is on providing an efficient and robust lookup and storage
layer with simple algorithms.
CFS Software Structure
RPC API
Local API
FS
DHASH
DHASH
DHASH
CHORD
CHORD
CHORD
CFS Client
CFS Server
CFS Server
CFS: Layer functionalities




The client file system uses the DHash layer to
retrieve blocks
The Server Dhash and the client DHash layer uses
the client Chord layer to locate the servers that hold
desired blocks
The server DHash layer is responsible for storing
keyed blocks, maintaining proper levels of replication
as servers come and go, and caching popular blocks
Chord layers interact in order to integrate looking up
a block identifier with checking for cached copies of
the block
 Client identifies the root block using a public key generated by
the publisher.
 Uses the public key as the root block identifier to fetch the root block and
checks for the validity of the block using the signature
 File inode key is obtained by usual search through directory
blocks . These contain the keys of the file inode blocks which are
used to fetch the inode blocks.
 The inode block contains the block numbers and their corr. keys
which are used to fetch the data blocks.
CFS: Properties







decentralized control – no administrative relationship
between servers and publishers.
scalability – lookup uses space and messages at most
logarithmic in the number of servers.
availability – client can retrieve data as long as at least one
replica is reachable using the underlying network.
load balance – for large files, it is done through spreading
blocks over a number of servers. For small files, blocks are
cached at servers involved in the lookup.
persistence – once data is inserted, it is available for the
agreed upon interval.
quotas – are implemented by limiting the amount of data
inserted by any particular IP address
efficiency - delay of file fetches is comparable with FTP
due to efficient lookup, pre-fetching, caching and server
selection.
Chord

Consistent hashing





maps node IP address + Virtual host number into a m-bit node
identifier.
maps block keys into the same m bit identifier space.
Node responsible for a key is the successor of the key’s id with
wrap-around in the m bit identifier space.
Consistent hashing balances the keys so that all nodes share
equal load with high probability. Minimal movement of keys as
nodes enter and leave the network.
For scalability, Chord uses a distributed version of consistent
hashing in which nodes maintain only O(log N) state and use
O(log N) messages for lookup with a high probability.
Chord details

two data structures used for performing lookups



Successor list : This maintains the next r successors of the node.
The successor list can be used to traverse the nodes and find the
node which is responsible for the data in O(N) time.
Finger table : ith entry in the finger table contains the identity of
the first node that succeeds n by at least 2i –1 on the ID circle.
lookup pseudo code



find id’s predecessor, its successor is the node responsible for the
key
to find the predecessor, check if the key lies between the node-id
and its successor. Else, using the finger table and successor list,
find the node which is the closest predecessor of id and repeat this
step.
since finger table entries point to nodes at power-of-two intervals
around the ID ring, each iteration of above step reduces the
distance between the predecessor and the current node by half.
Finger i points to successor of n+2i
N120
112
¼
1/8
1/16
1/32
1/64
1/128
N80
½
Chord: Node join/failure

Chord tries to preserve two invariants



To preserve these invariants, when a node joins a network





Each node’s successor is correctly maintained.
For every key k, node successor(k) is responsible for k.
Initialize the predecessors, successors and finger table of node n
Update the existing finger tables of other nodes to reflect the
addition of n
Notify higher layer software so that state can be transferred.
For concurrent operations and failures, each Chord node runs a
stabilization algorithm periodically to update the finger tables
and successor lists to reflect addition/failure of nodes.
If lookups fail during the stabilization process, the higher layer
can lookup again. Chord provides guarantees that the
stabilization algorithm will result in a consistent ring.
Chord: Server selection




added to Chord as part of CFS implementation.
Basic idea: reduce lookup latency by preferentially
contacting nodes likely to be nearby in the underlying
network
Latencies are measured during finger table creation,
so no extra measurements necessary.
This works only well for latencies such that low
latencies from a to b and from b to c => that the
latency is low between a and c

Measurements suggest this is true. [A case study of server
selection, Masters thesis]
CFS: Node Id Authentication


Attacker can destroy chosen data by selecting a node ID which
is the successor of the data key and then deny the existence of
the data.
To prevent this, when a new node joins the system, existing
nodes check



If the hash (node ip + virtual number) is same as the professed
node id
send a random nonce to the claimed IP to check for IP spoofing
To succeed, the attacker would have to control a large number
of machines so that he can target blocks of the same file
(which are randomly distributed over multiple servers)
CFS: Dhash Layer






Provides a distributed hash table for block storage
reflects a key CFS design decision – split each file into blocks
and randomly distribute the blocks over many servers.
This provides good load distribution for large files .
disadvantage is that lookup cost increases since lookup is
executed for each block. The lookup cost is small though
compared to the much higher cost of block fetches.
Also supports pre-fetching of blocks to reduce user perceived
latencies.
Supports replication, caching, quotas , updates of blocks.
CFS: Replication

Replicates the blocks on “k” servers to increase availability.





Places the replicas at the “k” servers which are the immediate
successors of the node which is responsible for the key
Can easily find the servers from the successor list (r >=k)
Provides fault tolerance since when the successor fails, the next
server can serve the block.
Since in general successor nodes are not likely to be physically
close to each other , since the node id is a hash of the IP + virtual
number, this provides robustness against failure of multiple servers
located on the same network.
The client can fetch the block from any of the “k” servers. Latency
can be used as a deciding factor. This also has the side-effect of
spreading the load across multiple servers. This works under the
assumption that the proximity in the underlying network is
transitive.
CFS: Caching



Dhash implements caching to avoid overloading servers for
popular data.
Caching is based on the observation that as the lookup
proceeds more and more towards the desired key, the distance
traveled across the key space with each hop decreases. This
implies that with a high probability, the nodes just before the
key are involved in a large number of lookups for the same
block. So when the client fetches the block from the successor
node, it also caches it at the servers which were involved in the
lookup .
Cache replacement policy is LRU. Blocks which are cached on
servers at large distances are evicted faster from the cache
since not many lookups touch these servers. On the other
hand, blocks cached on closer servers remain alive in the cache
as long as they are referenced.
CFS: Implementation





Implemented in 7000 lines of C++ code including 3000 lines of
Chord
User level programs communicate over UDP with RPC primitives
provided by the SFS toolkit.
Chord library maintains the successor lists and the finger
tables. For multiple virtual servers on the same physical server,
the routing tables are shared for efficiency.
Each Dhash instance is associated with a chord virtual server.
Has its own implementation of the chord lookup protocol to
increase efficiency.
Client FS implementation exports an ordinary Unix like file
system. The client runs on the same machine as the server,
uses Unix domain sockets to communicate with the local server
and uses the server as a proxy to send queries to non-local CFS
servers.
CFS: Experimental results

Two sets of tests

To test real-world client-perceived performance , the first test
explores performance on a subset of 12 machines of the RON
testbed.




1 megabyte file split into 8K size blocks
All machines download the file one at a time .
Measure the download speed with and without server selection
The second test is a controlled test in which a number of servers
are run on the same physical machine and use the local loopback
interface for communication. In this test, robustness, scalability,
load balancing etc. of CFS are studied.
Future Research

Support keyword search



Improve security against malicious participants.




By adopting an existing centralized search engine (like Napster)
use a distributed set of index files stored on CFS
Can form a consistent internal ring and can route all lookups to
nodes internal to the ring and then deny the existence of the data
Content hashes help guard against block substitution.
Future versions will add periodic “routing table” consistency check
by randomly selected nodes to see try to detect malicious
participants.
Lazy replica copying to reduce the overhead for hosts which join
the network for a short period of time.
Conclusions

PAST(Pastry) and CFS(Chord)represent peer-to-peer
routing and location schemes for storage

The ideas are almost the same in all of them

CFS load management is less complex

Questions raised at SOSP about them:


Is there any real application for them ?
Who will trust these infrastructures to store
his/her files ?