Download Peer-to-Peer

peer-to-peer file systems Presented by: Serge Kreiker “P2P” in the Internet  Napster: A peer-to-peer file sharing application       allow Internet users to exchange files directly simple idea … hugely successful fastest growing Web application 50 Million+ users in January 2001 shut down in February 2001 similar systems/startups followed in rapid successi  Napster,Gnutella, Freenet Napster 128.1.2.3 Central Napster server Napster 128.1.2.3 Central Napster server Napster 128.1.2.3 Central Napster server Gnutella Gnutella xyz.mp3 ? Gnutella Gnutella So Far   Centralized : Napster - Table size – O(n) - Number of hops – O(1) Flooded queries: Gnutella - Table size – O(1) - Number of hops – O(n) Storage Management Systems challenges  Distributed  Nodes have identical capabilities and responsibilities  anonymity  Storage management : spread storage burden evenly  Tolerate unreliable participants  Robustness : surviving massive failures   Resilience to DoS attacks, censorship, other node failures. Cache management :cache additional copies of popular files Routing challanges  Efficiency : O(log(N)) messages per lookup    N is the total number of servers Scalability : O(log(N)) state per node Robustness : surviving massive failures We are going to look at   PAST (Rice and Microsoft Research, routing substrate - Pastry) CFS (MIT, routing substrate - Chord) What is PAST ?  Archival storage and content distribution utility  Not a general purpose file system  Stores multiple replicas of files  Caches additional copies of popular files in the local file system How it works    Built over a self-organizing, Internet-based overlay network Based on Pastry routing scheme Offers persistent storage services for replicated read-only files  Owners can insert/reclaim files  Clients just lookup PAST Nodes  The collection of PAST nodes form an overlay network  Minimally, a PAST node is an access point  Optionally, it contributes to storage and participate in the routing PAST operations  fileId = Insert(name, owner-credentials, k, file);  file = Lookup(fileId);  Reclaim(fileId, owner-credentials); Insertion    fileId computed as the secure hash of name, owner’s public key, salt Stores the file on the k nodes whose nodeIds are numerically closest to the 128 msb of fileId How to map Key IDs to Node IDs?  Use Pastry Insert contd      The required storage is debited against the owner’s storage quota A file certificate is returned  Signed with owner’s private key  Contains: fileId, hash of content, replication factor + others The file & certificate are routed via Pastry Each node of the k replica storing nodes attach a store receipt Ack sent back after all k-nodes have accepted the file Insert file with fileId=117, k=4 1.Node 200 insert file 117 source dest 200 122 2. 122 is one of the 4 closest nodes to 117, 125 was reached first Because it is the nearest node to 200. 124 120 115 Lookup & Reclaim   Lookup: Pastry locates a “near” node that has a copy and retrieves it Reclaim: weak consistency   After it, a lookup is no longer guaranteed to retrieve the file But, it does not guarantee that the file is no longer available Pastry: Peer-to-peer routing    Provide generic, scalable indexing, data location and routing Inspiration from Plaxton’s algorithm (used in web content distribution eg. Akamai) and Landmark hierarchy routing Goals     Efficiency Scalability Fault Resilience Self-organization (completely decentralized) Pastry:How it works?        Each node has Unique nodeId. Each Message has a key. Both are uniformly distributed and lie in the same namespace Pastry node routes the message to the node with the closest nodeId to the key. Number of routing steps is O(log N). Pastry takes into account network locality. PAST – uses fileID as key, and stores the file in k closest nodes. Pastry: Node ID space     Each node is assigned a 128-bit node identifier - nodeId. nodeId is assigned randomly when joining the system. (e.g. using SHA-1 hash of its IP or nodes public Key) Nodes with adjacent nodeId’s are diverse in geography, ownership, network attachment, etc. nodeId and keys are in base 2b. b is configuration param with typical value 4. Pastry:Node ID space 128 bits (=> max. 2128 nodes) Node id = 0 1 … … L–1 b bits 2128|0 1 1 Circular Namespace L levels b = 128/L bits per level NodeId = sequence of L, base 2b (bbit) digits Pastry: Node State (1)     Each node maintains: routing table-R, neighborhood set-M, leaf set-L. Routing table is organized into log2bN rows with 2b-1 entry each. Each entry n contains the IP address of a close node which ID matches in the first n digits, differs in digit (n+1) Choice of b - tradeoff between size of routing table and length of route. Pastry: Node State (2)  Neighborhood set - nodeId’s , IP addresses of M nearby nodes based on proximity in nodeId space    Leaf set – set of L nodes with closest nodeId to current node. L - divided into 2 : L /2 closest larger, L /2 closest smaller. values for L and M are 2b Example: NodeId=10233102, b=2, nodeId is 16 bit. All numbers in base 4. Pastry: Routing Requests Route (my-id, key-id, message) if (key-id in range of my leaf-set) forward to the numerically closest node in leaf set; else forward to a node node-id in the routing table s. th. node-id shares a longer prefix with key-id than myid; else forward to a node node-id that shares the same length prefix with key-id as my-id but is numerically closer Routing takes O(log N) messages B=2, l=4,key = 1230 source 2331 X0: 0130,1331,,2331,3001 1331 X1: 1030,1123,1211,1301 dest 1211 1223 X2: 1201,1213,1223,12331 1233 L: 1232,1223,1300,1301 Pastry:Node Addition     X – joining node A – node nearby X (network proximity) Z – node numerically closest to X2 Routing Table of X    leaf-set(X) = leaf-set(Z) neighborhood-set(X) = neighborhood-set(A) routing table X, row i = routing table Ni, row i, where Ni is the 240 ith node encountered along the route from A to Z  X notifies all-nodes in leaf-set(X); A = 10 N36 N1 Lookup(216) N2 Z = 210 X joins the system , first stage X X joins Join message Route message Key =X A B Z C Pastry: Node Failures, Recovery  Rely on a soft-state protocol to deal with node failures Neighboring nodes in the nodeId space periodically exchange keepalive msgs  unresponsive nodes for a period T removed from leaf-sets  recovering nodes contacts last known leaf set, updates its own leaf set, notifies members of its presence.   Randomized routing to deal with malicious nodes that can cause repeated query failures Security  Each PAST node and each user of the system hold a smartcard  Private/public key pair is associated with each card  Smartcards generate and verify certificates and maintain storage quotas More on Security  Smartcards ensures integrity of nodeId and fileId assignments  Store receipts prevent malicious nodes to create fewer than k copies  File certificates allow storage nodes and clients to verify integrity and authenticity of stored content, or to enforce the storage quota Storage Management   Based on local coordination among nodes nearby with nearby nodeIds Responsibilities:   Balance the free storage among nodes Maintain the invariant that replicas for each file are are stored on k nodes closest to its fileId Causes for storage imbalance & solutions  The number of files assigned to each node may vary  The size of the inserted files may vary  The storage capacity of PAST nodes differs  Solutions  Replica diversion  File diversion Replica diversion  Recall: each node maintains a leaf set    l nodes with nodeIds numerically closest to given node If a node A cannot accommodate a copy locally, it considers replica diversion A chooses B in its leaf set and asks it to store the replica  Then, enters a pointer to B’s copy in its table and issues a store receipt Policies for accepting a replica  If (file size/remaining free storage) > t    Reject t is a fixed threshold T has different values for primary replica ( nodes among k numerically closest ) and diverted replica ( nodes in the same leaf set, but not k closest )  t(primary) > t(diverted) File diversion     When one of the k nodes declines to store a replica  try replica diversion If the chosen node for diverted replica also declines  the entire file is diverted Negative ack is sent, the client will generate another fileId, and start again After 3 rejections the user is announced Maintaining replicas  Pastry uses keep-alive messages and it adjusts the leaf set after failures    The same adjustment takes place at join What happens with the copies stored by a failed node ? How about the copies stored by a node that leaves or enters a new leaf set ? Maintaining replicas contd  To maintain the invariant ( k copies ) the replicas have to be re-created in the previous cases  Big overhead  Proposed solution for join: lazy re-creation  First insert a pointer to the node that holds them, then migrate them gradually Caching  The k replicas are maintained in PAST for availability  The fetch distance is measured in terms of overlay network hops ( which doesn’t mean anything for the real case )  Caching is used to improve performance Caching contd    PAST uses the “unused” portion of their advertised disk space to cache files When store a new primary or a diverted replica, a node evicts one or more cached copies How it works: a file that is routed through a node by Pastry ( insert or lookup ) is inserted into the local cache f its size < c  c is a fraction of the current cache size Evaluation PAST implemented in JAVA  Network Emulation using JavaVM  2 workloads (based on NLANR traces) for file sizes  4 normal distributions of node storage sizes  Key Results  STORAGE    Replica and file diversion improved global storage utilization from 60.8% to 98% compared to without; insertion failures drop to < 5% from 51%. Caveat: Storage capacities used in experiment, 1000x times below what might be expected in practice. CACHING   Routing Hops with caching lower than without caching even with 99% storage utilization Caveat: median file sizes very low, likely caching performance will degrade if this is higher. CFS:Introduction   Peer-to-peer read only storage system Decentralized architecture focusing mainly on        Provides a distributed hash table for block storage Uses Chord to map keys to nodes. Does not provide    efficiency of data access robustness load balance scalability anonymity strong protection against malicious participants Focus is on providing an efficient and robust lookup and storage layer with simple algorithms. CFS Software Structure RPC API Local API FS DHASH DHASH DHASH CHORD CHORD CHORD CFS Client CFS Server CFS Server CFS: Layer functionalities     The client file system uses the DHash layer to retrieve blocks The Server Dhash and the client DHash layer uses the client Chord layer to locate the servers that hold desired blocks The server DHash layer is responsible for storing keyed blocks, maintaining proper levels of replication as servers come and go, and caching popular blocks Chord layers interact in order to integrate looking up a block identifier with checking for cached copies of the block  Client identifies the root block using a public key generated by the publisher.  Uses the public key as the root block identifier to fetch the root block and checks for the validity of the block using the signature  File inode key is obtained by usual search through directory blocks . These contain the keys of the file inode blocks which are used to fetch the inode blocks.  The inode block contains the block numbers and their corr. keys which are used to fetch the data blocks. CFS: Properties        decentralized control – no administrative relationship between servers and publishers. scalability – lookup uses space and messages at most logarithmic in the number of servers. availability – client can retrieve data as long as at least one replica is reachable using the underlying network. load balance – for large files, it is done through spreading blocks over a number of servers. For small files, blocks are cached at servers involved in the lookup. persistence – once data is inserted, it is available for the agreed upon interval. quotas – are implemented by limiting the amount of data inserted by any particular IP address efficiency - delay of file fetches is comparable with FTP due to efficient lookup, pre-fetching, caching and server selection. Chord  Consistent hashing      maps node IP address + Virtual host number into a m-bit node identifier. maps block keys into the same m bit identifier space. Node responsible for a key is the successor of the key’s id with wrap-around in the m bit identifier space. Consistent hashing balances the keys so that all nodes share equal load with high probability. Minimal movement of keys as nodes enter and leave the network. For scalability, Chord uses a distributed version of consistent hashing in which nodes maintain only O(log N) state and use O(log N) messages for lookup with a high probability. Chord details  two data structures used for performing lookups    Successor list : This maintains the next r successors of the node. The successor list can be used to traverse the nodes and find the node which is responsible for the data in O(N) time. Finger table : ith entry in the finger table contains the identity of the first node that succeeds n by at least 2i –1 on the ID circle. lookup pseudo code    find id’s predecessor, its successor is the node responsible for the key to find the predecessor, check if the key lies between the node-id and its successor. Else, using the finger table and successor list, find the node which is the closest predecessor of id and repeat this step. since finger table entries point to nodes at power-of-two intervals around the ID ring, each iteration of above step reduces the distance between the predecessor and the current node by half. Finger i points to successor of n+2i N120 112 ¼ 1/8 1/16 1/32 1/64 1/128 N80 ½ Chord: Node join/failure  Chord tries to preserve two invariants    To preserve these invariants, when a node joins a network      Each node’s successor is correctly maintained. For every key k, node successor(k) is responsible for k. Initialize the predecessors, successors and finger table of node n Update the existing finger tables of other nodes to reflect the addition of n Notify higher layer software so that state can be transferred. For concurrent operations and failures, each Chord node runs a stabilization algorithm periodically to update the finger tables and successor lists to reflect addition/failure of nodes. If lookups fail during the stabilization process, the higher layer can lookup again. Chord provides guarantees that the stabilization algorithm will result in a consistent ring. Chord: Server selection     added to Chord as part of CFS implementation. Basic idea: reduce lookup latency by preferentially contacting nodes likely to be nearby in the underlying network Latencies are measured during finger table creation, so no extra measurements necessary. This works only well for latencies such that low latencies from a to b and from b to c => that the latency is low between a and c  Measurements suggest this is true. [A case study of server selection, Masters thesis] CFS: Node Id Authentication   Attacker can destroy chosen data by selecting a node ID which is the successor of the data key and then deny the existence of the data. To prevent this, when a new node joins the system, existing nodes check    If the hash (node ip + virtual number) is same as the professed node id send a random nonce to the claimed IP to check for IP spoofing To succeed, the attacker would have to control a large number of machines so that he can target blocks of the same file (which are randomly distributed over multiple servers) CFS: Dhash Layer       Provides a distributed hash table for block storage reflects a key CFS design decision – split each file into blocks and randomly distribute the blocks over many servers. This provides good load distribution for large files . disadvantage is that lookup cost increases since lookup is executed for each block. The lookup cost is small though compared to the much higher cost of block fetches. Also supports pre-fetching of blocks to reduce user perceived latencies. Supports replication, caching, quotas , updates of blocks. CFS: Replication  Replicates the blocks on “k” servers to increase availability.      Places the replicas at the “k” servers which are the immediate successors of the node which is responsible for the key Can easily find the servers from the successor list (r >=k) Provides fault tolerance since when the successor fails, the next server can serve the block. Since in general successor nodes are not likely to be physically close to each other , since the node id is a hash of the IP + virtual number, this provides robustness against failure of multiple servers located on the same network. The client can fetch the block from any of the “k” servers. Latency can be used as a deciding factor. This also has the side-effect of spreading the load across multiple servers. This works under the assumption that the proximity in the underlying network is transitive. CFS: Caching    Dhash implements caching to avoid overloading servers for popular data. Caching is based on the observation that as the lookup proceeds more and more towards the desired key, the distance traveled across the key space with each hop decreases. This implies that with a high probability, the nodes just before the key are involved in a large number of lookups for the same block. So when the client fetches the block from the successor node, it also caches it at the servers which were involved in the lookup . Cache replacement policy is LRU. Blocks which are cached on servers at large distances are evicted faster from the cache since not many lookups touch these servers. On the other hand, blocks cached on closer servers remain alive in the cache as long as they are referenced. CFS: Implementation      Implemented in 7000 lines of C++ code including 3000 lines of Chord User level programs communicate over UDP with RPC primitives provided by the SFS toolkit. Chord library maintains the successor lists and the finger tables. For multiple virtual servers on the same physical server, the routing tables are shared for efficiency. Each Dhash instance is associated with a chord virtual server. Has its own implementation of the chord lookup protocol to increase efficiency. Client FS implementation exports an ordinary Unix like file system. The client runs on the same machine as the server, uses Unix domain sockets to communicate with the local server and uses the server as a proxy to send queries to non-local CFS servers. CFS: Experimental results  Two sets of tests  To test real-world client-perceived performance , the first test explores performance on a subset of 12 machines of the RON testbed.     1 megabyte file split into 8K size blocks All machines download the file one at a time . Measure the download speed with and without server selection The second test is a controlled test in which a number of servers are run on the same physical machine and use the local loopback interface for communication. In this test, robustness, scalability, load balancing etc. of CFS are studied. Future Research  Support keyword search    Improve security against malicious participants.     By adopting an existing centralized search engine (like Napster) use a distributed set of index files stored on CFS Can form a consistent internal ring and can route all lookups to nodes internal to the ring and then deny the existence of the data Content hashes help guard against block substitution. Future versions will add periodic “routing table” consistency check by randomly selected nodes to see try to detect malicious participants. Lazy replica copying to reduce the overhead for hosts which join the network for a short period of time. Conclusions  PAST(Pastry) and CFS(Chord)represent peer-to-peer routing and location schemes for storage  The ideas are almost the same in all of them  CFS load management is less complex  Questions raised at SOSP about them:   Is there any real application for them ? Who will trust these infrastructures to store his/her files ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Peer-to-Peer