Download node - Open Learning Environment

Varia • Assignment? • Nächsten Dienstag 4 Stunden (8:30-12:30) Lab, beide Labgruppen zusammen Distributed Systems 20. Peer-to-Peer Systems Simon Razniewski Faculty of Computer Science Free University of Bozen-Bolzano A.Y. 2016/2017 Outline 1. 2. 3. 4. Distributed System Architectures P2P - Overview Unstructured P2P - Gnutella Structured P2P – Chord 1. (Distributed) System Architectures Reminder • In general Distributed Systems (DS) are designed to share resources. • What is a resource? – e.g., programs, data, CPUs • Advantages of DS – – – – – Fault tolerance Cost reduction (e.g., by pooling resources) Scalability Load Balancing … • Disadvantages? Architectures • DS are software architectures. – It is important to organize the various software modules. – This organization can happen according to different architectural styles. • • • • Layered architectures Object-based architectures Data-centric architectures Event-based architectures Architectural Styles Layered architectures • Example: TCP/IP protocol stack Architectural Styles Object-based architectures • Components are connected in a RPC style • Example: – Client-Server architecture, Java RMI Architectural Styles data-centric architectures • Processes communicate via a shared repository • Example: – Distributed File Systems Architectural Styles Event-based architectures • Processes communicate through a mechanism of event propagation. • Publish/Subscribe systems – Processes publish events – Only those processes that subscribed will be notified  Publishers and subscribers don’t need to explicitly refer each other (MOM) 2. P2P - Overview Peer-to-Peer: History • ICQ – 1996: first P2P instant messaging tool – Server maintains the status of the users, communication happens P2P (till 2011, also Skype worked that way) • Napster – 1999 first file sharing system for music – 2001 shutdown over legal issues • Gnutella – 2000 first truly decentralized P2P filesharing system • Bittorrent – Developed 2001, most used system today • Bitcoin – Distributed virtual currency, released 2009 Peer-to-Peer Systems Motivation • A huge number of nodes participating in the network: – Have demands towards the use of resources – Classical client/server systems: Server are the bottlenecks – But: Clients have a lot of compute power by themselves – E.g., Bitcoin network in 2013: 8x the compute power of the top 500 supercomputers • Main questions: – How can we utilize the power of the clients? – How can we achieve fairness? Peer-to-Peer Systems Overlay Networks • P2P networks form an overlay network on top of the Internet (TCP/IP network) • TCP/IP networks form an overlay network politically and technically over the underlying telecom • Both: – introduce their own addressing scheme (e.g. peer IDs, IP addresses) – emphasize redundancy • Complexity of P2P systems lies in the design of the overlay network + the message exchange protocol • Super-peer based (e.g. Napster, (Bittorrent)) • Unstructured (e.g. Gnutella) • Structured (e.g. Chord) Peer-to-Peer Overlay Networks Client Server vs. P2P content discovery Index Overlay Network Client Network Network Provider Provider Provider Provider Node Node • Clients interacts with Servers after finding useful information provided by Indexing services • No intelligence in the network Node • Search in the network • “Content”-based • Intelligence in the network Peer-to-Peer Features • Resources (location, sharing) – Relevant resources located at (private) nodes (peers) • uncontrolled, voluntary offers – Resources of peers are heterogeneous • bandwidth, CPU power, storage space, … • quality depends on device / connectivity – Distributed resource location • widely spread: requires proper mechanism to find and use Peer-to-Peer Features • Networking – High churn rate (ratio of peers joining/leaving a service over time) • peers are online for an unpredictable limited time • heterogeneous connection types • often operating behind firewalls or NAT gateways • Self-organizing system – Relevant mechanisms performed by peers in the network • Resource search, allocation and scheduling • No central control Peer-to-Peer vs Grid Computing • A computing paradigm to interconnect the existing data processing centers to Virtual Organizations and operate on data in a distributed way. • Commonly in commercial applications • Still some centralized components Peer-to-Peer vs Cloud Computing • A computing paradigm to access distributed pools of resources. – Resources: storage, bandwidth, CPU • Resource providers: typically companies • Controlled environment – No malicious users – No (very little) churns – Homogeneous devices • Parts of the architecture are centralized • Different goal – Software as a service (e.g. Google docs) – Platform as a service (e.g. Google app engine, Windows Azure) – Infrastructure as a service (e.g. Amazon EC2) Peer-to-Peer Popular systems Peer-to-Peer Traffic http://www.exfo.com/PageFiles/37706/Robert%20Fitts_Cisco%20Graph.jpg (Since 2012, reported slight decrease vs. streaming) Structured P2P architectures • The overlay is constructed by a deterministic procedure. • Distributed Hash Tables (DHTs) are commonly used to organize data. • DHTs: – Each data item is assigned a key in a key space. – Each node is assigned a key in the same way. – Idea: map the key of a data item to a unique node key. The mapping is done by using a distance metric Structured P2P architectures • When looking for a piece of data, it is important to return the id of the node responsible for that piece of data. • This is achieved by performing routing of the request from the node that originated the request to the node responsible for the key associated to the data one is looking for. Structured P2P architectures - Chord • Nodes are logically organized in a ring. • Node ids can be obtained e.g., by hashing the IP address of the node. • A piece of data having key k is assigned to a node having id>k. – This node is called successor of k denoted by succ(k). Structured P2P architectures- Chord N63 The distance to peer nodes increases exponentially. N2 N6 N56 N11 N50 put ( 28, A ) performed on N48 N42 N42 Key 28 … N33 N33 is the target node N27 N17 Data in this area is stored into Node 27 Unstructured P2P architectures • The network looks like a random graph. • Each node maintains a list of neighbors. – How to construct this list ? • e.g., by contacting popular nodes. • Each node has a partial view of the whole network. • When a piece of data has to be found the network will be flooded. – Each node contacts its neighbors and so forth. Unstructured P2P architectures • Centralized P2P – e.g., Napster Unstructured P2P architectures • Fully decentralized – e.g., Gnutella 0.2 Super-peer-based P2P architectures • Some nodes will act as servers for a subset of peers. – These nodes are called super-peers (SPs). • SPs are organized in a P2P way. • Problems: – Leader election (i.e., which nodes are good candidates for being SPs ?) Super-peer-based P2P architectures • Combine centralized and decentralized aspects – e.g., KaZaa Overlay Networks Types of queries • Lookup – Key-value as in the case of hash tables • Given a key, returns a single values/file • Match – Given a sequence of keywords • Return all documents matching the terms • Structured networks have 100% probability of success for Lookup • Unstructured networks do not have the same guarantee Overlay Networks Classification Homogeneous/fully decentralized 4. No guarantee of success Heterogeneous/ super-peer-based 4. No guarantee of success 3. Unstructured P2P Unstructured Overlay Networks Types • Homogeneous (all nodes are assumed equal) • Heterogeneous (nodes have various roles) – Decentralized File Sharing with Distributed Servers – Decentralized file Sharing with Super Nodes • Flat: all nodes are in one overlay • Hierarchical: more than one overlay exists Unstructured Overlay Networks Centralized Networks • Central index maintaining information about – shared object (e.g., file name) – location (IP address where the object is located) • Normal peers maintain the objects – each peer maintains its own objects – decentralized storage – file transfer between peers • Issues: – Central point of failure – Unbalanced cost: the server is a bottleneck Unstructured Overlay Networks Centralized Networks SERVER Node G d1, B Node A Node D Node E Node B Node C d1 transfer of d1 Node F Unstructured Overlay Networks Centralized Networks • Advantages? – Search complexity O(1) – Simple and fast • Disadvantages? – No intrinsic scalability – Single point of failure/attack – Complex queries difficult – A single server cannot handle a large number of peers Unstructured Overlay Networks Distributed Networks • All peers play the same role • Search is carried out by a cooperation among peers • Each peer has a local view of the network • Main motivation – Provide robustness – No single point of failure – Scalability Unstructured Overlay Networks Distributed Networks - Challenges • How to join the network ? – No central index • Knowing at least 1 node that already is in the network – Constructing the local view of the network • How to perform search ? – Flooding – Random walk – .. • How to deliver contents? – Peer to Peer communication Unstructured Overlay Networks Distributed Networks • PROs? – Each peer can fail/leave the network – Scalable • CONs? – Slow and expensive search – No guarantee of completeness (i.e., finding ALL the objects that satisfy a request) Unstructured Overlay Networks Distributed Networks - Search • Flooding (Breadth First Search) – Send the message to all neighbor peers apart from the peer that delivered the message – Keep track of the messages processed to avoid cycles Query Unstructured Overlay Networks Distributed Networks - Flooding Source node Message Count Length of the path 5 Target node 2615 113 1 Unstructured Overlay Networks Distributed Networks - Search • Expanding Ring – Successive floods with increasing TTL • Start with small TTLs, if no success, then increase it – A.k.a. iterative deepening (e.g. games) • Properties – Improved performance • If content follow a Zipf distribution – Message overhead is high Unstructured Overlay Networks Distributed Networks - Search • Random walk – Forward the query to a randomly selected neighbor • Message overhead is reduced significantly • Increased latency • Multiple random walks – Reduce latency – Generate more load • Termination mechanism – TTL based Unstructured Overlay Networks Distributed Networks – Random walk Assume k=2, each message is sent to only 2 neighbors Target node Source node Gnutella Protocol 0.4 Unstructured Overlay Networks Gnutella protocol • Messages have a GUID (globally unique identifier) to avoid loops • TTL max. equal to 7 • Phases: – Connecting • PING • PONG – Querying • QUERY • QUERY HIT – Transfer • HTTP GET or PUSH (in case of firewall) Unstructured Overlay Networks Gnutella protocol – Phase 1 • To connect to a Gnutella network, peer must initially know – (at least) one member node of the network and connect to it • This/these first member node(s) must be found by other means – find first member using other medium (Web, chat, ...) – nowadays host caches are usually used • Share further neighbours to get more connections Unstructured Overlay Networks Gnutella algorithm – Phase 2 • A node that receives a QUERY message – increases the HOP count field of the message and • IF(HOP <= TTL && ! QUERY.GUID already received) THEN forwards it to all nodes except the one the peer received it from – Nodes also checks whether they can answer and in case • Send a QUERY HIT message to the requestor – QUERY HIT messages contain • The IP address of the sender • The message is routed through the same path it arrived Unstructured Overlay Networks Gnutella algorithm – Phase 3 • A peer sets up a HTTP connection – actual data transfer is not part of the Gnutella protocol – HTTP GET is used • Special case: peer with the file located behind a firewall/NAT gateway the downloading peer – cannot initiate a TCP/HTTP connection • can instead send the PUSH message asking the other peer to initiate a TCP/HTTP connection to it and then transfer (push) the file via it • does not work if both peers are behind firewalls Unstructured Overlay Networks Gnutella algorithm – Scalability • A TTL (Time To Live) of 4 hops for the PING messages leads to a known topology of roughly 8000 nodes – TTL in the original Gnutella client was 7 (not 4) • Gnutella in its original version (V 0.4) suffers from a range of scalability issues due to – fully decentralized approach – flooding of messages • Measured message frequency (Portman et al.): • Gnutella 0.6: Superpeers Unstructured Overlay Networks Gnutella algorithm – free riders • Study results (since e.g. Adar/Hubermann 2000): – 70% of the Gnutella users share no files – 90% do not answer to queries • Some solutions – incentives for sharing: peers only accept connections / forward messages from peers that share contents – Micro-payment (e.g., Mojo Nation) Bittorrent • BitTorrent is a system for downloading files. • It does not provide all the functionalities of a typical P2P system (e.g., search). • To share a content a user create a .torrent file containing: – Metadata about the content to be shared – Information about the tracker, that is, the node that coordinates the distribution Bittorrent - Features • The user that provides the file chops it into pieces of size between 64K and 1 MB. • When a client is looking for a file it locates the .torrent file pointing to the tracker. • The tracker tells which other peers (called leechers) are downloading the file. Bittorrent - Functionality • Replicas of the downloaded parts are created. – More leechers downloading more replicas. – As soon as the leecher has completed a piece, it can share it with others – Share rarest first • All the pieces are put together to assemble the file. • The main objective of BitTorrrent is to guarantee cooperation. – Avoiding free riding (i.e., nodes that only download and do not provide contents) – Tracker hands addresses of pieces only to peers that share received pieces (tit-for-tat) Bittorrent - Illustration (click) Bittorrent - Remarks • Tracker normally does not host the file • Torrent file hosts: Same applies • From thepiratebay.org: No torrent files are saved at the server. That means no copyrighted and/or illegal material are stored by us. It is therefore not possible to hold the people behind The Pirate Bay responsible for the material that is being spread using the site. • Today: Trackerless torrents (see structured P2P next) Legal and economical aspect of filesharing What is legal? What is illegal? Should filesharing be forbidden? What should the government do? What should producers of digital content do? Assignment 3 • Many nice solutions! • Why no security for ports needed? – Security at system level (firewall) – No exposure to remote code (in contrast to remote objects) • Concurrency is hidden, nevertheless an issue! – E.g., what if two clients register at the same time? • Structure of documentation – Title, subheadings, paragraphs, boldfacing, enumerations, bullets, screenshots, tables of text cases/differences, … – Not just monolithic text blocks 4. Structured P2P Networks Chord Distributed Hash Tables • Hash tables allow very quick access to large sets of objects – How fast? • Can we distribute them? • Why would we want to? Standard hashing and why it does not work System architectures Distributed Hash Tables • Distributed Hash Tables (DHTs): – Each data item is assigned a key in a key space. – Each node is assigned a key in the same way. – Idea: map the key of a data item to a unique node key by using a distance metric Structured P2P architectures Chord • Developed in 2001 at MIT • Efficient node localization – provable performance, proven correctness • Chord is a distributed lookup protocol – Distributed version of traditional hash tables • Support of just one operation: – given a key, Chord maps the key onto a node • More concretely – Chord provides the IP address of the node responsible for the sought key Structured P2P architectures Chord – consistent hashing • The Hash function assigns each node and key an m-bit identifier using a base hash function such as SHA-1 (Secure Hash Standard) – ID(node) = hash(IP, Port) – ID(key) = hash(key) • Ensures that when an n-th node joins (or leaves) the network, only an O(1/N) fraction of the keys are moved to a different location Structured P2P architectures Chord – consistent hashing • In an m-bit space there are 2m identifiers • How to build the ring? – Identifiers are ordered in an identifier circle modulo 2m – The circle is called the Chord • The key k is assigned to the node whose identifiers is equal to or follows k in the identifier space • This node is called the successor of k Structured P2P architectures Chord – consistent hashing identifier node 6 1 0 successor(6) = 0 6 identifier circle 6 5 2 2 3 4 key successor(1) = 1 1 7 X 2 successor(2) = 3 Structured P2P architectures Chord – node join and departure • When a node n joins the network – certain keys assigned to n’s successor become assigned to n • When node n leaves the network – all of its assigned keys are reassigned to n’s successor. Structured P2P architectures Chord – node join (6) keys 5 7 Certain keys assigned to succ(6) (i.e., 0) now are assigned to 6 keys 1 0 1 7 keys 6 2 5 3 4 keys 2 Structured P2P architectures Chord – node departure (1) keys 7 Certain keys are assigned to succ(1) (i.e., 3) when 1 leaves keys 1 0 1 7 keys 6 6 2 5 3 4 keys 2 Structured P2P architectures Chord – (simple) lookup • A very small amount of routing information suffices to implement consistent hashing in a distributed environment • If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order. • Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key. – Is this efficent ? Structured P2P architectures Chord – (simple) lookup • Pseudo code for finding the successor: // ask node n to find the successor of id n.find_successor(id) if (id  [n, successor]) return successor; else // forward the query around the circle return successor.find_successor(id); Structured P2P architectures Chord – (simple) lookup • The path taken by a query from node 1 for key 7: keys 7 lookup (7) 0 1 7 6 2 5 3 4 Structured P2P architectures Chord – efficient lookup • To accelerate lookups, each Chord node maintains additional routing information • Each node maintains a routing table with (at most) m entries (where N=2m) called the finger table • The ith entry in the table at node n will be: FTn[i]=successor(n+2i-1) – The distance from n increases exponentially Structured P2P architectures Chord – finger table finger table start For. 0+20 0+21 0+22 1 2 4 1 6 succ. 1 3 0 finger table For. start 0 7 keys 6 0 1+2 1+21 1+22 2 3 5 succ. keys 1 3 3 0 2 5 3 4 finger table For. start 0 3+2 3+21 3+22 4 5 7 succ. 0 0 0 keys 2 Structured P2P architectures Chord – finger table • Each node stores information about: – Its direct predecessor – A small number of successor nodes – Knows more about nodes closely following it than about nodes farther away • A node’s finger table generally does not contain enough information to determine the successor of an arbitrary key k • Repetitive queries to nodes that immediately precede the given key will lead to the key’s successor eventually Structured P2P architectures Chord – lookup with finger table • How to perform lookup by exploiting the finger table? – Search in finger table for the highest node q that precedes id q=MAX(FT≤id) – Invoke find_successor from that node q => Number of messages O(log N)! Structured P2P architectures Chord – finger table 63 The distance to peer nodes increases exponentially. finger table distance succ. N11+1 N17 N11+2 N17 N11+4 N17 N11+8 N27 N11+16 N27 N11+32 N48 2 6 56 11 50 finger table Distance succ. N42+1 N48 N48 N42+2 N48is responsible N42+4 The node that N50 28 is N11 N42+8 for the key N42+16 N63 N42+32 N11 get( 28) performed on 48 42 N42 Key 28 27 33 N33 is the target node 17 finger table distance succ. N27+1 N33 N27+2 N33 N27+4 N33 N27+8 N42 N27+16 N48 N27+32 N63 79 Structured P2P architectures Chord – maintenance • Basic “stabilization” protocol is used to keep nodes’ successor pointers up to date – this is sufficient to guarantee correctness of lookups • Those successor pointers can then be used to verify the finger table entries • Every node runs stabilize periodically to find newly joined nodes Structured P2P architectures Chord – stabilization • “Stabilization” protocol contains 6 functions: – create() – join() – stabilize() – notify() – fix_fingers() – check_predecessor() Structured P2P architectures Chord – join • When node n first starts, it calls n.join(n’), where n’ is any known Chord node. • The join() function asks n’ to find the immediate successor of n. • join() does not make the rest of the network aware of n. Structured P2P architectures Chord – stabilization // create a new Chord ring. n.create() predecessor = null; successor = n; // join a Chord ring containing node n’. n.join(n’) predecessor = null; successor = n’.find_successor(n); Structured P2P architectures Chord – stabilization • Each time node n runs stabilize() – it asks its successor for its predecessor p, – it decides whether p should be n’s successor instead. • stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n. • The successor does this only if it knows of no closer predecessor than n. Structured P2P architectures Chord – stabilization // called periodically. verifies n’s immediate // successor, and tells the successor about n. n.stabilize() x = successor.predecessor; if (x < n.successor) successor = x; successor.notify(n); // n’ thinks it might be our predecessor. n.notify(n’) if (predecessor is null or n’>predecessor) predecessor = n’; Structured P2P architectures Chord – join and stabilization  np succ(np) = ns succ(np) = n n nil  predecessor = nil  n acquires ns as successor via some n’ n runs stabilize  n notifies ns being the new predecessor  ns acquires n as its predecessor  np runs stabilize  pred(ns) = n pred(ns) = np ns n joins     np asks ns for its predecessor (now n) np acquires n as its successor np notifies n n will acquire np as its predecessor  all predecessor and successor pointers are now correct  fingers still need to be fixed, but old fingers will still work Structured P2P architectures Chord – fix finger • This is how new nodes initialize their finger tables • Each node also periodically calls fix fingers to make sure its finger table entries are up-todate (crashes) Structured P2P architectures Chord – fix finger // called periodically, refreshes finger table entries. n.fix_fingers() for (i=0..m-1) finger[i] = find_successor(n + 2i); Structured P2P architectures Chord – node failures • Key step in failure recovery is maintaining correct successor pointers – Fingers are “only” shortcuts • To help achieve this, each node maintains a successor-list of its r nearest successors on the ring • If node n notices that its successor has failed, it replaces it with the first live entry in the list Structured P2P architectures Chord – experimental results • Latency grows slowly with the total number of nodes • Path length for lookups is about ½ log2N (212 nodes) • Chord is robust in the face of multiple node failures • Original paper: Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-topeer lookup service for internet applications. SIGCOMM 2001 Bittorrent – Trackerless torrents • Keys are filehashes, values are sets of IP addresses • Replication among neighbours • Protocol for building the overlay and routing: Kademlia – – – – Very similar to Chord Finger tables also contain exponentially growing entries But: Distance calculated based on XOR (unintuitive) Advantages • Easier to calculate • Symmetric distance function  Neighbourhood relation is symmetric  Easier communication on join and leave • Further reading: – http://stackoverflow.com/questions/1332107/how-does-dht-intorrents-work – http://tutorials.jenkov.com/p2p/peer-routing-table.html Learned today • Peer-to-Peer – Structured (Chord) • • • • • Logarithmic lookup via finger tables 100% success for lookup Node join and leave How to do lookup Similar implementations underly Bittorrent (“Trackerless torrents”) – Unstructured • Centralized (Napster, Bittorrent (?)) • Decentralized (Gnutella) – Only flooding/random walk possible – Sometimes with super peers

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download node - Open Learning Environment