Download Elena Digor - Computer Networks and Distributed Systems

Kademlia Measuremenets Elena Digor May 18, 2009 1 Abstract comparison to client server systems. The constantly growing popularity of the peer to peer systems, has risen the interest in studying out their topology and dynamics. One of the mostly used approach is to create snapshots of the network at some specific points in time. The snapshots might be carried out by running distributed crawlers on the system of interest. The basic defining feature of a pure p2p system is that it is characterized by direct access between peer computers, and not through a centralized server. According to Androtsellis-Theotokis et al. [ea02] the questions to be asked as ”the litmus test” for p2p are: • Does this p2p treat variable connectivity and temporal network address as the norm? We are interested in studying one of the mostly deployed p2p networks, namely KAD. Up to now, • Does this p2p give the nodes at the edge of there is no open source crawler available for this the network significant autonomy? network. In this report we will give an approach and some up to date results on creating a crawler Trying to answer these questions, implies that for the KAD system. we have enough information about the topology and dynamics of the system. The last ones are not that easy to follow, as the nodes come and go 2 Introduction fast, and the system constantly readjusts itself. As a consequence, there is a need of a tool which Nowadays, p2p systems have seen a widely deploy- can take ”snapshots” of the desired network. ment. Their importance is growing fast especially in file sharing and grid computing applications. One widely used tool for capturing ”p2p snapThis is explained by the fact that, in comparison shots” is a crawler. It is system specific, and its to centralized file systems, p2p is additionally of- accuracy of captured snapshots is affected by both fering fault tolerance, availability, scalability and duration of a single crawl and the ratio of unreachperformance improvements. able peers. Therefore determining the accuracy of captured snapshots of a p2p system is fundamenAt the end of 2008, p2p accounted for more tally difficult because a perfect reference snapshot than 60% of the whole internet traffic [ipo]. for comparison is not available. Having more than Among some of the most active p2p systems we one method of crawlers’ implementations which can name: eDonkey [en], BitTorrent [Bit], KaZaA run over different networks, can help to make a [KaZ], Gnutella [refe] each one alone counting for better comparison of the results. more than 1 mln users at peak times. Even though these projects are prone to legal issues triggered Besides keeping in mind the p2p system’s specby the file-sharing capabilities, p2p systems have ifications, the design of the crawler should choose become a very interesting area of research. between the following tradeoffs: duration of the crawl and the completeness of the captured snapFor one reason, their large scale distributed shot. The choice between the two is a preference structure, and redundant storage ensures the fault between: studying the churn of the network (retolerance and resolute nature of p2p systems in quires only a list of participating peers and their 1 behavior at specific time) or study the overlay 4 Kad background topology (requires al the edges of the overlay, i.e. KAD is an implementation of Kademlia[PM02] the crawler should directly contact every peer). a peer-to-peer DHT routing protocol. Nowadays, Considering its widely spread usage and avail- it is widely deployed in such clients as Overnet ability [ipo, MS07b, PM02, M.S07c] we decided [reff], eMule[refc], aMule[refa] and since recently, to choose KAD network for our p2p analysis pur- BitTorrent [Bit]. The advantage of this protocol poses. In this report we will first give known is that it works pretty fast, and returns search related work, a general overview of KAD, then results much faster than other p2p networks. present a design for our crawler, show challenges During the last years, different p2p applicaand solutions met during KAD understanding, mention about our current results, and end up tions came up with new or modified Kademliabased overlay networks. Often the difference bewith future work and a summarizing conclusion. tween these networks is in their predefined opcodes. However the main principles are still the 3 Related Work same. The increasing popularity of p2p systems has risen up the interest of many researcher to see the actual p2p behavior when deployed on the internet [MS07a, M.S07c, MS07b, SR05, RB03]. Not so many crawlers, however, are implemented for KAD network. And there is no, known, opensource p2p crawler available. Each client is assigned a random and unique hashed value of 128 bits (called KAD ID) when the application is first time started. The ID stays unchanged (even if the client changes its IP and/or port) until the user deletes the application or its preference files. One of the reasons KAD was not exhaustively studied is that it is still pretty new. Moreover, there is no official protocol specification, and KAD undergoes changes pretty often (almost with every new version of aMule/eMule). In the same 128 bits space, KAD generates via MD4 cryptographic hash function the 128 bits HASH value of the shared files. These values are stored on nodes which are shown to be as close as possible (according to their XOR distance between the KAD ID and the File HASH). Most of the available crawlers [? MS07b, M.S07c] are crawling only a subset of the total KAD ID space that contains all KAD peers whose KAD IDs agree in the high order k bits. For example [? ] has studied 10 bit and 12 bit zones. [MS07b, M.S07c] have extensively crawled 8 bits zones for a period of 6 months. The observations so far where that Clients over the KAD network communicate via UDP messages [Bru06]. This makes the whole messaging process fast, and without a need of establishment of a connection between the peers. One reason why we wouldn’t want to do it, is because p2p is very dynamic, and therefore peers come and leave often the network, so TCP connections would actually be just loss of traffic. TCP connections, however, are still used for the up• the lifetime of a significant fraction of the load and download of the files between the peers peers observed can be as short as one single [Bru06]. session (especially in China). • Geographically, KAD is most widely used 4.1 Routing in Europe, however as for a single country, China holds the trophy (25% of all peers seen Routing in KAD is based on prefix matching [MS07b], i.e. on XOR distance. As a result, each at any point in time). KAD client has its own routing table, which is acEven though some statistical results were tually an unbalanced tree, which saves the known published[MS07b, M.S07c], there is no, known, peers based on their XOR distance to the current open source crawler available for KAD networks, peer. To visualize, we can imagine the following or p2p in general. tree: 2 Fig.0 : Routing 5 The distance from the client to itself is ”0” (”a” XOR ”a” = 0), and can be seen in our tree from fig.0 as the ”black” spot. The peer which has the same common prefix, and only differs in the last digit (i.e. an XOR operation of the two would give answer 1), is in the closes subtree. This allows a peer a to save more peers which are closer to it, than the ones which are more far. Settings and Algorithm As there is little or no official documentation at all for KAD protocol, we decided to use aMule open-source code [refb] (written in C++ under GNU license [refd]) for creating our own crawler. Having this in mind, a fair time was dedicated to understanding the class dependencies and separation of concerns. The latter one needed special All the HASH objects are saved as leaves of care. The reason behind is that aMule is actually this kinds of trees. The leaves are actually Kbased on 2 protocols: e2dk [en] and kademlia (and buckets, which mean they can save up to k peers. more precisely KAD) [PM02] ), the last one being (in our fig.0 example, K=2, i.e. each leaf can have embedded much later into the project [refa]. information of at most 2 peers). KAD from aMule project, takes k=10. Even though there is a tendency in the current Therefore routing is done by simply forwarding code to separate KAD part from the e2dk, KAD (3 in aMule) parallel messages to ”closest” nodes still depends on parts of e2dk implementations, as for example: the real downloading process or towards the target object. the socket implementations start still in the e2dk section. Because of these dependencies, under4.2 Publishing standing and modifying Kad-protocol becomes Once a peer wants to publish a file, it first gen- more challenging than expected. erates a (128 bit) hash value based on the name of the file. Afterwards, 10 nodes which are in the Our crawler is interested in capturing infortolerance zone with respect to this hash value , mation from the incoming packets, while also insave the information about the file’s hash value, jecting new packets for getting extra-information and its hoster’s address. We call that a node is from known peers. in tolerance zone of a target, if its XOR with the target is less than 8 bits. Therefore, the principal changes for the understanding and measurements of the KAD protocol In order to keep only ”available” information have been made exclusively in the files of KAD on the network, keys are periodically republished. protocol. Each publisher, also asks on specific intervals about the availability of the file its holding. In addition after 24 hours all the information about Let’s first take a look at what happens when the files are automatically deleted from the pub- aMule client is started and accepts incoming KAD lisher. packets. 3 Fig.1 • network interface All outgoing and incoming messages have to pass the KademliaUDPListener. It reads the incoming packets and redirects them to the target object. The outgoing messages are also created here and transferred to the ClientUDPSocket, which sends them tot he target peer. As we can see, before getting to the Kademlia specific file, the packet undergoes through several other components. What actually happens is that each incoming packet is first decrypted, and checked for the ”protocol OP code” in the header, and only after that it is further sent for processing to the corresponding component 1 . Therefore all the incoming (and already decrypted)KAD packets can be directly tracked into KAD UDP handler. • search object The two files SearchManager and Search are handling the search object. The manager is responsible for the whole lifecycle of the search (create, start, update, stop and delete). Furthermore it allocates a search object to an incoming response. Let’s look closer at the gasped ”kademlia” specific component. We have noticed five sections of the main functionality : • main file - Kademlia.cpp A main class for Kademlia components. All the KAD specific classes are merged here. A time handler, in background, triggers different processes for contacting/deleting peers etc. • index It takes care of all the references. The incoming metadata, location information and notes are managed here (i.e. the consists of calculation of the load, the serialisation and the finding of reference for an incoming search). • routing table Contains classes to handle the RountingZone. It has the built in tree-structure. It handles the inserting, removing, splitting As we are interested in churn dynamics, the and merging of the nodes and leaves of the idea would be to have more than one crawler tree( the latter one containing buckets of which gathers data information from the network. maximum K Contacts). Therefore, we’ve decided to dump the information 1 KADEMLIA OP protocol is either \0xe4 or \0xe5 [refb] 4 in a MySQL database. The reason for that is the fact that the database offers good ”synchronization” capabilities. As a result, all the crawlers can dump and update gathered data in a centralized manner, without worrying about racing conditions. However, the most important aspect is that the analysis of the results is easier and faster with an optimized database, than with a regular file. from bottleneck issues, the advantages of having a fast retrieval and update of the abundant incoming data ,motivates us on preferring this approach over others. Having in mind the observations from above, we came up with the following design for our database table: Even though, by these settings we might suffer Field KAD ID Type VARCHAR(8) Null NO IP VARCHAR(16) UDP PORT Key PRI Default NULL Description unique 128 bits hash value randomly chosen per client NO NULL The IP address of the peer VARCHAR(2) NO NULL The UDP port of the peer TYPE tinyint(x) NO 3 Quality of the peer START TIME timestamp NO CUR TIMEST First time peer was added to the database END TIME timestamp NO 0000-00-00 00:00:00 Last time the entry was updated DEAD int(11) YES NULL number of deletion of the peer from the routing table STATIC tinyint(x) NO 1 0 - if IP has changed 1 - if IP has never changed PACKETS int(11) YES NULL number of incoming packets from this peer Table 1 Most of this information can be easily taken out from the incoming packets. The ”STATIC” field can be updated based on known data from the table. Therefore, lets look at a *currently analyzed* peer. If it is already present in the database (the check is done based on the KAD ID) and its old IP (from the database) differs from the newly seen IP, then STATIC gets value 0, otherwise the value remains 1. If the peer is not yet present in the database, then it is added, and its ”STATIC” field is instantiated to 1. period of time, we will be able to observe the dynamical behavior of the KAD network. Moreover, by analyzing the table we can make assumptions about how prone KAD is to eclipse attack [ea04]. One reason to become suspicious will be the enormous number of ”inactive” peers in our database (i.e. ones encountered only once in a life-time). Another benefit of the gathered results will be the collection of new ”snapshots” of p2p networks which can be compared to snapshots done by others. By parallel gathering of the above mentioned Considering our observations up to now, the data for the encountered contacts, over a longer crawler will be built on the following basis: 5 • Whenever there is an incoming response or no incoming traffic. request packet from a peer, the corresponding information in the databases will be upAfter researching the issue into more details, dated. we’ve found out that aMule doesn’t perform quiet well under a firewall (imposed by the router, in • In order to get more incoming messages from my case). Looking through wireshark’s snapother peers with a random list (up to 20) of shot, I could see that no user was contacting known to them contacts, we will send around my aMule client, unless my client was contact”BOOTSTRAP REQ” to randomly chosen ing the user first. As concluded from the source peers from our routing table. code we could explain the behavior as following. The incoming packets, once decrypted (done Since KAD starts up with a given list of nodes before arriving to ”KademliaUDPListener.cpp”), (which is around 160 users), it can only contact have a predefined form. Their structure depends those nodes. However, considering the dynamic of on the type of the message, and they are partly p2p networks, the list of nodes can easily become described in ”src/include/protocol/kad/UDP.h”. outdated, and therefore none or very few of the We will take special care only of few types on in- available contacts is actually available to respond to our KADEMLIA REQ. As a result the nodes coming messages, namely: list gets exhausted easily, and our Kademlia client • KADEMLIA RES - OP: 0x28 —especially is not really connected to the network. interesting as it can hold up to 11 extra peers Why doesn’t actually our firewalled KAD • KADEMLIA BOOTSTRAP RES - OP: client become a node in other tables, and there0x08 – especially interesting as it can refore become more available to the network? To turn up to 20 peers. answer this question lets look at what happens The other responses will be analyzed as well, when a peer receives a ”KADEMLIA REQ” or but they won’t contribute significantly to populat- ”KADEMLIA RES” from us for the first time: ing our table. 6 Results In order to analyze KAD better for implementing the desired crawler, we had the following setup: 1. install aMule 2.2.4 from sources (installed on OS X and Linux) 2. setup the previously described database 3. setup aMule to work run on Kademlia only 4. run a packet sniffer to catch and analyze the incoming traffic (we used Wireshark 0.99 for OS X) The motivation to use a ”packet sniffer” was to see first how the client behaves on a localhost. First results were fairly disappointing. aMule managed to connect to kademlia network after several minutes of try, but was seeing only a small part of the whole network (aMule’s statistics was sensing only 200 users online, at most). Wireshark’s statistics on the UDP protocol was showing a lot of outgoing traffic on UDP, but almost Fig. 2 6 As we can easily see, whenever a peer is triggered by a first message from another peer which is not in its routing list, it asks for a ”KADEMLIA FIREWALLED REQ”. In case the requested peer is under a firewall, it just ignores the request, and after several seconds the inquirer knows that the requested node is under firewall, and doesn’t even try to add it to its routing table. In case it receives a confirmation that the peer is not under firewall, the inquire might consider adding this node to its routing table (as discussed in [M.S07c]). Therefore, the clients which are under firewall, cannot be seen by anyone except for those which were specifically contacted by our firewalled peer. This is how we can explain our initial behavior of the firewalled client, which was hardly connected to the network. To confirm this, we can look at the following 2 graphs (which were generated by wireshark after 1 hour run) of a client which was not under firewall. Fig. 3 Beginning of snapshot Fig. 4 End of snapshot 7 As we can clearly see, there is a big boost one machine. of incoming and then outgoing messages at the beginning of the snapshot. Most of the outgoing packages are actually the responses to the incom- 8 Conclusions ing requests. Kad client connects very quickly to the network, and sees an ”aMule” statistics of Every simulation or analysis study of a peer-toaround mln of users online. peer system relies on some model of churn. Towards this end, researchers and developers require Therefore, we can deduce from above results, an accurate model of churn in order to draw an apthat in order to have a more accurate crawler, propriate conclusion about peer-to-peer systems. first of all we need to run it in a non-firewalled However, accurately characterizing network’s dyenvironment. namics requires fine-grained and unbiased information about the arrival and departure of peers, We have implemented the ”peer information which is challenging to acquire in practice, prigrabbing” from the incoming packets, and dumped mary due to the large size and highly dynamic it into the database. We have achieved this by ex- nature of these systems. tending the code of aMule 2.2.4 and taking advantage of mysql++ connector to make the commuAs a result our project was aimed at capturing nication between our client and the database. We and analyzing a part of the p2p network on our used the mysql table we’ve discussed above. own. Therefore we have presented and partly(at the time of writing this report) implemented a personal crawler to analyze a p2p network, namely 7 Outlook KAD. In order to do so, we analyzed the open By having the first steps done, the next things source code of aMule, which implements a version to implement are the ”force bootstrapping” mes- of KAD protocol. The technical issues we’ve faced sages to all known peers, in order to gain more during this research, helped us understand parinformation about the active users of the network. ticular issues about KAD protocol that have not After having that finished, we can affirm that we been specifically documented in any of the known have a primitive crawler, which can be used for literature. data collecting already. Moreover we have set up a MySQL database which can gather the information of interest from A big improvement will be to take notice of more than one simultaneously running modified the related work, and try to implement a crawler clients. So far it gathers only the information from which crawls only 8-bit zones, instead of the whole the incoming KAD packets, but, by the end of the network (statistics have shown[MS07b] that the project, we plan to extend it more, by making the results in 8-bit zones are the same as for the enclient to specifically bootstrap new messages into tire network, however the improvement come from the network. As a result the routing tables will much faster crawler, and more efficient data manbe populated much faster with information about agement. The reason we didn’t start with this ”active clients” in the network. approach was the time constraint. ) Of course the full usage of crawler will be taken In the end we presented future aims and posinto account once it is run in parallel on more than sible developments of this project. 8 References [Bit] BitTorrent. Bittorrent. URL: http://www.bittorrent.com/. [Bru06] Rene Brunner. A performance evaluation of the kad-protocol. November 2006. [ea02] A. Adya et al. A survey of peer-to-peer file sharing technologies. Athens Univ. of Economics and Business White Paper (WHP-2002-03), 2002. [ea04] Atul Singh et al. Defending against eclipse attacks on overlay networks. ACM, 2004. [en] eDonkey network. edonkey. URL: http://www.edonkey.co.nr/. [ipo] ipoque. Internet study 2008/2009. URL: http://www.ipoque.com/resources/ internet-studies/internet-study-2008_2009. [KaZ] KaZaA. Kazaa. URL: http://www.kazaa.com/. [MS07a] Ernst W. Biersack Moritz Steiner, Taoufik En-Najjary. Exploiting kad: Possible uses and misuses. ACM SIGCOMM Computer Communication Review, 2007. [MS07b] Ernst W.Biersack Moritz Steiner, Taoufik En-Nakkary. A global view of kad. ACM, 2007. [M.S07c] E.W. Biersack M.Steiner, T. En-Najjary. Analyzing peer behavior in kad. Institut Eurecom, 2007. [PM02] D. Mazieres P. Maymounkov. Kademlia: A peer-to-peer information system based on the xor metric. Proceedings of the 1st International Workshop on Peer-to-Peer Systems, pages 53–65, 2002. [RB03] Geoffrey M. Voelker Ranjita Bhagwan, Stefan Savage. Understanding availability. Proceedings of the 2nd International Workshop onPeer-to-PeerSystems, 2003. [refa] amule. URL: amule.org. [refb] amule 2.2.4 sources. URL: http://www.amule.org/files/files.php?cat=39. [refc] emule project. URL: www.emule-project.net/. [refd] The gnu project. URL: www.gnu.org/copyleft/gpl.html. [refe] The gnutella protocol specification v0.4. URL: www9.limewire.com/developer/gnutella_ protocol_0.4.pdf. [reff] Overnet. URL: overnet.org. [SR05] Daniel Stutzbach and Reza Rejaie. Evaluating the accuracy of captured snapshots by peerto-peer cralwers. Springer-Verlag, Berlin Heidelberg, pages 353–357, 2005. 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Elena Digor - Computer Networks and Distributed Systems