Kademlia Measuremenets
Elena Digor
May 18, 2009
comparison to client server systems.
The constantly growing popularity of the peer to
peer systems, has risen the interest in studying
out their topology and dynamics. One of the
mostly used approach is to create snapshots of
the network at some specific points in time. The
snapshots might be carried out by running distributed crawlers on the system of interest.
The basic defining feature of a pure p2p system is that it is characterized by direct access between peer computers, and not through a centralized server. According to Androtsellis-Theotokis
et al. [ea02] the questions to be asked as ”the
litmus test” for p2p are:
• Does this p2p treat variable connectivity and
temporal network address as the norm?
We are interested in studying one of the mostly
deployed p2p networks, namely KAD. Up to now,
• Does this p2p give the nodes at the edge of
there is no open source crawler available for this
the network significant autonomy?
network. In this report we will give an approach
and some up to date results on creating a crawler
Trying to answer these questions, implies that
for the KAD system.
we have enough information about the topology
and dynamics of the system. The last ones are
not that easy to follow, as the nodes come and go
2 Introduction
fast, and the system constantly readjusts itself.
As a consequence, there is a need of a tool which
Nowadays, p2p systems have seen a widely deploy- can take ”snapshots” of the desired network.
ment. Their importance is growing fast especially
in file sharing and grid computing applications.
One widely used tool for capturing ”p2p snapThis is explained by the fact that, in comparison shots” is a crawler. It is system specific, and its
to centralized file systems, p2p is additionally of- accuracy of captured snapshots is affected by both
fering fault tolerance, availability, scalability and duration of a single crawl and the ratio of unreachperformance improvements.
able peers. Therefore determining the accuracy of
captured snapshots of a p2p system is fundamenAt the end of 2008, p2p accounted for more tally difficult because a perfect reference snapshot
than 60% of the whole internet traffic [ipo]. for comparison is not available. Having more than
Among some of the most active p2p systems we one method of crawlers’ implementations which
can name: eDonkey [en], BitTorrent [Bit], KaZaA run over different networks, can help to make a
[KaZ], Gnutella [refe] each one alone counting for better comparison of the results.
more than 1 mln users at peak times. Even though
these projects are prone to legal issues triggered
Besides keeping in mind the p2p system’s specby the file-sharing capabilities, p2p systems have ifications, the design of the crawler should choose
become a very interesting area of research.
between the following tradeoffs: duration of the
crawl and the completeness of the captured snapFor one reason, their large scale distributed shot. The choice between the two is a preference
structure, and redundant storage ensures the fault between: studying the churn of the network (retolerance and resolute nature of p2p systems in quires only a list of participating peers and their
behavior at specific time) or study the overlay 4
Kad background
topology (requires al the edges of the overlay, i.e.
KAD is an implementation of Kademlia[PM02] the crawler should directly contact every peer).
a peer-to-peer DHT routing protocol. Nowadays,
Considering its widely spread usage and avail- it is widely deployed in such clients as Overnet
ability [ipo, MS07b, PM02, M.S07c] we decided [reff], eMule[refc], aMule[refa] and since recently,
to choose KAD network for our p2p analysis pur- BitTorrent [Bit]. The advantage of this protocol
poses. In this report we will first give known is that it works pretty fast, and returns search
related work, a general overview of KAD, then results much faster than other p2p networks.
present a design for our crawler, show challenges
During the last years, different p2p applicaand solutions met during KAD understanding,
mention about our current results, and end up tions came up with new or modified Kademliabased overlay networks. Often the difference bewith future work and a summarizing conclusion.
tween these networks is in their predefined opcodes. However the main principles are still the
3 Related Work
The increasing popularity of p2p systems has risen
up the interest of many researcher to see the actual p2p behavior when deployed on the internet [MS07a, M.S07c, MS07b, SR05, RB03]. Not
so many crawlers, however, are implemented for
KAD network. And there is no, known, opensource p2p crawler available.
Each client is assigned a random and unique
hashed value of 128 bits (called KAD ID) when
the application is first time started. The ID stays
unchanged (even if the client changes its IP and/or
port) until the user deletes the application or its
preference files.
One of the reasons KAD was not exhaustively
studied is that it is still pretty new. Moreover,
there is no official protocol specification, and KAD
undergoes changes pretty often (almost with every
new version of aMule/eMule).
In the same 128 bits space, KAD generates
via MD4 cryptographic hash function the 128 bits
HASH value of the shared files. These values are
stored on nodes which are shown to be as close
as possible (according to their XOR distance between the KAD ID and the File HASH).
Most of the available crawlers [? MS07b,
M.S07c] are crawling only a subset of the total
KAD ID space that contains all KAD peers whose
KAD IDs agree in the high order k bits. For example [? ] has studied 10 bit and 12 bit zones.
[MS07b, M.S07c] have extensively crawled 8 bits
zones for a period of 6 months. The observations
so far where that
Clients over the KAD network communicate
via UDP messages [Bru06]. This makes the whole
messaging process fast, and without a need of establishment of a connection between the peers.
One reason why we wouldn’t want to do it, is because p2p is very dynamic, and therefore peers
come and leave often the network, so TCP connections would actually be just loss of traffic. TCP
connections, however, are still used for the up• the lifetime of a significant fraction of the
load and download of the files between the peers
peers observed can be as short as one single
session (especially in China).
• Geographically, KAD is most widely used 4.1 Routing
in Europe, however as for a single country,
China holds the trophy (25% of all peers seen Routing in KAD is based on prefix matching
[MS07b], i.e. on XOR distance. As a result, each
at any point in time).
KAD client has its own routing table, which is acEven though some statistical results were tually an unbalanced tree, which saves the known
published[MS07b, M.S07c], there is no, known, peers based on their XOR distance to the current
open source crawler available for KAD networks, peer. To visualize, we can imagine the following
or p2p in general.
Fig.0 : Routing
The distance from the client to itself is ”0”
(”a” XOR ”a” = 0), and can be seen in our tree
from fig.0 as the ”black” spot. The peer which
has the same common prefix, and only differs in
the last digit (i.e. an XOR operation of the two
would give answer 1), is in the closes subtree. This
allows a peer a to save more peers which are closer
to it, than the ones which are more far.
Settings and Algorithm
As there is little or no official documentation at
all for KAD protocol, we decided to use aMule
open-source code [refb] (written in C++ under
GNU license [refd]) for creating our own crawler.
Having this in mind, a fair time was dedicated to
understanding the class dependencies and separation of concerns. The latter one needed special
All the HASH objects are saved as leaves of
care. The reason behind is that aMule is actually
this kinds of trees. The leaves are actually Kbased on 2 protocols: e2dk [en] and kademlia (and
buckets, which mean they can save up to k peers.
more precisely KAD) [PM02] ), the last one being
(in our fig.0 example, K=2, i.e. each leaf can have
embedded much later into the project [refa].
information of at most 2 peers). KAD from aMule
project, takes k=10.
Even though there is a tendency in the current
Therefore routing is done by simply forwarding code to separate KAD part from the e2dk, KAD
(3 in aMule) parallel messages to ”closest” nodes still depends on parts of e2dk implementations,
as for example: the real downloading process or
towards the target object.
the socket implementations start still in the e2dk
section. Because of these dependencies, under4.2 Publishing
standing and modifying Kad-protocol becomes
Once a peer wants to publish a file, it first gen- more challenging than expected.
erates a (128 bit) hash value based on the name
of the file. Afterwards, 10 nodes which are in the
Our crawler is interested in capturing infortolerance zone with respect to this hash value , mation from the incoming packets, while also insave the information about the file’s hash value, jecting new packets for getting extra-information
and its hoster’s address. We call that a node is from known peers.
in tolerance zone of a target, if its XOR with the
target is less than 8 bits.
Therefore, the principal changes for the understanding
and measurements of the KAD protocol
In order to keep only ”available” information
made exclusively in the files of KAD
on the network, keys are periodically republished.
Each publisher, also asks on specific intervals
about the availability of the file its holding. In
addition after 24 hours all the information about
Let’s first take a look at what happens when
the files are automatically deleted from the pub- aMule client is started and accepts incoming KAD
• network interface
All outgoing and incoming messages have
to pass the KademliaUDPListener. It reads
the incoming packets and redirects them to
the target object. The outgoing messages
are also created here and transferred to the
ClientUDPSocket, which sends them tot he
target peer.
As we can see, before getting to the Kademlia
specific file, the packet undergoes through several other components. What actually happens is
that each incoming packet is first decrypted, and
checked for the ”protocol OP code” in the header,
and only after that it is further sent for processing to the corresponding component 1 . Therefore
all the incoming (and already decrypted)KAD
packets can be directly tracked into KAD UDP
• search object
The two files SearchManager and Search
are handling the search object. The manager is responsible for the whole lifecycle of
the search (create, start, update, stop and
delete). Furthermore it allocates a search
object to an incoming response.
Let’s look closer at the gasped ”kademlia” specific component. We have noticed five sections of
the main functionality :
• main file - Kademlia.cpp
A main class for Kademlia components. All
the KAD specific classes are merged here. A
time handler, in background, triggers different processes for contacting/deleting peers
• index
It takes care of all the references. The incoming metadata, location information and
notes are managed here (i.e. the consists
of calculation of the load, the serialisation
and the finding of reference for an incoming
• routing table
Contains classes to handle the RountingZone. It has the built in tree-structure.
It handles the inserting, removing, splitting
As we are interested in churn dynamics, the
and merging of the nodes and leaves of the idea would be to have more than one crawler
tree( the latter one containing buckets of which gathers data information from the network.
maximum K Contacts).
Therefore, we’ve decided to dump the information
OP protocol is either \0xe4 or \0xe5 [refb]
in a MySQL database. The reason for that is the
fact that the database offers good ”synchronization” capabilities. As a result, all the crawlers can
dump and update gathered data in a centralized
manner, without worrying about racing conditions. However, the most important aspect is that
the analysis of the results is easier and faster with
an optimized database, than with a regular file.
from bottleneck issues, the advantages of having a
fast retrieval and update of the abundant incoming data ,motivates us on preferring this approach
over others.
Having in mind the observations from above,
we came up with the following design for our
database table:
Even though, by these settings we might suffer
unique 128 bits hash value
randomly chosen per client
The IP address of the peer
The UDP port of the peer
Quality of the peer
First time peer was added
to the database
Last time the entry was
number of deletion of the
peer from the routing table
0 - if IP has changed
1 - if IP has never changed
number of incoming packets
from this peer
Table 1
Most of this information can be easily taken
out from the incoming packets. The ”STATIC”
field can be updated based on known data from
the table. Therefore, lets look at a *currently analyzed* peer. If it is already present in the database
(the check is done based on the KAD ID) and its
old IP (from the database) differs from the newly
seen IP, then STATIC gets value 0, otherwise the
value remains 1. If the peer is not yet present in
the database, then it is added, and its ”STATIC”
field is instantiated to 1.
period of time, we will be able to observe the dynamical behavior of the KAD network. Moreover,
by analyzing the table we can make assumptions
about how prone KAD is to eclipse attack [ea04].
One reason to become suspicious will be the enormous number of ”inactive” peers in our database
(i.e. ones encountered only once in a life-time).
Another benefit of the gathered results will be
the collection of new ”snapshots” of p2p networks
which can be compared to snapshots done by others.
By parallel gathering of the above mentioned
Considering our observations up to now, the
data for the encountered contacts, over a longer crawler will be built on the following basis:
• Whenever there is an incoming response or no incoming traffic.
request packet from a peer, the corresponding information in the databases will be upAfter researching the issue into more details,
we’ve found out that aMule doesn’t perform quiet
well under a firewall (imposed by the router, in
• In order to get more incoming messages from my case). Looking through wireshark’s snapother peers with a random list (up to 20) of shot, I could see that no user was contacting
known to them contacts, we will send around my aMule client, unless my client was contact”BOOTSTRAP REQ” to randomly chosen ing the user first. As concluded from the source
peers from our routing table.
code we could explain the behavior as following.
The incoming packets, once decrypted (done Since KAD starts up with a given list of nodes
before arriving to ”KademliaUDPListener.cpp”), (which is around 160 users), it can only contact
have a predefined form. Their structure depends those nodes. However, considering the dynamic of
on the type of the message, and they are partly p2p networks, the list of nodes can easily become
described in ”src/include/protocol/kad/UDP.h”. outdated, and therefore none or very few of the
We will take special care only of few types on in- available contacts is actually available to respond
to our KADEMLIA REQ. As a result the nodes
coming messages, namely:
list gets exhausted easily, and our Kademlia client
• KADEMLIA RES - OP: 0x28 —especially is not really connected to the network.
interesting as it can hold up to 11 extra peers
Why doesn’t actually our firewalled KAD
client become a node in other tables, and there0x08 – especially interesting as it can refore become more available to the network? To
turn up to 20 peers.
answer this question lets look at what happens
The other responses will be analyzed as well, when a peer receives a ”KADEMLIA REQ” or
but they won’t contribute significantly to populat- ”KADEMLIA RES” from us for the first time:
ing our table.
In order to analyze KAD better for implementing
the desired crawler, we had the following setup:
1. install aMule 2.2.4 from sources (installed on
OS X and Linux)
2. setup the previously described database
3. setup aMule to work run on Kademlia only
4. run a packet sniffer to catch and analyze the
incoming traffic (we used Wireshark 0.99 for
The motivation to use a ”packet sniffer” was
to see first how the client behaves on a localhost.
First results were fairly disappointing. aMule
managed to connect to kademlia network after
several minutes of try, but was seeing only a small
part of the whole network (aMule’s statistics was
sensing only 200 users online, at most). Wireshark’s statistics on the UDP protocol was showing a lot of outgoing traffic on UDP, but almost
Fig. 2
As we can easily see, whenever a peer is triggered by a first message from another peer which
is not in its routing list, it asks for a ”KADEMLIA FIREWALLED REQ”. In case the requested
peer is under a firewall, it just ignores the request,
and after several seconds the inquirer knows that
the requested node is under firewall, and doesn’t
even try to add it to its routing table. In case
it receives a confirmation that the peer is not
under firewall, the inquire might consider adding
this node to its routing table (as discussed in
[M.S07c]). Therefore, the clients which are under
firewall, cannot be seen by anyone except for those
which were specifically contacted by our firewalled
peer. This is how we can explain our initial behavior of the firewalled client, which was hardly
connected to the network.
To confirm this, we can look at the following 2
graphs (which were generated by wireshark after 1
hour run) of a client which was not under firewall.
Fig. 3 Beginning of snapshot
Fig. 4 End of snapshot
As we can clearly see, there is a big boost one machine.
of incoming and then outgoing messages at the
beginning of the snapshot. Most of the outgoing
packages are actually the responses to the incom- 8
ing requests. Kad client connects very quickly
to the network, and sees an ”aMule” statistics of Every simulation or analysis study of a peer-toaround mln of users online.
peer system relies on some model of churn. Towards this end, researchers and developers require
Therefore, we can deduce from above results, an accurate model of churn in order to draw an apthat in order to have a more accurate crawler, propriate conclusion about peer-to-peer systems.
first of all we need to run it in a non-firewalled However, accurately characterizing network’s dyenvironment.
namics requires fine-grained and unbiased information about the arrival and departure of peers,
We have implemented the ”peer information which is challenging to acquire in practice, prigrabbing” from the incoming packets, and dumped mary due to the large size and highly dynamic
it into the database. We have achieved this by ex- nature of these systems.
tending the code of aMule 2.2.4 and taking advantage of mysql++ connector to make the commuAs a result our project was aimed at capturing
nication between our client and the database. We and analyzing a part of the p2p network on our
used the mysql table we’ve discussed above.
own. Therefore we have presented and partly(at
the time of writing this report) implemented a personal crawler to analyze a p2p network, namely
7 Outlook
KAD. In order to do so, we analyzed the open
By having the first steps done, the next things source code of aMule, which implements a version
to implement are the ”force bootstrapping” mes- of KAD protocol. The technical issues we’ve faced
sages to all known peers, in order to gain more during this research, helped us understand parinformation about the active users of the network. ticular issues about KAD protocol that have not
After having that finished, we can affirm that we been specifically documented in any of the known
have a primitive crawler, which can be used for literature.
data collecting already.
Moreover we have set up a MySQL database
can gather the information of interest from
one simultaneously running modified
one simultaneously running modified
the related work, and try to implement a crawler
it gathers only the information from
the routing tables will be information about
packets, but, by the end of the
network (statistics have shown[MS07b] that the
extend it more, by making the
results in 8-bit zones are the same as for the enclient
bootstrap new messages into
tire network, however the improvement come from
the routing tables will
much faster crawler, and more efficient data manbe
information about
agement. The reason we didn’t start with this
approach was the time constraint. )
Of course the full usage of crawler will be taken
In the end we presented future aims and posinto account once it is run in parallel on more than sible developments of this project.
