ZHOU Mo, et al. Sci China Inf Sci April 2010 Vol. 53 No. 4 771
know the contact information first. We can figure out the importance of P2P crawler here.
The crawler in [11] has something similar to the single crawler described in this paper, it can perform
some lightweight crawling. We will compare the algorithm used between [11] and our work.
Ref. [12] developed a framework of crawlers and used the crawler in the emule kad overlay network.
They proposed some improvements to the kad. In our paper, we focus on the crawlers themselves and
ways of crawling into the overlay network fast. Our approach performs crawling iteratively, and the result
does not restrict us in one peer’s routing table as the case in [12].
Ref. [13] compared the structured overlay network and the unstructured one, it showed that the
geographic location did not have much influence on the measurement results, this finding helps us in
reducing the times of experiments.
3 The design and analysis of the crawler
This section proposes a basic crawler framework, this framework provides the possibility of crawling into
a known peer’s routing table for multiple times. It is the foundation for the crawling optimization by
checking the rate between results from a single crawling and the whole crawling. We can drop crawling
result with less value. The algorithm provided in this framework can also exclude the bogus information
from malicious peers while maintaining the crawling speed.
Each peer in the DHT has a routing table containing other peers’ contact information. Peers use their
routing tables to find peer with specified ID in the overlay network, and after the routing procedure,
peers can search or distribute related information with the peers in the result set. We can see that the
routing procedure forms the foundation of the function of the overlay network. The crawlers send the
regular finding peer requests to the peers who they already know as if the crawlers were trying to route
to peer with specified ID. In order to find out as much information as possible, we may need to send
finding peer requests to every known peer more than once. In each crawling action, we choose an ID, and
then send the finding peer request to a known peer, asking it for peers whose IDs are close to the ID we
specified. To gather most of the information of a known peer from its routing table, we need to choose a
set of requesting IDs accommodate to the structure of the routing table.
3.1 Basic design of crawlers for Kademlia
In the kademlia DHT, the logical distance between two peers is the XOR of their logical IDs, and the
data structure Kademlia uses to store the routing table is the binary tree. Every leaf node in the binary
tree is called a bucket, which stores a small set of peers’ information. The other nodes in the binary
tree do not contain any information of other peers, and each leaf node can only contain information of
a limited number of peers. In the initialization of the routing table, there will be only one root node in
it, and the root node itself is a leaf node. With the new peers continously joining the routing table, this
sole root node will split when the number of the peers exceed the limit. After the split, there will be a
new left child leaf node and a new right child leaf node, and the peers in the former root node will be
dropped into these two new leaf nodes. The peers whose ID has a prefix of 0 will be dropped into the
left child node, the others will be dropped into the right child node. With the height of the binary tree
increasing, the leaf nodes will contain peers ID with longer common prefix. So every bucket only contains
peers whose IDs are in some limited range, and only the host’s ID has the common prefix in this bucket
will lead the bucket to split. The other buckets will ignore peers if they need split themselves to contain
more peers. When a peer receive a request of searching for an ID, it will find a proper bucket in the
binary tree of its routing table by examining the logical distance to the desired ID, and choosing a set of
peers as the result from the bucket.
Now we can check this in the design of crawlers. If the crawler managed to choose ID that has a proper
logical distance with the peer being queried, we can expect that peers information in that bucket will
be returned. In the same way, a set of IDs will make the peer being queried return most information in
every buckets. We can set a parameter such as n for the size of the set of IDs sent to a known peer.