DHT爬虫实现与优化策略探讨

82 浏览量更新于2024-08-28 收藏 347KB PDF 举报

"AnimplementationandoptimizationforscalableDHTcrawler" 本文主要探讨了可扩展分布式哈希表（DHT）爬虫的实现与优化。DHT（Distributed Hash Table）是一种分布式存储系统，它通过哈希函数将数据分散存储在网络中的多个节点上，以实现高效、去中心化的数据查找。KAD（Kademlia）是DHT的一个典型代表，它在实际应用中有着广泛的应用，因此成为了研究DHT性能和行为的理想平台。爬虫在DHT研究中扮演着重要角色，尤其是主动测量中。它们从一组初始节点开始，向这些节点发送节点搜索请求，以获取更多未知节点的信息，从而扩大搜索范围。设计爬虫时，有三个关键目标需要考虑：快速完成对初始节点集的爬取、获取更多节点信息以及在尽可能减少网络数据包传输的情况下得到结果。这三个目标之间存在相互影响的关系，优化其中一个可能会影响到其他目标。文章提出了一种基础的DHT爬虫框架，并讨论了该框架的潜在扩展。考虑到覆盖网络中节点间的连接具有普遍性，即节点通常与多节点有连接，爬虫可以利用这种特性来减少对整个覆盖网络空间的遍历，同时保持爬取的效果。这种方法减少了网络负载，提高了效率，使得爬虫能够更有效地探索DHT的结构。在具体实现和优化过程中，作者可能涉及了算法优化、并发控制、路由策略改进等多个方面。例如，通过智能的路由算法，爬虫可以更快地找到目标节点，减少中间步骤。此外，可能还采用了分层或批量处理技术，以减少网络通信的次数。同时，为了平衡资源消耗和信息获取，可能实施了动态调整搜索深度和宽度的策略。优化DHT爬虫对于理解大规模分布式系统的性能、稳定性和扩展性至关重要。通过对KAD等DHT的深入研究，可以为未来的设计提供理论支持和实践经验，以构建更高效、可靠的分布式系统。这种研究不仅有助于改进现有的DHT，还有可能启发新的分布式计算和存储解决方案。本文对DHT爬虫的实现和优化进行了详细研究，旨在提高数据收集的效率，减少网络资源的消耗，同时保持对DHT结构的全面了解。这样的工作对于推动DHT技术的发展，尤其是在大数据和物联网等领域的应用，具有深远的影响。

ZHOU Mo, et al. Sci China Inf Sci April 2010 Vol. 53 No. 4 771

know the contact information ﬁrst. We can ﬁgure out the importance of P2P crawler here.

The crawler in [11] has something similar to the single crawler described in this paper, it can perform

some lightweight crawling. We will compare the algorithm used between [11] and our work.

Ref. [12] developed a framework of crawlers and used the crawler in the emule kad overlay network.

They proposed some improvements to the kad. In our paper, we focus on the crawlers themselves and

ways of crawling into the overlay network fast. Our approach performs crawling iteratively, and the result

does not restrict us in one peer’s routing table as the case in [12].

Ref. [13] compared the structured overlay network and the unstructured one, it showed that the

geographic location did not have much inﬂuence on the measurement results, this ﬁnding helps us in

reducing the times of experiments.

3 The design and analysis of the crawler

This section proposes a basic crawler framework, this framework provides the possibility of crawling into

a known peer’s routing table for multiple times. It is the foundation for the crawling optimization by

checking the rate between results from a single crawling and the whole crawling. We can drop crawling

result with less value. The algorithm provided in this framework can also exclude the bogus information

from malicious peers while maintaining the crawling speed.

Each peer in the DHT has a routing table containing other peers’ contact information. Peers use their

routing tables to ﬁnd peer with speciﬁed ID in the overlay network, and after the routing procedure,

peers can search or distribute related information with the peers in the result set. We can see that the

routing procedure forms the foundation of the function of the overlay network. The crawlers send the

regular ﬁnding peer requests to the peers who they already know as if the crawlers were trying to route

to peer with speciﬁed ID. In order to ﬁnd out as much information as possible, we may need to send

ﬁnding peer requests to every known peer more than once. In each crawling action, we choose an ID, and

then send the ﬁnding peer request to a known peer, asking it for peers whose IDs are close to the ID we

speciﬁed. To gather most of the information of a known peer from its routing table, we need to choose a

set of requesting IDs accommodate to the structure of the routing table.

3.1 Basic design of crawlers for Kademlia

In the kademlia DHT, the logical distance between two peers is the XOR of their logical IDs, and the

data structure Kademlia uses to store the routing table is the binary tree. Every leaf node in the binary

tree is called a bucket, which stores a small set of peers’ information. The other nodes in the binary

tree do not contain any information of other peers, and each leaf node can only contain information of

a limited number of peers. In the initialization of the routing table, there will be only one root node in

it, and the root node itself is a leaf node. With the new peers continously joining the routing table, this

sole root node will split when the number of the peers exceed the limit. After the split, there will be a

new left child leaf node and a new right child leaf node, and the peers in the former root node will be

dropped into these two new leaf nodes. The peers whose ID has a preﬁx of 0 will be dropped into the

left child node, the others will be dropped into the right child node. With the height of the binary tree

increasing, the leaf nodes will contain peers ID with longer common preﬁx. So every bucket only contains

peers whose IDs are in some limited range, and only the host’s ID has the common preﬁx in this bucket

will lead the bucket to split. The other buckets will ignore peers if they need split themselves to contain

more peers. When a peer receive a request of searching for an ID, it will ﬁnd a proper bucket in the

binary tree of its routing table by examining the logical distance to the desired ID, and choosing a set of

peers as the result from the bucket.

Now we can check this in the design of crawlers. If the crawler managed to choose ID that has a proper

logical distance with the peer being queried, we can expect that peers information in that bucket will

be returned. In the same way, a set of IDs will make the peer being queried return most information in

every buckets. We can set a parameter such as n for the size of the set of IDs sent to a known peer.

剩余10页未读，继续阅读

weixin_38628183

粉丝: 6
资源: 889

DHT爬虫实现与优化策略探讨

DHT算法的实现，学习chord好材料

Mycat1.6.7.3 for windows版本

2024年9月份全国乘用车市场分析报告.pdf

【Unity动画资源包】Quirky Series - Animals Mega Pack Vol.4古怪的动物和古怪的动画

nx二次开发.docx

bintrees-2.2.0-cp38-cp38-win_amd64.whl

ceODBC-3.0-cp37-cp37m-win_amd64.whl

tinyarray-1.2.3-cp36-cp36m-win_amd64.whl

CSP竞赛教程考试时间等相关介绍

curses-2.2.1+utf8-cp35-cp35m-win_amd64.whl

最新资源