没有合适的资源?快使用搜索试试~ 我知道了~
首页爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文
爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文
3星 · 超过75%的资源 需积分: 49 63 下载量 10 浏览量
更新于2023-03-03
评论 8
收藏 530KB PDF 举报
爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文爬虫论文
资源详情
资源评论
资源推荐
Board Forum Crawling: A Web Crawling Method for Web Forum
Yan Guo Kui Li Kai Zhang Gang Zhang
Software Division, ICT, CAS
guoy@ict.ac.cn
Abstract
We present a new method of Board Forum Crawling
to crawl Web forum. This method exploits the
organized characteristics of the Web forum sites and
simulates human behavior of visiting Web Forums. The
method starts crawling from the homepage, and then
enters each board of the site, and then crawls all the
posts of the site directly. Board Forum Crawling can
crawl most meaningful information of a Web forum site
efficiently and simply. We experimentally evaluated the
effectiveness of the method on real Web forum sites by
comparing with the traditional breadth-first crawling.
We also used this method in a real project, and 12000
Web forum sites have been crawled successfully. These
results show the effectiveness of our method.
1. Introduction
Web Crawlers (also called Web Spiders or Robots),
are programs used to download documents from
Internet. Traditional breadth-first crawling, which is
called as TBFC in this paper, is popular used in all
kinds of cases. However, for different type of Web
sites (e.g. News sites and forum sites), there are so
many differences among their organized structures that
TBFC cannot crawl all of them efficiently. We believe
that different crawling methods should be developed to
fit for different type of Web sites. In this work, we deal
with crawling Web forum.
Web forum sites have become precious deposits of
information, and crawling of Web forum has become
more and more important and significant. Generally,
for a Web forum site, the target of crawling is to
download all posts in the site. Web crawling is a well-
studied research problem. Some issues (e.g. crawling
Hidden Web [1] and user-centric crawling [2]) have
been hot research points. However, to the best of our
knowledge, there has been little amount of research on
the crawler especially for Web forum.
When a Web forum is crawled by TBFC, Spider
Trap and noisy links are main obstacles for precise and
efficient crawling. Such troubles are mainly caused by
conflicts between the organized characteristics of Web
forum sites and the characteristics of TBFC.
The TBFC works as follows: at first it follows all
links in the homepage to download all pages linked by
homepage, and then follows all links in those
downloaded pages to download all pages linked by
those pages, until there are no more new pages in the
site linked by downloaded pages.
For most of Web forum sites, there are some
characteristics as follows, from which we can explain
why Spider Trap and noisy links can be encountered:
(1) Most of the Web forum sites are designed as
dynamic sites. Most of the information contained in
forum sites is usually organized in databases. When
two requests which requiring the same piece of content
in the database are forwarded to the Web server, the
server will return two dynamic Web pages which
having the same content but different URLs to the
client. Here we call the two dynamic pages as
redundancy pages, and the links leading to redundancy
pages are called duplicated links. So when the forum
sites are crawled by TBFC, there must be a lot of
redundancy pages being downloaded and a lot of
duplicate links waiting for crawl. As a result, the links
needed to crawl become much more than reasonable
links of the site, just as if the site had infinite links need
to be crawled, which is known as Spider Trap. For the
redundancy pages, although they have the same content,
but they have different URLs, so the crawler can not
eliminate the duplicate links by URL checking and still
downloads all of them.
(2) In a Web forum site, there are a lot of noisy links,
such as the functional links for users to “print”, and the
links of some advertisements. The pages linked by the
noisy links almost have no useful information. As a
result, crawling the noisy links not only wastes the
resource of TBFC, but also lowers the quality of the
downloaded pages.
(3) Links in a Web forum site are organized as some
levels. To find a post in a board, we have to start from
the homepage, and then enter into a board, and then to
find the post. Here we should note that in a Web forum
site, the most useful information is contained in the
deep levels. Since TBFC starts crawling from the
Proceedings of the 2006 IEEE/WIC/ACM International Conference
on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
zhangfjchq
- 粉丝: 0
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- ExcelVBA中的Range和Cells用法说明.pdf
- 基于单片机的电梯控制模型设计.doc
- 主成分分析和因子分析.pptx
- 共享笔记服务系统论文.doc
- 基于数据治理体系的数据中台实践分享.pptx
- 变压器的铭牌和额定值.pptx
- 计算机网络课程设计报告--用winsock设计Ping应用程序.doc
- 高电压技术课件:第03章 液体和固体介质的电气特性.pdf
- Oracle商务智能精华介绍.pptx
- 基于单片机的输液滴速控制系统设计文档.doc
- dw考试题 5套.pdf
- 学生档案管理系统详细设计说明书.doc
- 操作系统PPT课件.pptx
- 智慧路边停车管理系统方案.pptx
- 【企业内控系列】企业内部控制之人力资源管理控制(17页).doc
- 温度传感器分类与特点.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论2