使用Java语言的webcollector 编写一个爬取https://www.xiangha.com/caipu/95247360.html正文内容的代码程序
时间: 2023-12-13 14:04:04 浏览: 150
vue双向数据绑定知识点总结
以下是使用Java语言和WebCollector框架编写的爬取https://www.xiangha.com/caipu/95247360.html正文内容的代码程序:
```java
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.net.HttpRequest;
import cn.edu.hfut.dmic.webcollector.net.HttpResponse;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class XianghaCrawler extends BreadthCrawler {
public XianghaCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
addSeed("https://www.xiangha.com/caipu/95247360.html");
}
public void visit(Page page, HttpRequest httpRequest, HttpResponse httpResponse) {
String url = page.getUrl();
if (httpResponse.getStatus() == HttpResponse.STATUS_OK) {
Document doc = Jsoup.parse(page.getHtml());
String title = doc.select("h1.title").text();
String ingredients = doc.select("div.ings").text();
String steps = doc.select("div.steps > ol > li").text();
System.out.println(title + "\n" + ingredients + "\n" + steps);
}
}
public static void main(String[] args) {
String crawlPath = "xiangha_crawler";
XianghaCrawler xianghaCrawler = new XianghaCrawler(crawlPath, true);
xianghaCrawler.setThreads(2);
xianghaCrawler.start(2);
}
}
```
代码中使用WebCollector爬虫框架,实现了对目标页面的爬取。在visit方法中,使用Jsoup解析HTML文档,并提取出正文内容,最终输出到控制台。可以根据需要将输出改为保存到文件中。注意Java版本需要在1.8以上,同时需要添加WebCollector和Jsoup的依赖。
阅读全文