webmagic爬取boss直聘招聘数据
时间: 2023-09-04 09:06:46 浏览: 89
1. 在项目中添加webmagic和jsoup的依赖
```xml
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
```
2. 创建一个实现PageProcessor接口的类
```java
public class BossProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
// 解析招聘信息列表页
if (page.getUrl().regex("https://www.zhipin.com/job_detail/.*").match()) {
page.putField("title", page.getHtml().xpath("//h1/text()").toString());
page.putField("salary", page.getHtml().xpath("//span[@class='salary']/text()"));
page.putField("company", page.getHtml().xpath("//div[@class='job-detail']/div[@class='detail-company']/div[@class='company-text']/h3/a/text()"));
} else {
// 解析招聘信息详情页链接
List<String> urls = page.getHtml().links().regex("https://www.zhipin.com/job_detail/.*").all();
page.addTargetRequests(urls);
}
}
@Override
public Site getSite() {
return site;
}
}
```
3. 创建一个爬虫类,并配置爬虫的入口链接和抓取规则
```java
public class BossSpider {
public static void main(String[] args) {
Spider.create(new BossProcessor())
.addUrl("https://www.zhipin.com/c101010100/?query=java&page=1&ka=page-1")
.run();
}
}
```
4. 运行程序,即可抓取到boss直聘的招聘信息数据。
阅读全文