爬取知乎话题评论数据
时间: 2023-12-26 17:05:54 浏览: 133
知乎网页爬取
为了爬取知乎话题评论数据,可以使用以下步骤:
1. 使用Scrapy框架创建一个爬虫项目。
2. 在爬虫项目中定义一个Spider,用于爬取知乎话题下的所有问题。
3. 在Spider中定义一个回调函数,用于解析每个问题页面并提取评论数据。
4. 使用XPath或BeautifulSoup等工具从页面中提取评论数据。
5. 将提取的评论数据存储到MongoDB数据库中。
以下是一个简单的示例代码,用于爬取知乎话题评论数据:
```python
import scrapy
import json
from scrapy import Request
from scrapy.spiders import Spider
class ZhihuSpider(Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/topic/19552832/hot']
def parse(self, response):
# 解析话题页面,获取所有问题的链接
question_links = response.xpath('//a[@class="question_link"]/@href')
for link in question_links:
yield Request(link.extract(), callback=self.parse_question)
def parse_question(self, response):
# 解析问题页面,获取所有评论
question_id = response.url.split('/')[-1]
api_url = f'https://www.zhihu.com/api/v4/questions/{question_id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Clabel_info%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&limit=20&offset=0&platform=desktop&sort_by=default'
yield Request(api_url, callback=self.parse_comments)
def parse_comments(self, response):
# 解析评论API,获取所有评论
comments = json.loads(response.body)['data']
for comment in comments:
yield {
'author': comment['author']['name'], 'content': comment['content'],
'voteup_count': comment['voteup_count']
}
```
阅读全文