首页python爬虫抓取豆瓣评论

python爬虫抓取豆瓣评论

时间: 2023-07-14 22:56:51 浏览: 82

好的，抓取豆瓣评论一般可以分为以下几个步骤： 1. 发送请求获取网页源代码可以使用 Python 的 requests 库发送 GET 请求来获取网页源代码，代码如下： ``` import requests url = 'https://movie.douban.com/subject/1292052/comments?start=0&limit=20&sort=new_score&status=P' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text ``` 2. 解析网页源代码可以使用 Python 的 BeautifulSoup 库来解析网页源代码，代码如下： ``` from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') comments = soup.find_all('span', class_='short') ``` 3. 提取数据在解析网页源代码后，我们可以通过提取特定的 HTML 标签来获取需要的数据，例如提取评论内容： ``` for comment in comments: content = comment.get_text().strip() print(content) ``` 4. 翻页抓取如果需要抓取多页评论，可以通过修改 start 参数来实现翻页抓取： ``` for i in range(0, 100, 20): url = 'https://movie.douban.com/subject/1292052/comments?start={}&limit=20&sort=new_score&status=P'.format(i) response = requests.get(url, headers=headers) html = response.text soup = BeautifulSoup(html, 'html.parser') comments = soup.find_all('span', class_='short') for comment in comments: content = comment.get_text().strip() print(content) ``` 以上就是 Python 爬虫抓取豆瓣评论的基本步骤。需要注意的是，在爬取数据时应遵守网站的 robots.txt 规定，不要频繁地发送请求，以免对网站造成不必要的影响。

阅读全文