首页python爬虫爬取豆瓣电影评论

python爬虫爬取豆瓣电影评论

时间: 2023-11-18 15:02:54 浏览: 128

以下是使用Python爬虫爬取豆瓣电影评论的步骤： 1. 导入必要的库，包括requests、BeautifulSoup和re等。 2. 使用requests库发送HTTP请求，获取豆瓣电影的网页源代码。 3. 使用BeautifulSoup库解析网页源代码，提取出电影评论的相关信息。 4. 使用正则表达式对提取出的评论信息进行处理，去除HTML标签和其他无用信息。 5. 将处理后的评论信息存储到本地文件或数据库中。下面是一个简单的示例代码，可以爬取指定电影的前10页评论： ```python import requests from bs4 import BeautifulSoup import re # 设置请求头，模拟浏览器访问 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} # 获取指定电影的前10页评论 movie_id = '1292052' # 电影《肖申克的救赎》的ID for i in range(10): url = f'https://movie.douban.com/subject/{movie_id}/comments?start={i*20}&limit=20&status=P&sort=new_score' response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') comments = soup.find_all('span', class_='short') for comment in comments: # 使用正则表达式去除HTML标签和其他无用信息 pattern = re.compile(r'<[^>]+>', re.S) comment_text = pattern.sub('', str(comment)) print(comment_text) ```