爬取大众点评评论信息代码
时间: 2023-08-28 15:08:05 浏览: 164
由于大众点评网站的反爬机制比较严格,因此需要使用一些技巧来避免被封禁。以下是一个使用Python爬取大众点评评论信息的例子:
```python
import requests
import time
from bs4 import BeautifulSoup
def get_comments(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# 获取评论信息
comments = []
comment_list = soup.find_all('div', class_='comment-list')[0].find_all('div', class_='comment')
for comment in comment_list:
name = comment.find_all('div', class_='user-info')[0].find_all('a')[0].text.strip()
star = len(comment.find_all('span', class_='sml-rank-stars')[0].find_all('span', class_='sml-str'))
content = comment.find_all('div', class_='comment-txt')[0].text.strip()
comments.append({'name': name, 'star': star, 'content': content})
return comments
if __name__ == '__main__':
# 模拟浏览器访问网页,获取cookies
url = 'https://www.dianping.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
cookies = response.cookies.get_dict()
# 爬取评论信息
url = 'https://www.dianping.com/shop/5343507/review_all'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Cookie': 'xxxxxxx' # 将获取的cookies填入
}
comments = []
for i in range(1, 11):
print('正在爬取第%d页评论...' % i)
url = 'https://www.dianping.com/shop/5343507/review_all/p%d' % i
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find_all('div', class_='content') != []:
comments += get_comments(url)
else:
break
time.sleep(2)
# 保存评论信息到文件
with open('comments.txt', 'w', encoding='utf-8') as f:
for comment in comments:
f.write('姓名:%s,评分:%d,评论:%s\n' % (comment['name'], comment['star'], comment['content']))
```
需要注意的是,以上代码中需要手动获取cookies,并填入`headers`中,否则会被网站封禁。此外,为了避免被封禁,需要设置合理的访问间隔,这里设置为每爬取一页评论后暂停2秒。
阅读全文