import requests from bs4 import BeautifulSoup url = "https://movie.douban.com/subject/30391186/comments?sort=new_score&status=P" response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") comments = [] for comment in soup.select(".comment-item"): content = comment.select_one(".comment-content").get_text().strip() # 评论内容 author = comment.select_one(".comment-info a").get_text().strip() # 发布人 likes = comment.select_one(".votes").get_text().strip() # 点赞数 time = comment.select_one(".comment-time").get_text().strip() # 时间 location = comment.select_one(".comment-info").contents[-1].strip() # 地点 comments.append({ "content": content, "author": author, "likes": likes, "time": time, "location": location }) print(comments),设置请求头
时间: 2024-04-03 20:37:16 浏览: 166
好的,设置请求头的方法如下,可以在请求中添加"User-Agent"等信息:
```python
import requests
from bs4 import BeautifulSoup
url = "https://movie.douban.com/subject/30391186/comments?sort=new_score&status=P"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for comment in soup.select(".comment-item"):
content = comment.select_one(".comment-content").get_text().strip() # 评论内容
author = comment.select_one(".comment-info a").get_text().strip() # 发布人
likes = comment.select_one(".votes").get_text().strip() # 点赞数
time = comment.select_one(".comment-time").get_text().strip() # 时间
location = comment.select_one(".comment-info").contents[-1].strip() # 地点
comments.append({
"content": content,
"author": author,
"likes": likes,
"time": time,
"location": location
})
print(comments)
```
在这个例子中,我们添加了一个名为"User-Agent"的请求头,值为一个模拟Chrome浏览器的User-Agent字符串,这样可以模拟浏览器访问网站,绕过反爬虫机制。
阅读全文