import requests from bs4 import BeautifulSoup from pprint import pprint url = "https://movie.douban.com/subject/30391186/comments?sort=new_score&status=P" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, "html.parser") comments = [] for comment in soup.select(".comment-item"): content = comment.select_one(".comment-content").get_text().strip() author = comment.select_one(".comment-info a").get_text().strip() likes = comment.select_one(".votes").get_text().strip() time = comment.select_one(".comment-time").get_text().strip() location = comment.select_one(".comment-location").contents[-1].strip() comments.append({ "content": content, "author": author, "likes": likes, "time": time, "location": location }) pprint(comments),我要爬取前五页的数据
时间: 2024-04-03 16:36:13 浏览: 18
好的,你可以使用以下代码来爬取前五页的数据:
```
import requests
from bs4 import BeautifulSoup
from pprint import pprint
for page in range(5):
url = "https://movie.douban.com/subject/30391186/comments?start={}&limit=20&sort=new_score&status=P".format(page*20)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for comment in soup.select(".comment-item"):
content = comment.select_one(".comment-content").get_text().strip()
author = comment.select_one(".comment-info a").get_text().strip()
likes = comment.select_one(".votes").get_text().strip()
time = comment.select_one(".comment-time").get_text().strip()
location = comment.select_one(".comment-location").contents[-1].strip()
comments.append({
"content": content,
"author": author,
"likes": likes,
"time": time,
"location": location
})
pprint(comments)
```
这段代码中,我们使用了一个循环来爬取前五页的数据。在每次循环中,我们构造了一个新的URL,将页码作为参数传递给URL,然后使用requests库发送GET请求。然后使用BeautifulSoup库解析HTML响应,并将每个评论的信息存储到一个字典中,最后使用pprint库打印出来。