首页爬取https://movie.douban.com/subject/3552028/的所有短评

爬取https://movie.douban.com/subject/3552028/的所有短评

时间: 2023-10-10 21:15:33 浏览: 93

这里提供一种Python爬取豆瓣电影短评的方法，需要用到requests和BeautifulSoup库。首先，我们需要向该网页发送请求，获取网页源代码。代码如下： ```python import requests url = "https://movie.douban.com/subject/3552028/comments?start=0&limit=20&status=P&sort=new_score" response = requests.get(url) html = response.text ``` 接着，我们需要使用BeautifulSoup对网页源代码进行解析，以便提取我们所需的信息。代码如下： ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html, "html.parser") comments = soup.find_all("span", class_="short") for c in comments: print(c.string) ``` 这段代码中，我们使用find_all方法找到所有class为"short"的span标签，这些标签包含了我们需要获取的短评信息。然后，我们遍历每个标签，并使用string属性获取标签内的文本内容。此外，我们需要注意到豆瓣电影短评采用了分页的方式，每页最多显示20条短评。如果要爬取所有短评，需要在url中指定start参数并逐页爬取。完整代码如下： ```python import requests from bs4 import BeautifulSoup url = "https://movie.douban.com/subject/3552028/comments?start={}&limit=20&status=P&sort=new_score" start = 0 while True: response = requests.get(url.format(start)) soup = BeautifulSoup(response.text, "html.parser") comments = soup.find_all("span", class_="short") if len(comments) == 0: break for c in comments: print(c.string) start += 20 ``` 这段代码中，我们使用while循环，每次递增20的start参数值，直到获取到的短评数为0为止。注意，豆瓣电影短评是需要登录才能进行查看的，因此如果你没有登录，可能无法获取到所有的短评。

阅读全文