爬取豆瓣电影top250排行榜的:片名、发布年份、评分、评价人数等项目数据(数据采集与预处理)代码
时间: 2024-10-10 09:06:10 浏览: 53
爬取豆瓣电影Top250的排行榜需要通过网络请求获取网页源码,然后解析HTML提取所需信息。这里提供一个基本的Python代码示例,使用BeautifulSoup库进行HTML解析,以及requests库发送HTTP请求:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 发送GET请求到豆瓣电影Top250页面
url = "https://movie.douban.com/top250"
response = requests.get(url)
# 检查请求是否成功
if response.status_code == 200:
# 解析HTML内容
soup = BeautifulSoup(response.text, 'lxml')
# 找到表格元素,通常在class为"list"的地方
table = soup.find('table', attrs={'class': 'list'})
# 初始化数据列表
data = []
# 提取每一行的数据
for row in table.tbody.find_all('tr'):
cols = row.find_all('td')
if len(cols) > 0: # 避免表头
title = cols[1].a.text.strip() # 片名
year = cols[1].find('span', {'class': 'year'}).text.strip() # 发布年份
rating = cols[2].find('span', {'class': 'rating_num'}).text.strip() # 评分
vote_count = cols[2].find('span', {'class': 'pl'}).text.strip().replace(',', '') # 评价人数
data.append({
'title': title,
'year': year,
'rating': rating,
'vote_count': vote_count
})
# 将数据转换为pandas DataFrame并清洗一下空值
df = pd.DataFrame(data)
df = df.dropna()
# 输出或保存数据
print(df)
else:
print(f"请求失败,状态码:{response.status_code}")
阅读全文
相关推荐
![-](https://img-home.csdnimg.cn/images/20241231044955.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231044947.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![rar](https://img-home.csdnimg.cn/images/20241231044955.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![docx](https://img-home.csdnimg.cn/images/20241231044901.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![.zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![rar](https://img-home.csdnimg.cn/images/20241231044955.png)