写一个完整的python代码:1.爬取豆瓣TOP250电影的数据 2.找出出片数量前10的国家(用柱状图表示) 3.找出出片数量前10的年份(用柱状图表示)
时间: 2024-09-10 19:20:12 浏览: 57
要完成这个任务,你需要使用Python编程语言,并且需要安装几个第三方库:`requests`用于网络请求,`BeautifulSoup`用于解析HTML页面,以及`matplotlib`用于绘制柱状图。以下是一个简单的示例代码:
```python
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
# 爬取豆瓣TOP250电影的数据
def fetch_douban_top250():
url = 'https://movie.douban.com/top250'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
params = {'start': 0, 'filter': ''}
all_movies = []
while url:
response = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='item')
for movie in movies:
movie_info = {
'title': movie.find('span', class_='title').get_text(),
'year': movie.find('span', class_='year').get_text(),
'country': movie.find('span', class_='releasetime').get_text().split('/')[0]
}
all_movies.append(movie_info)
# 获取下一页的链接
url = 'https://movie.douban.com' + soup.find('span', class_='next').find('a').get('href')
params['start'] += 25
return all_movies
# 统计国家的电影数量并绘图
def plot_countries(all_movies):
country_counts = {}
for movie in all_movies:
country = movie['country']
if country in country_counts:
country_counts[country] += 1
else:
country_counts[country] = 1
countries = list(country_counts.keys())
counts = list(country_counts.values())
countries_sorted = sorted(country_counts, key=country_counts.get, reverse=True)[:10]
counts_sorted = [country_counts[country] for country in countries_sorted]
plt.bar(countries_sorted, counts_sorted)
plt.xlabel('国家')
plt.ylabel('电影数量')
plt.title('出片数量前10的国家')
plt.xticks(rotation=45)
plt.show()
# 统计年份的电影数量并绘图
def plot_years(all_movies):
year_counts = {}
for movie in all_movies:
year = movie['year']
if year != 'NaN':
year = year.split('/')[1] # 只取年份部分
if year in year_counts:
year_counts[year] += 1
else:
year_counts[year] = 1
years = list(year_counts.keys())
counts = list(year_counts.values())
years_sorted = sorted(year_counts, key=year_counts.get, reverse=True)[:10]
counts_sorted = [year_counts[year] for year in years_sorted]
plt.bar(years_sorted, counts_sorted)
plt.xlabel('年份')
plt.ylabel('电影数量')
plt.title('出片数量前10的年份')
plt.xticks(rotation=45)
plt.show()
# 主函数
def main():
all_movies = fetch_douban_top250()
plot_countries(all_movies)
plot_years(all_movies)
if __name__ == '__main__':
main()
```
在运行上述代码之前,请确保你已经安装了所需的库。你可以使用pip安装它们:
```bash
pip install requests beautifulsoup4 matplotlib
```
上述代码会爬取豆瓣TOP250电影的基本信息,统计出片数量前10的国家和年份,并分别用柱状图显示出来。请注意,爬虫的运行可能受到目标网站的反爬虫策略的影响,需要确保遵守网站的爬虫协议,并且在合法范围内使用。
阅读全文