matplotlib做一个豆瓣TOP250导演词云图
时间: 2023-10-30 07:05:26 浏览: 161
好的,以下是实现步骤:
1. 确认需要爬取的数据
我们需要从豆瓣电影TOP250中爬取每个电影的信息,包括导演名称,电影名称和评分。
2. 爬取数据
使用requests和BeautifulSoup库,我们可以从豆瓣电影TOP250的页面中获取电影信息。
```
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
```
3. 提取导演信息
从每个电影的信息中提取导演名称,我们可以使用正则表达式或BeautifulSoup库来实现。
```
import re
directors = []
for movie in movies:
director = movie.find('div', class_='bd').p.text
director = re.sub('\n.*', '', director).strip()
directors.append(director)
```
4. 统计导演出现次数
使用Python中的collections库中的Counter类来统计每个导演出现的次数。
```
from collections import Counter
director_count = Counter(directors)
```
5. 生成词云图
使用Python中的wordcloud库来生成词云图。
```
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_words=200, width=1200, height=800).generate_from_frequencies(director_count)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
完整代码如下:
```
import requests
from bs4 import BeautifulSoup
import re
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
url = 'https://movie.douban.com/top250'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
directors = []
for movie in movies:
director = movie.find('div', class_='bd').p.text
director = re.sub('\n.*', '', director).strip()
directors.append(director)
director_count = Counter(directors)
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_words=200, width=1200, height=800).generate_from_frequencies(director_count)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
阅读全文