python爬取豆瓣电影词云图
时间: 2023-12-20 12:31:58 浏览: 45
以下是Python爬取豆瓣电影词云图的步骤:
1.导入需要的库和模块
```python
import requests
from bs4 import BeautifulSoup
import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
```
2.获取网页源代码并解析
```python
url = 'https://movie.douban.com/subject/26363254/comments?status=P'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
```
3.获取评论内容并进行分词
```python
comments = soup.find_all('span', class_='short')
comment_text = ''
for comment in comments:
comment_text += comment.text
words = jieba.cut(comment_text)
```
4.统计词频并生成词云图
```python
word_counts = {}
for word in words:
if len(word) == 1:
continue
else:
word_counts[word] = word_counts.get(word, 0) + 1
wordcloud = WordCloud(font_path='msyh.ttc', background_color='white', max_words=200, max_font_size=100, width=800, height=600)
wordcloud.generate_from_frequencies(word_counts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
5.生成带有图片的词云图
```python
mask = np.array(Image.open('movie.png'))
image_colors = ImageColorGenerator(mask)
wordcloud = WordCloud(font_path='msyh.ttc', background_color='white', max_words=200, max_font_size=100, width=800, height=600, mask=mask)
wordcloud.generate_from_frequencies(word_counts)
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis('off')
plt.show()
```