首页python爬取豆瓣电影词云图

python爬取豆瓣电影词云图

时间: 2023-12-20 12:31:58 浏览: 45

以下是Python爬取豆瓣电影词云图的步骤： 1.导入需要的库和模块 ```python import requests from bs4 import BeautifulSoup import jieba from wordcloud import WordCloud, ImageColorGenerator import matplotlib.pyplot as plt from PIL import Image import numpy as np ``` 2.获取网页源代码并解析 ```python url = 'https://movie.douban.com/subject/26363254/comments?status=P' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') ``` 3.获取评论内容并进行分词 ```python comments = soup.find_all('span', class_='short') comment_text = '' for comment in comments: comment_text += comment.text words = jieba.cut(comment_text) ``` 4.统计词频并生成词云图 ```python word_counts = {} for word in words: if len(word) == 1: continue else: word_counts[word] = word_counts.get(word, 0) + 1 wordcloud = WordCloud(font_path='msyh.ttc', background_color='white', max_words=200, max_font_size=100, width=800, height=600) wordcloud.generate_from_frequencies(word_counts) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show() ``` 5.生成带有图片的词云图 ```python mask = np.array(Image.open('movie.png')) image_colors = ImageColorGenerator(mask) wordcloud = WordCloud(font_path='msyh.ttc', background_color='white', max_words=200, max_font_size=100, width=800, height=600, mask=mask) wordcloud.generate_from_frequencies(word_counts) plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation='bilinear') plt.axis('off') plt.show() ```