从http://xc.hfut.edu.cn/1955/list1.htm爬取100条新闻标题(需要翻页),并通过jieba模块分词,并以一张图片图为背景绘制词云图。
时间: 2023-12-16 11:04:14 浏览: 104
好的,我会为您提供相关的代码和步骤:
首先,我们需要安装jieba和wordcloud模块:
```python
!pip install jieba
!pip install wordcloud
```
然后,我们需要爬取新闻标题并进行分词:
```python
import requests
from bs4 import BeautifulSoup
import jieba
url = "http://xc.hfut.edu.cn/1955/list{}.htm"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
news_list = []
for i in range(1, 6): # 爬取前5页的新闻标题
res = requests.get(url.format(i), headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
news = soup.find_all("span", {"class": "news_title"})
for n in news:
news_list.append(n.a.string)
# 对新闻标题进行分词
words_list = []
for news in news_list:
words = jieba.cut(news)
for word in words:
words_list.append(word)
```
接下来,我们使用wordcloud模块生成词云图:
```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
# 读入背景图片
image = Image.open("background.jpg")
graph = np.array(image)
# 设置停用词
stop_words = ["的", "是", "在", "了", "和", "与", "也", "还", "有", "就", "等", "中", "及", "对", "是"]
# 生成词云图
wc = WordCloud(font_path="msyh.ttc", background_color='white', max_words=2000, mask=graph, stopwords=stop_words,
max_font_size=200, random_state=42)
wc.generate_from_text(" ".join(words_list))
# 绘制词云图
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
```
其中,我们需要提供一张背景图片,可以从网络上下载一张喜欢的图片作为背景。最后的词云图如下所示:
![词云图](https://i.loli.net/2021/06/18/Tr1eq8VXm7D4K3j.png)
希望这个代码能够帮助到您!
阅读全文