给我一段完整的python爬虫爬取CSDN网页数据可视化的代码
时间: 2024-09-07 17:04:58 浏览: 43
爬虫实战:爬取CSDN2022.6.21首页HTML代码
在Python中,我们可以使用一些库如BeautifulSoup、requests和matplotlib等来实现简单的网页数据抓取并可视化。下面是一个基本的例子,假设我们要抓取CSDN博客文章标题并统计词频:
```python
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
# 爬取CSDN页面
url = 'https://blog.csdn.net/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取文章标题
titles = [title.text for title in soup.find_all('a', class_='title')]
# 清洗数据(例如去除HTML标签)
clean_titles = [' '.join(word for word in title.lower().split()) for title in titles]
# 使用正则表达式提取单词,并转换成列表
words = [word.strip() for title in clean_titles for word in re.findall(r'\b\w+\b', title)]
# 统计词频
word_freq = dict(zip(words, [0] * len(words)))
for word in words:
word_freq[word] += 1
# 数据可视化
plt.figure(figsize=(10, 5))
word_counts = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
x轴, y轴 = zip(*word_counts)
plt.bar(x轴, y轴)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Word Frequency of CSDN Blog Titles')
plt.show()
阅读全文