爬取新闻数据并文本聚类分析代码
时间: 2023-09-06 18:14:50 浏览: 164
口红数据爬取与处理.rar
5星 · 资源好评率100%
以下是一个简单的 Python 代码示例,用于爬取新闻数据并进行文本聚类分析:
```python
import requests
from bs4 import BeautifulSoup
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# 爬取新闻数据
def get_news(url):
res = requests.get(url)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
news = soup.select('h2 > a')
content = []
for n in news:
href = n['href']
title = n.text
news_res = requests.get(href)
news_res.encoding = 'utf-8'
news_soup = BeautifulSoup(news_res.text, 'html.parser')
article = news_soup.select('.article-content-inner > p')
content.append(title)
for p in article:
content.append(p.text)
return content
# 文本预处理
def preprocess(content):
stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()]
corpus = []
for c in content:
words = jieba.cut(c)
words = [w for w in words if w not in stopwords and len(w) > 1]
corpus.append(' '.join(words))
return corpus
# 文本聚类分析
def cluster_analysis(corpus):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
labels = kmeans.labels_
return labels
# 可视化展示
def visualization(corpus, labels):
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.scatter(labels, range(len(corpus)))
plt.show()
# 主函数
if __name__ == '__main__':
url = 'http://news.sina.com.cn/china/'
content = get_news(url)
corpus = preprocess(content)
labels = cluster_analysis(corpus)
visualization(corpus, labels)
```
需要注意的是,以上代码仅供参考,实际应用中需要根据具体需求进行修改和完善。同时,需要注意文本预处理和文本聚类算法的参数设置等细节问题。
阅读全文