k-means文本聚类算法代码
时间: 2024-05-22 22:08:36 浏览: 124
k-means是一种常用的文本聚类算法,其基本思想是将文本数据集分成k个不同的簇,每个簇都具有相似的特征。具体实现时,首先需要随机选择k个数据点作为初始质心,然后将每个数据点分配到距离其最近的质心所在的簇中,接着重新计算每个簇的质心位置,重复上述过程直到收敛。
下面是k-means文本聚类算法的Python代码实现:
```
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
def kmeans_cluster(docs, num_clusters):
# 提取文本特征
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
# 聚类
kmeans = KMeans(n_clusters=num_clusters, random_state=0)
kmeans.fit(X)
# 输出聚类结果
clusters = [[] for i in range(num_clusters)]
for i, label in enumerate(kmeans.labels_):
clusters[label].append(i)
for i, cluster in enumerate(clusters):
print('Cluster %d:' % i)
for doc_index in cluster:
print('\t%s' % docs[doc_index])
# 示例
docs = ['This is a sample sentence.',
'Another example sentence.',
'I love programming.',
'Python is a great programming language.',
'Java is also a popular programming language.']
kmeans_cluster(docs, 2)
```
运行结果如下:
```
Cluster 0:
I love programming.
Python is a great programming language.
Java is also a popular programming language.
Cluster 1:
This is a sample sentence.
Another example sentence.
```
以上代码通过TfidfVectorizer提取文本特征,并使用KMeans聚类算法进行文本聚类。
阅读全文