文本聚类kmeans 源码实现
时间: 2023-07-03 15:04:11 浏览: 53
下面是 Python 中使用 scikit-learn 库实现 k-means 文本聚类的示例代码:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# 文本集合
corpus = [
"this is the first document",
"this document is the second document",
"and this is the third one",
"is this the first document"
]
# 将文本转换成 TF-IDF 向量
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# 使用 k-means 进行聚类
k = 2
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
# 输出聚类结果
for i in range(k):
print("Cluster ", i+1, ":")
for j in range(len(corpus)):
if kmeans.labels_[j] == i:
print(corpus[j])
print("\n")
```
运行结果如下:
```
Cluster 1 :
this is the first document
is this the first document
Cluster 2 :
this document is the second document
and this is the third one
```
以上代码中,将文本集合转换为 TF-IDF 向量后,使用 k-means 进行聚类,并输出聚类结果。在这个示例中,选择将文本分为两个簇。