用python代码实现1. 数据预处理:利用Jieba分词对每篇文档进行分词;利用TF-ID对F每篇文档提取关键特征词;利用特征词的TF-IDF值作为文档的特征向量。 2. 对预处理后的文档(每篇文档用特征向量表示),从每个类别中随机选取40篇文档,共得到40×5=200篇文档。 3. 对提取的200篇文档,采用K-means算法,划分5个簇,每个簇代表一个类别 4. 画出聚类结果散点图,同时计算FMI指数
时间: 2023-08-24 16:05:08 浏览: 101
基于Python实现文本预处理(基于TF-IDF选取文档中非噪音词汇)【100010998】
以下是实现上述流程的Python代码:
数据预处理:
```
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
# 分词
def cut_words(text):
words = jieba.cut(text)
return " ".join(words)
# TF-IDF特征提取
def get_tfidf(corpus):
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
return vectorizer.get_feature_names(), tfidf.toarray()
corpus = ['这是一篇文档', '这是另一篇文档']
corpus_cut = [cut_words(text) for text in corpus]
feature_names, tfidf = get_tfidf(corpus_cut)
```
从每个类别中随机选取40篇文档:
```
import random
# 从每个类别中随机选取num篇文档
def random_sample(docs, labels, num):
sample_docs, sample_labels = [], []
for label in set(labels):
docs_label = [doc for i, doc in enumerate(docs) if labels[i] == label]
sample_docs_label = random.sample(docs_label, num)
sample_docs.extend(sample_docs_label)
sample_labels.extend([label]*num)
return sample_docs, sample_labels
docs = ['文档1', '文档2', '文档3', '文档4', '文档5', '文档6', '文档7', '文档8', '文档9', '文档10']
labels = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
sample_docs, sample_labels = random_sample(docs, labels, 2)
```
K-means聚类:
```
from sklearn.cluster import KMeans
# K-means聚类
def kmeans_clustering(features, n_clusters):
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(features)
return kmeans.labels_
features = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
labels = kmeans_clustering(features, 2)
```
计算FMI指数:
```
from sklearn.metrics import fowlkes_mallows_score
# 计算FMI指数
def fmi_score(labels_true, labels_pred):
return fowlkes_mallows_score(labels_true, labels_pred)
labels_true = [0, 1, 2, 0, 1]
labels_pred = [1, 0, 2, 1, 0]
fmi = fmi_score(labels_true, labels_pred)
```
画出聚类结果散点图:
```
import matplotlib.pyplot as plt
# 画出聚类结果散点图
def plot_clusters(features, labels):
plt.scatter([f[0] for f in features], [f[1] for f in features], c=labels)
plt.show()
features = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
labels = [0, 1, 0, 1, 1]
plot_clusters(features, labels)
```
阅读全文