用python代码利用词袋、TF-IDF、LDA、词向量对一个txt文本分别建立文本分类模型、评估模型、并对结果进行分析
时间: 2023-07-10 11:13:34 浏览: 121
好的,我可以给你提供一些示例代码,但是需要注意的是,这些代码仅供参考,实际应用时需要根据具体情况进行调整和改进。
首先,你需要准备一个文本分类的数据集,其中包括一组已经分类好的文本样本和相应的分类标签。这里我以20类新闻分类数据集为例,你可以在网上下载该数据集并进行解压缩。
接下来,你可以使用不同的特征提取方法和算法来建立文本分类模型。下面是一些示例代码:
1. 利用词袋模型建立文本分类模型
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
# 读取数据集
data_path = '20news-18828'
with open(data_path + '/data.txt', 'r') as f:
texts = f.readlines()
with open(data_path + '/target.txt', 'r') as f:
labels = f.readlines()
# 特征提取
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(texts)
# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, labels, test_size=0.2, random_state=42)
# 建立分类模型
clf = MultinomialNB()
clf.fit(x_train, y_train)
# 评估模型
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
print('Accuracy:', acc)
print('Precision:', pre)
print('Recall:', rec)
print('F1 score:', f1)
```
2. 利用TF-IDF建立文本分类模型
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
# 读取数据集
data_path = '20news-18828'
with open(data_path + '/data.txt', 'r') as f:
texts = f.readlines()
with open(data_path + '/target.txt', 'r') as f:
labels = f.readlines()
# 特征提取
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(texts)
# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, labels, test_size=0.2, random_state=42)
# 建立分类模型
clf = MultinomialNB()
clf.fit(x_train, y_train)
# 评估模型
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
print('Accuracy:', acc)
print('Precision:', pre)
print('Recall:', rec)
print('F1 score:', f1)
```
3. 利用LDA建立文本分类模型
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
# 读取数据集
data_path = '20news-18828'
with open(data_path + '/data.txt', 'r') as f:
texts = f.readlines()
with open(data_path + '/target.txt', 'r') as f:
labels = f.readlines()
# 特征提取
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(texts)
# 利用LDA提取文本主题
lda = LatentDirichletAllocation(n_components=10)
x_lda = lda.fit_transform(x)
# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x_lda, labels, test_size=0.2, random_state=42)
# 建立分类模型
clf = MultinomialNB()
clf.fit(x_train, y_train)
# 评估模型
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
print('Accuracy:', acc)
print('Precision:', pre)
print('Recall:', rec)
print('F1 score:', f1)
```
4. 利用词向量建立文本分类模型
```python
from gensim.models import Word2Vec
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
# 读取数据集
data_path = '20news-18828'
with open(data_path + '/data.txt', 'r') as f:
texts = f.readlines()
with open(data_path + '/target.txt', 'r') as f:
labels = f.readlines()
# 特征提取
sentences = [text.strip().split() for text in texts]
model = Word2Vec(sentences, size=100, window=5, min_count=1)
x = []
for sentence in sentences:
vec = [model[word] for word in sentence if word in model]
vec = sum(vec) / len(vec)
x.append(vec)
# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, labels, test_size=0.2, random_state=42)
# 建立分类模型
clf = MultinomialNB()
clf.fit(x_train, y_train)
# 评估模型
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
print('Accuracy:', acc)
print('Precision:', pre)
print('Recall:', rec)
print('F1 score:', f1)
```
总之,建立文本分类模型并评估模型需要注意特征提取方法、算法选择和参数调整等问题。在实际应用中,你需要根据具体情况进行选择和改进。
阅读全文