现有数据集敏感词的训练集和测试集txt文本,其中训练集和测试集分别有非敏感词txt文本和敏感词txt文本,txt文本中文字的格式为,每一个词语单独成行,使用python,CNN算法进行机器学习训练,使得在测试集上表现良好,给出代码,给出f1值的计算auc——roc曲线的绘制
时间: 2024-12-22 20:12:25 浏览: 8
在Python中使用CNN (卷积神经网络) 对敏感词检测的数据集进行训练,首先需要安装必要的库如`torch`, `numpy`, `pandas`, `sklearn`等。这里是一个简单的步骤概述:
1. **数据预处理**:
- 导入所需的库并读取txt文件到DataFrame。
```python
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data(file_path):
df = pd.read_csv(file_path, sep='\n', header=None)
return df.values
X_train, X_test = train_test_split(load_data('train.txt'), test_size=0.2)
y_train, y_test = load_data('labels_train.txt'), load_data('labels_test.txt')
```
2. **文本向量化**:
使用`Keras`的`Tokenizer`将文本转换为数字序列。
```python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
max_len = max(len(seq) for seq in X_train_seq)
X_train_padded = pad_sequences(X_train_seq, maxlen=max_len)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len)
```
3. **构建CNN模型**:
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
model = Sequential([
Embedding(tokenizer.num_words, 16, input_length=max_len),
Conv1D(32, kernel_size=3, activation='relu'),
GlobalMaxPooling1D(),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid') # binary classification with sigmoid output
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```
4. **训练模型**:
```python
history = model.fit(X_train_padded, y_train, validation_data=(X_test_padded, y_test), epochs=10)
```
5. **评估性能**:
```python
y_pred = model.predict(X_test_padded)
f1_score = f1_score(y_test, y_pred.round())
from sklearn.metrics import roc_auc_score, roc_curve
auc = roc_auc_score(y_test, y_pred)
fpr, tpr, _ = roc_curve(y_test, y_pred)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
lw=lw, label=f'ROC curve (area = {auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
```
这里假设`y_test`已经被转换成了二进制标签,以便于计算F1分数和绘制AUC-ROC曲线。
阅读全文