基于深度学习的新闻分类
时间: 2023-06-22 12:18:29 浏览: 45
好的,下面是一个基于深度学习的新闻分类代码示例:
```python
import pandas as pd
import numpy as np
import re
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 读取数据
data = pd.read_csv('news.csv')
# 数据预处理
def preprocess(text):
# 去除标点符号和数字
text = re.sub('[^a-zA-Z]', ' ', text)
# 转换为小写
text = text.lower()
return text
data['text'] = data['text'].apply(preprocess)
X = data['text']
y = pd.get_dummies(data['category']).values
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 构建词典
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
# 将文本转换为序列
max_len = 500
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)
# 建立模型
embedding_dim = 100
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_len))
model.add(Conv1D(128, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# 训练模型
batch_size = 32
epochs = 10
early_stopping = EarlyStopping(monitor='val_loss', patience=2)
model.fit(X_train_pad, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=[early_stopping])
# 预测
y_pred = model.predict(X_test_pad)
y_pred = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)
# 评估
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy: {:.2f}".format(accuracy))
```
其中,`news.csv`是包含`text`和`category`两列的数据文件,`text`表示新闻文本,`category`表示该新闻所属的类别。`preprocess`函数用于对文本进行预处理,去除标点符号和数字,并转换为小写。使用`Tokenizer`将文本转换为词典索引,使用`pad_sequences`将文本序列填充为固定长度。建立了一个包含词嵌入、卷积和全连接层的模型,使用`categorical_crossentropy`作为损失函数,使用`accuracy`作为评估指标。最后评估模型的准确率。