帮我做一个完整的LSTM中文文本分类的示例,基于tensorflow2.0
时间: 2023-05-22 16:00:58 浏览: 60
当然,我很乐意帮你完成这个任务。以下是一个基于 TensorFlow 2.0 的 LSTM 中文文本分类示例:
首先,我们需要导入必要的库:
```python
import tensorflow as tf
import numpy as np
import pandas as pd
import jieba
from sklearn.model_selection import train_test_split
```
接下来,我们需要加载数据集。这里我们使用 THUCNews 数据集,它包含了 14 个不同类别的新闻文章。你可以在这里下载数据集:[THUCNews](http://thuctc.thunlp.org/message)
```python
# 加载数据集
def load_data():
# 读取数据集
df = pd.read_csv('thucnews.csv')
# 分词
df['content'] = df['content'].apply(lambda x: ' '.join(jieba.cut(x)))
# 将标签转换为数字
label2id = {label: idx for idx, label in enumerate(set(df['label']))}
df['label'] = df['label'].apply(lambda x: label2id[x])
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
```
接下来,我们需要对文本进行预处理。这里我们使用 TensorFlow 的 TextVectorization 层来将文本转换为数字序列。
```python
# 对文本进行预处理
def preprocess(X_train, X_test):
# 定义 TextVectorization 层
vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=5000, output_sequence_length=500)
# 适配训练集
vectorizer.adapt(X_train.to_numpy())
# 转换训练集和测试集
X_train = vectorizer(X_train.to_numpy())
X_test = vectorizer(X_test.to_numpy())
return X_train, X_test
```
接下来,我们可以定义 LSTM 模型。这里我们使用一个简单的 LSTM 模型,包含一个嵌入层、一个 LSTM 层和一个全连接层。
```python
# 定义 LSTM 模型
def build_model():
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=5000, output_dim=64),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(14, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
```
最后,我们可以训练模型并评估其性能。
```python
# 训练模型
def train_model(X_train, y_train):
model = build_model()
model.fit(X_train, y_train, epochs=10, batch_size=64)
return model
# 评估模型
def evaluate_model(model, X_test, y_test):
loss, accuracy = model.evaluate(X_test, y_test)
print('Test accuracy:', accuracy)
```
完整代码如下:
```python
import tensorflow as tf
import numpy as np
import pandas as pd
import jieba
from sklearn.model_selection import train_test_split
# 加载数据集
def load_data():
# 读取数据集
df = pd.read_csv('thucnews.csv')
# 分词
df['content'] = df['content'].apply(lambda x: ' '.join(jieba.cut(x)))
# 将标签转换为数字
label2id = {label: idx for idx, label in enumerate(set(df['label']))}
df['label'] = df['label'].apply(lambda x: label2id[x])
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
# 对文本进行预处理
def preprocess(X_train, X_test):
# 定义 TextVectorization 层
vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=5000, output_sequence_length=500)
# 适配训练集
vectorizer.adapt(X_train.to_numpy())
# 转换训练集和测试集
X_train = vectorizer(X_train.to_numpy())
X_test = vectorizer(X_test.to_numpy())
return X_train, X_test
# 定义 LSTM 模型
def build_model():
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=5000, output_dim=64),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(14, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# 训练模型
def train_model(X_train, y_train):
model = build_model()
model.fit(X_train, y_train, epochs=10, batch_size=64)
return model
# 评估模型
def evaluate_model(model, X_test, y_test):
loss, accuracy = model.evaluate(X_test, y_test)
print('Test accuracy:', accuracy)
# 加载数据集
X_train, X_test, y_train, y_test = load_data()
# 对文本进行预处理
X_train, X_test = preprocess(X_train, X_test)
# 训练模型
model = train_model(X_train, y_train)
# 评估模型
evaluate_model(model, X_test, y_test)
```
希望这个示例对你有所帮助!