使用TensorFlow对https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv该文本进行分类
时间: 2023-10-03 22:07:54 浏览: 160
好的,这里提供一个基于TensorFlow的文本分类的示例代码。
首先,我们需要安装必要的依赖库:
```python
!pip install tensorflow pandas scikit-learn jieba
```
然后,我们可以加载数据集并进行预处理:
```python
import pandas as pd
import jieba
# 读取数据集
df = pd.read_csv("https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv")
# 分词
df["cut_words"] = df["review"].apply(lambda x: " ".join(jieba.cut(x)))
# 打标签
df["label"] = df["star"].apply(lambda x: 1 if x >= 4 else 0)
# 划分训练集和测试集
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df[["cut_words", "label"]], test_size=0.2, random_state=42)
```
接下来,我们可以定义模型:
```python
import tensorflow as tf
# 定义模型
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=[], dtype=tf.string),
tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=10000, output_sequence_length=256),
tf.keras.layers.Embedding(10000, 16),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")
])
# 编译模型
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
```
这里使用了一个简单的文本分类模型,包括一个嵌入层、全局平均池化层和两个全连接层。
最后,我们可以训练并评估模型:
```python
# 训练模型
history = model.fit(train_df["cut_words"], train_df["label"], validation_split=0.2, epochs=5)
# 测试模型
model.evaluate(test_df["cut_words"], test_df["label"])
```
完整代码如下:
```python
import pandas as pd
import jieba
import tensorflow as tf
from sklearn.model_selection import train_test_split
# 读取数据集
df = pd.read_csv("https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv")
# 分词
df["cut_words"] = df["review"].apply(lambda x: " ".join(jieba.cut(x)))
# 打标签
df["label"] = df["star"].apply(lambda x: 1 if x >= 4 else 0)
# 划分训练集和测试集
train_df, test_df = train_test_split(df[["cut_words", "label"]], test_size=0.2, random_state=42)
# 定义模型
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=[], dtype=tf.string),
tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=10000, output_sequence_length=256),
tf.keras.layers.Embedding(10000, 16),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")
])
# 编译模型
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# 训练模型
history = model.fit(train_df["cut_words"], train_df["label"], validation_split=0.2, epochs=5)
# 测试模型
model.evaluate(test_df["cut_words"], test_df["label"])
```
阅读全文