将本地词向量应用到简单的情感分类任务中,使用本地数据集,并且计算其准确率的pytorch代码
时间: 2024-02-13 20:05:34 浏览: 95
好的,以下是一个简单的情感分类任务的PyTorch代码,使用了本地词向量和数据集:
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import text_classification
from torchtext.vocab import Vectors
# 定义模型
class TextSentiment(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super().__init__()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text):
embedded = self.embedding(text)
return self.fc(embedded)
# 加载数据集
NGRAMS = 2
import os
if not os.path.isdir('./.data'):
os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
root='./.data', ngrams=NGRAMS, vocab=None)
# 加载词向量
url = 'path/to/your/embedding_file.txt'
vector = Vectors(name=url)
# 构建词汇表
from torchtext.vocab import build_vocab_from_iterator
train_iter = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=False)
vocab = build_vocab_from_iterator(train_iter, vectors=vector, specials=["<unk>"])
vocab.set_vectors(vector.stoi, vector.vectors, vector.dim)
# 定义超参数
VOCAB_SIZE = len(vocab)
EMBED_DIM = 32
NUM_CLASS = len(train_dataset.get_labels())
BATCH_SIZE = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 构建模型和优化器
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)
optimizer = optim.SGD(model.parameters(), lr=4.0)
criterion = nn.CrossEntropyLoss().to(device)
# 定义训练和测试函数
def train_func(sub_train_):
train_loss = 0
train_acc = 0
data = torch.cat(sub_train_, dim=0)
targets = data.get('target').to(device)
text = data.get('text').to(device)
optimizer.zero_grad()
output = model(text)
loss = criterion(output, targets)
train_loss += loss.item()
loss.backward()
optimizer.step()
train_acc += (output.argmax(1) == targets).sum().item()
return train_loss / len(sub_train_), train_acc / len(sub_train_)
def test(data_):
loss = 0
acc = 0
data = data_.to(device)
targets = data.get('target').to(device)
text = data.get('text').to(device)
with torch.no_grad():
output = model(text)
loss = criterion(output, targets)
loss += loss.item()
acc = (output.argmax(1) == targets).sum().item()
return loss / len(data_), acc / len(data_)
# 开始训练
from torch.utils.data import DataLoader
N_EPOCHS = 5
min_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
train_loss = 0
train_acc = 0
valid_loss = 0
valid_acc = 0
n_train = len(train_dataset)
n_valid = len(test_dataset)
train_iter = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=text_classification.collate_tensors)
valid_iter = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=text_classification.collate_tensors)
# 训练
for idx, (inputs, sub_train) in enumerate(train_iter):
loss, acc = train_func(sub_train)
train_loss += loss
train_acc += acc
# 测试
for data in valid_iter:
loss, acc = test(data)
valid_loss += loss
valid_acc += acc
train_loss /= len(train_iter)
train_acc /= len(train_iter)
valid_loss /= len(valid_iter)
valid_acc /= len(valid_iter)
print(f'Epoch: {epoch+1:02}')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
```
关于代码的解释:
1. 定义了一个 `TextSentiment` 类作为模型,并使用了 `EmbeddingBag` 层和 `Linear` 层,其中 `EmbeddingBag` 层将文本转换为向量表示,`Linear` 层将向量表示转换为类别概率。
2. 使用 `text_classification` 包中的 `AG_NEWS` 数据集作为训练集和测试集。
3. 加载了本地词向量,并使用 `build_vocab_from_iterator` 函数构建了词汇表。
4. 定义了超参数,包括词汇表大小、嵌入维度、类别数、批大小和设备类型。
5. 定义了模型和优化器,使用了交叉熵损失函数。
6. 定义了 `train_func` 和 `test` 函数,分别用于训练和测试模型。
7. 在主循环中,使用 `DataLoader` 对训练集和测试集进行批处理,并调用 `train_func` 和 `test` 函数对模型进行训练和测试。最后输出每个 epoch 的损失和准确率。
阅读全文