用torchCNN写自然语言分类的代码
时间: 2024-05-06 13:17:17 浏览: 5
以下是使用 PyTorch 实现自然语言分类的示例代码:
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class CNNTextClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.convs = nn.ModuleList([
nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(fs, embedding_dim))
for fs in filter_sizes
])
self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
embedded = self.embedding(text) # [batch_size, sent_len, emb_dim]
embedded = embedded.unsqueeze(1) # [batch_size, 1, sent_len, emb_dim]
conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
# [(batch_size, n_filters, sent_len - filter_sizes[n] + 1), ...] * len(filter_sizes)
pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
# [(batch_size, n_filters), ...] * len(filter_sizes)
cat = self.dropout(torch.cat(pooled, dim=1)) # [batch_size, len(filter_sizes) * n_filters]
return self.fc(cat) # [batch_size, output_dim]
```
这个模型使用了卷积神经网络(CNN)来处理文本数据。首先使用 `nn.Embedding` 将输入的文本序列转换为词向量,然后将词向量转换为一个 4D 张量(batch_size, 1, sent_len, emb_dim),其中 `sent_len` 表示输入的文本序列长度,`emb_dim` 表示词向量的维度。
接着使用 `nn.Conv2d` 对 4D 张量进行卷积操作,得到一组卷积特征。由于卷积操作是在前两个维度上进行的,因此我们需要先使用 `unsqueeze` 方法将第二个维度扩展为 1,这样卷积操作就可以正确进行。`nn.ModuleList` 是一个 PyTorch 模型容器,可以方便地管理多个模型参数。
然后使用 `F.relu` 和 `F.max_pool1d` 对卷积特征进行非线性激活和池化操作,最后将多组卷积特征拼接在一起,并使用全连接层对其进行分类。`nn.Dropout` 可以在模型训练时随机将部分神经元输出置为 0,以避免过拟合。
下面是训练模型的示例代码:
```python
def train(model, iterator, optimizer, criterion):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in iterator:
text, labels = batch
optimizer.zero_grad()
predictions = model(text).squeeze(1)
loss = criterion(predictions, labels)
acc = binary_accuracy(predictions, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
def evaluate(model, iterator, criterion):
epoch_loss = 0
epoch_acc = 0
model.eval()
with torch.no_grad():
for batch in iterator:
text, labels = batch
predictions = model(text).squeeze(1)
loss = criterion(predictions, labels)
acc = binary_accuracy(predictions, labels)
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
def binary_accuracy(preds, y):
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float()
acc = correct.sum() / len(correct)
return acc
N_EPOCHS = 10
best_valid_loss = float('inf')
model = CNNTextClassifier(vocab_size=len(TEXT.vocab),
embedding_dim=100,
n_filters=100,
filter_sizes=[3, 4, 5],
output_dim=1,
dropout=0.5)
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
for epoch in range(N_EPOCHS):
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'tut3-model.pt')
print(f'Epoch: {epoch+1:02}')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
```
在训练过程中,我们使用了 `nn.BCEWithLogitsLoss` 作为损失函数,这是一个常用的二元分类损失函数。同时还定义了 `binary_accuracy` 函数来计算模型的分类准确率。
每个 epoch 完成后,我们使用验证集计算模型的损失和准确率,并将表现最好的模型参数保存到 `tut3-model.pt` 文件中。