新冠疫情相似问句判定数据集 代码
时间: 2024-03-02 10:11:08 浏览: 140
新冠疫情相似问句判定是一个自然语言处理任务,需要使用机器学习算法和数据集来实现。以下是一个示例代码,用于训练和评估一个基于BERT模型的相似问句判定模型。
1. 数据准备
首先需要准备训练数据和测试数据。数据集可以从公开数据集中获取,或者自己构建。在这个示例中,我们使用了一个由清华大学开源的中文相似句子数据集 LCQMC。数据集的下载地址为:https://github.com/PaddlePaddle/ERNIE/blob/develop/doc/sentence_pair_similarity/lcqmc/lcqmc.zip
2. 模型构建
我们使用transformers库中的BertModel和BertTokenizer来构建BERT模型。
```python
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')
```
我们可以使用以下代码预处理数据集,将文本转换为BERT模型所需的格式。
```python
import pandas as pd
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
def preprocess(df):
sentences1 = df['sentence1'].tolist()
sentences2 = df['sentence2'].tolist()
labels = df['label'].tolist()
inputs = tokenizer(sentences1, sentences2,
padding=True, truncation=True,
max_length=128, return_tensors='pt')
labels = torch.tensor(labels)
return inputs, labels
train_inputs, train_labels = preprocess(df_train)
test_inputs, test_labels = preprocess(df_test)
```
接下来,我们定义一个基于BERT模型的相似问句判定模型。
```python
import torch.nn as nn
class SentencePairClassifier(nn.Module):
def __init__(self, bert):
super(SentencePairClassifier, self).__init__()
self.bert = bert
self.dropout = nn.Dropout(0.1)
self.linear = nn.Linear(768, 2)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.linear(pooled_output)
return logits
```
3. 模型训练
我们使用PyTorch中的Adam优化器和交叉熵损失函数来训练模型。
```python
from torch.utils.data import DataLoader, TensorDataset
batch_size = 32
train_dataset = TensorDataset(train_inputs['input_ids'], train_inputs['attention_mask'], train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss()
num_epochs = 10
for epoch in range(num_epochs):
model.train()
train_loss = 0
train_acc = 0
for input_ids, attention_mask, labels in train_loader:
input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_acc += (outputs.argmax(1) == labels).sum().item()
train_loss /= len(train_loader)
train_acc /= len(train_dataset)
print('Epoch {}/{}, Loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, num_epochs, train_loss, train_acc))
```
4. 模型评估
我们使用测试数据集评估模型的性能。
```python
test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'], test_labels)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
model.eval()
test_loss = 0
test_acc = 0
with torch.no_grad():
for input_ids, attention_mask, labels in test_loader:
input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
outputs = model(input_ids, attention_mask)
loss = loss_fn(outputs, labels)
test_loss += loss.item()
test_acc += (outputs.argmax(1) == labels).sum().item()
test_loss /= len(test_loader)
test_acc /= len(test_dataset)
print('Test Loss: {:.4f}, Test Accuracy: {:.4f}'.format(test_loss, test_acc))
```
这样,我们就完成了一个基于BERT模型的相似问句判定模型的训练和评估。
阅读全文