bert微调做二分类任务代码
时间: 2023-06-18 21:02:49 浏览: 110
基于 pytorch 的 bert 实现和下游任务微调【100011822】
下面是使用PyTorch实现BERT微调进行二分类任务的代码示例:
```python
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
# 定义一些基本参数
MAX_LEN = 128
BATCH_SIZE = 32
EPOCHS = 4
LEARNING_RATE = 2e-5
# 加载数据
train_texts = [...] # 训练集文本
train_labels = [...] # 训练集标签
test_texts = [...] # 测试集文本
test_labels = [...] # 测试集标签
# 加载BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# 对训练集和测试集进行编码
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=MAX_LEN)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=MAX_LEN)
# 将编码转换为PyTorch tensors
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)
train_dataset = TensorDataset(torch.tensor(train_encodings['input_ids']), torch.tensor(train_encodings['attention_mask']), train_labels)
test_dataset = TensorDataset(torch.tensor(test_encodings['input_ids']), torch.tensor(test_encodings['attention_mask']), test_labels)
# 创建数据加载器
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)
# 加载BERT模型
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# 定义优化器和学习率调度器
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)
total_steps = len(train_dataloader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
# 训练BERT模型
for epoch in range(EPOCHS):
for step, batch in enumerate(train_dataloader):
# 将batch转换为GPU tensor
batch = tuple(t.to('cuda') for t in batch)
inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}
outputs = model(**inputs)
loss = outputs[0]
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# 在测试集上评估模型
model.eval()
test_loss, test_accuracy = 0, 0
nb_test_steps, nb_test_examples = 0, 0
for batch in test_dataloader:
batch = tuple(t.to('cuda') for t in batch)
inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs[1]
logits = logits.detach().cpu().numpy()
label_ids = inputs['labels'].cpu().numpy()
tmp_test_accuracy = accuracy(logits, label_ids)
test_accuracy += tmp_test_accuracy
nb_test_examples += inputs['input_ids'].size(0)
nb_test_steps += 1
test_accuracy = test_accuracy / nb_test_examples
print('Epoch: {}, Test Accuracy: {}'.format(epoch, test_accuracy))
```
其中,`train_texts`和`train_labels`是训练集的文本和标签,`test_texts`和`test_labels`是测试集的文本和标签。`tokenizer`用来将文本转换为BERT的输入格式,`BertForSequenceClassification`是带有分类层的BERT模型,`AdamW`是用于优化模型的优化器,`get_linear_schedule_with_warmup`是用于调整学习率的调度器。在训练过程中,我们使用了梯度裁剪来避免梯度爆炸,同时在每个epoch结束后在测试集上评估模型的性能。
阅读全文