请给我写一段使用Python的用GPT2进行文本分类并验证模型准确率的详细代码
时间: 2024-02-13 07:02:57 浏览: 69
好的,下面是使用Python的GPT2进行文本分类并验证模型准确率的详细代码:
首先,我们需要安装相应的Python库和模型。这里我们使用Hugging Face的transformers库,以及GPT2模型。
```python
!pip install transformers
```
```python
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, AdamW
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score
```
然后,我们需要定义我们的数据集类。这里我们使用一个简单的文本分类示例数据集,其中包含两个类别的文本:“positive”和“negative”。
```python
class TextClassificationDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, item):
text = str(self.texts[item])
label = self.labels[item]
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'text': text,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'label': torch.tensor(label, dtype=torch.long)
}
```
接下来,我们需要加载我们的训练和测试数据,并将它们转换为数据集。
```python
train_texts = ['This is a positive text.', 'This is a negative text.', 'Another positive text.', 'Another negative text.']
train_labels = [1, 0, 1, 0]
test_texts = ['This is a positive test text.', 'This is a negative test text.']
test_labels = [1, 0]
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
max_len = 32
train_dataset = TextClassificationDataset(train_texts, train_labels, tokenizer, max_len)
test_dataset = TextClassificationDataset(test_texts, test_labels, tokenizer, max_len)
```
接下来,我们需要定义我们的模型和训练过程。
```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
model = GPT2ForSequenceClassification.from_pretrained('gpt2', num_labels=2)
model.to(device)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)
num_epochs = 5
optimizer = AdamW(model.parameters(), lr=1e-5)
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.eval()
predictions = []
targets = []
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions.extend(torch.argmax(logits, dim=-1).tolist())
targets.extend(labels.tolist())
accuracy = accuracy_score(targets, predictions)
print(f'Epoch {epoch + 1}/{num_epochs}, Accuracy: {accuracy}')
```
以上就是使用Python的GPT2进行文本分类并验证模型准确率的详细代码。
阅读全文