如何在PyTorch中准备和加载CSV数据进行BERT训练?
时间: 2024-10-18 15:18:20 浏览: 4
在PyTorch中使用BERT或其他基于Transformer的模型进行训练,通常需要通过以下步骤来准备和加载CSV数据:
1. 导入必要的库:
```python
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertForSequenceClassification
```
2. 定义自定义Dataset类:
```python
class CsvDataset(Dataset):
def __init__(self, csv_path, tokenizer, max_len=512, is_train=True):
self.tokenizer = tokenizer
self.data = pd.read_csv(csv_path)
self.labels = self.data['label_column']
self.texts = self.data['text_column']
self.max_len = max_len
if is_train:
self.texts = self.preprocessing()
def preprocessing(self):
# 这里处理文本,如分词、填充、标签编码等
inputs = self.tokenizer(
self.texts,
padding='max_length',
truncation=True,
max_length=self.max_len,
return_tensors='pt'
)
return inputs
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
item = {key: val[idx] for key, val in self.preprocessing().items()}
item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
return item
```
3. 初始化tokenizer和模型:
```python
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=num_classes)
```
4. 创建DataLoader:
```python
dataloader = DataLoader(
CsvDataset(csv_file, tokenizer, max_len=max_len),
batch_size=batch_size,
shuffle=(is_train and not evaluate),
drop_last=True
)
```
5. 开始训练或验证:
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = AdamW(model.parameters(), lr=learning_rate)
for epoch in range(num_epochs):
for data in dataloader:
inputs = {k: v.to(device) for k, v in data.items()}
outputs = model(**inputs)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
```
记得根据实际项目调整相关的参数,如`num_classes`、`batch_size`、`max_len`、`num_epochs`、`learning_rate`等。
阅读全文