使用Pytorch深度学习框架实现seq2seq机器翻译模型,数据集为中文与英文语句对,共10000条语句对,输出为该模型的BLEU评分值
时间: 2024-12-23 07:16:21 浏览: 38
使用PyTorch实现Seq2Seq机器翻译模型,首先你需要准备一个包含10000条中文-英文句子对的数据集。这个过程通常包括数据预处理,如分词、编码成数字序列以及创建词汇表。接下来,可以按照以下步骤操作:
1. 导入所需的库:
```python
import torch
from torchtext.data import Field, BucketIterator
from torchtext.datasets import Multi30k
from transformers import EncoderDecoderModel, BertTokenizer
```
2. 定义数据字段并加载数据集:
```python
src_field = Field(tokenize='spacy', tokenizer_language='zh', init_token='<sos>', eos_token='<eos>')
trg_field = Field(tokenize='spacy', tokenizer_language='en', init_token='<sos>', eos_token='<eos>')
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(src_field, trg_field))
```
3. 对数据进行预处理和构建迭代器:
```python
src_field.build_vocab(train_data, min_freq=2)
trg_field.build_vocab(train_data, min_freq=2)
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), batch_size=16, device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
```
4. 初始化BertEncoder和自定义解码器,并训练模型:
```python
model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-chinese', 'bert-base-multilingual-cased')
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=trg_field.vocab.stoi['<pad>'])
# 训练和评估循环
for epoch in range(num_epochs):
for batch in train_iterator:
src = batch.src
trg = batch.trg
optimizer.zero_grad()
output = model(src, trg[:, :-1])
loss = criterion(output[1:].view(-1, output.shape[-1]), trg[:, 1:].contiguous().view(-1))
loss.backward()
optimizer.step()
```
5. 计算测试集的BLEU评分:
```python
def evaluate(model, iterator, field):
model.eval()
with torch.no_grad():
scores = []
for batch in iterator:
src = batch.src
trg = batch.trg
output = model.generate(src)
output = [field.vocab.itos[i] for i in output]
reference = [[field.vocab.itos[i] for i in trg[batch_idx]] for batch_idx in range(trg.shape[0])]
# BLEU calculation
bleu = sacrebleu.corpus_bleu(output, reference)
scores.append(bleu.score)
return sum(scores) / len(scores)
test_bleu = evaluate(model, test_iterator, trg_field)
print(f"Test BLEU Score: {test_bleu}")
```
阅读全文