加权Transformer提升机器翻译效率与性能

下载需积分: 9 | PDF格式 | 793KB | 更新于2024-09-11 | 8 浏览量 | 举报

本文是一篇关于机器翻译的深度学习研究论文，标题为“加权Transformer网络 for Machine Translation”。作者Karim Ahmed、Nitish Shirish Keskar和Richard Socher来自Salesforce Research，他们位于美国加州帕洛阿尔托。论文旨在解决神经机器翻译(NMT)领域的挑战，尤其是通过注意力机制改进传统模型。传统的神经机器翻译模型通常依赖于某种形式的循环或卷积结构。然而，Vaswani等人在2017年提出了一种新的架构——Transformer，它完全避免了循环和卷积，仅采用自注意力层和前馈层。Transformer因其高效性和并行计算能力，在多个机器翻译任务上取得了最先进的性能，但同时存在一个缺点：它需要大量的参数和训练迭代次数才能收敛。论文的核心贡献是提出加权Transformer（Weighted Transformer），这是一种对标准Transformer进行修改的模型。其关键创新在于将多头注意力机制替换为多个自我注意力分支，这些分支允许模型在训练过程中学习如何组合它们的信息。这种设计的目的是提高模型的效率，并且在BLEU分数上，加权Transformer不仅超过了基础模型，而且训练速度提高了15%到40%。具体来说，通过加权Transformer，他们在WMT2014年的英语到德语翻译任务上提升了0.5个BLEU点。总结来说，这篇论文的主要知识点包括： 1. **Transformer架构的优势与局限**：介绍Transformer作为替代循环和卷积的新型NMT模型，以及其在性能上的优势。 2. **加权Transformer的设计**：提出通过多个自我注意力分支增强Transformer，赋予模型动态结合信息的能力。 3. **提升性能与效率**：展示了加权Transformer在BLEU分数上的改进，以及更快的收敛速度。 4. **实证结果**：通过WMT2014英文到德文任务的具体实验，证明了新模型的有效性。这项工作对于NLP领域，特别是机器翻译，具有重要意义，因为它不仅优化了现有技术，还提供了更高效的训练策略。此外，该研究也表明了在深度学习模型设计中，针对特定任务进行细粒度的调整和优化可以带来显著性能提升。

展开

WEIGHTED TRANSFORMER NETWORK FOR

MACHINE TRANSLATION

Karim Ahmed, Nitish Shirish Keskar & Richard Socher

Salesforce Research

Palo Alto, CA 94103, USA

{karim.ahmed,nkeskar,rsocher}@salesforce.com

ABSTRACT

State-of-the-art results on neural machine translation often use attentional

sequence-to-sequence models with some form of convolution or recursion.

Vaswani et al. (2017) propose a new architecture that avoids recurrence and con-

volution completely. Instead, it uses only self-attention and feed-forward layers.

While the proposed architecture achieves state-of-the-art results on several ma-

chine translation tasks, it requires a large number of parameters and training iter-

ations to converge. We propose Weighted Transformer, a Transformer with mod-

iﬁed attention layers, that not only outperforms the baseline network in BLEU

score but also converges 15 − 40% faster. Speciﬁcally, we replace the multi-head

attention by multiple self-attention branches that the model learns to combine dur-

ing the training process. Our model improves the state-of-the-art performance by

0.5 BLEU points on the WMT 2014 English-to-German translation task and by

0.4 on the English-to-French translation task.

1 INTRODUCTION

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs) (Hochreiter

& Schmidhuber, 1997), form an important building block for many tasks that require modeling of

sequential data. RNNs have been successfully employed for several such tasks including language

modeling (Melis et al., 2017; Merity et al., 2017), speech recognition (Xiong et al., 2017; Graves

et al., 2013), and machine translation (Wu et al., 2016; Bahdanau et al., 2014). RNNs make output

predictions at each time step by computing a hidden state vector h

based on the current input token

and the previous states. This sequential computation underlies their ability to map arbitrary input-

output sequence pairs. However, because of their auto-regressive property of requiring previous

hidden states to be computed before the current time step, they cannot beneﬁt from parallelization.

Variants of recurrent networks that use strided convolutions eschew the traditional time-step based

computation (Kaiser & Bengio, 2016; Lei & Zhang, 2017; Bradbury et al., 2016; Gehring et al.,

2016; 2017; Kalchbrenner et al., 2016). However, in these models, the operations needed to learn

dependencies between distant positions can be difﬁcult to learn (Hochreiter et al., 2001; Hochreiter,

1998). Attention mechanisms, often used in conjunction with recurrent models, have become an in-

tegral part of complex sequential tasks because they facilitate learning of such dependencies (Luong

et al., 2015; Bahdanau et al., 2014; Parikh et al., 2016; Paulus et al., 2017; Kim et al., 2017).

In Vaswani et al. (2017), the authors introduce the Transformer network, a novel architecture that

avoids the recurrence equation and maps the input sequences into hidden states solely using atten-

tion. Speciﬁcally, the authors use positional encodings in conjunction with a multi-head attention

mechanism. This allows for increased parallel computation and reduces time to convergence. The

authors report results for neural machine translation that show the Transformer networks achieves

state-of-the-art performance on the WMT 2014 English-to-German and English-to-French tasks

while being orders-of-magnitude faster than prior approaches.

Transformer networks still require a large number of parameters to achieve state-of-the-art perfor-

mance. In the case of the newstest2013 English-to-German translation task, the base model required

65M parameters, and the large model required 213M parameters. We propose a variant of the Trans-

former network which we call Weighted Transformer that uses self-attention branches in lieu of

arXiv:1711.02132v1 [cs.AI] 6 Nov 2017

下载后可阅读完整内容，剩余9页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

格里格

粉丝: 0

加权Transformer提升机器翻译效率与性能

Transformer中的Encoder、Decoder

Transformer

SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS （WFST））

光伏预测优化：向量加权与Transformer结合的Matlab仿真研究

Matlab负荷数据回归预测：向量加权与Transformer-LSTM优化

transformer加权

transformer对样本加权

【高创新】基于向量加权平均算法INFO-Transformer-LSTM实现故障识别Matlab实现.rar

向量加权平均算法INFO-Transformer-GRU故障诊断分类【含Matlab源码 6291期】.zip

向量加权平均算法INFO优化Transformer柴油机故障诊断分类【含Matlab源码 6465期】.zip

最新资源