对抗性文本生成：长远视角的改进策略

需积分: 41 159 浏览量更新于2024-07-16 2 收藏 3.94MB PDF 举报

对抗性文本生成是自然语言处理领域的一个前沿议题，特别是在自回归文本生成模型（如基于Transformer的LSTM或Transformer-XL等）的研究中，这些模型在追求局部连贯性的同时，往往面临在生成长篇文本时保持语义一致性方面的挑战。自动生成具有相似语义的词汇是一个难题，因为这需要模型具备理解上下文并预测长期依赖的能力，而传统的规则设计往往难以涵盖所有复杂的语言现象。传统的自回归模型倾向于在每个时间步仅关注下一个词的预测，导致生成的文本可能在篇章层面失去连贯性。为了克服这些问题，研究人员提出了"Improving Adversarial Text Generation by Modeling the Distant Future"这一方法。该研究团队，由Ruiyi Zhang、Changyou Chen等人代表，主要关注于开发一种能够考虑更长远文本规划的策略。他们提出的创新性指导网络，旨在扩展生成过程的视野，不仅限于当前的局部序列，而是引导模型预见更远的未来，以便提供更丰富的上下文信息。这种模型通过模仿学习的方式，让生成器在预测下个词的同时，也受到未来词语选择的影响，从而优化整体生成的质量。这种方法通过中间奖励机制，鼓励模型生成在语义上连贯且符合预期的文本。实验结果表明，这种基于模型的模仿学习方法显著提升了对抗性文本生成的性能。它不仅提高了文本的流畅度，还减少了在长文本生成中出现的语义断裂问题。通过这种方式，模型能够更好地理解和应用潜在的语法和语义规则，从而生成更具说服力和连贯性的文本。对抗性文本生成领域的研究正朝着更注重全局语境和长远规划的方向发展，而这种模仿学习策略作为一种有效的解决方案，正在推动这一领域的进步，使得机器在文本生成任务中表现得更加智能和自然。随着深度学习和强化学习技术的不断优化，我们期待看到更多创新性的方法来解决这一复杂的问题。

Though SeqGAN (Yu et al., 2017) has proposed

to use rollout to get rewards for each generated

word, the variance of the rewards is typically too

high to be useful practically. In addition, the com-

putational cost may be too high for practical use.

We below describe how to use the proposed guider

network to deﬁne intermediate rewards, leading to

a deﬁnition of feature-matching reward.

Feature-Matching Rewards

We ﬁrst deﬁne an

intermediate reward to generate a particular word.

The idea is to match the ground-truth features from

the CNN encoder in Figure 1 with those generated

from the guider network. Equation

(8)

indicates

that the further the generated feature is from the

true feature, the smaller the reward should be. To

this end, for each time

, we deﬁne the intermediate

reward for generating the current word as:

i=1

cos

− f

t−i

− f

t−i

)) ,

where

= G

t−c−1

, f

t−c

)

is the predicted fea-

ture. Intuitively,

−f

t−i

measures the difference

between the generated sentences in feature space;

the reward is high if it matches the predicted fea-

ture transition

− f

t−i

from the guider network.

At the last step of text generation, i.e.,

t = T

the corresponding reward measures the quality of

the whole generated sentence, thus it is called a

ﬁnal reward. The ﬁnal reward is deﬁned differently

from the intermediate reward, discussed below for

both the unconditional- and conditional-generation

cases.

Note that a token generated at time

will inﬂu-

ence not only the rewards received at that time

but also the rewards at subsequent time steps.

Thus we propose to deﬁne the cumulative reward,

i=t

with

a discount factor, as a feature-

matching reward. Intuitively, this encourages the

generator to focus on achieving higher long-term

rewards. Finally, in order to apply policy gradient

to update the generator, we combine the feature-

matching reward with the problem-speciﬁc ﬁnal

reward, to form a Q-value reward speciﬁed below.

Similar to SeqGAN, the ﬁnal reward is deﬁned as

the output of a discriminator, evaluating the quality

of the whole generated sentence, i.e., the smaller

the output, the less likely the generation is a true

sentence. As a result, we combine the adversarial

reward

∈ [0, 1]

by the discriminator (Yu et al.,

Algorithm 1

Model-based Imitation Learning for

Text Generation

Require:

generator policy

; guider network

; a sequence dataset

1...T

}

by some ex-

pert policy.

1: Initialize G

, D

with random weights.

2: while Imitation Learning phase do

Update generator

, guider

with MLE

loss.

4: end while

5: while Reinforcement Learning phase do

6: Generate a sequence Y

1...T

∼ π

7: Compute Q

, and update π

8: end while

2017) with the guider-matching rewards, to deﬁne

a Q-value reward as Q

= (

i=t

) × r

Generator Optimization

The generator is ini-

tialized by pre-training on sentences with an au-

toencoder structure, based on MLE training. Af-

ter that, the ﬁnal

-value reward

is used as a

reward for each time t, with standard policy gradi-

ent optimization methods to update the generator.

Speciﬁcally, the policy gradient is

∇

J = E

t−1

)∼ρ

∇

log p(y

t−1

; φ, ϕ)] ,

∇

J = E

t−1

)∼ρ

∇

log p(y

t−1

; φ, ϕ)] ,

where

p(y

t−1

; φ, ϕ)

is the probability of gener-

ating

given

t−1

in the generator. Algorithm

1 describes the proposed model-based imitation

learning framework for text generation.

Model-based or Model-free

Text generation

seeks to generate the next word (action) given

the current (sub-)sentence (state). The generator

is considered as an agent that learns a policy to

predict the next word given its current state. In

previous work (Ranzato et al., 2016), a metric re-

ward is given and the generator is trained to only

maximize the metric reward by trial, thus this is

model-free learning. In the proposed method, the

guider network models the environment dynamics,

and is trained by minimizing the cosine similar-

ity between the prediction and the ground truth

on real text. For generator training, it maximizes

the reward which is determined by the metric and

guider network, and thus is model-free learning

with model-based boosting (Gu et al., 2016). The

model predictive control scheme is included in our

method, where the guider network is used to help

next-word selection at each time-step.

剩余15页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

对抗性文本生成：长远视角的改进策略

ACL2020论文精选：词向量、机器翻译与多语言研究亮点

ACL 2020: 探索异构图神经网络在文本摘要中的应用

深入理解ExHiRD-DKG：ACL 2020论文的短语生成解码源代码解析

ganbert:使用半监督生成对抗网络增强BERT训练

acl2018 bionlp workshop.zip

acl预讲大会论文.rar

ACL论文：计算语言学协会的论文摘要

从ACL_2019年会看自然语言处理未来发展趋势_本刊讯.pdf

ACL’22 _ 为大模型定制的数据增强方法FlipDA，屠榜六大NLU 数据集！.pdf

分析ACL论文“模型理解问题了吗？”的对抗性攻击实验

最新资源