使用循环神经网络生成序列

下载需积分: 50 | PDF格式 | 3.34MB | 更新于2024-07-17 | 177 浏览量 | 举报

"这篇资源是 Generating Sequences With Recurrent Neural Networks 的英文原文，由 Alex Graves 所撰写，来自多伦多大学计算机科学系。本文详细介绍了如何利用长短期记忆（LSTM）循环神经网络生成具有长期结构的复杂序列，通过逐个数据点预测的方式。方法在离散数据（如文本）和连续数据（如在线手写）上得到验证，并进一步扩展到基于文本序列的书法合成，能够生成各种风格的高度逼真的草书手写体。" 在这篇文章中，作者探讨了循环神经网络（RNNs），特别是长短期记忆网络（LSTMs）在序列生成中的应用。RNNs 是一种动态模型，广泛应用于音乐、文本和动作捕捉数据等领域的序列生成。通过一次处理一个时间步的真实数据序列，RNN 可以训练来预测下一个时间步的数据。当预测具有概率性时，可以对训练好的网络进行迭代采样，从而生成新的序列。 LSTM 结构解决了标准 RNNs 在处理长期依赖问题上的困难，通过引入门控机制（输入门、遗忘门和输出门）来控制信息的流动，使得网络能有效地记住远距离的上下文信息。在文本生成任务中，LSTM 接收先前生成的单词作为输入，预测下一个最可能的单词，这样可以生成连贯的文本段落。而在手写生成任务中，LSTM 能够学习到书写笔画的动态特征，并基于文本序列条件化其预测，生成与给定文本内容相匹配的书写样式。文章进一步扩展了这一概念，将LSTM应用到手写合成领域。这涉及到让网络不仅预测下一笔的位置，还要考虑到整个文本序列，使生成的手写体反映出特定的书写风格。通过这种方式，系统能够以多种风格生成高度逼真的草书手写，展示了LSTM在序列生成任务中的强大能力。此外，该研究还可能对其他序列生成任务产生影响，比如语音合成、图像描述生成、视频预测等。通过LSTM和类似的递归神经网络结构，我们可以构建出能够理解和创造复杂序列的智能模型，这在深度学习领域具有深远的意义。

展开

Table 1: Penn Treebank Test Set Results. ‘BPC’ is bits-per-character.

‘Error’ is next-step classiﬁcation error rate, for either characters or words.

Input Regularisation Dynamic BPC Perplexity Error (%) Epochs

Char none no 1.32 167 28.5 9

char none yes 1.29 148 28.0 9

char weight noise no 1.27 140 27.4 25

char weight noise yes 1.24 124 26.9 25

char adapt. wt. noise no 1.26 133 27.4 26

char adapt. wt. noise yes 1.24 122 26.9 26

word none no 1.27 138 77.8 11

word none yes 1.25 126 76.9 11

word weight noise no 1.25 126 76.9 14

word weight noise yes 1.23 117 76.2 14

(the network can safely be stopped at the point of minimum total ‘description

length’ on the training data). However, to keep the comparison fair, the same

training, validation and test sets were used for all experiments.

The results are presented with two equivalent metrics: bits-per-character

(BPC), which is the average value of − log

Pr(x

t+1

) over the whole test set;

and perplexity which is two to the power of the average number of bits per word

(the average word length on the test set is about 5.6 characters, so perplexity ≈

5.6BP C

). Perplexity is the usual performance measure for language modelling.

Table 1 shows that the word-level RNN performed better than the character-

level network, but the gap appeared to close when regularisation is used. Overall

the results compare favourably with those collected in Tomas Mikolov’s the-

sis [23]. For example, he records a perplexity of 141 for a 5-gram with Keyser-

Ney smoothing, 141.8 for a word level feedforward neural network, 131.1 for the

state-of-the-art compression algorithm PAQ8 and 123.2 for a dynamically eval-

uated word-level RNN. However by combining multiple RNNs, a 5-gram and a

cache model in an ensemble, he was able to achieve a perplexity of 89.4. Inter-

estingly, the beneﬁt of dynamic evaluation was far more pronounced here than

in Mikolov’s thesis (he records a perplexity improvement from 124.7 to 123.2

with word-level RNNs). This suggests that LSTM is better at rapidly adapting

to new data than ordinary RNNs.

3.2 Wikipedia Experiments

In 2006 Marcus Hutter, Jim Bowery and Matt Mahoney organised the following

challenge, commonly known as Hutter prize [17]: to compress the ﬁrst 100

million bytes of the complete English Wikipedia data (as it was at a certain

time on March 3rd 2006) to as small a ﬁle as possible. The ﬁle had to include

not only the compressed data, but also the code implementing the compression

algorithm. Its size can therefore be considered a measure of the minimum

description length [13] of the data using a two part coding scheme.

Wikipedia data is interesting from a sequence generation perspective because

it contains not only a huge range of dictionary words, but also many character

sequences that would not be included in text corpora traditionally used for

language modelling. For example foreign words (including letters from non-

Latin alphabets such as Arabic and Chinese), indented XML tags used to deﬁne

meta-data, website addresses, and markup used to indicate page formatting such

as headings, bullet points etc. An extract from the Hutter prize dataset is shown

in Figs. 3 and 4.

The ﬁrst 96M bytes in the data were evenly split into sequences of 100 bytes

and used to train the network, with the remaining 4M were used for validation.

The data contains a total of 205 one-byte unicode symbols. The total number

of characters is much higher, since many characters (especially those from non-

Latin languages) are deﬁned as multi-symbol sequences. In keeping with the

principle of modelling the smallest meaningful units in the data, the network

predicted a single byte at a time, and therefore had size 205 input and output

layers.

Wikipedia contains long-range regularities, such as the topic of an article,

which can span many thousand words. To make it possible for the network to

capture these, its internal state (that is, the output activations h

of the hidden

layers, and the activations c

of the LSTM cells within the layers) were only reset

every 100 sequences. Furthermore the order of the sequences was not shuﬄed

during training, as it usually is for neural networks. The network was therefore

able to access information from up to 10K characters in the past when making

predictions. The error terms were only backpropagated to the start of each 100

byte sequence, meaning that the gradient calculation was approximate. This

form of truncated backpropagation has been considered before for RNN lan-

guage modelling [23], and found to speed up training (by reducing the sequence

length and hence increasing the frequency of stochastic weight updates) without

aﬀecting the network’s ability to learn long-range dependencies.

A much larger network was used for this data than the Penn data (reﬂecting

the greater size and complexity of the training set) with seven hidden layers of

700 LSTM cells, giving approximately 21.3M weights. The network was trained

with stochastic gradient descent, using a learn rate of 0.0001 and a momentum

of 0.9. It took four training epochs to converge. The LSTM derivates were

clipped in the range [−1, 1].

As with the Penn data, we tested the network on the validation data with

and without dynamic evaluation (where the weights are updated as the data

is predicted). As can be seen from Table 2 performance was much better with

dynamic evaluation. This is probably because of the long range coherence of

Wikipedia data; for example, certain words are much more frequent in some

articles than others, and being able to adapt to this during evaluation is ad-

vantageous. It may seem surprising that the dynamic results on the validation

set were substantially better than on the training set. However this is easily

explained by two factors: ﬁrstly, the network underﬁt the training data, and

secondly some portions of the data are much more diﬃcult than others (for

example, plain text is harder to predict than XML tags).

To put the results in context, the current winner of the Hutter Prize (a

剩余42页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

梧桐126

粉丝: 13

使用循环神经网络生成序列

Recurrent Neural Networks for Prediction(pdf)

RECURRENT NEURAL NETWORKS FOR PREDICTION

递归神经网络模型.zip

Udemy - Deep Learning Recurrent Neural Networks in Python

源码Deep Learning with Theano

[Advanced] Application of Convolutional Neural Networks (CNN) in MATLAB

Recurrent Neural Networks

Generating Text with Recurrent Neural Networks (LANG-RNN)-计算机科学

Recurrent Neural Networks Tutorial (RNN)

WinCC嵌入式Excel报表系统：实现高效自动化报表生成与数据分析

最新资源