深度学习：神经网络结构复兴与大数据驱动的性能提升

需积分: 9 129 浏览量更新于2024-07-15 收藏 19.03MB PDF 举报

《深层神经网络结构的复兴》是一篇关于深度学习领域的重要论文，由Petar Veličković于2019年1月提交，旨在探讨深度神经网络（Deep Neural Networks, DNN）在机器学习中的崛起及其对多个关键问题域的影响。论文的核心观点在于深度学习的出现使得机器可以直接从未经预处理的原始输入数据中自动学习复杂的特征表示，这一突破性进展彻底改变了传统的特征工程过程，即人工设计和“硬编码”特征。以往，许多机器学习任务如计算机视觉、自然语言处理（Natural Language Processing, NLP）、强化学习（Reinforcement Learning, RL）和生成模型（Generative Models）都依赖于人为设计的特征提取步骤，这些特征往往需要领域专家的知识和经验。然而，随着深度学习的兴起，特别是卷积神经网络（Convolutional Neural Networks, CNN）和循环神经网络（Recurrent Neural Networks, RNN）的发展，这些任务能够处理像文本和图像这样具有网格结构的数据，并利用大量标注数据（通常称为“大数据”）进行训练，从而实现了前所未有的性能提升。深度神经网络之所以能取得如此成就，主要归功于其庞大的参数空间，这赋予了它们强大的表征能力。然而，这也带来了一个挑战，那就是过拟合（Overfitting）的风险。过拟合是指模型在训练数据上表现优异，但在新数据上的泛化能力较差。解决这个问题需要适当的正则化策略，例如Dropout、数据增强（Data Augmentation）以及合理的网络架构设计。此外，论文可能还讨论了深度学习的其他重要方面，如深度学习的训练方法（如反向传播、优化算法），以及如何通过深度学习架构的改进（如残差连接、注意力机制等）来进一步提升性能。深度学习的可解释性和理论理解也可能是讨论的重点，尽管这些问题仍然在研究中处于前沿，但深度学习的复兴无疑推动了整个AI领域的进步。《深层神经网络结构的复兴》深入剖析了深度学习如何改变我们处理复杂任务的方式，并强调了数据、结构和模型选择在其中的关键作用。这篇论文不仅总结了已有的成功案例，还可能对未来的研究方向提出了新的思考和启示。

Figure 1.1: The notMNIST dataset of glyphs corresponding to characters A–J. For

humans, the associated classification task—of predicting the character depicted in the

glyph—is trivial to perform, but near‐impossible to conceptualise algorithmically.

The notMNIST dataset, designed as a step‐up in complexity from the handwritten digit

recognition of MNIST [103], is concerned with predicting the character depicted in a

given input glyph (represented as a 28 × 28 grayscale image; i.e., matrix of pixel intensi‐

ties). For humans, this task is often trivial; the majority of the glyphs shown in Figure

1.1 may quickly be recognised as the character A. However, formulating a rigorous set

of classification rules turns out to be quite difficult. This difficulty stems primarily from

the variability of real‐world signals: the large number of ways (fonts) to effectively en‐

code the “substance” of an A is not readily amenable to a rule‐based system that looks at

individual pixels. Even simple transformations, such as shifts, scales and rotations may

completely fool such a system without compromising human performance

Unlike such a rule‐based classifier, humans have been exposed to a significant number of

A characters during their lifetimes, and have therefore learnt the complex features that

make up an A, without ever needing to rigorously define a set of rules for doing so. Ma‐

chine learning, at its core, aims to capture exactly this—systems capable of generalising

from past experiences. It has, therefore, become the dominant approach to artificial in‐

telligence in recent years; especially through deep neural networks (commonly known as

deep learning [101] in this context).

It should be noted, of course, that modern methods based on deep neural networks are not immune to

variants of this issue either—in this space often referred to as adversarial attacks [168]. These, however,

often produce fooling inputs that substantially escape the true data distribution.

Figure 1.2: Left: An input image (of the tabby cat class). A deep neural network (VGG‐

16 [160]) pre‐trained on ImageNet is capable of extracting meaningful representations in

both its shallow (middle) and deep (right) layers. The shallower layers typically extract

rudimentary features (such as separating the foreground from the background) while, at

the deeper stages of processing, the network highlights distinct “cat‐like” properties, such

as its eyes, ears and fur.

1.1 The successes and pitfalls of deep learning

Deep neural networks (DNNs) tackle the problem from a representation learning [11]

perspective; they are composed of a stack of feature‐extracting layers, with the first layer

processing raw inputs, and each subsequent layer receiving the output of the previous

layer. As such, DNNs build up a gradually more complex representation of the input

and eliminate the need for hand‐crafted feature extraction. Given a sufficiently large

training set of input/output examples, the network iteratively adjusts its parameters (typ‐

ically using a first‐order method [143] such as gradient descent) in order to optimise the

error of the network’s predictions on those examples compared to the ground‐truth out‐

put. As a consequence, the network gradually acquires better representations of the data

as well, with the “shallower” layers specialising to detect simple input features (such as

edges and contours in an image), with “deeper” layers detecting more complex clues. Fig‐

ure 1.2 provides a visualisation of this phenomenon in the case of image classification on

the ImageNet dataset [144] (where it is typically easiest to conceptualise).

This method assumes a direction which is in stark contrast to the symbolic approaches;

minimal assumptions about the specifics of the task are given, relying on the model to

automatically infer them through observing the training data. In conjunction with spe‐

cialised architectures that use parameter sharing to exploit simple redundancies and in‐

variances in grid‐structured data—such as convolutional neural networks (CNNs) [102]

for images and recurrent neural networks (RNNs) [41, 74, 82] for sequences—these tech‐

niques currently hold the state‐of‐the‐art result on many challenging tasks of interest.

Their resurgence in the 21

century may be primarily attributed to the ability to har‐

ness large quantities of training data (the “big data” phenomenon) and an upturn in

specialised computer architectures for parallel processing (primarily graphics processing

units (GPUs), with fully‐dedicated architectures such as the tensor processing unit (TPU)

[83] being proposed recently).

Under tasks with a large dataset and simplistic structural invariances, deep neural net‐

works typically dominate. The scope of potential applications is extremely varied, with

state‐of‐the‐art (and often superhuman) performance being attained across areas such

as computer vision [98], speech recognition [70], natural language processing [166], re‐

inforcement learning [120], game‐playing [158] and generative modelling [54]. What is

particularly impressive is that the same kind of architecture may often be applied in wildly

different scenarios, achieving competitive performance on all of them. This allowed, for

the first time, the seamless connection between previously disjoint disciplines of computer

science. Historically, tackling each problem in these domains required extensive input

from experts for creation of appropriate hand‐crafted features. Deep neural networks al‐

low for automatic inference of useful features, purely through observing large quantities

of training data, therefore often eliminating the domain expert from the pipeline.

While the successes of deep neural networks are indisputable and certainly correspond

to the most promising direction of artificial intelligence in recent times, they are far from

a catch‐all solution to everything. Even if we assume that the networks do not have to

perform well in the presence of adversarial inputs [55]; a phenomenon already known to

leave most deep architectures particularly vulnerable

, issues arise as soon as we step out

of the “big data” environment and/or are forced to deal with more general kinds of input

structure (such as graph‐structured data or data lying on manifolds).

As modern‐day neural networks often have parameter counts in the millions, they will

tend to easily overfit to their inputs on a small training set—constructing representations

that memorise their training data rather than generalise to unseen inputs (which is the es‐

sential property of successful machine learning systems!). Furthermore, applying neural

networks to complex‐structured inputs makes it substantially less trivial to share parame‐

ters in ways that are simultaneously meaningful (do not lose representational power) and

scalable (computationally efficient, and ideally deployable on GPUs). The most common

response to such scenarios is to gather more data (nowadays possible even outside the

industrial environment, e.g. by leveraging tools such as the Amazon Mechanical Turk

[18] to effectively harness human annotators) and to discard the additional structure.

While tangential to the aims of the work presented in this dissertation, I acknowledge defence against

adversarial examples as a very important research direction.

Both of these paradigms work only to an extent—in many cases, discarding the struc‐

ture can cause severe losses in performance (as is the case with many node classification

tasks on graphs [92, 187]). Similarly, gathering more data is not always appropriate,

affordable or even feasible:

• In some cases, we may not know upfront what the task of interest is—and therefore

not be able to determine appropriate labels (ground‐truth outputs) for the data;

• For some tasks, data can be reliably collected only by trained professionals; a sim‐

ple example of this is medical image segmentation, where every pixel of the input

needs to be carefully labelled by an expert. In this setting, techniques such as ac‐

tive learning [46] may be applied to carefully choose which examples are the most

important to be labelled.

• If we are aiming to detect an extremely rare event, such as diagnosing a rare genetic

disease, any training dataset is fundamentally limited by the number of examples

possessing this characteristic, and extending it further is impossible

1.2 Reintroducing structural inductive biases

The contributions I will present in this dissertation are primarily concerned with alleviat‐

ing the issues outlined above in the setting where there is additional structure in the data

which may be exploited. A common way in which additional knowledge about data may

be exploited is by applying an appropriate inductive bias to the model.

Typically, given a particular (machine) learning setup, one may find a space of possible so‐

lutions (e.g. parameter assignments) to the learning problem that exhibit equally “good”

performance [57]. Inductive biases [118], broadly speaking, encourage the learning algo‐

rithm to prioritise solutions with certain properties. While there are many ways to encode

such biases—e.g. explicit regularisation objectives [115] or choices of prior distributions

in a Bayesian model [60]—I have restricted my attention specifically to methods that

incorporate structural assumptions directly into the learning architecture or algorithm.

This can be seen as a “meet‐in‐the‐middle” approach, combining aspects of classical sym‐

bolic artificial intelligence with modern deep architectures for representation learning.

With the caveat that data augmentation through generative models such as variational autoencoders

(VAEs) [89] or generative adversarial networks (GANs) [54] could provide a way to artificially expand

such datasets [199]—albeit substantial further research is necessary in order to verify the viability of such

examples in safety‐critical domains such as biomedicine.

By directly encoding the structural inductive biases present in data, I have recovered mod‐

els that are more data‐efficient, enabling a leap in predictive power—especially on smaller

training datasets. While the specific research questions I have tackled through my contri‐

butions will be thoroughly outlined in the next section, I would like to highlight that this

is by no means an isolated effort, and in fact represents a major recent push within the

machine learning community—as the amount of “low hanging fruit” tasks for deep neu‐

ral networks deteriorated over the past years. For reasons of space, I will only highlight

a few such directions and architectures here:

• Nontrivial structural inductive biases are introduced in the context of multimodal

deep learning [124, 163]—wherein the assumption that different parts of the input

should be processed by different feature extractors is encoded. Most work in this

space focusses on “late fusion”; separately extracting features from each modality,

and then driving decision‐making using the combination of these features [2, 85],

with recent work allowing for generic cross‐modal interaction to occur at an earlier

stage through feature‐wise transformations [130].

• A substantial recent research direction involves generalising CNNs to operate on

more general input structures—which grids are a special case of—such as graphs

[49], manifolds [121] and point clouds [183]. Here, convolutions are generalised

to more generic operations, often preserving appropriate locality, invariance and

parameter sharing properties.

• Relational inductive biases [7] explicitly encode that a neural network should be

mindful of relations present between objects in the input, and has recently been

explored in the context of physics interactions [6], visual question answering [151],

multi‐agent interactions [75], and relational memory modules [150]. It should be

noted that all of the above approaches assume a complete graph of relations; the

tangential problem of relational inference, which aims to explicitly recover these

relations and use them for inference, has also seen attention recently [90].

• Substantial work exists in introducing structural inductive biases in deep reinforce‐

ment learning, where they typically substantially improve the few‐shot performance

of the learnt agents. These approaches span reintroducing symbolic methods in

deep RL algorithms [47], explicitly handling relations between “objects” present in

the RL state [194], directly expressing nontrivial structure of the agent [149, 181],

and allowing for information flow across tasks for continual learning [145, 146,

155]. In addition, structural biases have been successfully introduced in the con‐

text of imitation learning [37] and programmable agents [34].

As a few additional assorted topics where structural inductive biases have been deployed, I

剩余146页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习：神经网络结构复兴与大数据驱动的性能提升

神经网络结构设计：理论、方法与应用探索

神经网络复兴：非线性分类的秘密

深度学习：人工神经网络的复兴与深度神经网络崛起

深度学习_多层神经网络的复兴与变革

深度学习_多层神经网络的复兴与变革_山世光1

UGR_TFG:在EEG分类的背景下，深层神经网络中的超参数优化

卷积神经网络与迁移学习

卷积神经网络-南京大学

卷积神经网络全面解析.pdf

UCAS-AI模式识别2020_07＿神经网络011

最新资源