深度学习基石：Ian Goodfellow等大师作品详解

需积分: 50 133 浏览量更新于2023-05-21 收藏 31.22MB PDF 举报

《深度学习》（Adaptive Computation and Machine Learning系列）是由Ian Goodfellow、Yoshua Bengio和Aaron Courville合著的一本权威著作，专为深度学习领域的学习者设计。该书旨在介绍深度学习的基本概念和数学基础，以及其在人工智能领域的重要地位。以下是书中部分章节的主要知识点概览： 1. **简介**： - **目标读者**：书中面向希望深入了解深度学习理论和技术的读者，包括研究人员、工程师以及对AI感兴趣的初学者。 - **历史趋势**：作者回顾了深度学习的发展历程，强调了其在神经网络和机器学习领域中的关键角色，特别是自20世纪80年代以来的突破。 2. **应用数学与机器学习基础**： - **线性代数**： - 学习了基础概念，如标量、向量、矩阵和张量的定义与运算。 - **矩阵乘法和向量相乘**：介绍了基本的矩阵和向量操作，这是深度学习模型中不可或缺的计算工具。 - **特性和逆矩阵**：探讨了矩阵的特性，以及如何求解逆矩阵，这对于理解神经网络的权重更新至关重要。 - **线性依赖与span**：阐述了向量空间中线性独立和可表示性的概念，有助于理解神经网络中特征的提取。 - **范数和矩阵规范**：讨论了衡量向量或矩阵大小的方法，这在优化算法中起着重要作用。 - **特征值分解**：解释了矩阵特征值和特征向量的重要性，它们在神经网络中与权重参数的学习相关。 - **奇异值分解（SVD）**：这是一种重要的矩阵分解技术，常用于降维和数据压缩，对深度学习模型的训练和理解非常有帮助。 - ** Moore-Penrose伪逆**：深入浅出地讲解了解决线性系统不完全问题的手段，对处理神经网络的欠定问题有指导意义。 - **迹和行列式**：这两个概念对于矩阵的性质和计算有深入的分析，例如在计算神经元激活函数的导数时会用到。 - **实例：主成分分析（PCA）**：展示了这一常用的数据预处理方法如何在深度学习中用于特征提取和降维。 3. **概率与信息论**： - **概率的重要性**：阐述了概率在建模不确定性、决策制定和机器学习中的核心地位。 - **随机变量**：介绍了随机变量的概念，它是深度学习模型中随机过程的基础。 - **概率分布**：涵盖了各种分布类型，如伯努利、高斯、马尔科夫链等，这些是构建模型和理解数据分布的基础。 - **概率和条件概率**：讨论了概率的基本运算规则，以及在处理模型中变量之间的依赖关系时的条件概率。 - **贝叶斯定理和链式法则**：这些都是深度学习中推断和反向传播的关键原理。 - **独立性和条件独立性**：这些概念对于理解模型的复杂性以及结构化预测任务至关重要。 - **期望、方差和协方差**：用于度量随机变量的集中趋势和离散程度，对理解模型性能和优化过程至关重要。通过以上内容，本书不仅引导读者掌握深度学习所需的数学工具，还揭示了如何将这些理论应用于实际问题解决，使读者能够在这个快速发展的领域中取得成功。

CONTENTS

Functions

f f: A B→ The function with domain and rangeA B

f g f g◦ Composition of the functions and

f( ; )x θ

A function of

parametrized by

. Sometimes

we just write

(

) and ignore the argument

lighten notation.

log x xNatural logarithm of

σ x( ) Logistic sigmoid,

1 + exp( )−x

ζ x x( ) log(1 + exp(Softplus, ))

|| ||x

norm of x

|| ||x L

norm of x

Positive part of , i.e.,x max(0 ), x

condition

is 1 if the condition is true, 0 otherwise

Sometimes we use a function

whose argument is a scalar, but apply it to a vector,

matrix, or tensor:

(

), or

(

). This means to apply

to the array

element-wise. For example, if

(

), then C

i,j,k

) for all valid values

of , and .i j k

Datasets and distributions

data

The data generating distribution

ˆp

data

The empirical distribution deﬁned by the training

set

X A set of training examples

( )i

The -th example (input) from a dataseti

( )i

or y

( )i

The target associated with

( )i

for supervised learn-

ing

The

m n×

matrix with input example

( )i

in row

i,:

xiv

CHAPTER 1. INTRODUCTION

concepts are built on top of each other, the graph is deep, with many layers. For

this reason, we call this approach to AI deep learning.

Many of the early successes of AI took place in relatively sterile and formal

environments and did not require computers to have much knowledge about

the world. For example, IBM’s Deep Blue chess-playing system defeated world

champion Garry Kasparov in 1997 ( , ). Chess is of course a very simpleHsu 2002

world, containing only sixty-four locations and thirty-two pieces that can move

in only rigidly circumscribed ways. Devising a successful chess strategy is a

tremendous accomplishment, but the challenge is not due to the diﬃculty of

describing the set of chess pieces and allowable moves to the computer. Chess

can be completely described by a very brief list of completely formal rules, easily

provided ahead of time by the programmer.

Ironically, abstract and formal tasks that are among the most diﬃcult mental

undertakings for a human being are among the easiest for a computer. Computers

have long been able to defeat even the best human chess player, but are only

recently matching some of the abilities of average human beings to recognize objects

or speech. A person’s everyday life requires an immense amount of knowledge

about the world. Much of this knowledge is subjective and intuitive, and therefore

diﬃcult to articulate in a formal way. Computers need to capture this same

knowledge in order to behave in an intelligent way. One of the key challenges in

artiﬁcial intelligence is how to get this informal knowledge into a computer.

Several artiﬁcial intelligence projects have sought to hard-code knowledge about

the world in formal languages. A computer can reason about statements in these

formal languages automatically using logical inference rules. This is known as the

knowledge base approach to artiﬁcial intelligence. None of these projects has led to

a major success. One of the most famous such projects is Cyc ( ,Lenat and Guha

1989). Cyc is an inference engine and a database of statements in a language

called CycL. These statements are entered by a staﬀ of human supervisors. It is an

unwieldy process. People struggle to devise formal rules with enough complexity

to accurately describe the world. For example, Cyc failed to understand a story

about a person named Fred shaving in the morning ( , ). Its inferenceLinde 1992

engine detected an inconsistency in the story: it knew that people do not have

electrical parts, but because Fred was holding an electric razor, it believed the

entity “FredWhileShaving” contained electrical parts. It therefore asked whether

Fred was still a person while he was shaving.

The diﬃculties faced by systems relying on hard-coded knowledge suggest that

AI systems need the ability to acquire their own knowledge, by extracting patterns

from raw data. This capability is known as machine learning. The introduction

CHAPTER 1. INTRODUCTION

of machine learning allowed computers to tackle problems involving knowledge

of the real world and make decisions that appear subjective. A simple machine

learning algorithm called logistic regression can determine whether to recommend

cesarean delivery (Mor-Yosef 1990et al., ). A simple machine learning algorithm

called can separate legitimate e-mail from spam e-mail.naive Bayes

The performance of these simple machine learning algorithms depends heavily

on the representation of the data they are given. For example, when logistic

regression is used to recommend cesarean delivery, the AI system does not examine

the patient directly. Instead, the doctor tells the system several pieces of relevant

information, such as the presence or absence of a uterine scar. Each piece of

information included in the representation of the patient is known as a feature.

Logistic regression learns how each of these features of the patient correlates with

various outcomes. However, it cannot inﬂuence the way that the features are

deﬁned in any way. If logistic regression was given an MRI scan of the patient,

rather than the doctor’s formalized report, it would not be able to make useful

predictions. Individual pixels in an MRI scan have negligible correlation with any

complications that might occur during delivery.

This dependence on representations is a general phenomenon that appears

throughout computer science and even daily life. In computer science, opera-

tions such as searching a collection of data can proceed exponentially faster if

the collection is structured and indexed intelligently. People can easily perform

arithmetic on Arabic numerals, but ﬁnd arithmetic on Roman numerals much

more time-consuming. It is not surprising that the choice of representation has an

enormous eﬀect on the performance of machine learning algorithms. For a simple

visual example, see Fig. .1.1

Many artiﬁcial intelligence tasks can be solved by designing the right set of

features to extract for that task, then providing these features to a simple machine

learning algorithm. For example, a useful feature for speaker identiﬁcation from

sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong

clue as to whether the speaker is a man, woman, or child.

However, for many tasks, it is diﬃcult to know what features should be extracted.

For example, suppose that we would like to write a program to detect cars in

photographs. We know that cars have wheels, so we might like to use the presence

of a wheel as a feature. Unfortunately, it is diﬃcult to describe exactly what a

wheel looks like in terms of pixel values. A wheel has a simple geometric shape but

its image may be complicated by shadows falling on the wheel, the sun glaring oﬀ

the metal parts of the wheel, the fender of the car or an object in the foreground

obscuring part of the wheel, and so on.

CHAPTER 1. INTRODUCTION













Figure 1.1: Example of diﬀerent representations: suppose we want to separate two

categories of data by drawing a line between them in a scatterplot. In the plot on the left,

we represent some data using Cartesian coordinates, and the task is impossible. In the plot

on the right, we represent the data with polar coordinates and the task becomes simple to

solve with a vertical line. (Figure produced in collaboration with David Warde-Farley)

One solution to this problem is to use machine learning to discover not only

the mapping from representation to output but also the representation itself.

This approach is known as representation learning. Learned representations often

result in much better performance than can be obtained with hand-designed

representations. They also allow AI systems to rapidly adapt to new tasks, with

minimal human intervention. A representation learning algorithm can discover a

good set of features for a simple task in minutes, or a complex task in hours to

months. Manually designing features for a complex task requires a great deal of

human time and eﬀort; it can take decades for an entire community of researchers.

The quintessential example of a representation learning algorithm is the au-

toencoder encoder. An autoencoder is the combination of an function that converts

the input data into a diﬀerent representation, and a decoder function that converts

the new representation back into the original format. Autoencoders are trained to

preserve as much information as possible when an input is run through the encoder

and then the decoder, but are also trained to make the new representation have

various nice properties. Diﬀerent kinds of autoencoders aim to achieve diﬀerent

kinds of properties.

When designing features or algorithms for learning features, our goal is usually

to separate the that explain the observed data. In this context,factors of variation

we use the word “factors” simply to refer to separate sources of inﬂuence; the factors

are usually not combined by multiplication. Such factors are often not quantities

剩余802页未读，继续阅读

不连续改变

粉丝: 0
资源: 1

深度学习基石：Ian Goodfellow等大师作品详解

Deep learning_ adaptive computation and machine learning-The MIT Press (2016)

Deep Learning (Adaptive Computation and Machine Learning series) 最新中文版

Deep Learning (Adaptive Computation and Machine Learning series) 英文版

deep learning(Adaptive Computation and Machine Learning series)

【Essentials of Deep Learning for Time Series Forecasting】: Tips and Advanced Applications of RNN

Optimization of Machine Learning Using MATLAB Genetic Algorithms: Strategies of Integration and ...

5 Key Tips for Cross-Validation: Unleash More Accurate Machine Learning Models

Real-Time Machine Learning Model Update Strategies: 3 Tips to Keep Your Model Ahead

Time Series Data Preprocessing: Experts Teach Standardization and Normalization Techniques

MATLAB Genetic Algorithm Adaptive Mechanism: Unveiling the Core of Intelligent Adjustment Strategies

最新资源