没有合适的资源？快使用搜索试试~ 我知道了~

首页The Matrix Calculus You Need For Deep Learning.pdf

# The Matrix Calculus You Need For Deep Learning.pdf

需积分: 26 130 浏览量
更新于2023-05-24
评论
收藏 734KB PDF 举报

深度学习是当前最热门的一个学校方向，该文件中包含了深度学习需要的基本数学知识，是深度学习的不可多得的资料。

资源详情

资源评论

资源推荐

The Matrix Calculus You Need For Deep Learning

Terence Parr and Jeremy Howard

July 3, 2018

(We teach in University of San Francisco’s MS in Data Science program and have other nefarious

projects underway. You might know Terence as the creator of the ANTLR parser generator. For

more material, see Jeremy’s fast.ai courses and University of San Francisco’s Data Institute in-

person version of the deep learning course.)

HTML version (The PDF and HTML were generated from markup using bookish)

Abstract

This paper is an attempt to explain all the matrix calculus you need in order to understand

the training of deep neural networks. We assume no math knowledge beyond what you learned

in calculus 1, and provide links to help you refresh the necessary math where needed. Note that

you do not need to understand this material before you start learning to train and use deep

learning in practice; rather, this material is for those who are already familiar with the basics of

neural networks, and wish to deepen their understanding of the underlying math. Don’t worry

if you get stuck at some point along the way—just go back and reread the previous section, and

try writing down and working through some examples. And if you’re still stuck, we’re happy

to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference

section at the end of the paper summarizing all the key matrix calculus rules and terminology

discussed here.

1

arXiv:1802.01528v3 [cs.LG] 2 Jul 2018

Contents

1 Introduction 3

2 Review: Scalar derivative rules 4

3 Introduction to vector calculus and partial derivatives 5

4 Matrix calculus 6

4.1 Generalization of the Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 Derivatives of vector element-wise binary operators . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Derivatives involving scalar expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.4 Vector sum reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.5 The Chain Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.5.1 Single-variable chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.5.2 Single-variable total-derivative chain rule . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.5.3 Vector chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 The gradient of neuron activation 23

6 The gradient of the neural network loss function 25

6.1 The gradient with respect to the weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 The derivative with respect to the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Summary 29

8 Matrix Calculus Reference 29

8.1 Gradients and Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

8.2 Element-wise operations on vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8.3 Scalar expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8.4 Vector reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8.5 Chain rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9 Notation 31

10 Resources 32

2

1 Introduction

Most of us last saw calculus in school, but derivatives are a critical part of machine learning,

particularly deep neural networks, which are trained by optimizing a loss function. Pick up a

machine learning paper or the documentation of a library such as PyTorch and calculus comes

screeching back into your life like distant relatives around the holidays. And it’s not just any old

scalar calculus that pops up—you need diﬀerential matrix calculus, the shotgun wedding of linear

algebra and multivariate calculus.

Well... maybe need isn’t the right word; Jeremy’s courses show how to become a world-class deep

learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic

diﬀerentiation built in to modern deep learning libraries. But if you really want to really understand

what’s going on under the hood of these libraries, and grok academic papers discussing the latest

advances in model training techniques, you’ll need to understand certain bits of the ﬁeld of matrix

calculus.

For example, the activation of a single computation unit in a neural network is typically calculated

using the dot product (from linear algebra) of an edge weight vector w with an input vector x plus

a scalar bias (threshold): z(x) =

P

n

i

w

i

x

i

+ b = w · x + b. Function z(x) is called the unit’s aﬃne

function and is followed by a rectiﬁed linear unit, which clips negative values to zero: max(0, z(x)).

Such a computational unit is sometimes referred to as an “artiﬁcial neuron” and looks like:

Neural networks consist of many of these units, organized into multiple collections of neurons called

layers. The activation of one layer’s units become the input to the next layer’s units. The activation

of the unit or units in the ﬁnal layer is called the network output.

Training this neuron means choosing weights w and bias b so that we get the desired output for all N

inputs x. To do that, we minimize a loss function that compares the network’s ﬁnal activation(x)

with the target(x) (desired output of x) for all input x vectors. To minimize the loss, we use

some variation on gradient descent, such as plain stochastic gradient descent (SGD), SGD with

momentum, or Adam. All of those require the partial derivative (the gradient) of activation(x)

with respect to the model parameters w and b. Our goal is to gradually tweak w and b so that the

overall loss function keeps getting smaller across all x inputs.

If we’re careful, we can derive the gradient by diﬀerentiating the scalar version of a common loss

3

function (mean squared error):

1

N

X

x

(target(x) − activation(x))

2

=

1

N

X

x

(target(x) − max(0,

|x|

X

i

w

i

x

i

+ b))

2

But this is just one neuron, and neural networks must train the weights and biases of all neurons

in all layers simultaneously. Because there are multiple inputs and (potentially) multiple network

outputs, we really need general rules for the derivative of a function with respect to a vector and

even rules for the derivative of a vector-valued function with respect to a vector.

This article walks through the derivation of some important rules for computing partial derivatives

with respect to vectors, particularly those useful for training neural networks. This ﬁeld is known as

matrix calculus, and the good news is, we only need a small subset of that ﬁeld, which we introduce

here. While there is a lot of online material on multivariate calculus and linear algebra, they are

typically taught as two separate undergraduate courses so most material treats them in isolation.

The pages that do discuss matrix calculus often are really just lists of rules with minimal explanation

or are just pieces of the story. They also tend to be quite obscure to all but a narrow audience

of mathematicians, thanks to their use of dense notation and minimal discussion of foundational

concepts. (See the annotated list of resources at the end.)

In contrast, we’re going to rederive and rediscover some key matrix calculus rules in an eﬀort to

explain them. It turns out that matrix calculus is really not that hard! There aren’t dozens of new

rules to learn; just a couple of key concepts. Our hope is that this short paper will get you started

quickly in the world of matrix calculus as it relates to training neural networks. We’re assuming

you’re already familiar with the basics of neural network architecture and training. If you’re not,

head over to Jeremy’s course and complete part 1 of that, then we’ll see you back here when you’re

done. (Note that, unlike many more academic approaches, we strongly suggest ﬁrst learning to

train and use neural networks in practice and then study the underlying math. The math will be

much more understandable with the context in place; besides, it’s not necessary to grok all this

calculus to become an eﬀective practitioner.)

A note on notation: Jeremy’s course exclusively uses code, instead of math notation, to explain

concepts since unfamiliar functions in code are easy to search for and experiment with. In this

paper, we do the opposite: there is a lot of math notation because one of the goals of this paper is

to help you understand the notation that you’ll see in deep learning papers and books. At the end

of the paper, you’ll ﬁnd a brief table of the notation used, including a word or phrase you can use

to search for more details.

2 Review: Scalar derivative rules

Hopefully you remember some of these main scalar derivative rules. If your memory is a bit fuzzy

on this, have a look at Khan academy vid on scalar derivative rules.

4

Rule f(x) Scalar derivative notation

with respect to x

Example

Constant c 0

d

dx

99 = 0

Multiplication

by constant

cf c

df

dx

d

dx

3x = 3

Power Rule x

n

nx

n−1

d

dx

x

3

= 3x

2

Sum Rule f + g

df

dx

+

dg

dx

d

dx

(x

2

+ 3x) = 2x + 3

Diﬀerence Rule f − g

df

dx

−

dg

dx

d

dx

(x

2

− 3x) = 2x − 3

Product Rule fg f

dg

dx

+

df

dx

g

d

dx

x

2

x = x

2

+ x2x = 3x

2

Chain Rule f(g(x))

df(u)

du

du

dx

, let u = g(x)

d

dx

ln(x

2

) =

1

x

2

2x =

2

x

There are other rules for trigonometry, exponentials, etc., which you can ﬁnd at Khan Academy

diﬀerential calculus course.

When a function has a single parameter, f(x), you’ll often see f

0

and f

0

(x) used as shorthands for

d

dx

f(x). We recommend against this notation as it does not make clear the variable we’re taking

the derivative with respect to.

You can think of

d

dx

as an operator that maps a function of one parameter to another function.

That means that

d

dx

f(x) maps f (x) to its derivative with respect to x, which is the same thing

as

df(x)

dx

. Also, if y = f(x), then

dy

dx

=

df(x)

dx

=

d

dx

f(x). Thinking of the derivative as an operator

helps to simplify complicated derivatives because the operator is distributive and lets us pull out

constants. For example, in the following equation, we can pull out the constant 9 and distribute

the derivative operator across the elements within the parentheses.

d

dx

9(x + x

2

) = 9

d

dx

(x + x

2

) = 9(

d

dx

x +

d

dx

x

2

) = 9(1 + 2x) = 9 + 18x

That procedure reduced the derivative of 9(x + x

2

) to a bit of arithmetic and the derivatives of x

and x

2

, which are much easier to solve than the original derivative.

3 Introduction to vector calculus and partial derivatives

Neural network layers are not single functions of a single parameter, f(x). So, let’s move on to

functions of multiple parameters such as f (x, y). For example, what is the derivative of xy (i.e.,

the multiplication of x and y)? In other words, how does the product xy change when we wiggle

the variables? Well, it depends on whether we are changing x or y. We compute derivatives with

respect to one variable (parameter) at a time, giving us two diﬀerent partial derivatives for this two-

parameter function (one for x and one for y). Instead of using operator

d

dx

, the partial derivative

operator is

∂

∂x

(a stylized d and not the Greek letter δ). So,

∂

∂x

xy and

∂

∂y

xy are the partial derivatives

of xy; often, these are just called the partials. For functions of a single parameter, operator

∂

∂x

is

equivalent to

d

dx

(for suﬃciently smooth functions). However, it’s better to use

d

dx

to make it clear

you’re referring to a scalar derivative.

The partial derivative with respect to x is just the usual scalar derivative, simply treating any other

variable in the equation as a constant. Consider function f (x, y) = 3x

2

y. The partial derivative

5

剩余32页未读，继续阅读

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功

## 评论0