深度学习入门：TensorFlow实战指南

需积分: 10 170 浏览量更新于2024-07-18 收藏 5.82MB PDF 举报

"这本书是Manning出版社的《TensorFlow机器学习》早期访问版，旨在帮助读者从直觉上理解深度学习的基础，从而了解机器如何通过深度学习进行学习。" 在本书中，作者旨在教授深度学习的基本概念，让读者能够以直观的方式理解机器如何运用深度学习进行学习。"Machine Learning with TensorFlow"是针对那些对深度学习感兴趣，希望深入了解该领域的读者设计的。标签"Machine Learning"和"TensorFlow"明确了本书的核心内容，即机器学习技术，特别是使用TensorFlow这一强大的开源库进行实践。书中提到的Manning Early Access Program (MEAP) 是Manning出版社的一种出版模式，允许读者在书完全完成之前就能获取并阅读最新版本的内容，以便及时跟进技术和理论的最新进展。在这个版本中，作者回应了读者的反馈，对每一章都进行了精心的修订和升级。文本得到了极大的改进，对于复杂问题的讲解更为详尽，特别是在读者需要更多解释的地方。所有的图表和数学公式都进行了更新，以更清晰、专业的形式呈现，提升了阅读体验。代码部分已更新至TensorFlow v1.0版本，这是Google开发的一个用于构建和训练神经网络的强大工具，同时也可在GitHub（https://github.com/BinRoot/TensorFlow-Book/）上找到，方便读者下载和实践。章节的排列顺序也经过重新调整，以更好地按照学习曲线递进，确保读者在合适的时间掌握合适的技术。作者对所有与他交流的读者表示感谢，无论是在官方书籍论坛、电子邮件、GitHub还是Reddit上的互动，他都认真倾听并考虑了读者的问题、建议和关切。尽管无法一一回复，但他对此表示歉意，并通过改进书中的内容来回应这些反馈。《TensorFlow机器学习》这本书不仅提供了深度学习的基础知识，还强调了实践应用，通过TensorFlow这个强大的平台，使读者能够亲手实现各种深度学习模型。无论是初学者还是有一定经验的开发者，都能从中受益，深化对机器学习和深度学习的理解，并提升实际操作能力。

For example, the Euclidian distance between (0, 1) and (1, 0) is

Scholars call this the L2 norm. But that’s actually just one of many possible distance

functions. There also exists L0, L1, and L-infinity norms. All of these norms are a valid way to

measure distance. Here they are in more detail:

• The L0 norm counts the total number of non-zero elements of a vector. For example,

the distance between the origin (0, 0) and vector (0, 5) is 1, because there is only 1

non-zero element. The L0 distance between (1,1) and (2,2) is 2, because neither

dimension matches up. Imagine if the first and second dimensions represent username

and password, respectively. If the L0 distance between a login attempt and the true

credentials is 0, then the login is successful. If the distance is 1, then either the

username or password is incorrect, but not both. Lastly if the distance is 2, both

username and password are not found in the database.

• The L1 norm is defined as Σ|x

|. The distance between two vectors under the L1 norm is

also referred to as the Manhattan distance. Imagine living in a downtown area like

Manhattan, New York, where the streets form a grid. The shortest distance from one

intersection to another is along the blocks. Similarity, the L1 distance between two

vectors is along the orthogonal directions. So the distance between (0, 1) and (1, 0)

under the L1 norm is 2. Computing the L1 distance between two vectors is essentially

the sum of absolute differences at each dimension, which is a useful measure of

similarity.

Figure 1.10 The L1 distance is also called the taxi-cab distance because it resembles the route of a car in a grid-

like neighborhood such as Manhattan. If a car is travelling from point (0,1) to point (1,0), the shortest route

requires a length of 2 units.

• The L2 norm is the Euclidian length of a vector, (Σ(x

)

1/2

. It is the most direct route

one can possibly take on a geometric plane to get from one point to another. For the

mathematically inclined, this is the norm that implements the least square estimation as

predicted by the Gauss-Markov theorem. For the rest of you, it’s the shortest distance

between two points in space.

Figure 1.11 The L2 norm between points (0,1) and (1,0) is the length of a single straight line segment reaching

both points.

• The L-N norm generalizes this pattern, resulting in (Σ|x

)

1/N

. We rarely use finite

norms above L2, but it’s here for completeness.

• The L-infinity norm is (Σ|x

∞

)

1/∞

. More naturally, it is the largest magnitude among

each element. If the vector is (-1, -2, -3), the L-infinity norm will be 3. If a feature

vector represents costs of various items, then minimizing the L-infinity norm of the

vector is an attempt to reduce the cost of the most expensive item.

When do I use a metric other than the L2 norm in the real-world?

Let’s say you’re working for a new search-engine start-up trying to compete with Google. Your boss assigns you the

task of using machine learning to personalize the search results for each user.

A good goal might be that users shouldn’t see five or more incorrect search-results per month. A year’s worth of user-

data is a 12-dimension vector (where each month of the year is a dimension), indicating number of incorrect results

shown per month. You are trying to satisfy the condition that the L-infinity norm of this vector must less than 5.

Suppose instead that your boss changes his/her mind and requires that less than 5 erroneous search-results are

allowed for the entire year. In this case, you are trying to achieve a L1 norm below 5, because the sum of all errors in the

entire space should be less than 5.

Actually, your boss changes his or her mind again. Now, the number of months with erroneous search-results should

be less than 5. In that case, you are trying to achieve an L0 norm below 5, because the number of months with a non-

zero error should be less than 5.

1.4 Types of Learning

Now that we can compare feature vectors, we have the tools necessary to use data for

practical algorithms. Machine learning is often split into three perspectives: supervised

learning, unsupervised learning, and reinforcement learning. Let's examine each, one by one.

1.4.1 Supervised Learning

By definition, a supervisor is someone higher up in the chain of command. When in doubt,

he or she dictates what to do. Likewise, supervised learning is all about learning from examples

laid out by a "supervisor" (such as a teacher).

A supervised machine learning system needs labeled data to develop a useful

understanding, which we call its model. For example, given many photographs of people and

their recorded corresponding ethnicity, we can train a model to classify the ethnicity of a

never-before-seen individual in an arbitrary photograph. Simply put, a model is a function that

assigns a label to some data. It does so by using previous examples, called a training dataset,

as reference.

A convenient way to talk about models is through mathematical notation. Let x be an

instance of data, such as a feature vector. The corresponding label associated with x is f(x),

often referred to as the ground truth of x. Usually, we use the variable y = f(x) because it’s

quicker to write. In the example of classifying the ethnicity of a person through a photograph, x

can be a hundred-dimensional vector of various relevant features, and y is one of a couple

values to represent the various ethnicities. Since y is discrete with few values, the model is

called a classifier. If y could result in many values, and the values have a natural ordering,

then the model is called a regressor.

Let’s denote a model's prediction of x as g(x). Sometimes you can tweak a model to change

its performance drastically. Models have some parameters that can be tuned either by a

human or automatically. We use the vector θ to represent the parameters. Putting it all

together, g(x|θ) more completely represents the model, read "g of x given θ."

ASIDE

Models may also have hyper-parameters, which are extra ad-hoc properties about a model. The word

“hyper” in “hyper-parameter” is a bit strange at first. If it helps, a better name could be “meta-parameter,”

because the parameter is akin to metadata about the model.

The success of a model’s prediction g(x|θ) depends on how well it agrees with the ground

truth y. We need a way to measure the distance between these two vectors. For example, the

L2-norm may be used to measure how close two vectors lie. The distance between the ground

truth and prediction is called the cost.

The essence of a supervised machine learning algorithm is to figure out the parameters of a

model that results in the least cost. Mathematically put, we are trying to look for a θ* (Theta

star) that minimizes the cost among all data points x ∈ X. One way of formalizing this

optimization problem is the following:

Clearly, brute forcing every possible combination of θs, also known as a parameter-space,

will eventually find the optimal solution, but at an unacceptable runtime. A major study in

machine learning is about writing algorithms that efficiently search through this parameter-

space. Some of the first algorithms include gradient descent, simulated annealing, and genetic

algorithms. TensorFlow automatically takes care of the low-level implementation details of

these algorithms, so we won’t get into them in too much detail.

Once the parameters are learned one way or another, you can finally evaluate the model to

figure out how well the system captured patterns from the data. A rule of thumb is not to

evaluate your model on the same data you used to train it, because we already know it works

for the training data – we need to tell if it works for data that wasn’t part of the training set, to

make sure our model is “general purpose” and not biased to the data we used to train it. Use

the majority of the data for training, and the remaining for testing. For example, if you have

100 labeled data, randomly select 70 of them to train a model, and reserve the other 30 to test

it.

Why split the data?

If the 70-30 split seems odd to you, think about it like this. Let's say your Physics teacher gives you a practice exam

and tells you the real exam will be no different. You might as well memorize the answers and earn a perfect score

without actually understanding the concepts. Similarly, if we test our model on the training dataset, we're not doing

ourselves any favors. We risk a false sense of security since the model may merely be memorizing the results. Now,

where's the intelligence in that?

Instead of the 70-30 split, machine learning practitioners typically divided their dataset 60-20-20. Training consumes

60% of the dataset, and testing uses 20%, leaving the other 20% for what is called "validation," which will be explained

in the next chapter.

1.4.2 Unsupervised Learning

Unsupervised learning is about modeling data that comes without corresponding labels or

responses. The fact that we can make any conclusions at all on just raw data feels like magic.

With enough data, it may be possible to find patterns and structure. Two of the most powerful

tools that machine learning practitioners use to learn from data alone are clustering and

dimensionality reduction.

Clustering is the process of splitting the data into individual buckets of similar items. In a

sense, clustering is like classification of data without knowing any corresponding labels. For

instance, when organizing your books onto three shelves, you likely place similar genres

together, or maybe you group them by author's last name. You might have a Stephen King

section, another for textbooks, and a third for “anything else.” You don’t care that they are all

separated by the same feature, just that each has something unique about it that allows you to

break it into roughly equal, easily identifiable groups. One of the most popular clustering

algorithms is K-means, which is a specific instance of a more powerful technique called the E-M

algorithm.

Dimensionality reduction is about manipulating the data to view it under a much simpler

perspective. It is the ML equivalent of the phrase, “Keep it simple, stupid.” For example, by

getting rid of redundant features, we can explain the same data in a lower-dimensional space

and see which features really matter. This simplification also helps in data visualization or

preprocessing for performance efficiency. One of the earliest algorithms is Principle Component

Analysis (PCA), and some newer ones include autoencoders, which we’ll cover in chapter 7.

1.4.3 Reinforcement Learning

Supervised and unsupervised learning seem to suggest that the existence of a teacher is all

or nothing. But, there is a well-studied branch of machine learning where the environment acts

as a teacher, providing hints as opposed to definite answers. The learning system receives

feedback on its actions, with no concrete promise that it's progressing in the right direction,

which might be to solve a maze or accomplish some explicit goal.

Exploration vs. Exploitation is the heart of reinforcement learning

Imagine playing a video-game that you've never seen before. You click buttons on a controller and discover that a

particular combination of strokes gradually increases your score. Brilliant, now you repeatedly exploit this finding in

hopes of beating the high-score. In the back of your mind, you think to yourself that maybe there's a better combination

of button-clicks that you're missing out on. Should you exploit your current best strategy, or risk exploring new options?

Unlike supervised learning, where training data is conveniently labeled by a “teacher,”

reinforcement learning trains on information gathered by observing how the environment

reacts to actions. In other words, reinforcement learning is a type of machine learning that

interacts with the environment to learn which combination of actions yields the most favorable

results. Since we're already anthropomorphizing our algorithm by using the words

“environment” and “action,” scholars typically refer to the system as an autonomous “agent.”

Therefore, this type of machine learning naturally manifests itself into the domain of robotics.

To reason about agents in the environment, we introduce two new concepts: states and

actions. The status of the world frozen at some particular time is called a state. An agent may

perform one of many actions to change the current state. To drive an agent to perform actions,

each state yields a corresponding reward. An agent eventually discovers the expected total

reward of each state, called the value of a state.

Like any other machine learning system, performance improves with more data. In this

case, the data is a history of previous experiences. In reinforcement learning, we do not know

the final cost or reward of a series of actions until it’s executed. These situations render

traditional supervised learning ineffective, because we do not know exactly which action in the

剩余243页未读，继续阅读

u010616159

粉丝: 0

深度学习入门：TensorFlow实战指南

Machine learning for graph-based representations

Machine Learning with TensorFlow (MEAP v.10)-Manning(2018).pdf

Manning.Unlocking.Android.MEAP.Edition.2008.pdf

Manning.Machine.Learning.Systems.2018.5.pdf

Manning.Lucene.in.Action.2nd.Edition.Jun.2010.MEAP.rar

Manning.JUnit.in.Action.2nd.Edition.Jun.2010.MEAP.rar

Manning.D3.js.in.Action.2nd.Edition.2017.11.pdf

Manning.Grokking.Deep.Learning

Manning.Grokking.Deep.Learning_deeplearning_deep_python_

Manning.Azure.in.Action.Aug.2010.MEAP.rar

最新资源