模式识别第四版：深度学习与大数据分析

3星 · 超过75%的资源需积分: 50 82 浏览量更新于2024-07-18 6 收藏 12.43MB PDF 举报

"模式识别第四版" 是一本深入探讨模式识别基础理论、方法和应用的书籍，由Elsevier出版。书中涵盖了贝叶斯分类、贝叶斯网络、线性与非线性分类器设计、特征工程、特征选择技术、学习理论、聚类概念与算法等核心主题。与第三版相比，第四版增加了对处理大数据集和高维数据的算法，如核方法在分类器和鲁棒回归中的应用，并涉及非线性降维、非负矩阵因数分解、关联性反馈、鲁棒回归、半监督学习、谱聚类以及聚类组合技术等最新研究热点。每章都配有习题和MATLAB编程练习，以帮助读者实践和巩固理论知识。此外，官方支持网站还提供习题解答，以增强读者的实践经验。本书的特色在于其详尽的内容和实践导向，不仅涵盖了模式识别的基本概念，还与时俱进地介绍了当前研究领域的前沿技术。对于贝叶斯分类和贝叶斯网络，读者将了解到如何利用概率模型进行决策和推理。线性和非线性分类器设计部分，读者将学习如何构建和优化分类器以适应不同的数据分布。特征生成和特征选择技术有助于降低数据复杂度，提高识别性能。学习理论部分则探讨了理论上的最优学习策略和性能界限。新引入的大数据集和高维数据处理算法，如核方法，是针对现代数据分析挑战的解决方案。非线性降维技术如主成分分析(PCA)和奇异值分解(SVD)的扩展，可以帮助在高维空间中找到关键特征。非负矩阵因数分解(NMF)则常用于数据表示和挖掘潜在结构。关联性反馈是推荐系统中的关键技术，用于提升用户体验。鲁棒回归考虑了数据中的异常值，使得模型更具稳健性。半监督学习在有限标记数据的情况下也能进行有效学习。谱聚类利用数据的相似性关系构造图谱，进而进行聚类，而聚类组合技术则能结合多种聚类结果，提高聚类的稳定性和准确性。本书的习题和MATLAB实现为读者提供了动手实践的机会，通过解决实际问题来深化理解。在线提供的习题解答进一步强化了学习体验，使读者能够自我检查和提高。 "模式识别第四版" 是一本全面且深入的教材，适合于计算机科学、统计学、机器学习和数据科学等相关专业的学生和研究人员，同时也可作为相关行业从业者的重要参考书。

“03-Ch01-SA272” 17/9/2008 page 10

10 CHAPTER 1 Introduction

1.5 OUTLINE OF THE BOOK

Chapters 2–10 deal with supervised pattern recognition and Chapters 11–16 deal

with the unsupervised case. Semi-supervised learning is introduced in Chapter 10.

The goal of each chapter is to start with the basics,deﬁnitions, and approaches,and

move progressively to more advanced issues and recent techniques. To what extent

the various topics covered in the book will be presented in a ﬁrst course on pattern

recognition depends very much on the course’s focus,on the students’background,

and, of course, on the lecturer. In the following outline of the chapters, we give

our view and the topics that we cover in a ﬁrst course on pattern recognition. No

doubt, other views do exist and may be better suited to different audiences. At the

end of each chapter, a number of problems and computer exercises are provided.

Chapter 2 is focused on Bayesian classiﬁcation and techniques for estimating

unknown probability density functions. In a ﬁrst course on pattern recognition,the

sections related to Bayesian inference, the maximum entropy, and the expectation

maximization (EM) algorithm are omitted. Special focus is put on the Bayesian clas-

siﬁcation,the minimum distance (Euclidean and Mahalanobis),the nearest neighbor

classiﬁers, and the naive Bayes classiﬁer. Bayesian networks are brieﬂy introduced.

Chapter 3 deals with the design of linear classiﬁers. The sections dealing with the

probability estimation property of the mean square solution as well as the bias var i-

ance dilemma are only brieﬂy mentioned in our ﬁrst course. The basic philosophy

underlying the support vector machines can also be explained, although a deeper

treatment requires mathematical tools (summarized inAppendix C) that most of the

students are not familiar with during a ﬁrst course class. On the contrary,emphasis is

put on the linear separability issue, the perceptron algorithm, and the mean square

and least squares solutions. After all, these topics have a much broader horizon

and applicability. Support vector machines are brieﬂy introduced. The geometric

interpretation offers students a better understanding of the SVM theory.

Chapter 4 deals with the design of nonlinear classiﬁers. The section dealing with

exact classiﬁcation is b ypassed in a ﬁrst course. The proof of the backpropagation

algorithm is usually very boring for most of the students and we bypass its details.

A description of its rationale is given, and the students experiment with it using

MATLAB. The issues related to cost functions are bypassed. Pruning is discussed

with an emphasis on generalization issues. Emphasis is also given to Cover’s theorem

and radial basis function (RBF) networks. The nonlinear support vector machines,

decision trees, and combining classiﬁers are only brieﬂy touched via a discussion

on the basic philosophy behind their rationale.

Chapter 5 deals with the feature selection sta ge, and we have made an effort

to present most of the well-known techniques. In a ﬁrst course we put emphasis

on the t-test. This is because hypothesis testing also has a broad horizon, and at

the same time it is easy for the students to apply it in computer exercises. Then,

depending on time constraints, divergence, Bhattacharrya distance, and scattered

matrices are presented and commented on, although their more detailed treatment

“03-Ch01-SA272” 17/9/2008 page 11

1.5 Outline of The Book 11

is for a more advanced course. Emphasis is given to Fisher’s linear discriminant

method ( LDA) for the two-class case.

Chapter 6 deals with the feature generation stage using transformations. The

Karhunen–Loève transform and the singular value decomposition are ﬁrst intro-

duced as dimensionality reduction techniques. Both methods are brieﬂy covered in

the second semester. In the sequel the independent component analysis (ICA),non-

negative matrix factorization and nonlinear dimensionality reduction techniques

are presented. Then the discrete Fourier transform (DFT), discrete cosine trans-

form (DCT), discrete sine transform (DST), Hadamard, and Haar transforms are

deﬁned. The rest of the chapter focuses on the discrete time wavelet transform.

The incentive is to give all the necessary information so that a newcomer in the

wavelet ﬁeld can grasp the basics and be able to develop software, based on

ﬁlter banks, in order to generate features. All these techniques are bypassed in

a ﬁrst course.

Chapter 7 deals with feature generation focused on image and audio classiﬁca-

tion. The sections concerning local linear transforms, moments,parametric models,

and fractals are not covered in a ﬁrst course. Emphasis is placed on ﬁrst- and second-

order statistics features as well as the run-length method. The chain code for shape

description is also taught. Computer exercises are then offered to generate these

features and use them for classiﬁcation for some case studies. In a one-semester

course there is no time to cover more topics.

Chapter 8 deals with template matching. Dynamic programming (DP) and the

Viterbi algorithm are presented and then applied to speech recognition. In a

two-semester course, emphasis is given to the DP and the Viterbi algorithm.

The edit distance seems to be a good case for the students to grasp the basics. Cor-

relation matching is taught and the basic philosophy behind deformable template

matching can also be presented.

Chapter 9 deals with context-dependent classiﬁcation. Hidden Markov mod-

els are introduced and applied to communications and speech recognition. This

chapter is bypassed in a ﬁrst course.

Chapter 10 deals with system evaluation and semi-supervised learning. The

various error rate estimation techniques are discussed, and a case study with real

data is treated. The leave-one-out method and the resubstitution methods are

emphasized in the second semester,and students practice with computer exercises.

Semi-supervised learning is bypassed in a ﬁrst course.

Chapter 11 deals with the basic concepts of clustering. It focuses on deﬁnitions

as well as on the major stages involved in a clustering task. The various types of

data encountered in clustering applications are reviewed, and the most commonly

used proximity measures are provided. In a ﬁrst course, only the most widely used

proximity measures are covered (e.g., l

norms, inner product, Hamming distance).

Chapter 12 deals with sequential clustering algorithms. These include some

of the simplest clustering schemes, and they are well suited for a ﬁrst course to

introduce students to the basics of clustering and allow them to experiment with

“03-Ch01-SA272” 17/9/2008 page 12

12 CHAPTER 1 Introduction

the computer. The sections related to estimation of the number of clusters and

neural network implementations are bypassed.

Chapter 13 deals with hierarchical clustering algorithms. In a ﬁrst course, only

the general a gglomerative scheme is considered with an emphasis on single link

and complete link algorithms, based on matrix theory. Agglomerative algorithms

based on graph theory concepts as well as the divisive schemes are bypassed.

Chapter 14 deals with clustering algorithms based on cost function optimization,

using tools from differential calculus. Hard clustering and fuzzy and possibilistic

schemes are considered,based on various types of cluster representatives,including

point representatives,hyperplane representatives,and shell-shaped representatives.

In a ﬁrst course, most of these algorithms are b ypassed, and emphasis is given to

the isodata algorithm.

Chapter 15 features a high degree of modularity. It deals with clustering algo-

rithms based on different ideas,which cannot be grouped under a single philosophy.

Spectral clustering, competitive lear ning, branch and bound, simulated annealing,

and genetic algorithms are some of the schemes treated in this chapter. These are

bypassed in a ﬁrst course.

Chapter 16 deals with the clustering validity stage of a clustering procedure. It

contains rather advanced concepts and is omitted in a ﬁrst course. Emphasis is given

to the deﬁnitions of internal,external,and relative cr iteria and the random hypothe-

ses used in each case. Indices, adopted in the framework of external and internal

criteria,are presented,and examples are provided showing the use of these indices.

Syntactic pattern recognition methods are not treated in this book. Syntactic

pattern recognition methods differ in philosophy from the methods discussed in

this book and, in general, are applicable to different types of problems. In syntactic

pattern recognition, the structure of the patterns is of paramount importance, and

pattern recognition is performed on the basis of a set of pattern primitives, a set

of rules in the form of a grammar, and a recognizer called automaton. Thus, we

were faced with a dilemma: either to increase the size of the book substantially, or

to provide a short overview (which, however, exists in a number of other books),

or to omit it. The last option seemed to be the most sensible choice.

“04-Ch02-SA272” 18/9/2008 page 13

CHAPTER

Classiﬁers Based on

Bayes Decision Theory

2.1 INTRODUCTION

This is the ﬁrst chapter, out of three, dealing with the design of the classiﬁer in a

pattern recognition system. The approach to be followed builds upon probabilistic

arguments stemming from the statistical nature of the generated features. As has

already been pointed out in the introductory chapter, this is due to the statistical

variation of the patterns as well as to the noise in the measuring sensors. Adopting

this reasoning as our kickoff point,we will design classiﬁers that classify an unknown

pattern in the most probable of the classes. Thus, our task now becomes that of

deﬁning what “most probable”means.

Given a classiﬁcation task of M classes,␻

, ␻

, ..., ␻

,and an unknown pattern,

which is represented by a feature vector x,we form the M conditional probabilities

P(␻

|x), i ⫽ 1, 2, ..., M. Sometimes, these are also referred to as a posteriori

probabilities. In words, each of them represents the probability that the unknown

pattern belongs to the respective class ␻

, given that the corresponding feature

vector takes the value x. Who could then argue that these conditional probabilities

are not sensible choices to quantify the term most probable? Indeed, the classiﬁers

to be considered in this chapter compute either the maximum of these M values

or, equivalently, the maximum of an appropriately deﬁned function of them. The

unknown pattern is then assigned to the class corresponding to this maximum.

The ﬁrst task we are faced with is the computation of the conditional proba-

bilities. The Bayes rule will once more prove its usefulness! A major effort in this

chapter will be devoted to techniques for estimating probability density functions

(pdf ), based on the available experimental evidence, that is, the feature vectors

corresponding to the patterns of the training set.

2.2 BAYES DECISION THEORY

We will initially focus on the two-class case. Let ␻

, ␻

be the two classes in which

our patterns belong. In the sequel, we assume that the a priori probabilities

“04-Ch02-SA272” 18/9/2008 page 14

14 CHAPTER 2 Classiﬁers Based on Bayes Decision Theory

P(␻

), P(␻

) are known. This is a very reasonable assumption, because even if

they are not known, they can easily be estimated from the available training feature

vectors. Indeed, if N is the total number of available training patterns, and N

, N

of them belong to ␻

and ␻

, respectively, then P(␻

) ≈ N

/N and P(␻

) ≈ N

/N .

The other statistical quantities assumed to be known are the class-conditional

probability density functions p(x|␻

), i ⫽ 1, 2, describing the distribution of the

feature vectors in each of the classes. If these are not known, they can also be

estimated from the available training data, as we will discuss later on in this chapter.

The pdf p(x|␻

) is sometimes referred to as the likelihood function of ␻

with

respect to x. Here we should stress the fact that an implicit assumption has been

made. That is, the feature vectors can take any value in the l-dimensional feature

space. In the case that feature vectors can take only discrete values,density functions

p(x|␻

) become probabilities and will be denoted by P(x|␻

We now have all the ingredients to compute our conditional probabilities, as

stated in the introduction. To this end, let us recall from our probability course

basics the Bayes rule (Appendix A)

P(␻

|x) ⫽

p(x|␻

)P(␻

)

p(x)

(2.1)

where p(x) is the pdf of x and for which we have (Appendix A)

p(x) ⫽



i⫽1

p(x|␻

)P(␻

) (2.2)

The Bayes c lassiﬁcation rule can now be stated as

If P(␻

|x) ⬎ P(␻

|x), x is classiﬁed to ␻

If P(␻

|x) ⬍ P(␻

|x), x is classiﬁed to ␻

(2.3)

The case of equality is detrimental and the pattern can be assigned to either of the

two classes. Using (2.1), the decision can equivalently be based on the inequalities

p(x|␻

)P(␻

) ≷ p(x|␻

)P(␻

) (2.4)

p(x) is not taken into account, because it is the same for all classes and it does

not affect the decision. Furthermore, if the a priori probabilities are equal, that is,

P(␻

) ⫽ P(␻

) ⫽ 1/2, Eq. (2.4) becomes

p(x|␻

) ≷ p(x|␻

) (2.5)

Thus, the search for the maximum now rests on the values of the conditional pdfs

evaluated at x. Figure 2.1 presents an example of two equiprobable classes and

shows the variations of p(x|␻

), i ⫽ 1, 2, as functions of x for the simple case of a

single feature (l ⫽ 1). The dotted line at x

is a threshold partitioning the feature

space into two regions,R

and R

. According to the Bayes decision rule,for all values

of x in R

the classiﬁer decides ␻

and for all values in R

it decides ␻

. However,

it is obvious from the ﬁgure that decision errors are unavoidable. Indeed, there is

剩余966页未读，继续阅读

Vance-01

粉丝: 2
资源: 63

模式识别第四版：深度学习与大数据分析

模式识别第四版答案（pattern recognition fourth edition solution）

模式识别第四版答案（Pattern Recognition Fourth Edition Solution）.pdf

Pattern Recognition and Machine Learning，模式识别-张学工

模式识别第四版答案

模式识别第四版答案_patternrecognition_模式识别_pdf_模式识别答案_

模式识别第四版matlab代码

模式识别第四版课后习题答案

模式识别 第四版_tixijiegou_

模式识别第四版英文版-Sergios Theodoridis

模式识别第四版 - 英文清晰版

最新资源

模式识别第四版_tixijiegou_