图数据管理与挖掘：从识别到应用

4星 · 超过85%的资源需积分: 11 147 浏览量更新于2024-07-27 1 收藏 10.68MB PDF 举报

"图数据管理与挖掘是现代信息技术领域中的关键课题，它涉及到对复杂网络结构中的数据进行组织、分析和提取有价值信息的过程。在这个领域中，模式识别（Pattern Recognition）起着核心作用，它定义了对现实世界中的观察进行分类或归类的能力。人类的进化史表明，这种能力是生存和适应的关键，从而发展出了高级的神经和认知系统，帮助我们在数百万年的时间里解决模式识别任务。模式识别不仅仅是将观测对象归入预定义类别，而是通过构建和应用数学模型来理解数据中的规律和潜在联系。在图数据挖掘中，图形分类（GRAPHCLASSIFICATION）是其中一项重要任务，它通过对图结构的分析，识别出不同节点或子图之间的共性特征，进而将其归类到相应的类别中。这种方法通常依赖于向量空间嵌入（VECTOR SPACE EMBEDDING），这是一种将图形数据映射到低维欧几里得空间的技术，以便于分析和计算相似性。此外，系列的机器感知和人工智能研究期刊，如《图论技术在网页内容挖掘中的应用》、《软件质量保证中的计算智能》、《模式识别中的距离表示：基础与应用》等，专门关注图数据挖掘中的各种方法和应用。这些作品探讨了如何利用图编辑距离与核机

December 28, 2009 9:59 Classiﬁcation and Clustering clustering

Introduction and Basic

Concepts

The real power of human

thinking is based on recognizing

patterns.

Ray Kurzweil

1.1 Pattern Recognition

Pattern recognition describes the act of determining to which category, re-

ferred to as class, a given pattern belongs and taking an action according

to the class of the recognized pattern. The notion of a pattern thereby

describes an observation in the real world. Due to the fact that pattern

recognition has been essential for our survival, evolution has led to highly

sophisticated neural and cognitive systems in humans for solving pattern

recognition tasks over tens of millions of years [1]. Summarizing, recogniz-

ing patterns is one of the most crucial capabilities of human beings.

Each individual is faced with a huge amount of various pattern recog-

nition problems in every day life [2]. Examples of such tasks include the

recognition of letters in a bo ok, the face of a friend in a crowd, a spoken

word embedded in noise, the chart of a presentation, the proper key to

the locked door, the smell of coﬀee in the cafeteria, the importance of a

certain message in the mail folder, and many more. These simple exam-

ples illustrate the essence of pattern recognition. In the world there exist

classes of patterns we distinguish according to certain knowledge that we

have learned before [3].

Most pattern recognition tasks encountered by humans can be solved in-

tuitively without explicitly deﬁning a certain method or specifying an exact

December 28, 2009 9:59 Classiﬁcation and Clustering clustering

2 Graph Classiﬁcation and Clustering Based on Vector Space Embedding

algorithm. Yet, formulating a pattern recognition problem in an algorith-

mic way provides us with the possibility to delegate the task to a machine.

This can be particularly interesting for very complex as well as for cumber-

some tasks in b oth science and industry. Examples are the prediction of

the properties of a certain molecule based on its structure, which is known

to be very diﬃcult, or the reading of handwritten payment orders, which

might become quite tedious when their quantity reaches several hundreds.

Such examples have evoked a growing interest in adequate modeling of

the human pattern recognition ability, which in turn led to the establish-

ment of the research area of pattern recognition and related ﬁelds, such as

machine learning, data mining, and artiﬁcial intelligence [4]. The ultimate

goal of pattern recognition as a scientiﬁc discipline is to develop methods

that mimic the human capacity of perception and intelligence. More pre-

cisely, pattern recognition as computer science discipline aims at deﬁning

mathematical foundations, models and methods that automate the process

of recognizing patterns of diverse nature.

However, it soon turned out that many of the most interesting prob-

lems in pattern recognition and related ﬁelds are extremely complex, often

making it diﬃcult, or even impossible, to specify an explicit programmed

solution. For instance, we are not able to write an analytical program to

recognize, say, a face in a photo [5]. In order to overcome this problem,

pattern recognition commonly employs the so called learning methodology.

In contrast to the theory driven approach, where precise speciﬁcations of

the algorithm are required in order to solve the task analytically, in this

approach the machine is meant to learn itself the concept of a class, identify

objects, and discriminate between them.

Typically, a machine is fed with training data, coming from a certain

problem domain, whereon it tries to detect signiﬁcant rules in order to

solve the given pattern recognition task [5]. Based on this training set

of samples and particularly the inferred rules, the machine becomes able

to make predictions about new, i.e. unseen, data. In other words, the

machine acquires generalization power by learning. This approach is highly

inspired by the human ability to recognize, for instance, what a dog is,

given just a few examples of dogs. Thus, the basic idea of the learning

methodology is that a few examples are suﬃcient to extract important

knowledge about their respective class [4]. Consequently, employing this

approach requires computer scientists to provide mathematical foundations

to a machine allowing it to learn from examples.

Pattern recognition and related ﬁelds have become an immensely im-

December 28, 2009 9:59 Classiﬁcation and Clustering clustering

Introduction and Basic Concepts 3

portant discipline in computer science. After decades of research, reliable

and accurate pattern recognition by machines is now possible in many for-

merly very diﬃcult problem domains. Prominent examples are mail sort-

ing [6, 7], e-mail ﬁltering [8, 9], text categorization [10–12], handwritten

text recognition [13–15], web retrieval [16, 17], writer veriﬁcation [18, 19],

person identiﬁcation by ﬁngerprints [20–23], gene detection [24, 25], activ-

ity predictions for molecular compounds [26, 27], and others. However, the

indispensable necessity of further research in automated pattern recognition

systems becomes obvious when we face new applications, challenges, and

problems, as for instance the search for important information in the huge

amount of data which is nowadays available, or the complete understanding

of highly complex data which has been made accessible just recently. There-

fore, the major role of pattern recognition will deﬁnitely be strengthened

in the next decades in science, engineering, and industry.

1.2 Learning Methodology

The key task in pattern recognition is the analysis and the classiﬁcation

of patterns [28]. As discussed above, the learning paradigm is usually em-

ployed in pattern recognition. The learning paradigm states that a machine

tries to infer classiﬁcation and analysis rules from a sample set of training

data. In pattern recognition several learning approaches are distinguished.

This section goes into the taxonomy of supervised, unsupervised, and the

recently emerged semi-supervised learning. All of these learning methodolo-

gies have in common that they incorporate important information captured

in training samples into a mathematical model.

Supervised Learning In the supervised learning approach each training

sample has an associated class label, i.e. each training sample belongs to

one and only one class from a ﬁnite set of classes. A class contains similar

objects, whereas objects from diﬀerent classes are dissimilar. The key task

in supervised learning is classiﬁcation. Classiﬁcation refers to the process

of assigning an unknown input object to one out of a given set of classes.

Hence, supervised learning aims at capturing the relevant criteria from

the training samples for the discrimination of diﬀerent classes. Typical

classiﬁcation problems can be found in biometric person identiﬁcation [29],

optical character recognition [30], medical diagnosis [31], and many other

domains.

Formally, in the supervised learning approach, we are dealing with a

December 28, 2009 9:59 Classiﬁcation and Clustering clustering

4 Graph Classiﬁcation and Clustering Based on Vector Space Embedding

pattern space X, and a space of class lab els Ω. All patterns x ∈ X are

potential candidates to be recognized, and X can be any kind of space

(e.g. the real vector space R

, or a ﬁnite or inﬁnite set of symbolic data

structures

). For binary classiﬁcation problems the space of class labels is

usually deﬁned as Ω = {−1, +1}. If the training data is labeled as belonging

to one of k classes, the space of class labels Ω = {ω

, . . . , ω

} consists of a

ﬁnite set of discrete symbols, representing the k classes under consideration.

This task is then referred to as multiclass classiﬁcation. Given a set of N

labeled training samples {(x

, ω

)}

i=1,...,N

⊂ X × Ω the aim is to derive

a prediction function f : X → Ω, assigning patterns x ∈ X to classes

∈ Ω, i.e. classifying the patterns from X. The prediction function f

is commonly referred to as classiﬁer. Hence, supervised learning employs

some algorithmic procedures in order to deﬁne a powerful and accurate

prediction function

Obviously, an overly complex classiﬁer system f : X → Ω may allow

perfect classiﬁcation of all training samples {x

}

i=1,...,N

. Such a system,

however, might perform poorly on unseen data x ∈ X \ {x

}

i=1,...,N

. In

this particular case, which is referred to as overﬁtting, the classiﬁer is too

strongly adapted to the training set. Conversely, underﬁtting occurs when

the classiﬁer is unable to model the class boundaries with a suﬃcient degree

of precision. In the best case, a classiﬁer integrates the trade-oﬀ between un-

derﬁtting and overﬁtting in its training algorithm. Consequently, the overall

aim is to derive a classiﬁer from the training samples {x

}

i=1,...,N

that is

able to correctly classify a majority of the unseen patterns x coming from

the same pattern space X. This ability of a classiﬁer is generally referred

to as generalization power. The underlying assumption for generalization

is that the training samples {x

}

i=1,...,N

are suﬃciently representative for

the whole pattern space X.

Unsupervised Learning In unsupervised learning, as opposed to su-

pervised learning, there is no labeled training set whereon the class con-

cept is learned. In this case the important information needs to be ex-

tracted from the patterns without the information provided by the class

We will revisit the problem of adequate pattern spaces in the next section.

The supervised learning approach can be formulated in a more general way to include

other recognition tasks than classiﬁcation, such as regression. Regression refers to the

case of supervised pattern recognition in which rather than a class ω

∈ Ω, an unknown

real-valued feature y ∈ R has to be predicted. In this case, the training sample consists

of pairs {(x

, y

)}

i=1,...,N

⊂ X × R. However, in this book considerations in supervised

learning are restricted to pattern classiﬁcation problems.

December 28, 2009 9:59 Classiﬁcation and Clustering clustering

Introduction and Basic Concepts 5

label [5]. Metaphorically speaking, in this learning approach no teacher is

available deﬁning which class a certain pattern belongs to, but only the

patterns themselves. More concretely, in the case of unsupervised learning,

the overall problem is to partition a given collection of unlabeled patterns

}

i=1,...,N

into k meaningful groups C

, . . . , C

. These groups are com-

monly referred to as clusters, and the process of ﬁnding a natural division of

the data into homogeneous groups is referred to as clustering. The cluster-

ing algorithm, or clusterer, is a function mapping each pattern {x

}

i=1,...,N

to a cluster C

. Note that there are also Fuzzy clustering algorithms avail-

able, allowing a pattern to be assigned to several clusters at a time. Yet, in

the present book only hard clusterings, i.e. clusterings where patterns are

assigned to exactly one cluster, are considered.

Clustering is particularly suitable for the exploration of interrelation-

ships among individual patterns [32, 33]. That is, clustering algorithms

are mainly used as data exploratory and data analysis tool. The risk in us-

ing clustering methods is that rather than ﬁnding a natural structure in the

underlying data, we are imposing an arbitrary and artiﬁcial structure [3].

For instance, for many of the clustering algorithms the number of clusters k

to be found in the data set has to be set by the user in advance. Moreover,

given a particular set of patterns, diﬀerent clustering algorithms, or even

the same algorithm randomly initialized, might lead to completely diﬀerent

clusters. An open question is in which scenarios to employ a clustering

approach at all [34].

An answer can be found in the concept of a cluster. Although both

concepts, class and cluster, seem to be quite similar, their subtle diﬀerence

is crucial. In contrast to the concept of a class label, the assignment of

a pattern to a certain cluster is not intrinsic. Changing a single feature

of a pattern, or changing the distance measurement between individual

patterns, might change the partitioning of the data, and therefore the pat-

terns’ cluster membership. Conversely, in a supervised learning task the

class membership of the patterns of the labeled training set never changes.

Hence, the objective of clustering is not primarily the classiﬁcation of the

data, but an evaluation and exploration of the underlying distance measure-

ment, the representation formalism, and the distribution of the patterns.

Semi-supervised Learning Semi-supervised learning is halfway be-

tween supervised and unsupervised learning [35]. As the name of this

approach indicates, both labeled and unlabeled data are provided to the

learning algorithm. An important requirement for semi-supervised learning

剩余345页未读，继续阅读

lynxhl

粉丝: 0
资源: 10

图数据管理与挖掘：从识别到应用

图数据管理与挖掘技术探索

数据仓库管理与数据挖掘技术应用

图书馆数据仓库与数据挖掘管理系统设计

图数据的管理与挖掘

成绩管理系统数据仓库与挖掘课设

客户关系管理与数据挖掘

数据挖掘与高校学生管理

数据挖掘数据挖掘PPT

数据挖掘与数据管理-正则表达式.pptx

《数据挖掘与管理系统-米其林餐厅数据分析及数据挖掘（机器学习与数据挖掘课程设计）》+源代码+设计资料

最新资源