自组织映射（SOM）神经网络在大数据分析与生物信息学中的应用

需积分: 1 114 浏览量更新于2024-09-12 收藏 1.98MB PDF 举报

"这篇论文是关于自组织映射（Self-Organizing Map，SOM）的探讨，主要涉及其在大规模文本数据库管理和生物信息学中的广泛应用。SOM是一种自动的数据分析方法，常用于聚类问题和数据探索。此外，文中还提到了SOM与向量量化（Vector Quantization, VQ）的关系及其工作原理。" 自组织映射（SOM）是由Teuvo Kohonen提出的，它是一种基于神经网络的非监督学习算法，主要用于数据的可视化和结构化。SOM的独特之处在于其能够自动地将高维输入数据映射到低维空间，通常是一个二维网格，这个过程称为“拓扑保留”。这使得相似的数据点在映射后的二维平面上靠近，从而揭示了数据的内在结构。在数据探索和聚类任务中，SOM具有显著的优势。它可以处理连续、离散以及混合型数据，并且不需要预先设定类别数量，使得数据的分组更加自然。例如，在金融领域，SOM可用于市场细分，识别投资者的行为模式；在自然科学中，它可以帮助科学家理解复杂数据集的模式，如气候数据或基因表达数据；在语言学中，SOM可以用于词汇和语义关系的研究。 SOM与向量量化（VQ）有一定的联系，VQ是数字信号处理和传输中常用的技术，它通过有限的码书（codebook）对连续信号进行近似表示。SOM同样涉及到输入数据的模型表示，但它通过学习过程使得模型节点在网格上自动布局，使得相似的数据项更接近。这一特性使得SOM在处理大规模数据集时，不仅能够进行有效的聚类，还能保持数据之间的相对位置信息，对于理解和解释结果非常有帮助。在生物信息学中，SOM的应用尤其广泛。例如，在基因表达数据分析中，SOM可以用来识别基因共表达模式，揭示不同条件下的基因调控网络。在蛋白质结构研究中，SOM可以用于蛋白质结构域的分类和功能预测。 SOM是一种强大的工具，它能够对复杂数据进行有洞察力的分析，尤其是在需要理解和揭示数据内在结构的场景下。通过自动学习和拓扑保留，SOM提供了一种直观的方式来探索和理解高维数据，从而在各种领域得到广泛应用。

54 T. Kohonen / Neural Networks 37 (2013) 52–65

rule, on the other hand, was originally set up only for theoretical

reasons and to facilitate a comparison with the other self-

organizing models. Moreover, it is not possible to use the stepwise

learning with general metrics, but we will see that batch learning

also solves this problem.

More detailed descriptions of the SOM algorithms will be given

below.

Several commercial software packages as well as plenty of

freeware on the SOM are available. This author strongly encourages

the use of well-justified public-domain software packages. For

instance, there exist two freeware packages developed by us,

namely, the SOM_PAK (Kohonen, Hynninen, Kangas, & Laaksonen,

1996; SOM_PAK Team, 1990) and the SOM Toolbox (SOM Toolbox

Team, 1999; Vesanto, Alhoniemi, Himberg, Kiviluoto, & Parviainen,

1999; Vesanto, Himberg, Alhoniemi, & Parhankangas, 1999), both

downloadable from the Internet. Both packages contain auxiliary

analytical procedures, and especially the SOM Toolbox, which

makes use of the MATLAB functions, is provided with versatile

graphics means.

Unlike in most biologically inspired map models, the topo-

graphic order in the SOM can always be materialized globally over

the whole map.

The spatial order in the display facilitates a convenient and

quick visual inspection of the similarity relationships of the input

data as well as their clustering tendency, and comes in handy in the

verification and validation of data samples. Moreover, with proper

calibration of the models, the clustering and classification of the

data become explicit.

The rest of this article concentrates on the SOM principles and

applications. The SOM has been used extensively as a visualization

tool in exploratory data analysis. It has had plenty of practical

applications ranging from industrial process control and finance

analyses to the management of very large document collections.

New, promising applications exist in bioinformatics. The largest

applications so far have been in the management and retrieval of

textual documents, of which this paper contains two examples.

Many versions of the SOM algorithms have been suggested

over the years. They are too numerous to be reviewed here; cf.

the extensive bibliographies mentioned as Refs. (Kaski, Kangas, &

Kohonen, 1998; Oja, Kaski, & Kohonen, 2003; Pöllä, Honkela, &

Kohonen, 2009). See also the Discussion, Section 7.

3.2. Calibration of the SOM

If the input items fall in a finite number of classes, the different

models can be made to correspond to these classes and to

be provided with corresponding symbolic labels. This kind of

calibration of the models can be made in two ways: 1. If the

number of input items is sufficiently large, one can first study the

distribution of matches that all of the input data items make with

the various models. A particular model is labeled according to that

class that occurs in the majority of input samples that match with this

model. In the case of a tie, one may carry out, e.g., a majority voting

over a larger neighborhood of the model. 2. If there is only a smaller

number of input data items available so that the above majority

voting makes no sense (e.g., there are too many ties, or there are no

hits at some of the models), one can apply the so-called k-nearest-

neighbors (kNN) method. For each model, those k input data items

that are closest to it (in the metric applied in the construction of

the SOM) are searched, and a majority voting over them is carried

out to determine the most probable classification of the node. In

the case of a tie, the value of k is increased until the tie is resolved.

Usually k is selected to be on the order of half a dozen to a hundred,

depending on the number of input data items and the size of the

SOM array.

When a new, unknown input item is compared with all of the

models, it will be identified with the best-matching model. The

classification of the input item is then understood as that of the

best-matching model.

3.3. On ‘‘matching by similarity’’

There exist many versions of the SOM, which apply different

definitions of ‘‘similarity’’. This property deserves first a short dis-

cussion. ‘‘Similarity’’ and ‘‘distance’’ are usually opposite concepts.

The cognitive meaning of similarity is a very vague one. For

instance, one may talk of the similarity of two persons or two

historical eras, although such a comparison is usually based on a

subjective opinion.

If the same comparison should be implemented automatically,

it can only be based on some very restricted analytical, say

statistical attributes. The situation is much clearer, if we deal with

concrete objects in science or technology, since we can then base

the definition of dissimilarity on basic mathematical concepts of,

say, distance measures between attribute vectors. The statistical

figures are usually also expressed as real vectors, consisting of

numerical results or other statistical indicators. Various kinds

of spectra and other transformations can also be regarded as

multidimensional vectors of their components.

The first problem in trying to compare such vectors is usually

different scaling of their elements. For metric comparison, a simple

remedy is to normalize the scales so that either the variances of the

variables in the different dimensions, or their maxima and minima,

respectively, become the same. After that, some standard distance

measure, such as the Euclidean, or more generally, the Minkowski

distance, etc. can be tried, the choice depending on the nature

of the data. It has turned out that the Euclidean distance, with

normalization, is already applicable to most practical studies, since

the SOM is able to display even complex interdependencies of the

variables in its display.

A natural measure of the similarity of vectorial items is in

general some inner product. In the SOM research, the dot product

is commonly used. This measure also complies better with the

biological neural models than the Euclidean distance. However,

the model vectors m

, for their comparison with the input x, must

be kept normalized to constant length all the time. If the vector

dimensionality is high, and also the input vectors are normalized

to constant length, the difference between SOMs based on the

Euclidean distances and the dot products is insignificant. (For the

construction of Euclidean and dot-product SOMs, cf. Sections 4.1

and 4.5, respectively.) On the other hand, if there are plenty of

zero elements in the vectors, the computation of dot products is

correspondingly faster. This property can be utilized effectively

especially in the fast computation of document maps discussed at

the end of this article.

Before proceeding further, it will be necessary to emphasize a

basic fact. An image, often given as a set of pixels or other structural

elements, will usually not be applicable as such as an input vector.

The natural variations in the images, such as translations, rotations,

variations of size, etc., as well as variations due to different lighting

conditions are usually so wide that a direct comparison of the

objects on the basis of their appearances does not make any

sense. Instead, the classification of natural items shall be based

on the extraction and classification of their characteristic features

which must be as invariant as possible. Features of this type may

consist of color spectrograms, expansions of the images in Fourier

transforms, wavelets, principal components, or eigenvectors of

some image operators, etc. If one can describe the input objects by

a restricted set of invariant features, the dimensionality of the input

representations, and the computing load are reduced drastically.

A special kind of dissimilarity or distance measure is applied

in an SOM that is called the Adaptive-Subspace SOM (ASSOM), cf.

Kohonen (1995, 1996, 2001) and Kohonen, Kaski, and Lappalainen

(1997). In it, certain elementary systems are associated with the

nodes, and these systems develop into specific filters that respond

invariantly to some class (e.g., translation-invariant, rotation-

invariant, or scale-invariant) of local features. Their parameters

剩余13页未读，继续阅读

cccstudio

粉丝: 0

自组织映射（SOM）神经网络在大数据分析与生物信息学中的应用

Self-Organizing Queue Clustering Matlab代码实现

Self-organizing Maps在水质检测中的应用

ORCTS_SOM算法：自组织映射网络下旅行商问题的高效求解策略

Clustering of the Self-Organizing Map

self-organizing map

A Self-Organizing Map for Adaptive Processing of Structured Data

DAMAGE DETECTION BASDE ON SELF-ORGANIZING MAP NEURAL NETWORK

Self-Organizing-Map:MATLAB

自组织映射(Self-organizing map, SOM)matlab代码+机器学习+人工智能

sommatlab代码-self-organizing-map:自组织图

最新资源