基于模糊控制和新型混合语义相似度的文本聚类遗传算法

39 浏览量更新于2024-07-15 收藏 634KB PDF 举报

"Fuzzy控制GA与新颖的混合语义相似性策略用于文本聚类" 本文是一篇研究论文，探讨了一种将模糊控制遗传算法（GA）与创新的混合语义相似性策略相结合的方法，应用于文本聚类。该方法旨在解决传统聚类算法在处理文档时忽略相关术语之间概念关系的问题。文本聚类是数据挖掘领域的一个关键任务，它涉及将文本数据集组织成不同的非重叠类别或簇，以便于理解和分析。在传统的基于向量空间模型（VSM）的聚类方法中，每个文档被表示为一个由词汇项频率组成的向量，这种方法往往忽视了词与词之间的语义联系。为了解决这一问题，本文提出利用语义相似性度量来捕捉这些隐藏的关系。语义相似性度量通常分为两类：基于词库的方法和基于词典的方法。词库如WordNet提供了词汇间的语义关系网络，可以计算出两个词在概念上的接近程度。而基于词典的方法则可能涉及更复杂的自然语言处理技术，如词干提取、词形还原和上下文依赖分析。在本文提出的模糊控制遗传算法中，模糊控制被用来处理不确定性，这在文本聚类中是常见的，因为文本的意义往往是模糊的。遗传算法是一种进化计算方法，通过模拟自然选择和遗传过程来搜索解决方案空间，优化聚类结果。结合模糊逻辑，GA能够更好地处理文本聚类中的模糊性和复杂性。具体实现过程中，首先，使用混合语义相似性策略对文档中的词进行预处理，以增强文档向量的语义信息。然后，这些增强的向量被输入到模糊控制GA中，GA通过迭代过程不断调整和优化聚类中心，使得聚类结果更加符合语义上的相似性。最后，通过比较不同迭代周期的聚类效果，确定最优解。论文的关键贡献在于提出了一种新颖的混合语义相似性策略，它能够结合多种语义度量方法的优点，提高聚类的准确性。同时，模糊控制GA的引入增加了算法的灵活性，使得聚类结果能够更好地反映出文本的语义结构。这篇研究工作为文本聚类提供了一个新的视角，即通过融合语义理解与优化算法，改善聚类质量和效率。这种方法对于信息检索、文档分类、社交媒体分析等应用场景具有重要的实际意义。

would nonlinearly decrease as the shortest path connected them increases. Therefore, it would be reasonable to expect that

the similarity decreases at an exponential rate as the shortest path increases, and f

is deﬁned by:

ðlÞ¼e



ð2Þ

where

is a real constant between 0 and 1. From (2) we can see that when the path length decreases to zero, the similarity

would monotonically increase toward 1. While the path length increases inﬁnitely, the similarity should monotonically de-

crease to 0. However, only the shortest path for semantic similarity calculation may be not so accurate, the shortest path

length method must be revised by adding more information from the hierarchical semantic structure of WordNet. It is intu-

itive that concepts at higher levels of the hierarchy have more general information, while concepts at lower levels have more

concrete semantics. Thus, the depth of concept in the hierarchy should be taken into account. The depth h of the subsumer is

derived by calculation the shortest length of links from the subsumer to the root concept of the ontology. According to this

observation, the depth function to similarity is deﬁned by:

ðhÞ¼

 e

bh

þ e

bh

ð3Þ

where b > 0 is a smoothing factor. Also, f

can be considered as an extension of Shepard’s law [39], which claims that expo-

nential-decay functions are a universal law of stimulus generalization for psychological science. We have achieved the

semantic similarity between two concepts based on the thesaurus method by far. However, the common corpus-based

(or information-based) method [32] is a rather difﬁcult to tackle. We cannot easily access it solely from the semantic nets.

But it can be calculated with the help of a large corpus [21]. The Brown Corpus [10] is the ﬁrst modern, computer readable,

general corpora. However, the scope of such corpus is also restricted for various speciﬁc data sets in practice. Meanwhile, it

takes long time to calculate the probability of encountering an instance of the concept in the large corpus. In the next section

we will propose a new corpus-based semantic similarity measure.

Since a word can be expressed by different concepts, the semantic similarity between words is then represented by the

maximum value of the similarity of concepts signiﬁed by words. Assuming word w

is represented by a number of a concepts

1,1

, c

1,2

, ... , c

1,a

) and word w

is represented by a number of b concepts (c

2,1

, c

2,2

, ..., c

2,b

), the semantic similarity between

these two words is assessed by:

simðw

; w

Þ¼maxfsimðc

; c

Þg c

2fc

1;1

; c

1;2

; ...; c

1;a

g; c

2fc

2;1

; c

2;2

; ...; c

2;b

gð4Þ

Hence, the semantic similarity between these two documents is deﬁned by:

sim

ONTO

ðd

; d

Þ¼

i¼1

j¼1

simðw

1;i

; w

2;j

mn ð5Þ

where m and n are the number of WordNet lexicon words included in documents d

and d

respectively. In the light of the

experimental results given by Li et al. [21], and

in (2) and b in (3) are set as 0.08 and 0.60 respectively. However, despite the

fact that such a thesaurus-based method is effective and could provide the semantic similarity between two individual

words, if we only use such semantic similarity in our system, here comes some restrictions in practice. e.g., a document,

in some specialized domains, does not necessarily include WordNet lexicon words or after stemming some formal words

are broken up into incomplete forms which will not be included in WordNet lexicon. Hence, some important concepts will

be lost and only the application of WordNet for semantic similarity calculation may be not so accurate. We combine the the-

saurus-based ontology with a new semantic space model (SSM) to calculate semantic similarity between pairs of document.

In next section the SSM is proposed to reveal the associated semantic relationships between documents.

3. Semantic similarity calculation based on SSM

In this part we propose and demonstrate a semantic space model (SSM) whose whole dimensions precisely simulate the

original vector space model via cosine and Euclidean distance similarity calculation, and the appropriately reduced space can

hopefully capture the true semantic relationship between documents. SSM is an automatic approach which can solve the

problems by using statistically derived conceptual indices instead of individual words. It utilizes singular value decomposi-

tion (SVD) [27,43] to decompose the large term-by-document matrix into a set of k orthogonal factors.

3.1. The proof of SSM to simulate VSM

We use document-by-term matrix D(n  m) to represent the original corpus matrix, assuming there are m terms in an n

documents data set. The transpose of matrix D is then represented by the term-by-document matrix A(m  n)

D ¼ A

ð6Þ

The singular value decomposition of A is deﬁned as

A ¼ U

ð7Þ

158 W. Song et al. / Information Sciences 273 (2014) 156–170

剩余14页未读，继续阅读

weixin_38666230

粉丝: 6
资源: 961

基于模糊控制和新型混合语义相似度的文本聚类遗传算法

Fuzzy Control Systems

ga & fuzzy ga.rar_GA fuzzy_GA FUZZY_GA fuzzy matlab_fuzzy_fuzzy

Study on the Fuzzy Control Strategy of Automobile with CVT* (2002年)

FPGA-based Servo Control IC for PMLSM Drives with Adaptive Fuzzy Control

fuzzyPID.zip_Fuzzy Control_PID fuzzy_fuzzy control_fuzzy matlab

svpwm_fuzzy.rar_FUZZY INDUCTION_fuzzy control_fuzzy control mot

Control allocation for a V/STOL aircraft based on robust fuzzy control

Fuzzy Logic Control Based PMS for 10 KW Hybrid PV/FC Power System的并网：Fuzzy Logic Control Based PMS for Grid并网10 KW Hybrid PV/FC Power System-matlab开发

fuzzy control for photovoltaic_control_fuzzy_photovoltaic_

FUZZY CONTROL

最新资源