基于组件关系图的文本相似性检测新算法

156 浏览量更新于2024-08-27 收藏 434KB PDF 举报

本文发表在《国际网络安全》杂志2015年9月第17卷第5期，页码637-642，主要探讨了一种新颖的文本相似性检测算法——基于组件关系图（Component Relation Map, CHM-TSD）。传统的文本相似度检测方法通常依赖于词频向量，这种方法在处理中文文本时存在高维稀疏的问题，难以准确反映文本的语义信息。 CHM-TSD算法的核心思想是利用汉字的数学表达形式，将每个汉字拆分成多个组件，以此来构建文本的特征表示。具体来说，通过统计每个组件在文本中的出现频率，形成组件直方图（Component Histogram Map, CHM），从而构建一个低维度且更密集的文本特征向量。这种方法有效地减少了文本表示的复杂性和维度，提高了信息的密度和可处理性。为了评估不同距离公式在文本相似性检测中的效果，文章提出了四种距离计算公式进行实验对比。这可能包括欧氏距离、余弦相似度、Jaccard相似系数等常见的文本相似度度量方法，或者是针对CHM特性的自定义距离公式。作者通过对大量文本数据进行实验，旨在找出最能反映文本语义相似度的最佳距离公式。实验结果部分展示了CHM-TSD算法相对于传统方法在文本相似度检测任务上的优势，可能包括更高的精确度、更好的召回率或更快的计算速度。此外，该研究还可能讨论了算法的鲁棒性、对噪声的抵抗能力以及在实际应用场景（如文本检索、文本聚类或情感分析）中的性能表现。这篇文章为解决中文文本相似性检测中的问题提供了一个创新的解决方案，它通过组件关系图的构建和定制距离公式的选择，提高了文本特征的表达效率和相似度计算的准确性。这对于自然语言处理领域，特别是中文处理，具有重要的理论价值和实践意义。

International Journal of Network Security, Vol.17, No.5, PP.637-642, Sept. 2015 637

A Component Histogram Map Based Text

Similarity Detection Algorithm

Huajun Huang, Shuang Pang, Qiong Deng, and Jiaohua Qin

(Corresponding author: Huajun Huang)

College of Computer and Information Engineering, Central South University of Forestry and Technology

Changsha 410004, China

(Email: hhj0906@163.com)

(Received Apr. 10, 2015; revised and accepted May 16 & May 24, 2015)

Abstract

The conventional text similarity detection usually use

word frequency vectors to represent texts. But it is

high-dimensional and sparse. So in this research, a new

text similarity detection algorithm using component his-

togram map (CHM-TSD) is proposed.This method is

based on the mathematical expression of Chinese charac-

ters, with which Chinese characters can be split into com-

ponents. Then each components occurrence frequency

will be counted for building the component histogram

map (CHM) in a text as text characteristic vector. Four

distance formulas are used to ﬁnd which the best distance

formula in text similarity detection is. The experiment re-

sults indicate that CHM-TSD achieves a better precision,

recall and F1 than cosine theorem and Jaccard coeﬃcient.

Keywords: Component histogram map, distance calcula-

tion, text similarity detection

1 Introduction

As a branch of natural language processing, text simi-

larity detection is more and more important for infor-

mation security. It has been used in many ﬁelds such

as information retrieval (IR), duplicated detection, Data

clustering and classiﬁcation [3]. In general, there are two

ways for text similarity detection, one is that based on

semantic similarity, and the other one is non-semantic.

Semantic similarity detection usually based on dictionary

computation like HowNet [13] and WordNet [4]. Huang

has ever proposed a method that combined the external

dictionary with TF-IDF to compute text similarity [5].

Some people also use a large-scale corpus for semantic

similarity detection[7], but its uncommon because of its

disadvantages. Non-semantic similarity detection mostly

uses word frequency statistics and string comparison two

methods. The most common used methods of word fre-

quency statistics are VSM [11, 12] the text similarity can

be computed through cosine [14] theorem or Jaccard co-

eﬃcient [10]. In the other hand, Shingling [15] and maxi-

mum string matching algorithm [6] is often used for string

comparison. All of the methods above performance well in

certain situations, but there are also some shortcomings.

For examples, the semantic method based on dictionary

is too depending on person and the knowledge library

to express the sense of a word exactly. Word frequency

statistics is very high-dimensional and sparse [8].

From the above, a new Chinese text similarity de-

tection method was proposed. This method used CHM

(component histogram map) to avoid high-dimensional

and sparse problem. Mathematical expression of Chinese

characters [9], used to split Chinese characters into com-

ponents was the basic theorem for this method. And the

components were taken as research object. Components

are correlated with each other to compose Chinese char-

acters, so these components are correlative. CHM was

built with each components occurrence frequency. Then

the distance between text and duplicate text is calculated

with Bhattacharyya formula. From the results, we can see

that CHM-TSD performance better than cosine theorem

and Jaccard coeﬃcient.

2 Related Theories

In the process of text duplicate detection, text feature

representation and similarity detection are two very im-

portant steps [12]. VSM is the most common method for

text feature representation. Assuring d

is the i-th text,

is the weight of the j-th word of d

, then the i-th text

can be represented as

= (W

i,1

, W

i,2

, ··· , W

i,3

), so all

the texts in the experiment can compose a vector space

D = (

, ··· ,

). The similarity of each pair of text

can be computed as two vectors distance through cosine

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38741317

粉丝: 3
资源: 905

基于组件关系图的文本相似性检测新算法

A Component-Relation-Map Detection Algorithm for Text Similarity

Lane Detection Algorithm for Intelligent Vehicles

A Component Histogram Map Based Text Similarity Detection Algorithm

An improved edge detection algorithm for depth map inpainting

An edge detection algorithm for imaging ladar

Low complexity ZF detection algorithm for Massive MIMO systems

LSD a line detection algorithm

Adaptive double-threshold energy detection a algorithm for cognitive radio

A novel detection algorithm of double MP3 compression

A Novel Edge Detection Algorithm Based on Global Minimization Active Contour Model for Oil Slick Infrared Aerial Image

最新资源