多级匹配框架下的高效抄袭检测方法

需积分: 12 43 浏览量更新于2024-07-21 收藏 858KB PDF 举报

"本文介绍了一种用于高效防止抄袭的粗到细框架，该框架利用多级匹配方法进行文档相似性检测。文章提出了一种文档-段落-句子的多层次结构来表示每个文档，并在文档和段落级别使用传统的降维技术将高维直方图映射到潜在语义空间。地球移动距离（EMD）代替了全面匹配，用于检索相关文档，大大缩小了搜索范围。此外，设计并实现了两种PD算法，以有效地标识涉嫌抄袭的源文档。进行了广泛的实验验证，包括文档检索、抄袭检测以及对效率的研究。" 在本文中，作者探讨了抄袭检测的新方法，其核心是一个基于多级匹配的粗到细框架。这个框架首先将每个文档分解为文档、段落和句子三个层次，这有助于在不同粒度上分析文本的相似性。在文档和段落层面，通过使用降维技术（如主成分分析PCA或奇异值分解SVD），可以将高维的文本特征转换为低维的潜在语义表示，这有助于捕捉文本的实质性内容而忽略噪声。关键创新在于采用地球移动距离（Earth Mover's Distance, EMD）作为相似度度量。EMD是一种衡量两个概率分布之间差异的方法，常用于图像处理和信息检索领域。在抄袭检测中，EMD可以计算两个文档或段落在潜在语义空间中的“距离”，而不是简单地比较每个单词或短语的出现次数。这种方法的优势在于，即使原文经过改写或重新排列，EMD仍能捕捉到它们之间的相似性，从而降低误报和漏报的可能性。为了提高检测效率，文章设计并实现了两种PD算法。这些算法可能采用了启发式策略或优化方法，以快速定位潜在的抄袭源。实验部分包含了对这些算法性能的评估，包括对检索准确率、召回率以及运行时间的分析。实验结果证实了所提出的框架和算法在有效性和效率方面的优势，为未来抄袭检测研究提供了有价值的参考。这篇论文贡献了一种新的、高效的抄袭检测策略，它利用多级匹配和EMD，能够在大量文档中准确且快速地识别出可能的抄袭行为，对学术诚信和知识产权保护具有重要意义。

partition a document into paragraphs and further partition each

paragraph into sentences for HTML format documents (see

Section 3.1), building two different sizes of vocabularies

(see Section 3.2), and construction of multilevel representation

(see Section 3.3).

3.1. Document segmentation

We propose a hierarchical multilevel representation of

documents that contain text content only. To extract the multi-

level structure, a document is segmented into paragraphs that are

further segmented into sentences. We only considered HTML

documents in this paper and developed a Java platform to

implement that kind of segmentation. In HTML format document,

we can use HTML tags to easily identify paragraphs that are

further partitioned into sentences by marking periods. Before

document segmentation, we ﬁrst ﬁlter out the formatted text that

appears within the HTML tags. The text is not accounted for in

word counts or document features.

The overall document partitioning process can be summarized

as follows:

1. Partition the document into blocks using the HTML tags:

‘‘o p4 ’’, ‘‘o br\ 4 ’’, ‘‘o li4 ’’, ‘‘o /td4 ’’, etc.

2. Merge the subsequent blocks to form a new paragraph until

the total number of words of the merged blocks exceeds a

paragraph threshold (set at 50 in this paper). The new block is

merged with the previous paragraph if the total number

of words in a paragraph exceeds the minimum threshold

(set at 30).

3. Partition each generated paragraph into sentences using the

tag ‘‘\.’’.

For HTML documents, it is noted that there is no rule for

minimum/maximum number of words for paragraphs. But the

use of a threshold of word counts still enables us to ﬂexibly

control the number of paragraphs in each document, and makes

the blocks that contain only a few words (e.g. titles) attached to

the real paragraph blocks. In this way, we build a hierarchical

multilevel structure (or tree structure) to describe the semantic

information from global data-view to local data-view.

Thus the document contents are structured in a ‘document-

paragraphs-sentences’ hierarchy. This is a simple way to

generate a hierarchical structure. It can be further improved by

a ﬁner segmentation such as ‘document-sections-pages-

paragraphs-sentences’. But it needs a more complex algorithm

to facilitate this kind of segmentation.

3.2. Vocabulary construction

The main text contents are ﬁrstly separated from HTML tags.

We then extract words from all the documents in a dataset and

apply stemming to each word. Stems are used as basic features

instead of original words. Thus ‘program’, ‘programs’, and

‘programming’ are all considered the same word. We remove

the stop words (set of common word like ‘a’, ‘the’, ‘are’, etc.) and

store the stemmed words together with the information of the

term frequency f

(the frequency of a word in all documents) and

document frequency f

(the number of documents a word

appears). In order to form a histogram vector for each document,

we need to construct a word vocabulary each histogram vector

refers to. Based on the stored term frequency f

and document

frequency f

information, we use a simple term-weighting

measure, which is similar to the tf–idf, to calculate the weight of

each word

ﬃﬃﬃﬃ

 idf , ð1Þ

where the inverse-document-frequency idf ¼ log

ðN=f

Þ, and N is

the total number of documents in the dataset. It is noted that this

term-weighting measure can be replaced by other feature

selection criteria [20]. The words are then sorted in a descending

order according to the weights. Here, we construct two vocabul-

aries denoted as V

and V

, respectively. V

is used to form

histogram vectors of document and paragraphs, whereas V

used to form signatures of sentences (see Section 3.3). The ﬁrst N

words are selected to construct the vocabulary V

and the ﬁrst N

words are selected to construct the vocabulary V

. The vocabulary

is mainly used for DR (see Section 4), and the vocabulary V

used for further sentence sorting (see Section 5). The vocabulary

size N

is supposed to be much larger than N

. According to our

empirical study [3,4], using all the words in the dataset to

construct the vocabulary V

is not necessarily expected to deliver

the improvement of the DR accuracy because some words may be

noisy features for some topics. Document modeling approaches

[5–10], however, almost used all the words to form the basic

histogram vectors for DR. Efﬁcient feature selection for DR is

still an open problem, which we leave to other researchers.

We also conducted detailed experiments to evaluate the perfor-

mance in terms of different options of the vocabulary sizes (see

Sections 6.4.3 and 6.4.4).

3.3. Multilevel representation

After the vocabulary construction, we use the document

segmentation procedures (see Section 3.1) to partition each

document in the dataset and generate the multilevel representa-

tion in the form of the ‘document-paragraphs-sentences’

structure. The top level contains the histogram of a whole

document and the 2nd level is used for paragraphs. Each element

of the histogram indicates the number of times the corresponding

word in the vocabulary V

appears in a document or a paragraph.

The 3rd level used for sentences is different from the upper two

layers. It uses the index number of words that are included in

the vocabulary V

, instead of using histograms, to indicate

the presence/absence of words in a sentence. This architecture

has two advantages: saving the storage space (computational

efﬁciency) and improving the detection accuracy (accuracy

efﬁciency) because it examines the document more locally.

Since the document level and paragraph level are mainly used

for DR (see Section 4), we apply PCA, a well-known dimension-

ality reduction tool, to word histogram vector for the whole

document and the segmented paragraphs. Here, PCA is employed

to project higher dimensional data into lower dimensional latent

semantic space without losing much statistical information. We

ﬁrst normalize the histogram vector H

¼½h

 (t¼1, 2, y, N

)of

the ith document

t ¼ 1

 log



, ð2Þ

where n

is the frequency of the tth word in the vocabulary,

and f

is the document frequency of the tth word. We then

use the normalized histogram to construct the PCA projection

matrix B. To save the computational burden, we apply PCA only in

the document level. We have used the MATLAB tool [42] to

compute the projection matrix. The compressed histogram vector

¼½f

 (u¼1, 2, y, N

) of the ith document is calculated as

H. Zhang, T.W.S. Chow / Pattern Recognition 44 (2011) 471–487474

剩余16页未读，继续阅读

d648600245

粉丝: 0
资源: 1

多级匹配框架下的高效抄袭检测方法

Coarse-to-Fine Auto-encoder Networks.pdf

FQuery for moss

Similar Spatial Trajectory Retrieval in a large Scale A Coarse-to-Fine Method using Map Reduce

Coarse-to-fine-matching

matlab三次样条插值函数代码-Data-Driven-and-Coarse-to-Fine-Baseline-Correction-for

Coarse-to-Fine Lung Segmentation in Computed Tomography Images

Cloud Detection in Remote Sensing Images：论文“A Coarse-to-Fine Method for Cloud Detection in Remote Sensing Images, IEEE Geoscience and Remote Sensing Letters”的代码-matlab开发

详细解读：Image-to-Markup Generation with Coarse-to-Fine Attention中的Row Encoder

Coarse-fine interpolation for AMR-开源

Exploiting Coarse-Grain Speculative Parallelism-计算机科学

最新资源