检测局部文本复用的新方法

需积分: 10 24 浏览量更新于2024-09-04 收藏 171KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

“Local Text Reuse Detection.pdf” 这篇论文主要探讨了局部文本重用检测的问题，这是一种在各种文档中常见的现象，可能出于多种原因。局部文本重用是指句子、事实或段落，而非整个文档，被复用并进行修改。研究人员对这种形式的重用特别关注，因为它可以为文本分析提供新的工具。作者Jangwon Seo和W. Bruce Croft提出了一种新的方法来检测局部文本重用，并将其与其他方法进行了比较。这个比较研究涉及到在实际文档（包括TREC新闻稿和博客集合）中对重用的数量和类型进行深入分析，这些文档数据集通常用于信息检索和自然语言处理的研究。论文的分类和主题描述涉及H.3.1[内容分析和索引]：索引方法，强调了在文本分析和索引过程中识别重复内容的重要性。论文的关键词包括文本重用、指纹识别和信息流，这表明该研究不仅关注文本的重复性，还关注信息传播的追踪和验证。在介绍部分，作者指出，文本重用和复制是信息领域中的普遍现象，尤其是在互联网环境中，它影响了搜索引擎的性能和信息的可信度。检测局部文本重用的技术对于防止抄袭、追踪信息源以及理解信息传播模式至关重要。传统的文本检测方法可能侧重于整个文档的相似性，而局部文本重用检测则更专注于发现微小的、经过修改的重复片段。论文中可能涉及的具体技术包括文本指纹（fingerprints），这是一种通过创建文档的独特标识来识别重复文本的技术。此外，信息流的概念可能与跟踪文本如何在不同文档之间传播有关，这有助于理解文本重用的上下文和目的。在实验部分，作者可能对比了新方法与基于n-gram、TF-IDF（词频-逆文档频率）或其他传统相似性计算方法的性能。这些比较可能基于精确度、召回率和F1分数等指标，以评估新方法在检测局部文本重用方面的效率和准确性。这篇论文对局部文本重用检测的研究提供了重要的贡献，不仅开发了新的检测技术，还通过实证研究深化了我们对文本重用在真实世界文档中的理解。这对于提高信息检索系统的质量、打击学术不端行为以及研究网络信息传播模式具有重要意义。

资源详情

资源推荐

Local Text Reuse Detection

Jangwon Seo

jangwon@cs.umass.edu

W. Bruce Croft

croft@cs.umass.edu

Center for Intelligent Information Retrieval

Department of Computer Science

University of Massachusetts, Amherst

Amherst, MA 01003

ABSTRACT

Text reuse occurs in many diﬀerent types of documents and

for many diﬀerent reasons. One form of reuse, duplicate or

near-duplicate docu ments, has been a focus of researchers

because of its importance in Web search. Local text reuse

occurs when sentences, facts or passages, rather than whole

docu ments, are reused and modiﬁed. Detecting this type of

reuse can be the basis of new tools for text analysis. In this

paper, we introduce a new approach to detecting local text

reuse and compare it to other approaches. This comparison

involves a study of the amount and type of reuse that oc-

curs in real documents, including TREC newswire and b log

collections.

Categories and Subject Descriptors

H.3.1 [Content Analy sis and Indexing]: Indexing meth-

ods

General Terms

Algorithms, Measurement, Experimentation

Keywords

Text reu se, ﬁngerprinting, information ﬂow

1. INTRODUCTION

Text reuse and duplication can occur for many reasons.

Web collections, for example, contain many duplicate or

near-duplicate versions of documents because the same in-

formation is stored in many diﬀerent locations. Local text

reuse, on the other h and, occurs when people borrow or pla-

giarize sentences, facts, or passages from various sources.

The tex t that is reused may be modiﬁed and may be only a

small part of the d ocument that is being created.

Near-duplicate document detection has been a major fo-

cus of researchers because of the need for these techniques in

Web search engines. These search engines hand le enormous

collections with a great number of duplicate documents. The

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGIR’08, July 20–24, 2008, Singapore.

duplicate documents make the system less eﬃcient in that

they consume considerable system resources. Further, users

typically do not want to see redundant documents in search

results. Many eﬃcient and eﬀective algorithms for near-

duplicate document detection have been described in the

literature [1, 4, 5, 6].

The obvious application involving local text reuse is pla-

giarism detection, but being able to detect local reuse would

be a powerful new tool for other possible applications involv-

ing text analysis. For example, Metzler et al. [15] discussed

tracking information ﬂow, which is the history of statements

and “facts” that are found in a text database such as news.

This application was motivated by intelligence analysis, but

could potentially be used by anyone who is interested in ver-

ifying the sources and “provenance” of information they are

reading on the Web or in blogs.

Local text reuse detection requires diﬀerent algorithms

than have been developed for near-duplicate d ocument de-

tection. The reason for this is that, in the case of local

text reuse, only a small part (or parts) of a document may

have been taken from other sources. For example, state-

of-art near-duplicate detection algorithms like the locality

sensitive hash [5] assume a transitive relation between doc-

uments. That is, if a document A is a near-duplicate of

docu ment B, which is a near-duplicate of document C, then

docu ment A should be a near-duplicate of document C. A

text reuse relationship based on parts of documents, how-

ever, violates this assumption, as shown in Figure 1.

In this paper, we focus on algorithms for detecting local

text reuse based on parts of documents. In Section 2, we dis-

cuss the related literature. In Section 3, we expand on the

idea of local text reuse by introducing categories of reuse.

These categories are the basis of our experimental evalua-

tion. In Section 4, we introduce a novel algorithm for local

text reuse detection called DCT ﬁngerprinting. This algo-

rithm is evaluated for eﬃciency and eﬀectiveness in Section

5. In Section 6, the local reuse detection algorithm is used

to measure the amount and type of text reuse that occurs

in TREC news and blog collections.

2. RELATED WORK

There have been broadly two approaches to text reuse

detection. One approach is using document ﬁngerprints

through hashing subsequences of words in documents. This

approach is known to work well for copy detection. Shiv-

akumar and Garcia-Molina [19, 20] and Broder [3] intro-

duced eﬃcient frameworks. Since handling many ﬁnger-

prints is too expensive, various selection algorithms for ﬁn-

下载后可阅读完整内容，剩余7页未读，立即下载

niuliangshenmin

粉丝: 2
资源: 29

检测局部文本复用的新方法

Detecting and Analyzing Text Reuse with BLAST.pdf

tf.variable_scope('positinal_embedding',reuse=tf.AUTO_REUSE)

stata local

linux udp 绑两个ip和端口

tf.AUTO_REUSE

how to create a .py file in jupyter notebook

vscode安装latex参数配置

docker: Error response from daemon: Conflict. The container name "/mysql" is already in use by container "5e6c2cf4a8a1b7a0e4e34f4d3cb9dbf6562b2efb9442569b6bc982fbc0ddae72". You have to remove (or rename) that container to be able to reuse that name. See 'docker run --help'.

hive调用MapReduce之后遇到kill command之后卡住或者一直开在MapReduce之前

with slim.arg_scope(inception_v3.inception_v3_arg_scope()): logits_v3, end_points_v3 = inception_v3.inception_v3( x, num_classes=num_classes, is_training=False, reuse=tf.AUTO_REUSE)

docker: Error response from daemon: Conflict. The container name "/local-regi" is already in use by container "048cfe846f9f8436dd99350f26675d265c4daf2e20c3746955f52cf7ef61247f". You have to remove (or rename) that container to be able to reuse that name. See 'docker run --help'.

with slim.arg_scope([slim.conv2d, slim.batch_norm], reuse=reuse): with slim.arg_scope( [slim.conv2d],

outofmemory error GC overhead limit exceeded

net.ipv4.tcp_tw_reuse

vscode [Warning] Output path is not specified. Unable to reuse previously compiled files. Build will be slower. See README.

vue3 class

SO_REUSEPORT

ERROR: for zookeeper Cannot create container for service zookepper: Conflict. The container name "/zookeeper" is already in use by container "42698da8f9c735be34c6d8925e40169e16120647bd3b52478b197dcabc83723f". You have to remove (or rename) that container to be able to reuse that name.

ERROR: for tomcat Cannot create container for service tomcat: Conflict. The container name "/tomcat" is already in use by container "5f4051b5e1b718b250b0cc187489b3bdf985928d99ce199ebb5c779fa8ec7f11". You have to remove (or rename) that container to be able to reuse that name.

基于大模型技术的算力产业监测服务平台设计

最新资源