Local Text Reuse Detection
Jangwon Seo
jangwon@cs.umass.edu
W. Bruce Croft
croft@cs.umass.edu
Center for Intelligent Information Retrieval
Department of Computer Science
University of Massachusetts, Amherst
Amherst, MA 01003
ABSTRACT
Text reuse occurs in many different types of documents and
for many different reasons. One form of reuse, duplicate or
near-duplicate docu ments, has been a focus of researchers
because of its importance in Web search. Local text reuse
occurs when sentences, facts or passages, rather than whole
docu ments, are reused and modified. Detecting this type of
reuse can be the basis of new tools for text analysis. In this
paper, we introduce a new approach to detecting local text
reuse and compare it to other approaches. This comparison
involves a study of the amount and type of reuse that oc-
curs in real documents, including TREC newswire and b log
collections.
Categories and Subject Descriptors
H.3.1 [Content Analy sis and Indexing]: Indexing meth-
ods
General Terms
Algorithms, Measurement, Experimentation
Keywords
Text reu se, fingerprinting, information flow
1. INTRODUCTION
Text reuse and duplication can occur for many reasons.
Web collections, for example, contain many duplicate or
near-duplicate versions of documents because the same in-
formation is stored in many different locations. Local text
reuse, on the other h and, occurs when people borrow or pla-
giarize sentences, facts, or passages from various sources.
The tex t that is reused may be modified and may be only a
small part of the d ocument that is being created.
Near-duplicate document detection has been a major fo-
cus of researchers because of the need for these techniques in
Web search engines. These search engines hand le enormous
collections with a great number of duplicate documents. The
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’08, July 20–24, 2008, Singapore.
Copyright 2008 ACM 978-1-60558-164-4/08/07 ...$5.00.
duplicate documents make the system less efficient in that
they consume considerable system resources. Further, users
typically do not want to see redundant documents in search
results. Many efficient and effective algorithms for near-
duplicate document detection have been described in the
literature [1, 4, 5, 6].
The obvious application involving local text reuse is pla-
giarism detection, but being able to detect local reuse would
be a powerful new tool for other possible applications involv-
ing text analysis. For example, Metzler et al. [15] discussed
tracking information flow, which is the history of statements
and “facts” that are found in a text database such as news.
This application was motivated by intelligence analysis, but
could potentially be used by anyone who is interested in ver-
ifying the sources and “provenance” of information they are
reading on the Web or in blogs.
Local text reuse detection requires different algorithms
than have been developed for near-duplicate d ocument de-
tection. The reason for this is that, in the case of local
text reuse, only a small part (or parts) of a document may
have been taken from other sources. For example, state-
of-art near-duplicate detection algorithms like the locality
sensitive hash [5] assume a transitive relation between doc-
uments. That is, if a document A is a near-duplicate of
docu ment B, which is a near-duplicate of document C, then
docu ment A should be a near-duplicate of document C. A
text reuse relationship based on parts of documents, how-
ever, violates this assumption, as shown in Figure 1.
In this paper, we focus on algorithms for detecting local
text reuse based on parts of documents. In Section 2, we dis-
cuss the related literature. In Section 3, we expand on the
idea of local text reuse by introducing categories of reuse.
These categories are the basis of our experimental evalua-
tion. In Section 4, we introduce a novel algorithm for local
text reuse detection called DCT fingerprinting. This algo-
rithm is evaluated for efficiency and effectiveness in Section
5. In Section 6, the local reuse detection algorithm is used
to measure the amount and type of text reuse that occurs
in TREC news and blog collections.
2. RELATED WORK
There have been broadly two approaches to text reuse
detection. One approach is using document fingerprints
through hashing subsequences of words in documents. This
approach is known to work well for copy detection. Shiv-
akumar and Garcia-Molina [19, 20] and Broder [3] intro-
duced efficient frameworks. Since handling many finger-
prints is too expensive, various selection algorithms for fin-