Document Summarization Based on Data Reconstruction
Zhanying He and Chun Chen and Jiajun Bu and Can Wang and Lijun Zhang
Zhejiang Provincial Key Laboratory of Service Robot, College of Computer Science,
Zhejiang University, Hangzhou 310027, China.
{hezhanying, chenc, bjj, wcan, zljzju}@zju.edu.cn
Deng Cai and Xiaofei He
State Key Lab of CAD&CG, College of Computer Science,
Zhejiang University, Hangzhou 310058, China.
{dengcai, xiaofeihe}@cad.zju.edu.cn
Abstract
Document summarization is of great value to many
real world applications, such as snippets generation for
search results and news headlines generation. Tradition-
ally, document summarization is implemented by ex-
tracting sentences that cover the main topics of a doc-
ument with a minimum redundancy. In this paper, we
take a different perspective from data reconstruction and
propose a novel framework named Document Summa-
rization based on Data Reconstruction (DSDR). Specif-
ically, our approach generates a summary which consist
of those sentences that can best reconstruct the original
document. To model the relationship among sentences,
we introduce two objective functions: (1) linear recon-
struction, which approximates the document by linear
combinations of the selected sentences; (2) nonnega-
tive linear reconstruction, which allows only additive,
not subtractive, linear combinations. In this framework,
the reconstruction error becomes a natural criterion for
measuring the quality of the summary. For each objec-
tive function, we develop an efficient algorithm to solve
the corresponding optimization problem. Extensive ex-
periments on summarization benchmark data sets DUC
2006 and DUC 2007 demonstrate the effectiveness of
our proposed approach.
Introduction
With the explosive growth of the Internet, people are over-
whelmed by a large number of accessible documents. Sum-
marization can represent the document with a short piece
of text covering the main topics, and help users sift through
the Internet, catch the most relevant document, and filter out
redundant information. So document summarization has be-
come one of the most important research topics in the natural
language processing and information retrieval communities.
In recent years, automatic summarization has been ap-
plied broadly in varied domains. For example, search en-
gines can provide users with snippets as the previews of
the document contents (Turpin et al. 2007; Huang, Liu, and
Chen 2008; Cai et al. 2004; He et al. 2007). News sites usu-
ally describe hot news topics in concise headlines to facili-
tate browsing. Both the snippets and headlines are specific
forms of document summary in practical applications.
Copyright
c
2012, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Most of the existing generic summarization approaches
use a ranking model to select sentences from a candidate set
(Brin and Page 1998; Kleinberg 1999; Wan and Yang 2007).
These methods suffer from a severe problem that top ranked
sentences usually share much redundant information. Al-
though there are some methods (Conroy and O’leary 2001;
Park et al. 2007; Shen et al. 2007) trying to reduce the redun-
dancy, selecting sentences which have both good coverage
and minimum redundancy is a non-trivial task.
In this paper, we propose a novel summarization method
from the perspective of data reconstruction. As far as we
know, our approach is the first to treat the document sum-
marization as a data reconstruction problem. We argue that
a good summary should consist of those sentences that can
best reconstruct the original document. Therefore, the re-
construction error becomes a natural criterion for measur-
ing the quality of summary. We propose a novel framework
called Document Summarization based on Data Reconstruc-
tion (DSDR) which finds the summary sentences by mini-
mizing the reconstruction error. DSDR firstly learns a recon-
struction function for each candidate sentence of an input
document and then obtains the error formula by that func-
tion. Finally it obtains an optimal summary by minimizing
the reconstruction error. From the geometric interpretation,
DSDR tends to select sentences that span the intrinsic sub-
space of candidate sentence space so that it is able to cover
the core information of the document.
To model the relationship among sentences, we discuss
two kinds of reconstruction. The first one is linear recon-
struction, which approximates the document by linear com-
binations of the selected sentences. Optimizing the corre-
sponding objective function is achieved through a greedy
method which extracts sentences sequentially. The second
one is non-negative linear reconstruction, which allows only
additive, not subtractive, combinations among the selected
sentences. Previous studies have shown that there is psycho-
logical and physiological evidence for parts-based represen-
tation in the human brain (Palmer 1977; Wachsmuth, Oram,
and Perrett 1994; Cai et al. 2011). Naturally, a document
summary should consist of the parts of sentences. With the
nonnegative constraints, our method leads to parts-based re-
construction so that no redundant information needs to be
subtracted from the combination. We formulate the nonneg-
ative linear reconstruction as a convex optimization problem
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence