数据重构视角的文档摘要方法

72 浏览量更新于2024-08-29 收藏 592KB PDF 举报

"Document summarization 是一个在众多实际应用场景中具有重大价值的技术，例如用于搜索引擎结果的摘要生成和新闻标题的创建。传统方法通常通过提取能覆盖文档主要话题且冗余度最低的句子来实现文档摘要。然而，该论文提出了一个新的框架——基于数据重建的文档摘要（DSDR）。在这个框架中，摘要生成的目标是找出那些能够最佳重建原始文档的句子。为了建模句子之间的关系，论文引入了两个目标函数，并利用线性组合和线性重建技术来优化问题，以降低重建误差。" 本文探讨了一种创新的文档摘要方法，名为“基于数据重建的文档摘要”(DSDR)。传统文档摘要方法主要关注提取关键句子以概括主要信息，而DSDR则从数据重建的角度出发，寻求构建一个由句子组成的摘要，这些句子能够尽可能地恢复原文档的内容。在DSDR框架中，作者提出的关键创新在于如何量化和利用句子之间的相互关联。通过引入两个目标函数，可以分析句子对整个文档内容贡献的程度。这些目标函数可能是为了最小化重建误差，确保摘要中的句子组合能最大程度地接近原文档的信息。同时，线性组合和线性重建技术在此过程中扮演了重要角色，它们允许通过线性操作来组合各个句子，以达到最接近原文的效果。此外，该研究可能还涉及优化问题的解决，以找到最优的句子组合，同时减少冗余信息。优化算法的选择和实施可能包括梯度下降、遗传算法或其他数值优化方法，以寻找使重建误差最小化的句子集合。这一方法的实际应用前景广泛，特别是在搜索引擎结果展示和新闻摘要等场景中，能够提供更准确、精炼的信息摘要，提高用户获取信息的效率。同时，DSDR框架也提供了对文档理解的新视角，对于自然语言处理和信息检索领域的研究具有重要意义。总结起来，DSDR是一种基于数据重建的文档摘要新方法，通过构建模型来捕捉句子间的相互作用，优化选择能最好重建原文的句子，以生成高质量的摘要。这种方法有望改善现有摘要技术，提升实际应用中的效果。

Document Summarization Based on Data Reconstruction

Zhanying He and Chun Chen and Jiajun Bu and Can Wang and Lijun Zhang

Zhejiang Provincial Key Laboratory of Service Robot, College of Computer Science,

Zhejiang University, Hangzhou 310027, China.

{hezhanying, chenc, bjj, wcan, zljzju}@zju.edu.cn

Deng Cai and Xiaofei He

State Key Lab of CAD&CG, College of Computer Science,

Zhejiang University, Hangzhou 310058, China.

{dengcai, xiaofeihe}@cad.zju.edu.cn

Abstract

Document summarization is of great value to many

real world applications, such as snippets generation for

search results and news headlines generation. Tradition-

ally, document summarization is implemented by ex-

tracting sentences that cover the main topics of a doc-

ument with a minimum redundancy. In this paper, we

take a different perspective from data reconstruction and

propose a novel framework named Document Summa-

rization based on Data Reconstruction (DSDR). Specif-

ically, our approach generates a summary which consist

of those sentences that can best reconstruct the original

document. To model the relationship among sentences,

we introduce two objective functions: (1) linear recon-

struction, which approximates the document by linear

combinations of the selected sentences; (2) nonnega-

tive linear reconstruction, which allows only additive,

not subtractive, linear combinations. In this framework,

the reconstruction error becomes a natural criterion for

measuring the quality of the summary. For each objec-

tive function, we develop an efﬁcient algorithm to solve

the corresponding optimization problem. Extensive ex-

periments on summarization benchmark data sets DUC

2006 and DUC 2007 demonstrate the effectiveness of

our proposed approach.

Introduction

With the explosive growth of the Internet, people are over-

whelmed by a large number of accessible documents. Sum-

marization can represent the document with a short piece

of text covering the main topics, and help users sift through

the Internet, catch the most relevant document, and ﬁlter out

redundant information. So document summarization has be-

come one of the most important research topics in the natural

language processing and information retrieval communities.

In recent years, automatic summarization has been ap-

plied broadly in varied domains. For example, search en-

gines can provide users with snippets as the previews of

the document contents (Turpin et al. 2007; Huang, Liu, and

Chen 2008; Cai et al. 2004; He et al. 2007). News sites usu-

ally describe hot news topics in concise headlines to facili-

tate browsing. Both the snippets and headlines are speciﬁc

forms of document summary in practical applications.

 2012, Association for the Advancement of Artiﬁcial

Most of the existing generic summarization approaches

use a ranking model to select sentences from a candidate set

(Brin and Page 1998; Kleinberg 1999; Wan and Yang 2007).

These methods suffer from a severe problem that top ranked

sentences usually share much redundant information. Al-

though there are some methods (Conroy and O’leary 2001;

Park et al. 2007; Shen et al. 2007) trying to reduce the redun-

dancy, selecting sentences which have both good coverage

and minimum redundancy is a non-trivial task.

In this paper, we propose a novel summarization method

from the perspective of data reconstruction. As far as we

know, our approach is the ﬁrst to treat the document sum-

marization as a data reconstruction problem. We argue that

a good summary should consist of those sentences that can

best reconstruct the original document. Therefore, the re-

construction error becomes a natural criterion for measur-

ing the quality of summary. We propose a novel framework

called Document Summarization based on Data Reconstruc-

tion (DSDR) which ﬁnds the summary sentences by mini-

mizing the reconstruction error. DSDR ﬁrstly learns a recon-

struction function for each candidate sentence of an input

document and then obtains the error formula by that func-

tion. Finally it obtains an optimal summary by minimizing

the reconstruction error. From the geometric interpretation,

DSDR tends to select sentences that span the intrinsic sub-

space of candidate sentence space so that it is able to cover

the core information of the document.

To model the relationship among sentences, we discuss

two kinds of reconstruction. The ﬁrst one is linear recon-

struction, which approximates the document by linear com-

binations of the selected sentences. Optimizing the corre-

sponding objective function is achieved through a greedy

method which extracts sentences sequentially. The second

one is non-negative linear reconstruction, which allows only

additive, not subtractive, combinations among the selected

sentences. Previous studies have shown that there is psycho-

logical and physiological evidence for parts-based represen-

tation in the human brain (Palmer 1977; Wachsmuth, Oram,

and Perrett 1994; Cai et al. 2011). Naturally, a document

summary should consist of the parts of sentences. With the

nonnegative constraints, our method leads to parts-based re-

construction so that no redundant information needs to be

subtracted from the combination. We formulate the nonneg-

ative linear reconstruction as a convex optimization problem

Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence

620

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38688745

粉丝: 4
资源: 908

数据重构视角的文档摘要方法

《Deep Learning Based Text Summarization: A Survey》讲了什么

安装summarization

No module named 'gensim.summarization'

在 '__init__.py | __init__.py' 中找不到引用 'summarization'

Traceback (most recent call last): File "C:/Users/WangLinYong/Desktop/code/MLPB/summarization_main.py", line 818, in <module> writer = csv.writer(f3, delimiter='</gap>', quotechar='"', quoting=csv.QUOTE_MINIMAL) TypeError: "delimiter" must be a 1-character string

Traceback (most recent call last): File "C:/Users/WangLinYong/Desktop/code/MLPB/summarization_main.py", line 823, in <module> writer.writerow([i, string1, string2]) UnicodeEncodeError: 'gbk' codec can't encode character '\xa3' in position 76: illegal multibyte sequence

text summarization with pretrained encoders

gensim.summarization

Dual-Normalization

what"s ChitGPT

最新资源

在 'init.py | init.py' 中找不到引用 'summarization'