AI生成内容的检索增强生成技术综述

版权申诉

183 浏览量更新于2024-06-14 收藏 7.48MB PDF 举报

"AI 生成内容的检索增强生成 - 一项调查" 随着人工智能技术的飞速发展，AI 生成内容（Artificial Intelligence Generated Content, AIGC）已经成为了各种领域的重要工具，这得益于模型算法的进步、大规模基础模型的兴起以及高质量数据集的广泛可用。然而，尽管AIGC在许多方面取得了显著的成果，但仍存在一些挑战，如保持知识的时效性和长尾知识的维护、防止数据泄露的风险以及训练和推理过程中的高昂成本。检索增强生成（Retrieval-Augmented Generation, RAG）是为解决这些问题而提出的一种新范式。RAG通过引入信息检索过程，在生成过程中检索可用数据存储中的相关信息，从而提高了生成内容的准确性和鲁棒性。它允许模型利用外部知识库，以补充模型自身的记忆局限，尤其是在处理复杂、动态或特定领域的信息时。在这篇综述论文中，作者对现有的将RAG技术融入AIGC场景的工作进行了全面回顾。首先，他们根据不同的基础模型和检索策略对RAG方法进行了分类。这包括基于检索的模型（如Transformer-XL和Reformer），这些模型能够处理长序列，从而更好地捕获上下文信息；还有基于记忆的模型（如MemN2N和Transformer with Memory），它们通过内部记忆模块存储和检索信息。接着，论文深入探讨了RAG在不同应用场景中的表现，例如对话系统、问答、文本摘要和生成式对抗网络等。在对话系统中，RAG能够提供更准确、更自然的响应，因为它可以从大量历史对话中检索相关上下文；在问答系统中，它能有效利用外部知识库，提高答案的准确度；在文本摘要中，RAG可以生成更连贯、更丰富的摘要，因为它可以检索并整合原始文本中的关键信息。此外，论文还分析了RAG的评估指标和方法，包括BLEU、ROUGE、METEOR等传统生成任务评价指标，以及针对RAG特性的新型评价方法，如知识准确性、新颖性和多样性。这些指标有助于量化RAG在提升生成内容质量方面的效果。最后，作者讨论了RAG面临的挑战与未来研究方向，比如如何优化检索效率、减少无效检索，如何处理知识过时问题，以及如何在保护隐私和避免数据泄露的同时，实现更有效的知识融合。此外，他们还提到了将预训练模型与RAG结合，以及探索动态更新和自适应学习的可能。这篇调查论文为理解RAG在AIGC中的作用提供了全面的视角，对于研究人员和实践者来说，是深入了解这一领域和寻找未来研究灵感的重要资源。

search for information paragraphs closely related to the input

question. Subsequently, the critique model evaluates these

paragraphs to determine their relevance and level of support of

the retrieved text, assessing their impact on the generation of

responses. Finally, the generator model constructs responses

based on this information and evaluates the quality of these

responses through critique marks.

Recently, some methods have proposed to implement RAG

without modifying the language model architecture, which

is particularly suitable for scenarios where language models

are accessed through APIs. REPLUG [128] illustrates this

methodology by treating the language model as a ”black

box,” utilizing Contriever to seamlessly incorporate relevant

external documents into the query. REPLUG LSR, a variant

with LM-Supervised Retrieval, further reﬁnes this process by

optimizing retrieval through language model-supervised in-

sights, aiming to reduce perplexity scores and improve model

performance by enriching its contextual understanding. In-

Context RALM [129] uses the BM25 algorithm for document

retrieval and predictive reranking to select pertinent documents

for input integration.

In contemporary multimodal application research, integrat-

ing retrieved content into inputs has proven highly effective in

enhancing the performance of various tasks. This strategy is

applicable across several key ﬁelds, including code generation,

audio generation, and Knowledge Base Question Answering

(KBQA).

For text-to-code task, APICoder [130] and DocPrompt-

ing [41] demonstrate how effectively integrating retrieved

information into language models can improve the accuracy

and relevance of generated code. In automatic program repair

task, CEDAR [131] and InferFix [132] utilize retrieved code

snippets to aid the repair process, enhancing the model’s

understanding and application of repair strategies by combin-

ing them with the original input. For code completion task,

ReACC [133] employs a prompting mechanism, leveraging

retrieved code snippets as part of the new input to increase

the accuracy and efﬁciency of code completion.

In the audio generation ﬁeld, MakeAnAudio [43] leverages

retrieval to construct captions of language-free audio, so as to

mitigate the data sparsity for text-to-audio training.

Recent research in KBQA has shown signiﬁcant effects of

combining retrieval and language models. Uni-Parser [134],

RNG-KBQA [122], and ECBRF [135] effectively improve

the performance and accuracy of QA systems by merging

queries and retrieved information into prompts. BLLM aug-

mentation [136] represents an innovative attempt at zero-

shot KBQA using black-box large language models. This

method, by directly integrating retrieved information into the

model input without the need for additional sample training,

demonstrates the great potential of combining retrieval and

language models to enhance the model’s generalization ability

in understanding and answering unseen questions.

In the scientiﬁc domain of RAG technology, Chat-

Orthopedist [137] aims to provide support for shared decision-

making among adolescents with idiopathic scoliosis. This ap-

proach enhances the application effectiveness and information

accuracy of large language models by integrating retrieved

information into the prompts of the model.

In the task of image generation, RetrieveGAN [44] enhances

the relevance and accuracy of generated images by integrating

retrieved information, including selected image patches and

their corresponding bounding boxes, into the input stage of

the generator. IC-GAN [138] modulates the speciﬁc conditions

and details of the generated images by concatenating noise

vectors with instance features.

In the ﬁeld of 3D generation, RetDream [49] initially

utilizes CLIP to retrieve relevant 3D assets, effectively merging

the retrieved content with the user input during the input phase.

2) Latent Representation-based RAG: In the framework

of Latent Representation-based RAG, the generative models

interact with latent representations of retrieved objects, thereby

enhancing the model’s comprehension abilities and the quality

of the content generated.

The FiD [34] technique leverages both BM25 and DPR

for sourcing supportive paragraphs. It concatenates each re-

trieved paragraph and its title with the query, processing

them individually through the encoder. FiD reduces computa-

tional complexity and efﬁciently utilizes relevant information

to generate answers by fusing information from multiple

retrieved paragraphs in the decoder, rather than processing

each paragraph in the encoder. The application of Fusion-

in-Decoder methodologies transcends the realm of textual

content processing, demonstrating substantial potential and

adaptability in processing code, structured knowledge, and

diverse multimodal datasets. Speciﬁcally within the code-

related domain, technologies such as EDITSUM [139], BASH-

EXPLAINER [140], and RetrieveNEdit [141] adopt the FiD

approach, facilitating integration through encoder-processed

fusion. Re2Com [142], and RACE [143] , among other meth-

ods, also feature the design of multiple encoders for different

types of inputs. In the ﬁeld of Knowledge Base Question

Answering (KBQA), the FiD method has been widely adopted,

demonstrating signiﬁcant effectiveness. UniK-QA [144], DE-

CAF [145], SKP [146], KD-CoT [147], and ReSKGC [148]

have effectively enhanced the performance of QA systems

through the application of Fusion-in-Decoder technology. This

illustrates that by integrating RAG for KBQA, the efﬁciency

and accuracy of QA systems can be signiﬁcantly improved.

In the ﬁeld of Science, RetMol [54] employ the Fusion-

in-Decoder strategy, integrating information at the decoder

stage to enhance the relevance and quality of the generated

molecular structures.

Retro [35] pioneers the integration of retrieved text via

”Chunked Cross-Attention,” a novel mechanism that segments

the input sequence into discrete chunks. Each chunk indepen-

dently executes cross-attention operations, thereby mitigating

computational burdens. This technique enables the model to

selectively retrieve and assimilate distinct documents for var-

ied sequence segments, fostering dynamic retrieval throughout

the generation process. This enhances the model’s adaptability

and enriches the contextual backdrop of generated content. In

the domain of image generation, cross-attention mechanisms

have been widely adopted within RAG frameworks. Methods

such as Re-imagen [149], KNN-Diffusion [150], RDM [151]

and LAION-RDM & ImageNet-RDM [152] utilize cross-

attention to integrate multiple retrieval results, effectively

enhancing the overall performance of the models. On the

other hand, Li [153] introduces the ACM, a text-image afﬁne

combination module, which notably does not employ any form

of attention mechanism.

Diverging from prior methods for knowledge, EaE [154]

empowers language models to internalize explicit entity

knowledge. EaE introduces an entity-speciﬁc parameteriza-

tion, optimizing inference efﬁcacy through an entity mem-

ory layer embedded within the transformer architecture. This

layer directly acquires entity representations from textual

data, utilizing a sparse retrieval strategy to fetch the nearest

entities based on their embeddings, thus reﬁning the model’s

comprehension through a calculated aggregation of entity-

speciﬁc information. Distinctly, TOME [155] shifts the focus

towards comprehensive mention encodings, prioritizing the

granularity of mention over mere entity representations. It

meticulously generates encodings for each entity mention

across Wikipedia, populating a repository with approximately

150 million entries. This repository, encompassing key and

value encodings alongside entity IDs, empowers the retrieval

of much more ﬁne-grained information. TOME integrates an

initial transformer block to process input texts, followed by

TOME blocks with memory attention layers, facilitating the

synthesis of multifaceted information sources and enhancing

inferential reasoning capabilities, even for unencountered en-

tities. Memorizing Transformers [30] revolutionize long docu-

ment processing through the integration of a kNN-augmented

attention mechanism within a Transformer layer. This innova-

tion triggers a kNN search amidst input sequence processing,

fetching data based on similarities between the sequence and

stored key-value pairs, thereby elevating performance without

necessitating complete retraining. This approach not only

bolsters processing efﬁciency but also broadens the model’s

memory span, enabling self-retrieval from its generated out-

puts and ﬁne-tuning for extensive knowledge bases or code

repositories. Unlimiformer [156], by embedding a k-nearest

neighbors (kNN) index within a pre-trained encoder-decoder

transformer framework, pioneers handling inputs of indeﬁnite

length. Storing hidden states of input tokens in the kNN index

allows for the efﬁcient retrieval of highly relevant tokens

during decoding. This innovation extends the model’s capacity

to manage prolonged sequences.

In the ﬁeld of 3D generation, ReMoDiffuse [50] introduces

a semantics-modulated attention mechanism. This technol-

ogy enhances the accuracy of generating corresponding 3D

motions based on textual descriptions. AMD [157] achieves

efﬁcient conversion from text to 3D motion by fusing the

original diffusion process with the reference diffusion process.

In the audio domain, Koizumi [42] utilizes a pre-trained

large-scale language model, incorporating dense features gen-

erated by VGGish and embedding networks in the atten-

tion module to guide the generation of audio captions. Re-

AudioLDM [158] extracts deep features from text and audio

using T5 and AudioMAE, and integrates these features within

the attention mechanism of its Latent Diffusion Model (LDM).

In the ﬁeld of video captioning, R-ConvED [47] employs

a convolutional encoder-decoder network architecture, which

processes retrieved video-sentence pairs with the aid of an at-

tention mechanism to generate hidden states and subsequently

produce captions. CARE [159] introduces a concept detector

as an input to the decoder and embeds concept representa-

tions within a hybrid attention mechanism. EgoInstructor [48]

employs gated-cross attention to integrate these textual inputs

with encoded video features, enhancing the relevance and

coherence of the generated captions for egocentric video

content.

3) Logit-based RAG: In Logit-based RAG, generative

models combine retrieval information through logits during

the decoding process. Typically, the logits are summed or

combined through models to produce the probability for step-

wise generation.

The kNN-LM [36] model integrates a pre-trained neural

language model with the k-nearest neighbor search. It employs

the pre-trained model to generate a list of candidate words and

their probability distribution, while simultaneously performing

retrieval from a data repository to ﬁnd the k most relevant

neighbors based on the current context, thus enhancing the

output of the original language model. The innovation at the

core of this model lies in its ability for dynamic retrieval of

information from a broad text corpus, signiﬁcantly improving

the accuracy and relevance of its predictions, particularly

in addressing rare patterns and adapting to various ﬁelds.

He [37] introduces a new framework that is predicated on

performing retrieval operations only when necessary, aimed

at enhancing the inference efﬁciency of the kNN-LM model

through an adaptive retrieval. This framework accelerates the

model’s inference speed by training a retrieval adapter, which

automatically identiﬁes and eliminates unnecessary retrieval

actions in certain scenarios. This method allows the model to

dynamically decide on the necessity of retrieval based on the

current context, thereby balancing the trade-off between per-

formance and efﬁciency, and substantially increasing inference

speed while maintaining model performance.

Unlike previous methods that only merge memories during

the testing time, TRIME [160] achieves memory merging

during both training and testing phases, treating in-batch

examples as accessible memory. TRIME leverages new data

batching and memory construction techniques to effectively

utilize external memory. TRIME employs BM25 scores to

pack paragraphs with high lexical overlap into the same

batch, constructing the training memory to further optimize

model performance. NPM [161] is a non-parametric masked

language model comprised of an encoder and a reference

corpus. Unlike traditional models that apply a softmax over a

ﬁnite vocabulary, NPM models a non-parametric distribution

over the corpus. The encoder’s role is to map phrases from the

corpus into ﬁxed-size vectors, ﬁlling in [MASK] by retrieving

the phrase most similar to the masked position.

Beyond the text domain, other modalities such as code and

image also utilize methods that integrate retrieved content with

language models in the ﬁnal output stage.

For code-to-text conversion task, Rencos [120] generates

multiple summary candidates in parallel from the retrieved

code. It then normalizes these candidates using edit distance

and calculates the ﬁnal probability to select the summary out-

剩余35页未读，继续阅读

百态老人

粉丝: 6835
资源: 2万+

AI生成内容的检索增强生成技术综述

人工智能中的语言智能：从自然语言处理到计算语言学

人工智能在标题改写处理中的应用方法研究

金融凭证影像数据生成技术新进展

【20230328】AIGC生成式人工智能产业全梳理-国信证券_80页.pdf

800-2024生成式AI应用程序安全测试和验证标准（英译中）-WDTA.pdf

百度大脑AI技术成果白皮书-2019.10-48页 (1).pdf

国信证券-20230328-人工智能专题报告：生成式人工智能产业全梳理.pdf

Complexity-Based Prompting for Multi-Step Reasoning.pdf

知识图谱的质量控制-李直旭.pdf

人工智能-搜索引擎-搜索引擎中文档聚类方.pdf

最新资源