7
search for information paragraphs closely related to the input
question. Subsequently, the critique model evaluates these
paragraphs to determine their relevance and level of support of
the retrieved text, assessing their impact on the generation of
responses. Finally, the generator model constructs responses
based on this information and evaluates the quality of these
responses through critique marks.
Recently, some methods have proposed to implement RAG
without modifying the language model architecture, which
is particularly suitable for scenarios where language models
are accessed through APIs. REPLUG [128] illustrates this
methodology by treating the language model as a ”black
box,” utilizing Contriever to seamlessly incorporate relevant
external documents into the query. REPLUG LSR, a variant
with LM-Supervised Retrieval, further refines this process by
optimizing retrieval through language model-supervised in-
sights, aiming to reduce perplexity scores and improve model
performance by enriching its contextual understanding. In-
Context RALM [129] uses the BM25 algorithm for document
retrieval and predictive reranking to select pertinent documents
for input integration.
In contemporary multimodal application research, integrat-
ing retrieved content into inputs has proven highly effective in
enhancing the performance of various tasks. This strategy is
applicable across several key fields, including code generation,
audio generation, and Knowledge Base Question Answering
(KBQA).
For text-to-code task, APICoder [130] and DocPrompt-
ing [41] demonstrate how effectively integrating retrieved
information into language models can improve the accuracy
and relevance of generated code. In automatic program repair
task, CEDAR [131] and InferFix [132] utilize retrieved code
snippets to aid the repair process, enhancing the model’s
understanding and application of repair strategies by combin-
ing them with the original input. For code completion task,
ReACC [133] employs a prompting mechanism, leveraging
retrieved code snippets as part of the new input to increase
the accuracy and efficiency of code completion.
In the audio generation field, MakeAnAudio [43] leverages
retrieval to construct captions of language-free audio, so as to
mitigate the data sparsity for text-to-audio training.
Recent research in KBQA has shown significant effects of
combining retrieval and language models. Uni-Parser [134],
RNG-KBQA [122], and ECBRF [135] effectively improve
the performance and accuracy of QA systems by merging
queries and retrieved information into prompts. BLLM aug-
mentation [136] represents an innovative attempt at zero-
shot KBQA using black-box large language models. This
method, by directly integrating retrieved information into the
model input without the need for additional sample training,
demonstrates the great potential of combining retrieval and
language models to enhance the model’s generalization ability
in understanding and answering unseen questions.
In the scientific domain of RAG technology, Chat-
Orthopedist [137] aims to provide support for shared decision-
making among adolescents with idiopathic scoliosis. This ap-
proach enhances the application effectiveness and information
accuracy of large language models by integrating retrieved
information into the prompts of the model.
In the task of image generation, RetrieveGAN [44] enhances
the relevance and accuracy of generated images by integrating
retrieved information, including selected image patches and
their corresponding bounding boxes, into the input stage of
the generator. IC-GAN [138] modulates the specific conditions
and details of the generated images by concatenating noise
vectors with instance features.
In the field of 3D generation, RetDream [49] initially
utilizes CLIP to retrieve relevant 3D assets, effectively merging
the retrieved content with the user input during the input phase.
2) Latent Representation-based RAG: In the framework
of Latent Representation-based RAG, the generative models
interact with latent representations of retrieved objects, thereby
enhancing the model’s comprehension abilities and the quality
of the content generated.
The FiD [34] technique leverages both BM25 and DPR
for sourcing supportive paragraphs. It concatenates each re-
trieved paragraph and its title with the query, processing
them individually through the encoder. FiD reduces computa-
tional complexity and efficiently utilizes relevant information
to generate answers by fusing information from multiple
retrieved paragraphs in the decoder, rather than processing
each paragraph in the encoder. The application of Fusion-
in-Decoder methodologies transcends the realm of textual
content processing, demonstrating substantial potential and
adaptability in processing code, structured knowledge, and
diverse multimodal datasets. Specifically within the code-
related domain, technologies such as EDITSUM [139], BASH-
EXPLAINER [140], and RetrieveNEdit [141] adopt the FiD
approach, facilitating integration through encoder-processed
fusion. Re2Com [142], and RACE [143] , among other meth-
ods, also feature the design of multiple encoders for different
types of inputs. In the field of Knowledge Base Question
Answering (KBQA), the FiD method has been widely adopted,
demonstrating significant effectiveness. UniK-QA [144], DE-
CAF [145], SKP [146], KD-CoT [147], and ReSKGC [148]
have effectively enhanced the performance of QA systems
through the application of Fusion-in-Decoder technology. This
illustrates that by integrating RAG for KBQA, the efficiency
and accuracy of QA systems can be significantly improved.
In the field of Science, RetMol [54] employ the Fusion-
in-Decoder strategy, integrating information at the decoder
stage to enhance the relevance and quality of the generated
molecular structures.
Retro [35] pioneers the integration of retrieved text via
”Chunked Cross-Attention,” a novel mechanism that segments
the input sequence into discrete chunks. Each chunk indepen-
dently executes cross-attention operations, thereby mitigating
computational burdens. This technique enables the model to
selectively retrieve and assimilate distinct documents for var-
ied sequence segments, fostering dynamic retrieval throughout
the generation process. This enhances the model’s adaptability
and enriches the contextual backdrop of generated content. In
the domain of image generation, cross-attention mechanisms
have been widely adopted within RAG frameworks. Methods
such as Re-imagen [149], KNN-Diffusion [150], RDM [151]
and LAION-RDM & ImageNet-RDM [152] utilize cross-