Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
Knowledge
A survey on hallucination prevention
— student project —
Abstract—Large pre-trained language models have demon- breaking capabilities of LLMs appear to scale with the model’s
strated their proficiency in storing factual knowledge within their size in terms of trainable parameters. While recent efforts
parameters and achieving remarkable results when fine-tuned have produced smaller LLMs with retained capabilities [4], the
for downstream natural language processing tasks. Nonetheless,
their capacity to access and manipulate knowledge with precision practicality of training and maintaining large models remains
remains constrained, resulting in performance disparities on a challenge, with continual learning for such models posing
knowledge-intensive tasks when compared to task-specific ar- an ongoing research question [5].
chitectures. Additionally, the challenges of providing provenance These limitations are rooted in a fundamental issue with
for model decisions and maintaining up-to-date world knowledge LLMs: they are primarily trained for statistical language
persist as open research frontiers. To address these limitations,
the integration of pre-trained models with differentiable access modeling, relying on a single parametric model and a rel-
mechanisms to explicit non-parametric memory emerges as atively limited context, typically the preceding ”n” tokens.
a promising solution. This survey delves into the realm of Despite advancements in hardware and software, most models
language models (LMs) augmented with the ability to tap into still employ relatively small context sizes compared to the
external knowledge sources, including external knowledge bases expansive context required for accurate language modeling in
and search engines. While adhering to the standard objective
of predicting missing tokens, these augmented LMs leverage all scenarios. Consequently, achieving the necessary scale to
diverse, possibly non-parametric external modules to augment store the knowledge beyond the immediate context has become
their contextual processing capabilities, departing from the con- a necessity.
ventional language modeling paradigm. Through an exploration In response, a growing research trend has emerged, mov-
of current advancements in augmenting large language models ing away from the traditional statistical language modeling
with knowledge, this work concludes that this emerging research
direction holds the potential to address prevalent issues in paradigm. One approach addresses the limited context size of
traditional LMs, such as hallucinations, un-grounded responses, LLMs by enhancing its relevance through the incorporation
and scalability challenges. of information extracted from external documents [6] [7]. By
equipping language models with modules that retrieve relevant
I. I NTRODUCTION documents from databases based on the context, it becomes
Large Language Models (LLMs) have brought about re- possible to replicate certain capabilities of larger LLMs while
markable advancements in Natural Language Processing using fewer parameters [8] [9].
(NLP) and are now integral to various widely-used prod- Moreover, in this evolving landscape, pioneering models
ucts, including Copilot [1], Google’s search engine, and [10] [11] that leverage structured knowledge stand out. These
more recently, Chat-GPT, a chatbot built upon GPT3 [2]. models leverage knowledge graphs along with a corpus of sup-
These models, characterized by their memorization capabilities porting documents, which can be jointly processed by Graph
as well as their compositional prowess, have demonstrated Convolutional Neural Networks (CNNs). By harnessing graph-
unprecedented performance in tasks ranging from language based representations, these structured-knowledge augmented
understanding to text generation, paving the way for more models excel in generating precise responses to open-domain
sophisticated human-computer interactions. questions. This innovative use of structured knowledge marks
However, LLMs are not without their limitations. They a significant advancement in enhancing language models,
often produce seemingly plausible yet incorrect predictions, a demonstrating the diverse strategies researchers are adopting
phenomenon known as hallucinations [3], leading to avoidable to address the limitations of contemporary LLMs.
errors in various contexts. Furthermore, many of the ground- It is worth noting that such approaches transform the
1
resulting models into a non-parametric ones, as they can now One of the most well-known autoregressive models in NLP
effectively query external data sources. is the GPT (Generative Pre-trained Transformer) series, such
Another strategy involves enabling LLMs to leverage ex- as GPT-2 [18] and GPT-3 [2]. These models generate text
ternal tools [12], such as search engines [13] [14] [12], by predicting the next word in a sentence based on the
allowing them to augment the current context with crucial preceding words. They use self-attention [19] mechanisms to
missing information not contained within the model’s weights. capture dependencies between words at different positions in
Although most of these efforts aim to address individual the sequence, making them capable of generating coherent and
shortcomings of LLMs, it is evident that a more comprehensive contextually relevant text.
integration of knowledge tools has the potential to significantly C. Sequence-to-sequence Models
enhance the capabilities of these models.
A sequence-to-sequence (seq2seq) model [20] predicts the
In light of these recent developments in NLP, there is a
probability of a token being the next token in a given sequence
pressing need for a comprehensive taxonomy of augmented
of words.
language models and clear definitions of the technical termi-
It consists of an encoder and a decoder. The encoder reads
nology used, which sometimes carry varying interpretations
the input sequence, one step at a time and produces a fixed-
and intentions.
dimensional vector representation of the entire sequence. This
II. BACKGROUND vector is called a context vector and it is a representation of all
the meaningful information of the input sequence. The context
As we delve into the intricacies of augmenting Large vector is then passed to the decoder, which generates an output
Language Models (LLMs) with external knowledge, it is sequence.
imperative to establish a foundational understanding of the Sequence-to-sequence models are typically trained using
key concepts that underpin this transformative field. Knowl- a maximum likelihood objective, which means that they are
edge augmentation strategies, such as harnessing knowledge trained to produce the output sequence that is most likely to
graphs, employing beam search techniques, leveraging triple- follow the input sequence. In summary, seq2seq models are de-
store databases, and integrating sequence-to-sequence models, signed for tasks involving the transformation of one sequence
form the bedrock upon which advanced language models now into another, often with different lengths and structures. They
stand. In this section, we embark on a comprehensive explo- are typically applied to tasks such as: machine translation,
ration of these pivotal concepts, unraveling their significance, text summarization, and question-answering, where the rela-
methodologies, and interconnectedness. By elucidating these tionship between the input and output sequences is not purely
fundamental building blocks, we pave the way for a profound linear or where the lengths of input and output sequences can
understanding of how contemporary LLMs harness external vary significantly.
knowledge to achieve unprecedented linguistic feats. From this point and onwards, we will refer to sequence-to-
sequence models as just seq2seq.
A. Generative Language Models
D. Transformers
Generative language models are trained to produce new text, The Transformer architecture [19] marked a groundbreak-
given an input sequence of tokens. They are able to perform ing advancement in the field of NLP. Since its inception,
this by learning the statistical relationships between words and Transformers have become the backbone of various state-of-
phrases in a large corpus of text. When given a prompt, a the-art language models, underpinning many of the recent
generative model will try to produce text that is consistent developments in the realm of augmented language models.
with the statistical patterns it has learned. At its core, the Transformer architecture revolutionized
Some of the most popular generative models in natural sequence-to-sequence modeling through the introduction of
language processing include autoregressive models [15], vari- the attention mechanism. Unlike earlier recurrent neural net-
ational autoencoders (VAEs) [16] and Generative adversarial works (RNNs) [21] [22] and convolutional neural networks
networks (GANs) [17]. In this literature survey, we will (CNNs) [23], Transformers rely on self-attention mechanisms
mostly explore Transformers, autoregressive models, along to capture dependencies between elements in a sequence,
with another type of generative language models, sequence- making them highly parallelizable and efficient for processing
to-sequence models. long-range dependencies.
The architecture consists of two main components: the
B. Autoregressive Models encoder and the decoder. The encoder processes the input
An autoregressive model [15] is a type of neural network sequence, while the decoder generates the output sequence.
used for generating sequences of data, where each element in Each component comprises multiple layers, with each layer
the sequence is predicted one at a time based on the previously containing a multi-head self-attention mechanism and feed-
generated elements. In other words, the model generates data forward neural networks. These self-attention mechanisms
by conditioning its predictions on the data it has generated enable Transformers to capture contextual information effi-
so far. Autoregressive models are typically used for tasks like ciently, making them ideal for tasks that involve understanding
text generation, time series forecasting, and speech synthesis. and generating sequences of data.
2
In the context of language modeling, Transformers can Specifically, in the context of this paper, a document is essen-
be adapted to function as decoder-only models. In decoder- tially a sentence, and an article is a collection of documents.
only Transformers, the encoder component, which is used for As we will see later on in this survey, text corpora are
encoding input sequences, is removed. These models retain considered an unstructured knowledge base and are usually
the core Transformer architecture but focus exclusively on organized in vector databases.
generating sequences of tokens, making them particularly
G. Vector Database
suitable for autoregressive language modeling tasks.
Decoder-only Transformers operate in an autoregressive In a vector database, a document can correspond to one vec-
manner. They generate sequences one token at a time, with tor or many vectors, depending on the specific implementation
each token’s prediction conditioned on the previously gen- of the database. A single vector captures the overall meaning
erated tokens. This autoregressive approach allows them to of the document. This is often done by averaging the vectors of
produce coherent and contextually relevant text. Decoder-only the words in the document. In other cases, a document may
Transformers have been instrumental in various text generation be represented by a vector for each word in the document.
tasks, including machine translation, text summarization, and This is often done when it is important to be able to track the
text completion. individual words in the document.
Since the introduction of the Transformer architecture, nu- When a language model retrieves information from a vector
merous variants and extensions have emerged, each tailored database, it essentially has access to knowledge that is not
to address specific challenges in NLP. These variants include stored in its parameters (weights). Therefore, a vector database
models such as BERT (Bidirectional Encoder Representations is a form of non-parametric memory for LLMs.
from Transformers) [24], GPT (Generative Pre-trained Trans- H. Dense Vector Index
former) [18] [2], and T5 (Text-to-Text Transfer Transformer) Indexing in a vector database is the process of organizing the
[25], among others. Many of these models have laid the vectors in the database in a way that makes it efficient to search
foundation for augmenting language models with external for and retrieve similar vectors (vectors with a high inner
knowledge, a topic of great interest in recent NLP research. product). This is accomplished by creating a data structure
E. Beam Search that maps each vector to a set of other vectors that are similar
Beam Search is a heuristic search algorithm that explores to it.
a graph, G, by expanding only the K (beam width) most Maximum Inner Product Search (MIPS) is a specific type
promising nodes at each step. Beam Search simulates the of vector search that aims to find the vector in the database
behavior of Breadth-First Search. More specifically, it uses with the highest inner product with a given query vector. MIPS
BFS to create a search tree. At each level of the tree, it checks is used in a variety of applications, such as recommendation
all the successors of the current level and keeps only the top systems, machine learning, and image retrieval.
K ones, while pruning the others. The process repeats until FAISS [26] is a popular open-source library for efficient
K leaves are found. Beam search will return the leaf that similarity search and clustering of dense vectors. FAISS
maximizes some given score function. contains a variety of algorithms for MIPS, as well as other
In the context of NLP, when using a generative model, Beam types of vector search. FAISS is used by many companies and
Search is utilized to find the sequence y = (y1 , ..., yn ) that is organizations, including Google, Facebook, and Microsoft.
most likely to come after an input sequence x. In mathematic I. Triplestore Knowledge Bases
notation, the probability to maximize is: A Triplestore knowledge base is a database that consists of
subject-predicate-object triples. An example of such a triple is:
p(y|x) = p(yn |x, y1...n−1 ) · p(y1...n − 1|x) (Subject: Albert Einstein, Predicate: was born in, Object: Ulm,
= p(yn |x, y1...n−1 ) · p(yn−1 |x, y1...n−2 ) · ... · p(y1 |x) Germany). Triples are a great form of representing factual
(1) knowledge because they capture the nature of the relationship
Instead of choosing only the output token with the highest between a subject and an object and can be easily processed
probability each time, beam search chooses the top K tokens by LLMs. One can visualize this knowledge base as a graph
with the highest probability and explores the generated se- whose vertices are the various subjects and objects (entities)
quences recursively until we reach an < EOS > (end-of- and the predicates are the edges between these entities.
sequence) token. Then, it returns sequence y (out of the K Each edge has a type (e.g: ”was born in”) that describes the
sequences) that maximizes p(y|x). kind of the relation between the connected entities. Triplestore
In the following sections, we will explore some concepts knowledge bases with more than one types of relations are
that are pivotal to the understanding of state-of-the-art aug- called heterogeneous.
mentation of LLMs. Triplestores are an excellent example of what we call
structured knowledge bases. They can be merged with un-
F. Text Corpus structured knowledge bases through a set of entity links:
A text corpus, D is a set of documents: d1 , ..., d|D| where (v, dp ), connecting entity v with a word at position p, in
each document is a sequence of words: di = (w1 , ..., w|di | ). document d.
3
J. Graph Convolutional Networks
Graph convolutional networks (GCNs) are a type of neural
network that can be used to learn representations of nodes
in a structured knowledge base, such as a graph. GCNs are
particularly well-suited for node classification tasks, where the
goal is to predict the label of each node in the graph (e.g:
whether the node contains an answer to a given question or
not).
GCNs work by iteratively aggregating information from the
neighbors of each node. At each layer, the GCN collects the
embeddings of all of a node’s neighbors, averages them, and
then applies a linear transformation and a nonlinear activation
function. The output of this layer is then used as the input to
Fig. 1: Overview of knowledge augmentation of language
the next layer.
models from the paper by Izacard et al. [7]. The input query
The more layers the GCN has, the more multi-hop reasoning
(light yellow), along with a number of retrieved relevant
the model will be able to perform, because it will gather
documents (light blue), passes through the generative seq2seq
information from more far away neighbors. This makes GCNs
model to produce an output response.
well-suited for tasks where the labels of nodes depend on the
labels of their neighbors, such as social network analysis and
fraud detection.
bases can be either unstructured (text-based) or structured
Here is a high-level overview of how a GCN works for node
(graph-based). In this literature survey, we are going to explore
classification:
work from both worlds.
1) Initialize the embeddings of all nodes in the graph.
2) For each node in the graph: A. Retrieval-Augmented Generation (RAG)
a) Collect the embeddings of all of the node’s neighbors.
b) Average the embeddings of the node’s neighbors. RAG [6] uses both parametric and non-parametric memory
c) Apply a linear transformation and a nonlinear activa- to generate more accurate and informative responses to an
tion function to the average embedding. input query.
d) The output of this function is the new embedding for Specifically, the RAG architecture entails:
the node. • a generator: a BART-large [28] sequence-to-sequence
3) Repeat step 2 for a fixed number of layers. language model, pre-trained on a massive dataset of text
4) The final embedding of each node is used as the input to and code (parametric memory).
a classifier to predict the node’s label. • a knowledge base: a dense vector index of the Wikipedia
database (non-parametric memory). All documents in the
K. Relational Graph Convolutional Networks knowledge base are also encoded as vectors using a
One problem that arises when the knowledge-base graph BERTBASE [24] document encoder, BERTd .
heterogeneous is that, in that case, we want to take into • a retriever: a component that is responsible for retrieving
consideration the type of relation that a node has with its the documents of the knowledge base that are most rele-
neighbors before we average their embeddings. vant to the input query. It follows the DPR (dense passage
A relational GCN [27] is similar to a regular GCN, but it retrieval) architecture [29] and it consists of a document
uses a separate matrix for each type of relation. Therefore, encoder, BERTd and a query encoder, BERTq . The
when using a relational GCN, we aggregate the embeddings retriever
from all neighbors with a specific relation and we pass the – calculates the embedding of the input query, using the
averaged embedding into a separate CNN layer for each BERTq encoder.
relation. – conducts Maximum Inner Product Search (MIPS) in
the indexed knowledge base to find the K most similar
III. K NOWLEDGE BASE AUGMENTED G ENERATION documents to the input query
Language models have the ability to store knowledge in According to the authors of RAG, training and fine-tuning
their parameters. Alternatively, knowledge in the form of the parameters of the BERTd encoder is extremely com-
natural language can be offloaded completely from the LM by putationally expensive, and not very effective accuracy-wise.
retrieving from an external knowledge base. Memory augmen- Specifically, if they were to train the parameters of BERTd ,
tation strategies help the language model to avoid producing then for each training iteration, the embeddings of each
non-factual information as well as reducing the number of document in the BERTBASE knowledge base would have
parameters required to achieve comparable performance to to be updated as well, so that they are in-sync with the new
significantly larger LMs. Based on their structure, knowledge BERTd encoder.
4
Therefore, they use a completely pre-trained BERTd en- 1) randomly selects sentences from the text corpus and
coder, and during the fine-tuning stage, they only fine-tune masks specific tokens from each sentence
the parameters of the query encoder BERTq . 2) receives a masked query, q, as input. An example of that
One interesting aspect of RAG is how it implements the query would be: ”The [MASK] at the top of the pyramid”
fusion of knowledge from all retrieved documents to produce 3) outputs its token prediction (correct answer is ”pyramid-
a final response. In both proposed versions of RAG, RAG- ion”)
token and RAG-sequence, fusion is performed right after the 4) back-propagates through the parameters, θ of the the
decoder. retriever pθ (z|x), and ϕ, of the generator pϕ (z|x) (joint
Specifically, RAG-token: pre-training of the models).
• for each retrieved document z, calculates the probability During pre-training, both the Embeddoc and the
for each token yi in the vocabulary to be the next token Embedinput components of the Retriever are updated.
in the sequence: Because the parameters of Embeddoc are updated during
pre-training, in order for the document embeddings in the
pθ (yi |x, z, y1:i−1 ) (2) Wikipedia knowledge base to stay in-sync with the updated
• sums the probabilities over all retrieved documents Retriever, after each back-propagation step, REALM needs
(marginalization): to:
′ X 1) re-compute the document embeddings
pθ (yi |x, y1:i−1 ) = pη (z|x) · pθ (yi |x, z, y1:i−1 ) (3) 2) re-calculate the document index (in order to perform
z MIPS)
• runs Beam Search to find the K most likely next tokens This is a computationally expensive task, especially for
• chooses the token, yi with the highest transition proba- really large databases, such as Wikipedia. Therefore, REALM
bility was designed such that the embedding updates happen every
The RAG-sequence model is quite easier to grasp. It takes 100 back-propagation steps, as an asynchronous process.
into account only one retrieved document per sequence that The supervised fine-tuning method that the authors used
it generates. Specifically, for each retrieved document, it con- in order to evaluate REALM on Open-domain Question An-
ducts Beam Search to generate K sequences. Then, it simply swering (Open-QA) goes as follows: 1. they collect question-
returns the sequence with the highest probability. answer tuples, such as: (”What’s the angle of an equilateral
triangle”, ”60 degrees”). 4. REALM receives the question as
B. REALM [30] input. 5. it outputs its prediction. 6. similar to the pre-training
REALM was the first method that managed to pre-train phase, REALM back-propagates through the parameters of the
jointly the retriever and the generator. The authors of REALM the retriever pθ (z|x), and ϕ, of the generator pϕ (z|x), but
propose three stages of training for the given architecture: this time Embeddoc stays untouched. Therefore, fine-tuning
is much less computationally expensive.
• initialization
• pre-training C. Fusion in Decoder (FiD)
• fine-tuning FiD [7] employs a similar but quite simpler idea to RAG.
One significant challenge that REALM faced was the fact Their main difference, however, lies in the way they perform
that, at the beginning of training, the query and document the fusion of the retrieved knowledge.
encoders, Embedinput and Embeddoc respectively contain Similar to RAG, in FiD, we have two main models:
completely random parameters. Hence, the retrieved docu- • the retriever which has access to a BERTBASE where
ments, z, will likely be unrelated to the input query, x. As a documents are represented as dense vectors and retrieves
result, the Generator learns to ignore the retrieved documents. the most relevant documents by running Maximum Inner
Once this occurs, during training, the Retriever no longer Product Search (MIPS) using the FAISS library [26]
receives a meaningful gradient and cannot improve, creating • the generator which is a sequence-to-sequence model
a vicious cycle that does not result in an accurate end model. that receives the input query concatenated with a retrieved
To avoid this cold-start problem, the authors warm-start passage and is trained to produce an answer. For their
(initialization) the Retriever (Embedinput + Embeddoc ) using experiments, they used a pre-trained T5 [25] seq2seq
a training objective known as the Inverse Cloze Task (ICT) model.
[31] where, given a sentence, the model is trained to retrieve In FiD, fusion of the knowledge in the retrieved documents
the document where that sentence came from. is performed right before the decoder. Specifically, similar to
In the case of the Generator, the authors warm-start it with RAG, they concatenate the input query with each retrieved
BERT pre-training [24] and they use the uncased BERT-base passage and they separately feed each concatenation to the
model (12 layers, 768 hidden units, 12 attention heads). encoder (in parallel). However, after that, all the produced
After the initialization stage, the REALM proposes an unsu- encoded vectors are concatenated together (fusion) and are
pervised pre-training method. During the pre-training iteration, passed a single-vector input to the decoder, which performs
REALM: attention across all retrieved documents (cross-attention).
5
Fig. 2: Overview of the Fusion-in-Decoder (FiD) [7] technique. The input question gets concatenated with each relevant passage
and all concatenations get encoded in parallel. The embeddings that are produced are concatenated together (fusion) and are
passed as input to the decoder.
6
Graft-Net consists of the following stages: G. PullNet [11]
1) the question sub-graph (Gq ) retrieval stage: This is a PullNet builds upon the advancements made by GRAFT-Net
characteristic of early fusion, the process of combining and uses the text corpus to supplement information extracted
information from the triplestore knowledge-base and text from the triplestore knowledge base in order to answer multi-
early in the model, i.e., before a graph neural network is hop questions. The subjects and objects in the triples contain
used. links to relevant documents in the text corpus and PullNet uses
2) the answer selection stage, where GRAFT-Net use a these links to produce more factually-based answers.
Graph Convolutional Network (GCN) variant [34] [35] Like GRAFT-Net, PullNet has an initial phase where it
[27] to do binary classification (answer, not-answer) on retrieves a question sub-graph Gq . However, PullNet learns
the nodes of Gq . how to construct the sub-graph, rather than using an ad-hoc
subgraph-building strategy. More specifically, PullNet relies on
The question sub-graph Gq essentially is a copy of the entire
a small set of retrieval operations, each of which expands a
knowledge-base graph, in which the nodes and edges that are
graph node by retrieving new information from the knowledge
irrelevant to a given question, q, are pruned. In addition, the
base or the corpus. It learns when and where to apply these
question sub-graph contains text documents as well, but only
“pull” operations with another graph CNN classifier. The
the ones that are likely to contain the answer to question q.
“pull” classifier is weakly supervised, using question-answer
The retrieval of the question sub-graph, Gq happens in two
pairs.
parallel pipelines:
The end result is a learned iterative process for sub-graph
1) Knowledge Base Retrieval construction, which begins with a small sub-graph containing
2) Text Retrieval only the question text and the entities which it contains, and
During the knowledge base retrieval, a sub-graph of the gradually expands the sub-graph to contain information from
triplestore knowledge base is retrieved. Specifically, GRAFT- the knowledge base and corpus that are likely to be useful.
Net: The process is especially effective for multi-hop questions.
1) retrieves a set of seed entities, Sq, that are relevant to the IV. S EARCH -E NGINE AUGMENTED G ENERATION
question q Augmenting large language models with search engines
2) runs the Personalized PageRank (PPR) method [36] represents the next step in the evolution of AI-driven natural
around these seeds to identify other entities which might language processing. Search engines empower models with
be an answer to the question. During PPR, we assign a gateway to an expansive universe of knowledge that far
weights to edges around the seed entities. Each edge surpasses what external knowledge bases can access. By
weight is essentially the cosine similarity between: harnessing the prowess of search engines, these models gain
• the question vector, v(q): average of all word vectors the ability to tap into the vast and ever-expanding repository
in the question of information on the World Wide Web. This dynamic access
• the relation vector, v(r): average of all word vectors in not only provides a wealth of information but also ensures
the relation corresponding to that edge that text generation remains current and up-to-date with the
3) retains the top E entities v1 , ..., vE by PPR score, along latest developments, a feat that external knowledge bases often
with any edges between them, and adds them to the struggle to achieve as they require continuous updates.
question sub-graph, Gq However, it is crucial to acknowledge that this newfound
access to the open web through search engines carries potential
During the text retrieval phase, GRAFT-Net retrieves doc-
risks. The information landscape of the internet is diverse, en-
uments (sentences) relevant to the question , q, from the
compassing both valuable knowledge and, regrettably, harmful
Wikipedia database. The text retrieval phase entails the steps
or malicious content. When integrated with augmented large
that are described below. GRAFT-Net:
language models, there exists the possibility of inadvertently
1) retrieves the top 5 most relevant Wikipedia articles (col- exposing the model to inappropriate or unsafe content. This
lection of documents), by using a weighted bag-of-words introduces concerns regarding the reliability and safety of
model [37]. the generated responses, as the model may unintentionally
2) populates a Lucene index [38] (facilitates data search in incorporate harmful information into its outputs.
a large corpus of text) with sentences from these articles, As we will see in the following sections, the use of search
and retrieves the top ranking ones: d1 , ..., dD . engine-based queries has the benefit that these queries are
The final question graph Gq consists of: inherently designed to be understood by humans, enhancing
both the interpretability of the model’s responses and its po-
• Vq : all retrieved entities and documents tential for continuous improvement through direct annotation
• Eq : all relations between the retrieved entities and all or feedback. However, to harness the immense potential of
entity links between entities and documents this symbiotic fusion of AI-driven language models and the
Because the vertices of the graphs can be either entities or vast knowledge landscape facilitated by search engines, it is
documents, the graph is considered: heterogeneous. imperative to develop robust safeguards and mechanisms to
7
mitigate the risks associated with accessing potentially harmful • Search module: generates a search query from the en-
or malicious content. This will ensure that the augmentation of coded input context. Subsequently, this query is chan-
language models with search engines not only broadens their neled into the Bing Web Search API [39], initiating a
horizons but also maintains the integrity and safety of their retrieval process that yields the 5 most relevant documents
outputs, ushering in a new era of responsible and informed as outcomes.
natural language understanding and interaction. • Knowledge module: utilizes the encoded input context
and a pool of retrieved documents to generate meaningful
A. Internet Augmented Dialogue Generation (IADG)
responses. This response comprises one or more pertinent
Previously described FAISS-based approaches, such as phrases or sentences extracted directly from the retrieved
RAG (III-A) and FiD (III-C), can take advantage of many documents. Notably, the FiD [7] method is employed to
existing methods developed for QA and dialogue tasks, as we encode both the context and the documents.
saw, but have several disadvantages. First, they may be difficult • Response module: operates on the encoded input context
to update to real-time web documents. On top of that, there merged with the knowledge response and crafts a coher-
may be a limit to the number of documents that can be stored ent and contextually relevant continuation to the input.
in local FAISS deployments. Finally, such methods will not It is essential to highlight that the knowledge module
take advantage of the high quality ranking that has been finely essentially involves a ”copy” mechanism, as it does not entail
tuned in Internet Search engines over decades of use. Thus, the creation of new tokens; rather, its complexity lies in the
the authors of this paper by Facebook AI Research consider precise selection of the relevant knowledge to replicate.
using Internet search engines directly for knowledge retrieval. The authors of SeeKeR consider the GPT2 transformer [18]
IADG [13] consists of two main components: as a base model, and fine-tune it to become a SeeKeR model.
• a search query generator: an encoder-decoder Trans- Therefore, they did not perform any pre-training of their own
former that takes in the dialogue context as input, and in this case. For their experiments, they considered medium,
generates a search query. This is given to the black-box large and XL (345M, 762M and 1.5B parameters) models.
search engine API, and N documents are returned.
• a FiD-style generator: an encoder-decoder model that en- C. LaMDA
codes each document individually (along with the dialog In this paper by Google, the authors of LaMDA [12] manage
context), concatenates the embeddings before they enter to augment a language generation model with what they call
the encoder, and finally generates the next response. a Toolset (TS), a black-box external knowledge source. The
Each of these components can be trained separately, given Toolset consists of:
supervised data for both tasks. The query generator requires: 1) a calculator
(context, search query) pairs, and the response generator 2) a translator
requires: (context, response) pairs. 3) an information retrieval system (similar to a search en-
The search engine is a black box in this system (similar gine)
to LaMDA), and could potentially be swapped out for any
The TS takes a single string as input and outputs a list of
method. In IADG, they use the Bing Search API [39] for their
one or more strings. Each tool in TS expects a string and
experiments to generate a list of URLs for each query. Then,
returns a list of strings. For example, the information retrieval
they use these URLs as keys to find their page content.
system can receive “How old is Rafael Nadal?” as input, and
B. SeeKeR output [“Rafael Nadal / Age / 35”].
SeeKeR [14] (Search-engine → Knowledge → Response) The information retrieval system is also capable of returning
introduces an innovative approach that employs a single lan- snippets of content from the open web, with their correspond-
guage model to tackle three distinct modular tasks consecu- ing URLs. The TS tries an input string on all of its tools, and
tively: searching for information, generating knowledge, and produces a final output list of strings by concatenating the
crafting a final response. In this research endeavor, SeeKeR output lists from every tool in the following order: calculator,
explores a modular framework that builds upon the founda- translator, and information retrieval system. A tool will return
tions of IADG [13] while amalgamating the most effective an empty list of results if it can’t parse the input (e.g., the
elements from various existing solutions. calculator cannot parse “How old is Rafael Nadal?”), and
The SeeKeR model adheres to the foundational architecture therefore does not contribute to the final output list.
of the standard transformer [19], but it distinguishes itself by It is essential to note that only little information is given on
employing the same model in a modular fashion, iteratively how the information retrieval system works, in the LaMDA
for multiple tasks. Within each module, the encoder (or de- paper, apart from the fact that it entails a database, but also
coder) incorporates distinct special tokens to signal the specific can provide web snippets along with their URLs.
module being activated. The output generated by each module LaMDA entails two main sub-models that follow the
is subsequently fed into the next one, along with the original decoder-only Transformer architecture:
context. SeeKeR comprises a trio of specialized modules, each 1) LaMDA-Base: A regular generative model that is pre-
dedicated to unique functionalities, namely: trained on a large dataset. LaMDA-Base is the first model
8
to receive a query from the user. It then generates a re- contextually grounded and up-to-date responses. Throughout
sponse that is checked and refined by LaMDA-Research. these studies, LMs have demonstrated their capacity to en-
2) LaMDA-Research: A generative model that usually re- hance context by incorporating relevant information, thereby
ceives the output of LaMDA-Base as input and is fine- fostering the production of informative answers to various
tuned to choose the recipient of its output (the TS or the questions. This augmentation often involves the integration
user). In general, LaMDA-Research queries the TS in a of non-parametric modules, marking a departure from the
loop, until it has sufficient information to generate a final conventional language modeling paradigm and categorizing
response to the user. these models as augmented language models.
However, it is essential to acknowledge certain limitations
V. L IMITATIONS AND D ISCUSSION within this paradigm shift. While LMs augmented with ex-
Augmented large language models grapple with a set of ternal knowledge exhibit reduced hallucination, they do not
recurring challenges. These issues encompass occasional in- offer an ironclad guarantee of factual grounding. Instances
consistencies, contradictions, factual inaccuracies, potential arise where conflicting retrievals result in mixed answers,
repetition, and a limited depth of reasoning, among others [40] underscoring the need for continued refinement in this domain.
[41]. Moreover, the limited exploration of the interplay between
Furthermore, concerns emerge regarding the generation of reasoning augmentation and knowledge integration in current
content imbued with toxic language and bias, especially in research highlights a promising avenue for future endeavors.
specific contexts and topics [42] [43]. Another noteworthy As we reflect on the landscape of augmented language
concern is the influence of internet-sourced documents on models, it becomes evident that this field holds immense
model outputs, potentially leading to the retrieval of undesir- promise and excitement. It represents a vital step towards
able content. Many research experiments lean on externally ushering in the next generation of deep learning systems
developed search engines, offering advantages in terms of that can engage in complex and meaningful human-machine
optimization and reliability. However, building one’s retrieval interactions while minimizing the parameter footprint. The
system, as is often the case in question-answering (QA) and journey towards fully realizing the potential of augmented
language modeling (LM) research, necessitates starting from LMs is ongoing, with opportunities for further innovation and
scratch. investigation awaiting those who seek to shape the future of
While search engines are adept at crawling and indexing this dynamic field.
the latest news and documents, this process demands signifi-
R EFERENCES
cant engineering effort and is vital for various applications.
[1] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan,
Conversely, methods in the literature using their retrieval H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
setups often rely on fixed document databases, which become G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
outdated over time. Additionally, search engines are designed S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian,
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis,
for human interaction, using natural language queries with E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
limited context. In contrast, machine-generated queries, as J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
exemplified by models like RAG [6], can potentially encode A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford,
M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew,
more context or adopt vector-encoded queries, albeit at the cost D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating
of human interpretability. A benefit of search engine-based large language models trained on code,” 2021.
queries is their human readability, offering both interpretability [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
and the potential for improvement through direct annotation Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
or feedback. J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
Language models employing augmentation address the chal- B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot learners,” 2020.
lenge of hallucination but do not guarantee factual grounding. [3] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston,
Instances of conflicting retrievals can lead to mixed responses. “Neural text generation with unlikelihood training,” 2019.
To enhance reliability, the introduction of trust mechanisms, [4] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,
E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark,
assigning different weights to retrievals, is a potential avenue. T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc,
Another concern is the generation of generic responses that A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and
may overlook the incorporated knowledge. L. Sifre, “Training compute-optimal large language models,” 2022.
[5] T. Scialom, T. Chakrabarty, and S. Muresan, “Fine-tuned language
In this survey, we have highlighted these common chal- models are continual learners,” 2022.
lenges and limitations faced by augmented large language [6] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
models, shedding light on the evolving landscape of language H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
augmented generation for knowledge-intensive nlp tasks,” Advances in
generation and the pressing need for innovative solutions. Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[7] G. Izacard and E. Grave, “Leveraging passage retrieval with generative
VI. C ONCLUSION models for open domain question answering,” 2021.
In this literature survey, we have explored a multitude of [8] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Mil-
lican, G. van den Driessche, J.-B. Lespiau, B. Damoc, A. Clark,
works in which Language Models (LMs) have been enriched D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
with external knowledge, enabling them to generate more L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving,
9
Year ALM Source of Knowledge Retriever Generator
2018 GRAFT-Net Graph + Text Personalized PageRank + DrQA GCNN
2019 PullNet Graph + Text Pull GCNN
2020 RAG Text BERT seq2seq
2020 REALM Text BERT seq2seq
2021 FiD Text BERT seq2seq
2021 IADG Internet seq2seq + Search Engine Encoder-Decoder Transformer
2022 LaMDA Internet Black Box Information Retrieval System Decoder-only Transformer
2022 Atlas Text Contriever seq2seq
2022 RETRO Text BERT Encoder-Decoder Transformer
2022 SeeKeR Text Encoder-Decoder Transformer Encoder-Decoder Transformer
O. Vinyals, S. Osindero, K. Simonyan, J. W. Rae, E. Elsen, and L. Sifre, [24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
“Improving language models by retrieving from trillions of tokens,” of deep bidirectional transformers for language understanding,” 2019.
2022. [25] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
[9] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot with a unified text-to-text transformer,” 2020.
learning with retrieval augmented language models,” 2022. [26] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search
[10] H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and with gpus,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547,
W. Cohen, “Open domain question answering using early fusion of 2021.
knowledge bases and text,” in Proceedings of the 2018 Conference [27] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov,
on Empirical Methods in Natural Language Processing. Brussels, and M. Welling, “Modeling relational data with graph convolutional
Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, networks,” in The Semantic Web: 15th International Conference, ESWC
pp. 4231–4242. [Online]. Available: https://aclanthology.org/D18-1455 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15.
[11] H. Sun, T. Bedrax-Weiss, and W. W. Cohen, “Pullnet: Open domain Springer, 2018, pp. 593–607.
question answering with iterative retrieval on knowledge bases and text,” [28] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
2019. V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence
[12] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. pre-training for natural language generation, translation, and comprehen-
Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, sion,” 2019.
A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, [29] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen,
D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, V. Zhao, Y. Zhou, and W. tau Yih, “Dense passage retrieval for open-domain question
C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, answering,” 2020.
K. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, [30] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm:
J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, Retrieval-augmented language model pre-training,” 2020.
K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Rajakumar, [31] K. Lee, M.-W. Chang, and K. Toutanova, “Latent retrieval for weakly
A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, supervised open domain question answering,” 2019.
R. Kurzweil, B. Aguera-Arcas, C. Cui, M. Croak, E. Chi, and Q. Le, [32] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin,
“Lamda: Language models for dialog applications,” 2022. and E. Grave, “Unsupervised dense information retrieval with contrastive
[13] M. Komeili, K. Shuster, and J. Weston, “Internet-augmented dialogue learning,” 2022.
generation,” 2021. [33] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and
[14] K. Shuster, M. Komeili, L. Adolphs, S. Roller, A. Szlam, and J. We- S. Kumar, “Accelerating large-scale inference with anisotropic vector
ston, “Language models that seek for knowledge: Modular search and quantization,” 2020.
generation for dialogue and prompt completion,” 2022. [34] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” 2017.
[15] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic
[35] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
language model,” in Advances in Neural Information Processing
sequence neural networks,” 2017.
Systems, T. Leen, T. Dietterich, and V. Tresp, Eds., vol. 13. MIT
[36] T. H. Haveliwala, “Topic-sensitive pagerank,” in Proceedings of the
Press, 2000. [Online]. Available: https://proceedings.neurips.cc/paper
11th International Conference on World Wide Web, ser. WWW ’02.
files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf
New York, NY, USA: Association for Computing Machinery, 2002, p.
[16] D. P. Kingma and M. Welling, “An introduction to variational
517–526. [Online]. Available: https://doi.org/10.1145/511446.511513
autoencoders,” Foundations and Trends® in Machine Learning,
[37] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading wikipedia to
vol. 12, no. 4, pp. 307–392, 2019. [Online]. Available: https:
answer open-domain questions,” 2017.
//doi.org/10.1561%2F2200000056
[38] A. S. Foundation. (2011) Apache lucene - scoring. Letzter Zugriff: 20.
[17] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, Oktober 2011. [Online]. Available: http://lucene.apache.org/java/3 4 0/
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” scoring.html
2014. [39] Microsoft, “Bing web search api,” 2023. [Online]. Available:
[18] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., https://www.microsoft.com/en-us/bing/apis/bing-web-search-api
“Language models are unsupervised multitask learners,” OpenAI blog, [40] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu,
vol. 1, no. 8, p. 9, 2019. M. Ott, K. Shuster, E. M. Smith, Y.-L. Boureau, and J. Weston, “Recipes
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, for building an open-domain chatbot,” 2020.
L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [41] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin,
[20] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton,
with neural networks,” 2014. F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano,
[21] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning internal J. Leike, and R. Lowe, “Training language models to follow instructions
representations by error propagation,” 1985. with human feedback,” 2022.
[22] M. I. Jordan, “Serial order: A parallel distributed processing approach,” [42] J. Xu, D. Ju, M. Li, Y.-L. Boureau, J. Weston, and E. Dinan, “Recipes
in Advances in psychology. Elsevier, 1997, vol. 121, pp. 471–495. for safety in open-domain chatbots,” 2021.
[23] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, [43] E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and
and L. Jackel, “Handwritten digit recognition with a back-propagation J. Weston, “Queens are powerful too: Mitigating gender bias
network,” Advances in neural information processing systems, vol. 2, in dialogue generation,” in Proceedings of the 2020 Conference on
1989. Empirical Methods in Natural Language Processing (EMNLP). Online:
10
Association for Computational Linguistics, Nov. 2020, pp. 8173–8188.
[Online]. Available: https://aclanthology.org/2020.emnlp-main.656
11