0% found this document useful (0 votes)
198 views

Comparative Analysis of RAG Fine-Tuning and Prompt Engineering in Chatbot Development

This paper compares the effectiveness of Retriever-Augmented Generation (RAG), fine-tuning, and prompt engineering in developing advanced chatbots. The fine-tuned model achieved the highest performance with an accuracy of 87.8%, while RAG and prompt engineering followed with 84.5% and 83.2% respectively. The study highlights the strengths of each method, suggesting that fine-tuning is best for accuracy, RAG for real-time data retrieval, and prompt engineering for cost-effective adaptability.

Uploaded by

Satya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

Comparative Analysis of RAG Fine-Tuning and Prompt Engineering in Chatbot Development

This paper compares the effectiveness of Retriever-Augmented Generation (RAG), fine-tuning, and prompt engineering in developing advanced chatbots. The fine-tuned model achieved the highest performance with an accuracy of 87.8%, while RAG and prompt engineering followed with 84.5% and 83.2% respectively. The study highlights the strengths of each method, suggesting that fine-tuning is best for accuracy, RAG for real-time data retrieval, and prompt engineering for cost-effective adaptability.

Uploaded by

Satya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2024 International Conference on Future Technologies for Smart Society (ICFTSS)

Comparative Analysis of RAG, Fine-Tuning, and


Prompt Engineering in Chatbot Development
Harshit Kumar Chaubey Gaurav Tripathi Rajnish Ranjan
Computer Science and Engineering Electronics and Communication Data Science and Applications
IIIT Naya Raipur Engineering IIT, Madras
Raipur, India IIIT Naya Raipur Chennai, India
Raipur, India [email protected]
2024 International Conference on Future Technologies for Smart Society (ICFTSS) | 979-8-3503-7384-4/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICFTSS61109.2024.10691338

[email protected]
[email protected]

Srinivasa k. Gopalaiyengar
Data Science and Artificial Intelligence
IIIT, Naya Raipur
Raipur, India
[email protected]

Abstract—This paper examines the integration and This research will not only deepen our understanding of
comparative effectiveness of Retriever-Augmented Generation individual AI methodologies but also guide future
(RAG), fine-tuning, and prompt engineering in the development implementations in the field of LLM based chatbots.
of advanced chatbots. By employing domain-specific fine-
tuning, the study addresses contextual misunderstandings and LLMs are knows to make chatbots and also for
inaccuracies prevalent in base Large Language Models (LLMs). development of conversational AI through recent research
RAG enhances chatbot functionality by incorporating real-time while using methods like Prompt engineering ,finetuning and
data retrieval, ensuring relevance in dynamically changing RAG for the creation of LLM based chatbots such as Death
environments. Prompt engineering is utilized to refine input doula chatbot which was finetuned and working this with
prompts, thereby optimizing the accuracy of responses. RAG gives us the best outcomes for security, flexibility and
Employing the "openassistant-guanaco" dataset from Hugging control , according to the study which analysed which used
Face, this research assesses the performance improvements different methods for the creation of LLM based chatbot
offered by these methodologies, both quantitatively and (assistants) [1]. Using a dataset gathered from hospital
qualitatively. The fine-tuned model outperforms other methods brochures, another research project presented an approach for
with an accuracy of 87.8\% and a BLEU score of 0.81, proving developing and evaluating question-answer pairs to gauge the
its effectiveness in generating the most relevant responses. In quality of RAGs and demonstrated how effective they are in
contrast, while the RAG with LLM approach shows promising
medical settings [2].
results with a reasonable accuracy of 84.5\%, the Prompt
Engineering method, though slightly less effective with an Additionally, studies looked into how businesses may use
accuracy of 83.2\%, still maintains competitive performance. the LangChain architecture to integrate generative AI
This study highlights the unique and combined strengths of services, with an emphasis on RAG and fine-tuning for
these technologies, contributing valuable insights into their effective information management and retrieval [3].
synergistic potential for enhancing chatbot interactions Compared with traditional intent-based systems, cognitive
assistants that use LLMs were found to provide better user
Keywords—Chatbots, Large Language Models (LLM),
experiences and task completion rates [4], indicating that
Retriever-Augmented Generation (RAG), Fine-tuning, Prompt
LLMs have a great deal of potential for knowledge-intensive
Engineering.
activities.
I. INTRODUCTION Prompt engineering has been used in the aviation industry
This paper investigates and contrasts the various effects to automate the classification of Standard Operating
on chatbot performance that prompt engineering, fine-tuning, Procedure (SOP) procedures, resulting to improved accuracy
and retriever-augmented generation (RAG) have. Every and time savings [5]. Prompt-RAG, an innovative method that
method is used separately, and the outcomes are compared. enhances LLM performance in particular domains without
These artificial intelligence methods and strategies are using vector embeddings, has been introduced and has
implemented to enhance chatbot performance and user demonstrated superior results in terms of response relevance
engagement. Finding the most effective strategy to use with and informativeness [6]. Last but not least, despite first results
the LLM-based chatbot was the aim of this. It is through the not demonstrating appreciable gains, reinforcement learning
thorough implementation and testing on a specific set of LLM from human feedback (RLHF) was investigated for improving
Chatbots that we discover the unique strengths and LLMs in psychology, underscoring the necessity for
shortcomings of each strategy. We assess the chatbots' additional study [7]. Together, these experiments highlight the
performance using multiple measures in our experimental revolutionary potential of combining quick engineering, fine-
setup to decide which strategy is more suitable for the tuning, and RAG to create sophisticated, context-aware
chatbots. With the right approach, this comparative research chatbots for a range of applications.
will create an accurate representation of the capabilities of
each technology.

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 04,2025 at 15:21:06 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-7384-4/24/$31.00 ©2024 IEEE 169
2024 International Conference on Future Technologies for Smart Society (ICFTSS)

II. PROPOSED WORK user satisfaction surveys. Additionally, the combined use of
fine- tuning with RAG and prompt engineering was evaluated
A. Dataset
to determine any synergistic effects on chatbot performance.
Fig. 1 illustrates how the image offers a preview of the 1) Evaluation Metrics: For comparison, we used the
dataset that includes chatbot and human discussions. The following metrics:
conversations in this dataset show a variety of interaction • Accuracy: Measures the percentage of correct
patterns, which are crucial for training and assessing chatbot responses from the LLM out of all responses
effectiveness. The sample demonstrates the normal user provided. It directly reflects how often the model
interactions with the chatbot as well as the kinds of responses provides the right answer or solution.
that the system generates. • BLEU Score: Evaluates how closely the LLM's
We made use of Hugging Face's "openassistant-guanaco"
output matches human-written reference texts by
dataset, which is a subset of the Open Assistant dataset. With comparing overlapping n-grams (word sequences).
9,846 samples total, this dataset includes the conversation Higher scores indicate better alignment with human
pathways with the highest ratings, the sample image can be expectations, commonly used in translation tasks.
seen in Fig. 2. The data is a good resource for training and
• Perplexity: Indicates how well the LLM predicts a
assessing conversational AI models since it contains a variety
sequence of words. Lower perplexity means the
of conversational exchanges covering a wide range of themes
model is better at understanding and generating
and circumstances. By choosing the top-rated answers, the
coherent text.
dataset is made to guarantee high-quality interactions and
• Human Evaluation: Involves expert assessment of
offer a solid base for enhancing chatbot performance.
the LLM's output for quality, relevance, and
B. RAG coherence. Experts provide subjective judgments to
We applied the LangChain framework to build a RAG gauge how well the model performs in practical
system which combines a retrieval mechanism with a pre- scenarios.
trained LLM to enhance the model's access to current data. This structured experiment setup allows us to
For the purpose of trying to load and preprocess the dataset, comprehensively assess and compare the performance of
LangChain's RecursiveCharacterTextSplitter was used to RAG, fine-tuning, and prompt engineering in enhancing
separate the pages into manageable chunks. Using the Fine chatbot capabilities.
Tuned SFR-Embedding-Mistral model, we transformed the
III. EXPERIMENTANALYSIS AND RESULTS
text into numerical embeddings, which we then saved in a
vector storage (Milvus). A. Experiment
To ensure that the LLM could access real-time data from Three methods have been used in this paper to evaluate
the vector store to deliver accurate results, a retriever chatbot performance: We set up a RAG system from scratch
interface was developed to obtain relevant documents using RAG with LLM. with the goal to assess its effect on the
depending on user queries. The retriever was integrated with LLM's performance. By which includes real-time data
the LLM using the RetrievalQA class from LangChain. retrieval, this method enhances the LLM by allowing the
C. Fine-Tuning model to retrieve and make use of the most recent information
available.
To enhance domain-specific performance, we chose a pre-
During interactions, this method attempts to enhance the
trained LLM s LLaMA2 and Falcon. The fine-tuning process
relevance and accuracy of the chatbot's responses by offering
involved data preparation, where the dataset was formatted
the most recent context. We used the "openassistant-
into input-output pairs for supervised learning. We utilized
guanaco" dataset to fine-tune a pre-trained LLM using the
the Hugging Face transformers library to fine-tune the model
Fine-Tuned Model in order to customize the model for
on the conversational dataset, adjusting the model's weights
particular conversational scenarios. By fine-tuning the
to improve performance on domain-specific tasks. The fine-
model's weights in light of the updated training data, the LLM
tuned model was then evaluated using metrics like accuracy,
is able to produce replies that are more pertinent and accurate
relevance, and coherence of the generated responses.
while still being specific to the dataset.
D. Prompt Engineering Because of this process, the chatbot performs better over-
We developed specific prompts designed to elicit high- all and becomes more adept at handling domain-specific
quality responses from the LLM. The implementation queries. We created tailored prompts with the goal of getting
included creating templates using LangChain's the LLM to provide good and accurate answers using Prompt
PromptTemplate class to manage different prompt templates Engineering. We may influence the pre-trained model to
for various queries. Conversation history was integrated generate more accurate and contextually relevant responses
using ConversationBufferWindowMemory to maintain by designing precise and powerful prompts.
context and improve response relevance. Prompts were By utilizing the advantages of the current architecture,
iteratively tested and refined to optimize the chatbot's prompt engineering enhances the user experience while
optimizing interaction quality without requiring a significant
performance.
amount of retraining. For training and assessment, we made
E. Experimental Setup use of the Hugging Face "openassistant-guanaco" dataset.
The performance of RAG, fine-tuning, and prompt
engineering was assessed separately using standardized
metrics such as BLEU score, F1 score, response time, and

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 04,2025 at 15:21:06 UTC from IEEE Xplore. Restrictions apply.
170
2024 International Conference on Future Technologies for Smart Society (ICFTSS)

Fig. 1. Proposed Framework for Chatbot Construction. This framework outlines three distinct approaches: (A) Combining RAG with LLM, (B) Fine-tuning
a LLM, and (C) Enhancing a LLM with Prompt Engineering.

TABLE I. PERFORMANCE COMPARISON OF RAG, FINE-


TUNING, AND PROMPT ENGINEERING APPROACHES.
Approach Accuracy BLEU Perplexity HES
score
RAG with LLM 84.5 0.76 11.50 8.5
Fine-Tuned Model 87.8 0.81 10.3 8.9
Prompt Engineering 83.2 0.74 12.0 8.1
Fig. 2. Training and Evaluation Data for Chatbot Development [8]

The following metrics are used for the comparison: The RAG approach also performed well, with an accuracy
of 84.5%, a BLEU score of 0.76, and a HES of 8.5, though its
Accuracy measures the percentage of responses the LLM perplexity of 11.50 suggests some room for improvement in
receives that are entirely accurate; BLEU Score, which handling unexpected queries. Prompt engineering, while
measures the extent to which the text produced by the LLM offering a more cost-effective and flexible alternative, had the
matches references that were produced by individuals; lowest accuracy at 83.2%, a BLEU score of 0.74, the highest
Perplexity, an indicator of the LLM's understanding of the perplexity at 12.0, and a HES of 8.1, indicating it is less
assignment; and Human Evaluation, our professor's expert
effective but still delivers sufficient user experience.
assessment of the calibre, significance, and coherence of the
LLM's output. We are able to thoroughly evaluate and contrast For different organizations with varying resources and
the effectiveness of RAG, fine-tuning, and prompt datasets related to their product or service, the choice of
engineering in augmenting chatbot skills thanks to this well- approach to build a chatbot should consider these
organized experiment setting. performance metrics. Organizations with extensive resources
and specialized datasets may benefit more from fine-tuning,
B. Results achieving the highest accuracy and relevance. In contrast,
In this study, we made use of the "openassistant- guanaco" those with limited resources may find prompt engineering a
dataset from Hugging Face to evaluate the performance of viable, cost-effective option, while RAG offers a middle
three approaches on a LLM, RAG, fine-tuning, and prompt ground, especially for real-time data retrieval needs.
engineering. Based on human examination, BLEU score,
accuracy, and perplexity, we assessed every method. An IV. CONCLUSIGON AND FUTURE WORK
overview of each approach's performance can be found In this study, the integration of Retriever-Augmented
within the Table I. Generation RAG, fine-tuning, and prompt engineering
Based on the results in Table I, the fine-tuned model significantly enhanced a LLM chatbot's performance,
demonstrated the highest performance across all metrics, showcasing tailored and contextually accurate interactions.
with an accuracy of 87.8%, a BLEU score of 0.81, the lowest The fine-tuned model excelled in delivering coherent
perplexity at 10.3, and a human evaluation score (HES) of responses, as evidenced by high accuracy and BLEU scores,
8.9. These results indicate that it is the most effective whereas the RAG approach effectively augmented real-time
approach for generating relevant and coherent responses data retrieval, albeit with slightly higher perplexity. Prompt
tailored to the dataset, with a strong understanding of tasks engineering, though scoring lower in some metrics, offered
and minimal uncertainty in its predictions. quick adaptability and cost-efficiency, underlining its value
in rapid deployment settings. Future research could explore

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 04,2025 at 15:21:06 UTC from IEEE Xplore. Restrictions apply.
171
2024 International Conference on Future Technologies for Smart Society (ICFTSS)

merging fine-tuning and RAG to harness both detailed Workshop on Patient-Oriented Language Processing (CL4Health)@
LREC-COLING 2024. 2024.
language comprehension and extensive information access.
[3] Jeong, Cheonsu. "Generative AI service implementation using LLM
Advancements in prompt engineering could also enhance application architecture: based on RAG model and LangChain frame-
response nuance by incorporating adaptive prompts that react work." Journal of Intelligence and Information Systems 29.4 (2023):
to conversational contexts. Expanding these methodologies 129-164.
to diverse datasets and languages would test their scalability [4] Freire, Samuel Kernan, Chaofan Wang, and Evangelos Niforatos.
and adaptability. Additionally, addressing ethical "Chatbots in Knowledge-Intensive Contexts: Comparing Intent and
LLM-Based Systems." arXiv preprint arXiv: 2402.04955 (2024).
considerations and bias in Al systems is crucial for ensuring
[5] Bashatah, Jomana, and Lance Sherry. "Prompt Engineering to Classify
fair- ness. Optimizing real-time performance without Components of Standard Operating Procedure Steps Using Large
sacrificing accuracy could further broaden the practical Language Model (LLM)-Based Chatbots.
applications of sophisticated chatbot technologies. These [6] Kang, Bongsu, et al. "Prompt-RAG: Pioneering Vector Embedding-
directions could significantly propel forward the capabilities Free Retrieval-Augmented Generation in Niche Domains, Exemplified
and implementation of conversational Al in varied real-world by Korean Medicine." arXiv preprint arXiv:2401.11246 (2024).
scenarios. [7] Bill, Desirée, and Theodor Eriksson. "Fine-tuning a llm using re
inforcement learning from human feedback for a therapy chatbot
REFERENCES application." (2023).
[1] Borek, Cecylia. "Comparative evaluation of llm-based approaches to [8] Datasets Hugging Face, (2023, September 26) https://hugging
chatbot creation." (2024). face.co/datasets/timdettmers/openassistant-guanaco
[2] Torres, Juan José González, et al. "Automated Question-Answer Gen-
eration for Evaluating RAG-based Chatbots." Proceedings of the First

Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 04,2025 at 15:21:06 UTC from IEEE Xplore. Restrictions apply.
172

You might also like