Build A Chatbot On Your CSV Data With LangChain and OpenAI
Build A Chatbot On Your CSV Data With LangChain and OpenAI
The code
Now let’s get practical! We’ll develop our chatbot on CSV data with very little Python syntax.
Disclaimer: This code is a simplified version of the chatbot I created, it is not optimized to
reduce OpenAI API costs, for a more performant and optimized chatbot, feel free to check
out my GitHub project : yvann-hub/ChatBot-CSV or just test the app at chatbot-csv.com ?.
• First, we’ll install the necessary libraries:
pip install streamlit streamlit_chat langchain openai faiss-cpu tiktoken
• We ask the user to enter their OpenAI API key and download the CSV file on which the chatbot
will be based.
• To test the chatbot at a lower cost, you can use this lightweight CSV file: fishfry-
locations.csv
user_api_key = st.sidebar.text_input(
label="#### Your OpenAI API key ?",
placeholder="Paste your openAI API key, sk-",
type="password")
uploaded_file = st.sidebar.file_uploader("upload", type="csv")
• If a CSV file is uploaded by the user, we load it using the CSVLoader class from LangChain
if uploaded_file :
#use tempfile because CSVLoader only accepts a file_path
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
tmp_file.write(uploaded_file.getvalue())
tmp_file_path = tmp_file.name
loader = CSVLoader(file_path=tmp_file_path, encoding="utf-8")
data = loader.load()
• The LangChain CSVLoader class allows us to split a CSV file into unique rows. This can be
seen by displaying the content of the data:
st.write(data)
• Cutting the CSV file now allows us to provide it to our vectorstore (FAISS) using OpenAI
embeddings.
• Embeddings allow transforming the parts cut by CSVLoader into vectors, which then represent
an index based on the content of each row of the given file.
• In practice, when the user makes a query, a search will be performed in the vectorstore, and the
best matching index(es) will be returned to the LLM, which will rephrase the content of the
found index to provide a formatted response to the user.
• I recommend deepening your understanding of vectorstore and embeddings concepts for better
comprehension.
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(data, embeddings)
• This function allows us to provide the user’s question and conversation history to
ConversationalRetrievalChain to generate the chatbot’s response.
• st.session_state[‘history’] stores the user’s conversation history when they are on
the Streamlit site.
If you want to add improvements to this chatbot you can check my GitHub ?
def conversational_chat(query):
result = chain({"question": query,
"chat_history": st.session_state['history']})
st.session_state['history'].append((query, result["answer"]))
return result["answer"]
• We initialize the chatbot session by creating st.session_state[‘history’] and the first messages
displayed in the chat.
• [‘generated’] corresponds to the chatbot’s responses.
• [‘past’] corresponds to the messages provided by the user.
• Containers are not essential but help improve the UI by placing the user’s question area below
the chat messages.
if 'history' not in st.session_state:
st.session_state['history'] = []
if 'generated' not in st.session_state:
st.session_state['generated'] = ["Hello ! Ask me anything about " +
uploaded_file.name + " ?"]
if 'past' not in st.session_state:
st.session_state['past'] = ["Hey ! ?"]
st.session_state['past'].append(user_input)
st.session_state['generated'].append(output)
• This last part allows displaying the user’s and chatbot’s messages on the Streamlit site using the
streamlit_chat module.
if st.session_state['generated']:
with response_container:
for i in range(len(st.session_state['generated'])):
message(st.session_state["past"][i], is_user=True, key=str(i) +
'_user', avatar_style="big-smile")
message(st.session_state["generated"][i], key=str(i),
avatar_style="thumbs")