Building a Basic PDF Summarizer LLM Application with LangChain
Last Updated :
18 Jun, 2025
A PDF summarizer is a specialized tool built using LangChain designed to analyze the content of PDF documents providing users with concise and relevant summaries. The application integrates:
- Hugging Face models for advanced natural language understanding
- FAISS for efficient vector-based search
- Streamlit to deliver an interactive user interface.
The conversational summarizer is capable of processing a
How the Application Works (Step-by-Step)
- User uploads one or more PDF files.
- Text is extracted from each PDF and split into chunks.
- Chunks are embedded using Hugging Face Sentence Transformers and stored in a FAISS vector database.
- When the user asks a question, it is embedded and compared to all chunk embeddings to find the most relevant context.
- The relevant chunks are provided as context to the LLM which generates an answer.
- The conversation history is maintained and both user questions and AI answers are displayed in the chat interface.
Workflow of PDF Summarisation ToolImplementation of PDF Summarizer
1. Importing Required Libraries
We will import os, streamlit, dotenv, PyPDF2 and langchain.
Python
import os
import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_community.llms import HuggingFaceHub
2. Loading Environment Variables
Now load .env file which includes the hugging face API. You can use your hugging face API there.
To know how to fetch hugging face API refer to: How to Access HuggingFace API key?
Python
load_dotenv()
HF_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
3. PDF Reading
We will create a Python function get_pdf_text which takes a list of PDF documents (pdf_docs) as input. It iterates through each PDF then, through each page within that PDF. It extracts the text from each page and concatenates it into a single string. It helps to extract and consolidate text content from multiple PDF files.
Python
def get_pdf_text(pdf_docs):
text = ""
for pdf in pdf_docs:
pdf_reader = PdfReader(pdf)
for page in pdf_reader.pages:
extracted_text = page.extract_text()
if extracted_text:
text += extracted_text + "\n"
return text
4. Text Chunking
- The function get_text_chunks splits a long string of text into smaller, manageable chunks.
- It uses CharachterTextSplitter to break the text, primarily by newlines (\n) to ensure each chunk is no larger than 1000 characters. It overlaps consecutive chunks by 200 characters to maintain context.
- It returns a list of these smaller text chunks which is ready for further processing.
Python
def get_text_chunks(text):
text_splitter = CharacterTextSplitter(
separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len
)
return text_splitter.split_text(text)
5. Embedding Generation and Vector Store Creation
1. def get_vectorstore(text_chunks):
- text_chunks is expected to be a list of strings.
- Each string is a chunk or piece of text that will be embedded (converted to vectors).
2. embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
- This line initializes a sentence embedding model from Hugging Face.
- We are using "sentence-transformers/all-MiniLM-L6-v2" model which is a lightweight and fast transformer that converts sentences into dense vector representations (embeddings).
- These embeddings are useful for comparing text for similarity or clustering, searching, etc.
3. return FAISS.from_texts(texts=text_chunks, embedding=embeddings)
- FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.
- FAISS.from_texts() embeds each chunk of text using the embeddings model.
Python
def get_vectorstore(text_chunks):
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
return FAISS.from_texts(texts=text_chunks, embedding=embeddings)
6. Conversation Chain Setup (LLM, Retriever, Memory)
- Loads a Language Model: It initializes the Falcon 7B Instruct model from Hugging Face to generate conversational responses.
- Keeps Track of Chat History: It sets up a memory buffer (ConversationBufferMemory) to remember past messages and maintain coherent conversations.
- Creates a Conversational Retrieval System: It combines the language model, memory and a vector store retriever to answer questions based on stored documents and past interactions.
Python
def get_conversation_chain(vectorstore):
llm = HuggingFaceHub(
repo_id="tiiuae/falcon-7b-instruct",
model_kwargs={"temperature": 0.7, "max_new_tokens": 512},
huggingfacehub_api_token=HF_TOKEN
)
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
return ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(),
memory=memory
)
The function handle_userinput starts the chat experience in a Streamlit app which includes:
- Readiness Check: It first confirms that the AI conversation model is loaded. If not it prompts the user to upload a PDF.
- Get AI Response: It sends the user's question to the AI model and receives its response including the updated chat history.
- Display Conversation: Finally it updates and presents the entire conversation history in the app, clearly labeling who said what.
The output generated would be an error message if the pdfs are not uploaded and hence if it is not ready there would be Streamlit error message and if it is ready the LLM processes the question, retrieves context and generates an answer and it is displayed in the Streamlit interface.
Python
def handle_userinput(user_question):
if "conversation" not in st.session_state or not st.session_state.conversation:
st.error("Please upload and process a PDF first.")
return
response = st.session_state.conversation({'question': user_question})
st.session_state.chat_history = response['chat_history']
st.subheader("Chat History")
for i, message in enumerate(st.session_state.chat_history):
speaker = "You:" if i % 2 == 0 else "AI:"
st.write(f"**{speaker}** {message.content}")
8. User Interface (Streamlit App Logic)
This main function launches a Streamlit app for chatting with PDFs:
- Sets up the app: It configures the page and initializes session variables for the AI conversation and chat history.
- Handles PDF processing: A sidebar lets users upload PDFs. Clicking "Process" extracts text, chunks it, creates searchable embeddings, and sets up the AI conversation.
- Manages user interaction: An input box allows users to ask questions, which are then processed by the AI, and the conversation is displayed.
Python
def main():
st.set_page_config(page_title="Brainstorm with PDFs", page_icon=None)
st.title("Brainstorm with Your PDFs")
if "conversation" not in st.session_state:
st.session_state.conversation = None
if "chat_history" not in st.session_state:
st.session_state.chat_history = []
with st.sidebar:
st.subheader("Upload Your Documents")
pdf_docs = st.file_uploader("Upload PDFs", accept_multiple_files=True, type=["pdf"])
if st.button("Process"):
raw_text = get_pdf_text(pdf_docs)
text_chunks = get_text_chunks(raw_text)
vectorstore = get_vectorstore(text_chunks)
if vectorstore:
st.session_state.conversation = get_conversation_chain(vectorstore)
st.success("Processing complete! You can now chat with your PDFs.")
user_question = st.chat_input("Ask a question about your PDFs:")
if user_question:
handle_userinput(user_question)
if __name__ == '__main__':
main()
Output:
PDF SUMMARISER We can see that our PDF Summarizer is working and is easily hosted on streamlit for better interaction.
Similar Reads
Top 20 Applications of Large Language Models in Real-Life Language models (LLMs), such as GPT-4, have revolutionized numerous industries by leveraging their advanced capabilities in natural language processing (NLP) to enhance efficiency, accuracy, and user experience. From automating tasks to providing personalized services, these models have become indis
8 min read
How to use Hugging Face with LangChain ? Hugging Face is an open-source platform that provides tools, datasets, and pre-trained models to build Generative AI applications. We can access a wide variety of open-source models using its API. With the Hugging Face API, we can build applications based on image-to-text, text generation, text-to-i
3 min read
Develop an LLM Application using Openai Language Models (LMs) play a crucial role in natural language processing applications, enabling the development of tools that generate human-like text. OpenAI's Generative Pre-Trained Transformer (GPT) models, such as GPT-3.5-turbo, are widely used in this domain. They excel in understanding context
5 min read
Mastering Text Summarization with Sumy: A Python Library Overview Sumy is one of the Python libraries for Natural Language Processing tasks. It is mainly used for automatic summarization of paragraphs using different algorithms. We can use different summarizers that are based on various algorithms, such as Luhn, Edmundson, LSA, LexRank, and KL-summarizers. We will
8 min read
Build Chatbot Webapp with LangChain LangChain is a Python module that allows you to develop applications powered by language models. It provides a framework for connecting language models to other data sources and interacting with various APIs. LangChain is designed to be easy to use, even for developers who are not familiar with lang
13 min read