Building a Basic PDF Summarizer LLM Application with LangChain

Last Updated : 18 Jun, 2025

A PDF summarizer is a specialized tool built using LangChain designed to analyze the content of PDF documents providing users with concise and relevant summaries. The application integrates:

Hugging Face models for advanced natural language understanding
FAISS for efficient vector-based search
Streamlit to deliver an interactive user interface.

The conversational summarizer is capable of processing a

How the Application Works (Step-by-Step)

User uploads one or more PDF files.
Text is extracted from each PDF and split into chunks.
Chunks are embedded using Hugging Face Sentence Transformers and stored in a FAISS vector database.
When the user asks a question, it is embedded and compared to all chunk embeddings to find the most relevant context.
The relevant chunks are provided as context to the LLM which generates an answer.
The conversation history is maintained and both user questions and AI answers are displayed in the chat interface.

Workflow-of-PDF-Summarisation-Tool — Workflow of PDF Summarisation Tool

Implementation of PDF Summarizer

1. Importing Required Libraries

We will import os, streamlit, dotenv, PyPDF2 and langchain.

Python

import os
import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_community.llms import HuggingFaceHub

2. Loading Environment Variables

Now load .env file which includes the hugging face API. You can use your hugging face API there.

To know how to fetch hugging face API refer to: How to Access HuggingFace API key?

Python

load_dotenv()
HF_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

3. PDF Reading

We will create a Python function get_pdf_text which takes a list of PDF documents (pdf_docs) as input. It iterates through each PDF then, through each page within that PDF. It extracts the text from each page and concatenates it into a single string. It helps to extract and consolidate text content from multiple PDF files.

Python

def get_pdf_text(pdf_docs):
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            extracted_text = page.extract_text()
            if extracted_text:
                text += extracted_text + "\n"
    return text

4. Text Chunking

The function get_text_chunks splits a long string of text into smaller, manageable chunks.
It uses CharachterTextSplitter to break the text, primarily by newlines (\n) to ensure each chunk is no larger than 1000 characters. It overlaps consecutive chunks by 200 characters to maintain context.
It returns a list of these smaller text chunks which is ready for further processing.

Python

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len
    )
    return text_splitter.split_text(text)

5. Embedding Generation and Vector Store Creation

1. def get_vectorstore(text_chunks):

text_chunks is expected to be a list of strings.
Each string is a chunk or piece of text that will be embedded (converted to vectors).

2. embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

This line initializes a sentence embedding model from Hugging Face.
We are using "sentence-transformers/all-MiniLM-L6-v2" model which is a lightweight and fast transformer that converts sentences into dense vector representations (embeddings).
These embeddings are useful for comparing text for similarity or clustering, searching, etc.

3. return FAISS.from_texts(texts=text_chunks, embedding=embeddings)

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.
FAISS.from_texts() embeds each chunk of text using the embeddings model.

Python

def get_vectorstore(text_chunks):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return FAISS.from_texts(texts=text_chunks, embedding=embeddings)

6. Conversation Chain Setup (LLM, Retriever, Memory)

Loads a Language Model: It initializes the Falcon 7B Instruct model from Hugging Face to generate conversational responses.
Keeps Track of Chat History: It sets up a memory buffer (ConversationBufferMemory) to remember past messages and maintain coherent conversations.
Creates a Conversational Retrieval System: It combines the language model, memory and a vector store retriever to answer questions based on stored documents and past interactions.

Python

def get_conversation_chain(vectorstore):
    llm = HuggingFaceHub(
        repo_id="tiiuae/falcon-7b-instruct",
        model_kwargs={"temperature": 0.7, "max_new_tokens": 512},
        huggingfacehub_api_token=HF_TOKEN
    )
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    return ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )

7. User Input Handler (Ask Questions and Display Chat)

The function handle_userinput starts the chat experience in a Streamlit app which includes:

Readiness Check: It first confirms that the AI conversation model is loaded. If not it prompts the user to upload a PDF.
Get AI Response: It sends the user's question to the AI model and receives its response including the updated chat history.
Display Conversation: Finally it updates and presents the entire conversation history in the app, clearly labeling who said what.

The output generated would be an error message if the pdfs are not uploaded and hence if it is not ready there would be Streamlit error message and if it is ready the LLM processes the question, retrieves context and generates an answer and it is displayed in the Streamlit interface.

Python

def handle_userinput(user_question):
    if "conversation" not in st.session_state or not st.session_state.conversation:
        st.error("Please upload and process a PDF first.")
        return
    response = st.session_state.conversation({'question': user_question})
    st.session_state.chat_history = response['chat_history']
    st.subheader("Chat History")
    for i, message in enumerate(st.session_state.chat_history):
        speaker = "You:" if i % 2 == 0 else "AI:"
        st.write(f"**{speaker}** {message.content}")

8. User Interface (Streamlit App Logic)

This main function launches a Streamlit app for chatting with PDFs:

Sets up the app: It configures the page and initializes session variables for the AI conversation and chat history.
Handles PDF processing: A sidebar lets users upload PDFs. Clicking "Process" extracts text, chunks it, creates searchable embeddings, and sets up the AI conversation.
Manages user interaction: An input box allows users to ask questions, which are then processed by the AI, and the conversation is displayed.

Python

def main():
    st.set_page_config(page_title="Brainstorm with PDFs", page_icon=None)
    st.title("Brainstorm with Your PDFs")
    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = []

    with st.sidebar:
        st.subheader("Upload Your Documents")
        pdf_docs = st.file_uploader("Upload PDFs", accept_multiple_files=True, type=["pdf"])
        if st.button("Process"):
            raw_text = get_pdf_text(pdf_docs)
            text_chunks = get_text_chunks(raw_text)
            vectorstore = get_vectorstore(text_chunks)
            if vectorstore:
                st.session_state.conversation = get_conversation_chain(vectorstore)
                st.success("Processing complete! You can now chat with your PDFs.")

    user_question = st.chat_input("Ask a question about your PDFs:")
    if user_question:
        handle_userinput(user_question)

if __name__ == '__main__':
    main()

Output:

We can see that our PDF Summarizer is working and is easily hosted on streamlit for better interaction.

shambhava9ex

Improve

Article Tags :

NLP
AI-ML-DS

Building a Basic PDF Summarizer LLM Application with LangChain

How the Application Works (Step-by-Step)

Implementation of PDF Summarizer

1. Importing Required Libraries

2. Loading Environment Variables

3. PDF Reading

4. Text Chunking

5. Embedding Generation and Vector Store Creation

6. Conversation Chain Setup (LLM, Retriever, Memory)

7. User Input Handler (Ask Questions and Display Chat)

8. User Interface (Streamlit App Logic)

Similar Reads

Thank You!

What kind of Experience do you want to share?