
Introduction
With the rise of Large Language Models (LLMs), interacting with documents has become more intuitive than ever. Imagine having a chatbot that can read and summarize PDFs for you! In this tutorial, we’ll build a Chat with PDF application using Python and LangChain.
This guide will walk you through the steps of extracting text from PDFs, leveraging LLMs for natural language processing, and setting up an interactive chatbot.
Prerequisites
Before getting started, ensure you have the following:
- Python 3.8+
- OpenAI API Key
- Required dependencies installed (
pip install -r requirements.txt
)
Step 1: Install Dependencies
First, install the necessary libraries:
pip install langchain openai pypdf streamlit faiss-cpu
Step 2: Extract Text from PDFs
We’ll use PyPDF2
to extract text from the uploaded PDF files.
import PyPDF2
def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
text = "".join([page.extract_text() for page in reader.pages])
return text
Step 3: Chunk and Embed Text
Since LLMs have token limits, we split the extracted text into chunks and generate embeddings.
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
def create_embeddings(text):
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(text)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts, embeddings)
return vectorstore
Step 4: Implement Chatbot Logic
Now, we integrate the chatbot functionality using LangChain and OpenAI.
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
def chat_with_pdf(vectorstore, query):
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever())
response = qa.run(query)
return response
Step 5: Create a Streamlit UI
To make the chatbot user-friendly, we’ll use Streamlit for the interface.
import streamlit as st
def main():
st.title("Chat with PDF")
pdf_file = st.file_uploader("Upload your PDF", type=["pdf"])
if pdf_file:
with open("uploaded.pdf", "wb") as f:
f.write(pdf_file.getbuffer())
text = extract_text_from_pdf("uploaded.pdf")
vectorstore = create_embeddings(text)
query = st.text_input("Ask a question about the document:")
if query:
response = chat_with_pdf(vectorstore, query)
st.write(response)
if __name__ == "__main__":
main()
Step 6: Run the Application
Save the script as app.py
and run:
streamlit run app.py
Now, upload a PDF and start chatting with your document!
Conclusion
With just a few lines of code, we built a Chat with PDF application using LLMs. This can be expanded with additional features like multiple document support, memory-based conversations, and improved UI.