October 26, 2024
Chicago 12, Melborne City, USA
pdf

Improve Efficiency of Document Processing Pipeline in LangChain


I’m working on a project using LangChain to load, split, and embed a large number of PDFs. My goal is to process all the documents efficiently, but the code seems to be getting stuck in this line Chroma.from_documents(documents, embedding, persist_directory="chroma_db") (running for 30+ minutes without completing). I’d appreciate any insights or suggestions to speed things up or diagnose the issue.

Here’s a simplified version of my code. It loads PDFs, splits them into smaller chunks, creates embeddings with OllamaEmbeddings, and stores the embeddings in a Chroma vector database. I’ve included multithreading in the document splitting step to optimize performance. I’ve tested it thourgh llama2:latest, mistral-large:latest, mistral-nemo:latest, gemma2:latest, nomic-embed-text:latest, mxbai-embed-large:latest, llama3.1:8b-instruct-q8_0, and llama3.2:latest models.

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import time
from concurrent.futures import ThreadPoolExecutor

# Define the directory where PDFs are stored
pdf_directory = "data"
documents = []

# Load PDFs
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_directory, filename)
        loader = PyMuPDFLoader(file_path)
        docs = loader.load()
        documents.extend(docs)

start_time = time.time()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

def split_doc(doc):
    return text_splitter.split_documents([doc])

with ThreadPoolExecutor() as executor:
    split_docs = list(executor.map(split_doc, documents))

documents = [doc for sublist in split_docs for doc in sublist]
print(f"Splitting documents took {time.time() - start_time:.2f} seconds")

# Initialize embeddings with Ollama
start_time = time.time()
embedding = OllamaEmbeddings(model="nomic-embed-text:latest")
print(f"Initializing embeddings took {time.time() - start_time:.2f} seconds")

# Store embeddings in Chroma
start_time = time.time()
vector_db = Chroma.from_documents(documents, embedding, persist_directory="chroma_db")
print(f"Persisting database took {time.time() - start_time:.2f} seconds")
vector_db.persist()

Issues Observed

  1. Long Runtime: The code runs for over 30 minutes without completing, especially during the embedding and vector storage stages.
  2. No Visible Errors: No errors are thrown; the code just takes an unexpectedly long time to process.

System Details

  • LangChain Version: Latest version Environment: Running locally (with ample memory and CPU resources)

  • Python Version: 3.11 Relevant

  • Packages: langchain, PyMuPDFLoader, OllamaEmbeddings, Chroma,
    RecursiveCharacterTextSplitter

  • Graphics/Displays: Apple M2 Max (Chipset Model: Apple M2 Max, Type:
    GPU, us: Built-In, Total Number of Cores: 30)

Questions

  1. How can I optimize this pipeline for faster performance? Any recommendations for more efficient loaders, splitters, or database options would be greatly appreciated.
  2. Is there a way to diagnose which stage of the pipeline is causing the bottleneck? I added some timing code but would appreciate more advanced profiling tips.



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video