Improve Efficiency of Document Processing Pipeline in LangChain

I’m working on a project using LangChain to load, split, and embed a large number of PDFs. My goal is to process all the documents efficiently, but the code seems to be getting stuck in this line Chroma.from_documents(documents, embedding, persist_directory="chroma_db") (running for 30+ minutes without completing). I’d appreciate any insights or suggestions to speed things up or diagnose the issue.

Here’s a simplified version of my code. It loads PDFs, splits them into smaller chunks, creates embeddings with OllamaEmbeddings, and stores the embeddings in a Chroma vector database. I’ve included multithreading in the document splitting step to optimize performance. I’ve tested it thourgh llama2:latest, mistral-large:latest, mistral-nemo:latest, gemma2:latest, nomic-embed-text:latest, mxbai-embed-large:latest, llama3.1:8b-instruct-q8_0, and llama3.2:latest models.

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import time
from concurrent.futures import ThreadPoolExecutor

# Define the directory where PDFs are stored
pdf_directory = "data"
documents = []

# Load PDFs
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_directory, filename)
        loader = PyMuPDFLoader(file_path)
        docs = loader.load()
        documents.extend(docs)

start_time = time.time()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

def split_doc(doc):
    return text_splitter.split_documents([doc])

with ThreadPoolExecutor() as executor:
    split_docs = list(executor.map(split_doc, documents))

documents = [doc for sublist in split_docs for doc in sublist]
print(f"Splitting documents took {time.time() - start_time:.2f} seconds")

# Initialize embeddings with Ollama
start_time = time.time()
embedding = OllamaEmbeddings(model="nomic-embed-text:latest")
print(f"Initializing embeddings took {time.time() - start_time:.2f} seconds")

# Store embeddings in Chroma
start_time = time.time()
vector_db = Chroma.from_documents(documents, embedding, persist_directory="chroma_db")
print(f"Persisting database took {time.time() - start_time:.2f} seconds")
vector_db.persist()

Issues Observed

Long Runtime: The code runs for over 30 minutes without completing, especially during the embedding and vector storage stages.
No Visible Errors: No errors are thrown; the code just takes an unexpectedly long time to process.

System Details

LangChain Version: Latest version Environment: Running locally (with ample memory and CPU resources)
Python Version: 3.11 Relevant
Packages: langchain, PyMuPDFLoader, OllamaEmbeddings, Chroma,
RecursiveCharacterTextSplitter
Graphics/Displays: Apple M2 Max (Chipset Model: Apple M2 Max, Type:
GPU, us: Built-In, Total Number of Cores: 30)

Questions

How can I optimize this pipeline for faster performance? Any recommendations for more efficient loaders, splitters, or database options would be greatly appreciated.
Is there a way to diagnose which stage of the pipeline is causing the bottleneck? I added some timing code but would appreciate more advanced profiling tips.

You need to sign in to view this answers

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Improve Efficiency of Document Processing Pipeline in LangChain

Issues Observed

System Details

Questions

Leave feedback about this Cancel Reply

PROS

CONS

Categories

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP

Recent Posts

Postgres drop type XX000 “cache lookup failed for type”

Login servlet app with session and cookies

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Follow Us

Improve Efficiency of Document Processing Pipeline in LangChain

Issues Observed

System Details

Questions

Share This Post:

Leave feedback about this Cancel Reply

PROS

CONS

Related Post

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP