I’m working on a project using LangChain to load, split, and embed a large number of PDFs. My goal is to process all the documents efficiently, but the code seems to be getting stuck in this line Chroma.from_documents(documents, embedding, persist_directory="chroma_db")
(running for 30+ minutes without completing). I’d appreciate any insights or suggestions to speed things up or diagnose the issue.
Here’s a simplified version of my code. It loads PDFs, splits them into smaller chunks, creates embeddings with OllamaEmbeddings, and stores the embeddings in a Chroma vector database. I’ve included multithreading in the document splitting step to optimize performance. I’ve tested it thourgh llama2:latest
, mistral-large:latest
, mistral-nemo:latest
, gemma2:latest
, nomic-embed-text:latest
, mxbai-embed-large:latest
, llama3.1:8b-instruct-q8_0
, and llama3.2:latest models
.
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import time
from concurrent.futures import ThreadPoolExecutor
# Define the directory where PDFs are stored
pdf_directory = "data"
documents = []
# Load PDFs
for filename in os.listdir(pdf_directory):
if filename.endswith(".pdf"):
file_path = os.path.join(pdf_directory, filename)
loader = PyMuPDFLoader(file_path)
docs = loader.load()
documents.extend(docs)
start_time = time.time()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
def split_doc(doc):
return text_splitter.split_documents([doc])
with ThreadPoolExecutor() as executor:
split_docs = list(executor.map(split_doc, documents))
documents = [doc for sublist in split_docs for doc in sublist]
print(f"Splitting documents took {time.time() - start_time:.2f} seconds")
# Initialize embeddings with Ollama
start_time = time.time()
embedding = OllamaEmbeddings(model="nomic-embed-text:latest")
print(f"Initializing embeddings took {time.time() - start_time:.2f} seconds")
# Store embeddings in Chroma
start_time = time.time()
vector_db = Chroma.from_documents(documents, embedding, persist_directory="chroma_db")
print(f"Persisting database took {time.time() - start_time:.2f} seconds")
vector_db.persist()
Issues Observed
- Long Runtime: The code runs for over 30 minutes without completing, especially during the embedding and vector storage stages.
- No Visible Errors: No errors are thrown; the code just takes an unexpectedly long time to process.
System Details
-
LangChain Version: Latest version Environment: Running locally (with ample memory and CPU resources)
-
Python Version: 3.11 Relevant
-
Packages: langchain, PyMuPDFLoader, OllamaEmbeddings, Chroma,
RecursiveCharacterTextSplitter -
Graphics/Displays: Apple M2 Max (Chipset Model: Apple M2 Max, Type:
GPU, us: Built-In, Total Number of Cores: 30)
Questions
- How can I optimize this pipeline for faster performance? Any recommendations for more efficient loaders, splitters, or database options would be greatly appreciated.
- Is there a way to diagnose which stage of the pipeline is causing the bottleneck? I added some timing code but would appreciate more advanced profiling tips.
You need to sign in to view this answers