OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

RAG pipeline help request

  • Thread starter Thread starter ela
  • Start date Start date
E

ela

Guest
I'm a bit new to the whole RAG pipeline thing and find myself being a bit lost in the endless possibilities of building one. My goal is to create a script that can transform about 60 anatomical pdfs into a vector store database and use this to answer questions about body parts and return the references to the pages of the pdfs where that information was taken from.

My script so far looks like this because it is the only way I have managed to make it work:

Code:
import os

import faiss
import nest_asyncio
from dotenv import load_dotenv
from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.vector_stores.faiss import FaissVectorStore

nest_asyncio.apply()
load_dotenv()

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])
Settings.callback_manager = callback_manager

save_dir = "./documents/vector_store"

d = 1536
faiss_index = faiss.IndexFlatL2(d)
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

if not os.path.exists(save_dir):
    print("Saving vector store to disk ...")
    documents = SimpleDirectoryReader("./documents/test/").load_data()
    vector_store = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
    )
    vector_store.storage_context.persist(persist_dir=save_dir)
    vector_query_engine = vector_store.as_query_engine(similarity_top_k=3)
else:
    print("Loading vector store from disk...")
    vector_store = FaissVectorStore.from_persist_dir(save_dir)
    storage_context = StorageContext.from_defaults(
        vector_store=vector_store, persist_dir=save_dir
    )
    index = load_index_from_storage(storage_context=storage_context)
    vector_query_engine = index.as_query_engine(similarity_top_k=3)

response = vector_query_engine.query(
    "What is the diaphragm and what position does it occupy in the body?"
)

print(response)
for i, node in enumerate(response.source_nodes):
    metadata = node.node.metadata
    text_chunk = node.node.text
    page_label = metadata.get("page_label", "N/A")
    file_name = metadata.get("file_name", "N/A")
    print(f"Reference nr: {i+1}, Page: {page_label}, Document: {file_name}")
    print(f"Text Chunk: {text_chunk}\n")

And this is the (beginning of the) output:

Code:
Trace: query
    |_CBEventType.QUERY -> 2.734167 seconds
      |_CBEventType.RETRIEVE -> 0.417225 seconds
        |_CBEventType.EMBEDDING -> 0.417225 seconds
      |_CBEventType.SYNTHESIZE -> 2.316942 seconds
        |_CBEventType.TEMPLATING -> 0.0 seconds
        |_CBEventType.LLM -> 2.30051 seconds
**********
A diaphragm is a dome-shaped muscle that separates the thoracic cavity from the abdominal cavity. It is positioned below the lungs and heart, and above the liver, stomach, and other abdominal organs. The diaphragm is connected to the thoracic aorta, which supplies blood to the chest wall and thoracic organs, and the inferior vena cava, which returns blood from the lower body to the heart.

Reference nr: 1, Page: 317, Document: random_pdf.pdf
Text Chunk: even during sleep, and must have a constant flow of
blood to supply oxygen and remove waste products.For this reason there are four vessels that bring bloodto the circle of Willis. From this anastomosis, severalpaired arteries (the cerebral arteries) extend into thebrain itself.
The thoracic aorta and its branches supply the
chest wall and the organs within the thoracic cavity.These vessels are listed in T able 13–1.
The abdominal aorta gives rise to arteries that sup-ply the abdominal wall and organs and to the common
iliac arteries, which continue into the legs. Notice inFig. 13–3 that the common iliac artery becomes theexternal iliac artery, which becomes the femoral artery,which becomes the popliteal artery; the same vesselhas different names based on location. These vesselsare also listed in T able 13–1 (see Box 13–3: PulseSites).
The systemic veins drain blood from organs or
parts of the body and often parallel their correspond-The Vascular System 299
Figure 13–5. Arteries and veins of the head and neck shown in right lateral view. Veins
are labeled on the left. Arteries are labeled on the right.

My questions are two:

  • on a more theoretical level: I thought a RAG pipeline needed (in a very simplified fashion) 1) embedding of the chunks 2) retrieval based on similarity 3) rephrasing of the answer by an LLM; however, this script works fairly well while apparently skipping both 1 and 3, so am I missing the point? or does llama-index abstract away from a lot of the implementation?
  • on a practical level: how do I improve on this? The script works as in it usually outputs reasonable answers, but the text in "source_nodes" sometimes is very unsatifactory in terms of its relevance

Any help/guidance or resources would be super appreciated!
<p>I'm a bit new to the whole RAG pipeline thing and find myself being a bit lost in the endless possibilities of building one. My goal is to create a script that can transform about 60 anatomical pdfs into a vector store database and use this to answer questions about body parts and return the references to the pages of the pdfs where that information was taken from.</p>
<p>My script so far looks like this because it is the only way I have managed to make it work:</p>
<pre><code>import os

import faiss
import nest_asyncio
from dotenv import load_dotenv
from llama_index.core import (
Settings,
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
load_index_from_storage,
)
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.vector_stores.faiss import FaissVectorStore

nest_asyncio.apply()
load_dotenv()

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])
Settings.callback_manager = callback_manager

save_dir = "./documents/vector_store"

d = 1536
faiss_index = faiss.IndexFlatL2(d)
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

if not os.path.exists(save_dir):
print("Saving vector store to disk ...")
documents = SimpleDirectoryReader("./documents/test/").load_data()
vector_store = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
vector_store.storage_context.persist(persist_dir=save_dir)
vector_query_engine = vector_store.as_query_engine(similarity_top_k=3)
else:
print("Loading vector store from disk...")
vector_store = FaissVectorStore.from_persist_dir(save_dir)
storage_context = StorageContext.from_defaults(
vector_store=vector_store, persist_dir=save_dir
)
index = load_index_from_storage(storage_context=storage_context)
vector_query_engine = index.as_query_engine(similarity_top_k=3)

response = vector_query_engine.query(
"What is the diaphragm and what position does it occupy in the body?"
)

print(response)
for i, node in enumerate(response.source_nodes):
metadata = node.node.metadata
text_chunk = node.node.text
page_label = metadata.get("page_label", "N/A")
file_name = metadata.get("file_name", "N/A")
print(f"Reference nr: {i+1}, Page: {page_label}, Document: {file_name}")
print(f"Text Chunk: {text_chunk}\n")

</code></pre>
<p>And this is the (beginning of the) output:</p>
<pre><code>Trace: query
|_CBEventType.QUERY -> 2.734167 seconds
|_CBEventType.RETRIEVE -> 0.417225 seconds
|_CBEventType.EMBEDDING -> 0.417225 seconds
|_CBEventType.SYNTHESIZE -> 2.316942 seconds
|_CBEventType.TEMPLATING -> 0.0 seconds
|_CBEventType.LLM -> 2.30051 seconds
**********
A diaphragm is a dome-shaped muscle that separates the thoracic cavity from the abdominal cavity. It is positioned below the lungs and heart, and above the liver, stomach, and other abdominal organs. The diaphragm is connected to the thoracic aorta, which supplies blood to the chest wall and thoracic organs, and the inferior vena cava, which returns blood from the lower body to the heart.

Reference nr: 1, Page: 317, Document: random_pdf.pdf
Text Chunk: even during sleep, and must have a constant flow of
blood to supply oxygen and remove waste products.For this reason there are four vessels that bring bloodto the circle of Willis. From this anastomosis, severalpaired arteries (the cerebral arteries) extend into thebrain itself.
The thoracic aorta and its branches supply the
chest wall and the organs within the thoracic cavity.These vessels are listed in T able 13–1.
The abdominal aorta gives rise to arteries that sup-ply the abdominal wall and organs and to the common
iliac arteries, which continue into the legs. Notice inFig. 13–3 that the common iliac artery becomes theexternal iliac artery, which becomes the femoral artery,which becomes the popliteal artery; the same vesselhas different names based on location. These vesselsare also listed in T able 13–1 (see Box 13–3: PulseSites).
The systemic veins drain blood from organs or
parts of the body and often parallel their correspond-The Vascular System 299
Figure 13–5. Arteries and veins of the head and neck shown in right lateral view. Veins
are labeled on the left. Arteries are labeled on the right.

</code></pre>
<p>My questions are two:</p>
<ul>
<li>on a more theoretical level: I thought a RAG pipeline needed (in a very simplified fashion) 1) embedding of the chunks 2) retrieval based on similarity 3) rephrasing of the answer by an LLM; however, this script works fairly well while apparently skipping both 1 and 3, so am I missing the point? or does llama-index abstract away from a lot of the implementation?</li>
<li>on a practical level: how do I improve on this? The script works as in it usually outputs reasonable answers, but the text in "source_nodes" sometimes is very unsatifactory in terms of its relevance</li>
</ul>
<p>Any help/guidance or resources would be super appreciated!</p>
 

Online statistics

Members online
0
Guests online
4
Total visitors
4
Top