OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Azure search index with RAG vector search on own data. Chat completions doesn't return chunk_id as expected

  • Thread starter Thread starter esteebie
  • Start date Start date
E

esteebie

Guest
When I index with a skillset that splits the text and then creates embeddings I am able to search and retrieve a chat completion along with relevant citations with a single API call to https://xxxxxxxxxxxxxx.openai.azure...at/completions?api-version=2024-03-01-preview.

The problem is, unlike when doing a keyword search on the original text documents, the chunk_id field that comes back is always 0, meaning I am unable to process the citations meaningfully for the user.

I think this is because the split skill results in a new record in the index for every chunk and the vector search returns the entire chunk rather than a portion of it.

There must be a way around this? If for example, you could specify the fields that were returned along with each citation, then it would be easy to derive a page number for each reference, since Azure automatically creates the document ID as parentDocumentId_pages01 etc.

Any help much appreciated and thanks for reading.

Case 1: Keyword search on non-chunked document in index:

Code:
"message": {
    "role": "assistant",
    "content": "zzzzz [doc2].",
    "end_turn": true,
    "context": {
        "citations": [
            {
                "content": "xxxxxxxxxx",
                "title": "Process Hierarchy - 22-11-2018 11-01.pdf",
                "url":"https://xxxxxxxxx.sharepoint.com/xxxxxxxxx/Process%20Hierarchy%20-%2022-11-2018%2011-01.pdf",
                "filepath": "/xxxxxxxxx/Process Hierarchy - 22-11-2018 11-01.pdf",
                "chunk_id": "31"}]}}

Case 2: vector search on index containing chunks of document split by skillset

Code:
{
    "id": "zzzzzzzzzzzzzz",
    "model": "gpt-35-turbo",
    "created": 1718180381,
    "object": "extensions.chat.completion",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "message": {
                "role": "assistant",
                "content": "yyyyyyyyyyyyyyyyyyy",
                "end_turn": true,
                "context": {
                    "citations": [
                        {
                            "content": "xxxxxxxxxxxxxxx",
                            "title": "doc1 Comms 2022.pdf",
                            "url": "https://xxxxxxxxxx.blob.core.windows.net/xxxx/doc1%20%20Comms%202022.pdf",
                            "filepath": null,
                            "chunk_id": "0"
                        },
                        {
                            "content": "yyyyyyyyyyyyyyyy",
                            "title": "doc1 Comms 2022.pdf",
                            "url": "https://xxxxxxxxxx.blob.core.windows.net/xxxx/doc1%20%20Comms%202022.pdf",
                            "filepath": null,
                            "chunk_id": "0"}]}}

You can see that in the second example the same file is returned twice since different chunks have been identified. The chunks are both labelled as '0' however.

The desired outcome would either be for chunk_id to be different for the two citations OR to somehow concatenate the array index from the result of the split skill with the document title so the titles become 'doc1 Comms 2022.pdf - pt 1' and 'doc1 Comms 2022.pdf - pt 2'

The only option seems to be using a custom web skill via an azure function, but that seems way too over-engineered for something so simple?

Here's the API call to https://xxxxxx/openai/deployments/yyyyy/chat/completions?api-version=2024-03-01-preview

A select statement on the index does not work as the completions api always sends back the same fields unless this can be configured somehow.

Code:
{
    "data_sources": [
        {
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://xxxxx.search.windows.net",
                "authentication":{"type":"api_key","key": "yyyyyyy"},
                "index_name": "xxxxxx",
                "topNDocuments":5,
                "query_type": "vectorSimpleHybrid",
                "vectorFilterMode": "preFilter",
                "filter":"(search.ismatch('@{variables('region')}', 'region')) and (search.ismatch('@{variables('channel')}', 'channel'))",
                "embeddingEndpoint":"xxxxx/openai/deployments/yyyyyyy/embeddings?api-version=2024-02-15-preview",
                "embeddingKey":"xxxxxxx"
            }
        }
        ],
    "messages": [
        {
            "role": "system",
            "content": "You are..."
        },
        {
            "role": "user",
            "content": "@{triggerBody()?['text']}"
        }
    ],
    "temperature": 1.2,
    "top_p": 0.5,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "max_tokens": 2000,
    "stop": null
}
<p>When I index with a skillset that splits the text and then creates embeddings I am able to search and retrieve a chat completion along with relevant citations with a single API call to <a href="https://xxxxxxxxxxxxxx.openai.azure...at/completions?api-version=2024-03-01-preview" rel="nofollow noreferrer">https://xxxxxxxxxxxxxx.openai.azure...at/completions?api-version=2024-03-01-preview</a>.</p>
<p>The problem is, unlike when doing a keyword search on the original text documents, the chunk_id field that comes back is always 0, meaning I am unable to process the citations meaningfully for the user.</p>
<p>I think this is because the split skill results in a new record in the index for every chunk and the vector search returns the entire chunk rather than a portion of it.</p>
<p>There must be a way around this? If for example, you could specify the fields that were returned along with each citation, then it would be easy to derive a page number for each reference, since Azure automatically creates the document ID as parentDocumentId_pages01 etc.</p>
<p>Any help much appreciated and thanks for reading.</p>
<p>Case 1: Keyword search on non-chunked document in index:</p>
<pre><code>"message": {
"role": "assistant",
"content": "zzzzz [doc2].",
"end_turn": true,
"context": {
"citations": [
{
"content": "xxxxxxxxxx",
"title": "Process Hierarchy - 22-11-2018 11-01.pdf",
"url":"https://xxxxxxxxx.sharepoint.com/xxxxxxxxx/Process Hierarchy - 22-11-2018 11-01.pdf",
"filepath": "/xxxxxxxxx/Process Hierarchy - 22-11-2018 11-01.pdf",
"chunk_id": "31"}]}}
</code></pre>
<p>Case 2: vector search on index containing chunks of document split by skillset</p>
<pre><code>{
"id": "zzzzzzzzzzzzzz",
"model": "gpt-35-turbo",
"created": 1718180381,
"object": "extensions.chat.completion",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "yyyyyyyyyyyyyyyyyyy",
"end_turn": true,
"context": {
"citations": [
{
"content": "xxxxxxxxxxxxxxx",
"title": "doc1 Comms 2022.pdf",
"url": "https://xxxxxxxxxx.blob.core.windows.net/xxxx/doc1 Comms 2022.pdf",
"filepath": null,
"chunk_id": "0"
},
{
"content": "yyyyyyyyyyyyyyyy",
"title": "doc1 Comms 2022.pdf",
"url": "https://xxxxxxxxxx.blob.core.windows.net/xxxx/doc1 Comms 2022.pdf",
"filepath": null,
"chunk_id": "0"}]}}
</code></pre>
<p>You can see that in the second example the same file is returned twice since different chunks have been identified. The chunks are both labelled as '0' however.</p>
<p>The desired outcome would either be for chunk_id to be different for the two citations OR to somehow concatenate the array index from the result of the split skill with the document title so the titles become 'doc1 Comms 2022.pdf - pt 1' and 'doc1 Comms 2022.pdf - pt 2'</p>
<p>The only option seems to be using a custom web skill via an azure function, but that seems way too over-engineered for something so simple?</p>
<p>Here's the API call to https://xxxxxx/openai/deployments/yyyyy/chat/completions?api-version=2024-03-01-preview</p>
<p>A select statement on the index does not work as the completions api always sends back the same fields unless this can be configured somehow.</p>
<pre><code>{
"data_sources": [
{
"type": "azure_search",
"parameters": {
"endpoint": "https://xxxxx.search.windows.net",
"authentication":{"type":"api_key","key": "yyyyyyy"},
"index_name": "xxxxxx",
"topNDocuments":5,
"query_type": "vectorSimpleHybrid",
"vectorFilterMode": "preFilter",
"filter":"(search.ismatch('@{variables('region')}', 'region')) and (search.ismatch('@{variables('channel')}', 'channel'))",
"embeddingEndpoint":"xxxxx/openai/deployments/yyyyyyy/embeddings?api-version=2024-02-15-preview",
"embeddingKey":"xxxxxxx"
}
}
],
"messages": [
{
"role": "system",
"content": "You are..."
},
{
"role": "user",
"content": "@{triggerBody()?['text']}"
}
],
"temperature": 1.2,
"top_p": 0.5,
"frequency_penalty": 0,
"presence_penalty": 0,
"max_tokens": 2000,
"stop": null
}
</code></pre>
Continue reading...
 

Latest posts

Top