OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

How do I read the original pdf file in set indexer datasource in a custom WebApiSkill after enabling "Allow Skillset to read file data"

  • Thread starter Thread starter Mike B
  • Start date Start date
M

Mike B

Guest
I see the following under my indexer settings:

enter image description here

When hovering over it I read the following:

True means the original file data obtained from your blob data source is preserved. This allows passing the original file to a custom skill, or to the Document Extraction skill.

How do I read the original pdf file in the associated blob data source in a custom WebApiSkill?

Code:
file_data_base64 = value.get('data', {}).get('file_data', '')
...

EDIT​


I enabled Allow Skillset to read file data in the indexer. My full setup:

  • WebApiSkill inputs

Code:
inputs=[
    InputFieldMappingEntry(name="file_data", source="/document/file_data")
],
  • WebApiSkill input reading

Code:
import azure.functions as func
import datetime
import json
import logging
import base64
import fitz
from io import BytesIO

app = func.FunctionApp()
logging.basicConfig(level=logging.INFO)


@app.route(route="CustomSplitSkill", auth_level=func.AuthLevel.FUNCTION)
def CustomSplitSkill(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    try:
        req_body = req.get_json()
        logging.info('Request body parsed successfully.')
    except ValueError:
        logging.error(f"Invalid input: {e}")
        return func.HttpResponse("Invalid input", status_code=400)

    # 'values' expected top-level key in the request body
    response_body = {"values": []}
    for value in req_body.get('values', []):
        recordId = value.get('recordId')
        file_data_base64 = value.get('data', {}).get('file_data', '').get('data', '')
        if not file_data_base64:
            logging.error("No file_data found in the request.")
            return func.HttpResponse("Invalid input: No file_data found", status_code=400)

        try:
            file_data = base64.b64decode(file_data_base64)

            try:
                pdf_document = fitz.open(stream=BytesIO(file_data), filetype='pdf')
            except fitz.FileDataError as e:
                logging.error(f"Failed to open PDF document: {e}")
                return func.HttpResponse("Failed to open PDF document", status_code=400)
            except Exception as e:
                logging.error(f"An unexpected error occurred while opening the PDF document: {e}")
                return func.HttpResponse("An unexpected error occurred", status_code=500)
            
            if pdf_document.page_count == 0:
                logging.error("No pages found in the PDF document.")
                return func.HttpResponse("Invalid PDF: No pages found", status_code=400)

            extracted_text = ""
            for page_num in range(pdf_document.page_count):
                page = pdf_document.load_page(page_num)
                extracted_text += page.get_text()

            combined_list = [{'textItems': ['text1', 'text2'], 'numberItems': [0, 1]}]  # i deleted the chunking and associated page extraction for simplicity

            response_record = {
                "recordId": recordId,
                "data": {
                    "subdata": combined_list
                }
            }
            response_body['values'].append(response_record)
        except Exception as e:
            logging.error(f"Error processing file_data: {e}")
            return func.HttpResponse("Error processing file_data", status_code=500)

    logging.info('Function executed successfully.')
    return func.HttpResponse(json.dumps(response_body), mimetype="application/json")

The error:

Code:
Message:
Could not execute skill because the Web Api request failed.

Details:
Web Api response status: 'NotFound', Web Api response details: ''

Given that I have projections I cannot debug this properly as debugging is not supported with projections. The logging does not seem to log the specific error either despite the error handling and checks.
<p>I see the following under my indexer settings:</p>
<p><a href="https://i.sstatic.net/68WP1fBM.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/68WP1fBM.png" alt="enter image description here" /></a></p>
<p>When hovering over it I read the following:</p>
<blockquote>
<p>True means the original file data obtained from your blob data source
is preserved. This allows passing the original file to a custom skill,
or to the Document Extraction skill.</p>
</blockquote>
<p>How do I read the original pdf file in the associated blob data source in a custom WebApiSkill?</p>
<pre class="lang-py prettyprint-override"><code>file_data_base64 = value.get('data', {}).get('file_data', '')
...
</code></pre>
<h4>EDIT</h4>
<p>I enabled <code>Allow Skillset to read file data</code> in the indexer. My full setup:</p>
<ul>
<li>WebApiSkill inputs</li>
</ul>
<pre class="lang-py prettyprint-override"><code>inputs=[
InputFieldMappingEntry(name="file_data", source="/document/file_data")
],
</code></pre>
<ul>
<li>WebApiSkill input reading</li>
</ul>
<pre class="lang-py prettyprint-override"><code>import azure.functions as func
import datetime
import json
import logging
import base64
import fitz
from io import BytesIO

app = func.FunctionApp()
logging.basicConfig(level=logging.INFO)


@app.route(route="CustomSplitSkill", auth_level=func.AuthLevel.FUNCTION)
def CustomSplitSkill(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')

try:
req_body = req.get_json()
logging.info('Request body parsed successfully.')
except ValueError:
logging.error(f"Invalid input: {e}")
return func.HttpResponse("Invalid input", status_code=400)

# 'values' expected top-level key in the request body
response_body = {"values": []}
for value in req_body.get('values', []):
recordId = value.get('recordId')
file_data_base64 = value.get('data', {}).get('file_data', '').get('data', '')
if not file_data_base64:
logging.error("No file_data found in the request.")
return func.HttpResponse("Invalid input: No file_data found", status_code=400)

try:
file_data = base64.b64decode(file_data_base64)

try:
pdf_document = fitz.open(stream=BytesIO(file_data), filetype='pdf')
except fitz.FileDataError as e:
logging.error(f"Failed to open PDF document: {e}")
return func.HttpResponse("Failed to open PDF document", status_code=400)
except Exception as e:
logging.error(f"An unexpected error occurred while opening the PDF document: {e}")
return func.HttpResponse("An unexpected error occurred", status_code=500)

if pdf_document.page_count == 0:
logging.error("No pages found in the PDF document.")
return func.HttpResponse("Invalid PDF: No pages found", status_code=400)

extracted_text = ""
for page_num in range(pdf_document.page_count):
page = pdf_document.load_page(page_num)
extracted_text += page.get_text()

combined_list = [{'textItems': ['text1', 'text2'], 'numberItems': [0, 1]}] # i deleted the chunking and associated page extraction for simplicity

response_record = {
"recordId": recordId,
"data": {
"subdata": combined_list
}
}
response_body['values'].append(response_record)
except Exception as e:
logging.error(f"Error processing file_data: {e}")
return func.HttpResponse("Error processing file_data", status_code=500)

logging.info('Function executed successfully.')
return func.HttpResponse(json.dumps(response_body), mimetype="application/json")
</code></pre>
<p>The error:</p>
<pre class="lang-py prettyprint-override"><code>Message:
Could not execute skill because the Web Api request failed.

Details:
Web Api response status: 'NotFound', Web Api response details: ''
</code></pre>
<p>Given that I have projections I cannot debug this properly as debugging is not supported with projections. The logging does not seem to log the specific error either despite the error handling and checks.</p>
Continue reading...
 

Latest posts

Top