OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Trying to extract information from pdf files in google colab. It is just repeating most information from the first file into all the others

  • Thread starter Thread starter Victor Brandao
  • Start date Start date
V

Victor Brandao

Guest
This is the code:

Code:
for file in files.get('files', []):
    # ... (Get file content as before)

    # Extract data from the PDF
    pdf_reader = PyPDF2.PdfReader(BytesIO(file_content))
    page = pdf_reader.pages[0]  # Assuming you want to extract from the first page

    # 1. File Name
    file_name = file['name']
    print(f"File: {file_name}")

    # 2. Process Number
    process_number = None
    process_number_match = None
    process_number_match = re.search(r"(\d{7}-\d{2}.\d{4}.\d.\d{2}.\d{4})", page.extract_text())
    if process_number_match:
        process_number = process_number_match.group(1)
        print(f"Process Number: {process_number}")
    else:
        print("erro, número do processo não encontrado")

    # 3. Name
    name = None  # Reset the name variable
    name_match = None
    name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):\s+([A-Z\s]+)", page.extract_text())
    if name_match:
        name = name_match.group(2)
        print(f"Name: {name}")
    else:
        print("error, nome não encontrado")

    # 4. Keywords
    found_keywords = []  # Reset the found_keywords list
    keywords = ["audiência", "subsídios", "cumprimento"]
    for keyword in keywords:
        if keyword in page.extract_text():
            found_keywords.append(keyword)
    if found_keywords:
        print(f"Keywords Found: {', '.join(found_keywords)}")
    else:
        print("erro, pedido não encontrado")

It will keep printing this:

Code:
Keywords Found: cumprimento
File: 33-00737.015338.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 32-00737.012571.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 31-00737.012592.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 30-00737.010470.pdf
Process Number: (number1)
Name:(name1)
 
S
Keywords Found: cumprimento
File: 29-00737.007060.pdf
Process Number: (number1)
Name:(name1)

The file number is getting updated, so it is reading the correct files. But it keeps repeating the other strings. I tried reseting it with = None, but didn't work.

Tried using

Code:
# 3. Name
    name = None  # Reset the name variable
    name_match = None
    name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):\s+([A-Z\s]+)", page.extract_text())
    if name_match:
        name = name_match.group(2)
        print(f"Name: {name}")
    else:
        print("error, nome não encontrado")

I was expecting to print the name for each document. Instead I got the name right for the first document and it got repeated for all the others.
<p>This is the code:</p>
<pre class="lang-py prettyprint-override"><code>for file in files.get('files', []):
# ... (Get file content as before)

# Extract data from the PDF
pdf_reader = PyPDF2.PdfReader(BytesIO(file_content))
page = pdf_reader.pages[0] # Assuming you want to extract from the first page

# 1. File Name
file_name = file['name']
print(f"File: {file_name}")

# 2. Process Number
process_number = None
process_number_match = None
process_number_match = re.search(r"(\d{7}-\d{2}.\d{4}.\d.\d{2}.\d{4})", page.extract_text())
if process_number_match:
process_number = process_number_match.group(1)
print(f"Process Number: {process_number}")
else:
print("erro, número do processo não encontrado")

# 3. Name
name = None # Reset the name variable
name_match = None
name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):\s+([A-Z\s]+)", page.extract_text())
if name_match:
name = name_match.group(2)
print(f"Name: {name}")
else:
print("error, nome não encontrado")

# 4. Keywords
found_keywords = [] # Reset the found_keywords list
keywords = ["audiência", "subsídios", "cumprimento"]
for keyword in keywords:
if keyword in page.extract_text():
found_keywords.append(keyword)
if found_keywords:
print(f"Keywords Found: {', '.join(found_keywords)}")
else:
print("erro, pedido não encontrado")
</code></pre>
<p>It will keep printing this:</p>
<pre><code>Keywords Found: cumprimento
File: 33-00737.015338.pdf
Process Number: (number1)
Name:(name1)

S
Keywords Found: cumprimento
File: 32-00737.012571.pdf
Process Number: (number1)
Name:(name1)

S
Keywords Found: cumprimento
File: 31-00737.012592.pdf
Process Number: (number1)
Name:(name1)

S
Keywords Found: cumprimento
File: 30-00737.010470.pdf
Process Number: (number1)
Name:(name1)

S
Keywords Found: cumprimento
File: 29-00737.007060.pdf
Process Number: (number1)
Name:(name1)
</code></pre>
<p>The file number is getting updated, so it is reading the correct files. But it keeps repeating the other strings. I tried reseting it with = None, but didn't work.</p>
<p>Tried using</p>
<pre><code># 3. Name
name = None # Reset the name variable
name_match = None
name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):\s+([A-Z\s]+)", page.extract_text())
if name_match:
name = name_match.group(2)
print(f"Name: {name}")
else:
print("error, nome não encontrado")
</code></pre>
<p>I was expecting to print the name for each document. Instead I got the name right for the first document and it got repeated for all the others.</p>
 

Latest posts

Top