removing pdf watermark with PyMuPdf or reassembling multiplage images

I have been using PyMuPDF to remove some watermark from some pdf documents.

However, some documents prove to be more difficult then others.

For most cases, the watermark is a just pdf text overlaid on top of an image (the actual pdf content). If I get that image and apply it to a new pdf page , I can get the pdf original page without the watermark.

However, there are some cases where – althought the watermark is still just text overlaid on top of an image – the actual pdf page is broken down into multiple images. I can get those images but then I have trouble reassembling them into the original page.

Looking for an alternative way to remove that watermark or a way to reassemble the images together to look like the original pdf page.

My code currently looks something like this :


    # Open the original pdf file
    doc = fitz.open(os.path.join(input_folder, single_filename))

    # Initialize a new PDF to hold the images
    pdf_output = fitz.open()

    # Iterate through pages in the document
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        output = json.loads(page.get_text("json"))
        if "blocks" in output and len(output["blocks"]) > 0 and "image" in output["blocks"][0]:
            base64_string = output["blocks"][0]["image"]

            # Decode the Base64 string
            image_data = base64.b64decode(base64_string)

            # Insert the image into the new PDF
            img_pix = fitz.Pixmap(image_data)

            # Create a new page with dimensions of the image
            pdf_page = pdf_output.new_page(width=img_pix.width, height=img_pix.height)

            # Insert the image into the new page
            pdf_page.insert_image(pdf_page.rect, pixmap=img_pix)
            pdf_output.save("without_watermark/" + single_filename)
        else:
            pass

You need to sign in to view this answers

Related Post