I have been using PyMuPDF to remove some watermark from some pdf documents.
However, some documents prove to be more difficult then others.
For most cases, the watermark is a just pdf text overlaid on top of an image (the actual pdf content). If I get that image and apply it to a new pdf page , I can get the pdf original page without the watermark.
However, there are some cases where – althought the watermark is still just text overlaid on top of an image – the actual pdf page is broken down into multiple images. I can get those images but then I have trouble reassembling them into the original page.
Looking for an alternative way to remove that watermark or a way to reassemble the images together to look like the original pdf page.
My code currently looks something like this :
# Open the original pdf file
doc = fitz.open(os.path.join(input_folder, single_filename))
# Initialize a new PDF to hold the images
pdf_output = fitz.open()
# Iterate through pages in the document
for page_num in range(doc.page_count):
page = doc.load_page(page_num)
output = json.loads(page.get_text("json"))
if "blocks" in output and len(output["blocks"]) > 0 and "image" in output["blocks"][0]:
base64_string = output["blocks"][0]["image"]
# Decode the Base64 string
image_data = base64.b64decode(base64_string)
# Insert the image into the new PDF
img_pix = fitz.Pixmap(image_data)
# Create a new page with dimensions of the image
pdf_page = pdf_output.new_page(width=img_pix.width, height=img_pix.height)
# Insert the image into the new page
pdf_page.insert_image(pdf_page.rect, pixmap=img_pix)
pdf_output.save("without_watermark/" + single_filename)
else:
pass
You need to sign in to view this answers