How to extract the all the images from the pdf page and also ignore the header & footer logo image? #136
Unanswered
thangarajdeivasikamani
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello Team,
Consider my pdf sheet as like https://www.st.com/resource/en/datasheet/stm32f205rb.pdf.
In that I have used below code to extract the image. But I am not getting proper images actually available in the pdf.
import sys, pymupdf # import the bindings
fname = "stm32f103c8.pdf" # get filename from command line
doc = pymupdf.open(fname) # open document
iterate over the pages
for page in doc:
img_number = 0 # for enumerating images per page
# iterate over the image blocks
for block in page.get_text("dict")["blocks"]:
# skip if no image block
if block["type"] != 1:
continue
# build filename, like 'img17-3.jpg'
name = f"img{page.number}-{img_number}.{block['ext']}"
out = open(name, "wb")
out.write(block["image"]) # write the binary content
out.close()
img_number += 1 # increase image counter
Some time the reputative footer image logo, side image logo only consider as the image and extracting. Actual image extraction missing.

Even I tried with below code. It's not extracting the required images
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-images/extract-from-xref.py
Beta Was this translation helpful? Give feedback.
All reactions