Getting bounding boxes with different dpi settings #4252

zhuwenfei-wintech · 2025-01-26T07:06:35Z

zhuwenfei-wintech
Jan 26, 2025

Currently the bboxes I get when calling get_text are calculated under dpi=72. I have to resize the bbox myself if I need a higher dpi.
Can I get bounding boxes by setting dpi similary to get_pixmap?
Consider this scenario: in order to get a better performance when inferencing a neural network, I need a higher resolution image. Then, I want to mix the bbox returned by the neural network and the ones extracted directly from pdf for further processing.
If this cannot be done now, please consider this as a feature request. It may be even better when dpi can be set when opening the pdf.

Answered by JorjMcKie

Jan 26, 2025

No, performance would not benefit from incorporating this - rather the contrary. The standard user unit size in PDF is 72 points per inch in 99.99999% of all cases, and any deviations from this will be taken care of by the base library MuPDF. So page content coordinates will be correct in any case.

From the perspective of text extraction, your use case is peripheral. For the sake of brevity, my above explanation was somewhat imprecise: The boundary box coordinates in the image should be integers, so the correct / complete computation is (rect * matrix).irect. For points there is no similar transformation available.
Given all that, we don't want to bloat text extraction code with this sort…

View full answer

JorjMcKie · 2025-01-26T08:01:38Z

JorjMcKie
Jan 26, 2025
Maintainer

You are referring to coordinate computation of page content inside the rendered image?

This is no problem at all with whatever resolution you chose when creating the pixmap: Just take the image dimensions, which are represented by Pixmap.irect(an IRect object) and compute this matrix: matrix = page.rect.torect(pix.irect).

If you then take any Point or Rect defined for the page (e.g. also boundary boxes), then point * matrix or rect * matrix are the point / rectangle coordinates in the image.
This obviously works the same the other way round when you want to compute the original page coordinates of a location identified within the image: Just take the matrix inverse ~matrix or again use the Rect/IRect method .torect().

2 replies

zhuwenfei-wintech Jan 26, 2025
Author

I see what you are suggesting. What I am suggesting is if you could include what you said in a function like get_text with a given dpi or matrix, probably for a better speed performance. What do you think?

JorjMcKie Jan 26, 2025
Maintainer

No, performance would not benefit from incorporating this - rather the contrary. The standard user unit size in PDF is 72 points per inch in 99.99999% of all cases, and any deviations from this will be taken care of by the base library MuPDF. So page content coordinates will be correct in any case.

From the perspective of text extraction, your use case is peripheral. For the sake of brevity, my above explanation was somewhat imprecise: The boundary box coordinates in the image should be integers, so the correct / complete computation is (rect * matrix).irect. For points there is no similar transformation available.
Given all that, we don't want to bloat text extraction code with this sort of thing.

Answer selected by zhuwenfei-wintech

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting bounding boxes with different dpi settings #4252

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Getting bounding boxes with different dpi settings #4252

zhuwenfei-wintech Jan 26, 2025

Replies: 1 comment · 2 replies

JorjMcKie Jan 26, 2025 Maintainer

zhuwenfei-wintech Jan 26, 2025 Author

JorjMcKie Jan 26, 2025 Maintainer

zhuwenfei-wintech
Jan 26, 2025

Replies: 1 comment 2 replies

JorjMcKie
Jan 26, 2025
Maintainer

zhuwenfei-wintech Jan 26, 2025
Author

JorjMcKie Jan 26, 2025
Maintainer