Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to consistently extract field labels from PDFs #3950

Open
Rutvik-Trivedi opened this issue Oct 16, 2024 · 7 comments
Open

Unable to consistently extract field labels from PDFs #3950

Rutvik-Trivedi opened this issue Oct 16, 2024 · 7 comments

Comments

@Rutvik-Trivedi
Copy link

Description of the bug

For my usecase, I am trying to extract the widget.field_label field from a PDF file. I tried extracting this field from two PDFs. I am successfully able to extract the field labels from one PDF, but not from the other. If it helps in any way, I used Master PDF Editor to add the field labels for the PDFs.

This is the PDF for which I am able to extract the field labels from all the widgets -
working sample.pdf

This is the PDF for which I am not able to extract the field labels even after adding the labels -
not working sample.pdf

Is this a PDF/Editor level nuance? Or a bug?

How to reproduce the bug

The reproduction of the problem should be fairly simple:

import fitz
doc = fitz.Document("working sample.pdf")  # Or "not working sample.pdf"
for page in doc:
    for widget in page.widgets():
        print(widget.field_label)

PDF files:
working sample.pdf
not working sample.pdf

For working sample.pdf, I get the following output:

{{ firstName }}
{{ lastName }}
{{ address.street }}
{{ address.apt }}
{{ address.zipcode }}
{{ address.city }}
{{ spirit }}
{{ today }}
{{ evil | check }}
{{ language.french | X }}
{{ language.esperento | X }}
{{ language.latin | X }}
{{ sig | paste }}

Which is correct and expected. It covers all the available field labels

For not working sample.pdf, I get the following output:

""
None
None
None

But the expected output for not working sample.pdf should be (not necessarily in the same order):

{{ named_insured }}
{{ insurance_line }}
{{ policy_period_start_date }}
{{ policy_period_end_date }}

which are all the available field labels in the PDF

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.10

@JorjMcKie
Copy link
Collaborator

In this case, the field label is not stored with the field itself, but with its so-called Parent. The current code looks at this field Parent only for field_name while it should also do that for field_label.
The fix is trivial and should be available in a soon new version.

JorjMcKie added a commit that referenced this issue Oct 16, 2024
Access field label as an **inheritable** dictionary value.
Addresses #3950.
@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Oct 16, 2024
JorjMcKie added a commit that referenced this issue Oct 16, 2024
Access field label as an **inheritable** dictionary value.
Addresses #3950.
@Rutvik-Trivedi
Copy link
Author

Rutvik-Trivedi commented Oct 16, 2024

In this case, the field label is not stored with the field itself, but with its so-called Parent. The current code looks at this field Parent only for field_name while it should also do that for field_label. The fix is trivial and should be available in a soon new version.

Thanks @JorjMcKie . Would it be possible to know an approximate timeline for the stable release of this new version?

@julian-smith-artifex-com
Copy link
Collaborator

Thanks @JorjMcKie . Would it be possible to know an approximate timeline for the stable release of this new version?

There's a small chance that we will make a new release this week, but it's more likely to be next week.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.12.

@Rutvik-Trivedi
Copy link
Author

Rutvik-Trivedi commented Oct 22, 2024

@julian-smith-artifex-com @JorjMcKie thanks for the quick release. I tried running the script again with the latest version (1.24.12) locally. It does work better now, but it still is missing the very first field label from the PDF. When I run this code again on not working sample.pdf, I get only three field label names, while there are four in the PDF.

pip install --upgrade pymupdf  # installs version 1.24.12. Other system details are the same as mentioned in the start of the issue
import fitz
doc = fitz.Document("not working sample.pdf")
for page in doc:
    for widget in page.widgets():
        print(widget.field_label)

I get the following output:

<empty string as the first output ("")>
{{ policy_period_start_date }}
{{ policy_period_end_date }}
{{ insurance_line }}

But the expected output should be

{{ named_insured }}   # This comes as an empty string in the actual output
{{ policy_period_start_date }}
{{ policy_period_end_date }}
{{ insurance_line }}

Is this something that is fixable or is this due to some PDF level nuance?
If this is the former, is there anything I can do to change in the source code locally to try out a quick fix?
If this is the latter, what do I need to consider while editing a PDF so that the field labels are extracted properly?
Thanks

@julian-smith-artifex-com julian-smith-artifex-com removed Fixed in next release fix developed release schedule to be determined labels Oct 22, 2024
julian-smith-artifex-com added a commit that referenced this issue Oct 22, 2024
…n of field_label.

Also recurse to parent if node's string value is empty string. This appears to
be what Adobe does.

Addresses #3950.
julian-smith-artifex-com added a commit that referenced this issue Oct 22, 2024
…n of field_label.

Also recurse to parent if node's string value is empty string. This appears to
be what Adobe does.

Addresses #3950.
@Rutvik-Trivedi
Copy link
Author

@julian-smith-artifex-com thanks again for the newest fix. Could you please provide me with an estimated time-frame for the next release? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@JorjMcKie @Rutvik-Trivedi @julian-smith-artifex-com and others