Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc.save with linear=True PDF fails linearization check with 'overflow reading bit stream' #4263

Open
EddieOner opened this issue Jan 31, 2025 · 3 comments
Labels
upstream bug bug outside this package

Comments

@EddieOner
Copy link

EddieOner commented Jan 31, 2025

Linearization Error with linear=True Parameters

Problem Description

When saving PDF files using PyMuPDF with linear=True for web optimization, the PDF fails to linearize properly, resulting in integrity errors. These failures were initially observed in PDF proxies generated by AWS Lambda running pymudf (Python):
doc.save(output_file_path, garbage=4, clean=True, deflate=True, linear=True)

Upon further investigation using QPDF, the issue was reproducible on a local machine with pymupdf cli.

Steps to Reproduce

  1. Start with a non linearized pdf and run pymupdf command:
PyMuPDF  clean -linear input.pdf output.pdf
  1. check
qpdf --check output.pdf
checking output.pdf
PDF Version: 1.7
File is not encrypted
File is linearized
WARNING: output.pdf: error encountered while checking linearization data: overflow reading bit stream: wanted = 32; available = 16
qpdf: operation succeeded with warnings
  1. Observe linearization failures

Expected vs Actual Behavior

Expected: PDF should save with deflate compression and linearized structure without errors.
Actual: File produces validation errors indicating invalid linearization structure.

User Impact

If the linearization isn't working correctly:

User Experience: The PDF will still download, but users won't be able to view the first page immediately. Instead, they might have to wait for the entire file to download before they can start reading.

Page Order: The pages might not load in the intended order, which can be confusing and disrupt the reading experience.

Environment Information

  • pymupdf version: 1.25.2
  • qpdf version: 11.9.1
  • OS: Mac 15.3

Additional Context

This seems to happen with almost any pdf file

Twitter 4_linerarized.pdf
Twitter 4.pdf

How to reproduce the bug

pip install pymupdf
brew install qpdf

PyMuPDF version

1.25.2

Operating system

MacOS

Python version

3.11

@JorjMcKie
Copy link
Collaborator

This an upstream problem (MuPDF). We will create a report in their issue system.
You can recreate the issue without using PyMuPDF via this MuPDF CLI command:

mutool clean -lggggsz Twitter.4.pdf

Thereafter, running qpdf with the generated output PDF out.pdf shows the problem.

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Feb 1, 2025
@JorjMcKie
Copy link
Collaborator

Here is the link to MuPDF's issue: https://bugs.ghostscript.com/show_bug.cgi?id=708278

@EddieOner
Copy link
Author

EddieOner commented Feb 3, 2025

@JorjMcKie Added additional content how this effects end-users :)
If the linearization isn't working correctly:

User Experience: The PDF will still download, but users won't be able to view the first page immediately. Instead, they might have to wait for the entire file to download before they can start reading.

Page Order: The pages might not load in the intended order, which can be confusing and disrupt the reading experience.

@EddieOner EddieOner reopened this Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants