Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some PDF documents cannot be parsed #57

Open
1 task done
tiamjiakun opened this issue Jul 30, 2024 · 3 comments
Open
1 task done

Some PDF documents cannot be parsed #57

tiamjiakun opened this issue Jul 30, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@tiamjiakun
Copy link

Initial Checks

  • I confirm that I'm on the latest version

Description

iShot_2024-07-30_16 55 01 iShot_2024-07-30_16 54 33 [example1.pdf](https://github.com/user-attachments/files/16424947/example1.pdf) [example2.pdf](https://github.com/user-attachments/files/16424951/example2.pdf)

Example Code

import openparse
from openparse import DocumentParser
from IPython.display import display

pdf_path = "/Users/tjk/Desktop/ceshi_pdf/example1.pdf"
parser = DocumentParser(
    table_args={
        "parsing_algorithm": "pymupdf"}
)
parsed_content = parser.parse(pdf_path)

Python, open-parse & OS Version

python_version: 3.8.18
             operating_system: Darwin
                   os_version: 23.0.0
           open-parse version: 0.5.7
                 install path: /Users/tjk/miniconda3/envs/pytorch/lib/python3.8/site-packages/openparse
               python version: 3.8.18 (default, Sep 11 2023, 08:17:16)  [Clang 14.0.6 ]
                     platform: macOS-14.0-arm64-arm-64bit
             related packages: tokenizers-0.19.1 PyMuPDF-1.24.9 torchvision-0.18.1 transformers-4.43.1 torch-2.3.1 pydantic-2.8.2
@tiamjiakun tiamjiakun added the bug Something isn't working label Jul 30, 2024
@tiamjiakun
Copy link
Author

tiamjiakun commented Aug 12, 2024

waiting

@jaredmcqueen
Copy link

+1 on this. pdfminer struggles with a large amount of documents I'm testing with. pymupdf, on the other hand opens anything I throw at it flawlessly. ocr=true will flip to use pymupdf, but has additional logic that makes it useful to OCR.

@jaredmcqueen
Copy link

jaredmcqueen commented Sep 8, 2024

seems to be pdfminer:
pdfminer/pdfminer.six#1004
NixOS/nixpkgs#339919

there's a fix now but you'll have to wait until it gets released, which could be a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants