Some PDF documents cannot be parsed #57

tiamjiakun · 2024-07-30T08:58:11Z

Initial Checks

I confirm that I'm on the latest version

Description

[example1.pdf](https://github.com/user-attachments/files/16424947/example1.pdf) [example2.pdf](https://github.com/user-attachments/files/16424951/example2.pdf)

Example Code

import openparse
from openparse import DocumentParser
from IPython.display import display

pdf_path = "/Users/tjk/Desktop/ceshi_pdf/example1.pdf"
parser = DocumentParser(
    table_args={
        "parsing_algorithm": "pymupdf"}
)
parsed_content = parser.parse(pdf_path)

Python, open-parse & OS Version

python_version: 3.8.18
             operating_system: Darwin
                   os_version: 23.0.0
           open-parse version: 0.5.7
                 install path: /Users/tjk/miniconda3/envs/pytorch/lib/python3.8/site-packages/openparse
               python version: 3.8.18 (default, Sep 11 2023, 08:17:16)  [Clang 14.0.6 ]
                     platform: macOS-14.0-arm64-arm-64bit
             related packages: tokenizers-0.19.1 PyMuPDF-1.24.9 torchvision-0.18.1 transformers-4.43.1 torch-2.3.1 pydantic-2.8.2

The text was updated successfully, but these errors were encountered:

tiamjiakun · 2024-08-12T02:12:21Z

waiting

jaredmcqueen · 2024-09-08T03:37:46Z

+1 on this. pdfminer struggles with a large amount of documents I'm testing with. pymupdf, on the other hand opens anything I throw at it flawlessly. ocr=true will flip to use pymupdf, but has additional logic that makes it useful to OCR.

jaredmcqueen · 2024-09-08T03:56:02Z

seems to be pdfminer:
pdfminer/pdfminer.six#1004
NixOS/nixpkgs#339919

there's a fix now but you'll have to wait until it gets released, which could be a while.

tiamjiakun added the bug Something isn't working label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some PDF documents cannot be parsed #57

Some PDF documents cannot be parsed #57

tiamjiakun commented Jul 30, 2024

tiamjiakun commented Aug 12, 2024 •

edited

Loading

jaredmcqueen commented Sep 8, 2024

jaredmcqueen commented Sep 8, 2024 •

edited

Loading

Some PDF documents cannot be parsed #57

Some PDF documents cannot be parsed #57

Comments

tiamjiakun commented Jul 30, 2024

Initial Checks

Description

Example Code

Python, open-parse & OS Version

tiamjiakun commented Aug 12, 2024 • edited Loading

jaredmcqueen commented Sep 8, 2024

jaredmcqueen commented Sep 8, 2024 • edited Loading

tiamjiakun commented Aug 12, 2024 •

edited

Loading

jaredmcqueen commented Sep 8, 2024 •

edited

Loading