Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pypdf_table_extraction (camelot) and gmft? #174

Open
conjuncts opened this issue Oct 12, 2024 · 4 comments
Open

pypdf_table_extraction (camelot) and gmft? #174

conjuncts opened this issue Oct 12, 2024 · 4 comments

Comments

@conjuncts
Copy link

conjuncts commented Oct 12, 2024

Hello,

Thank you so much for continuing the development of camelot! I'm glad to see that camelot continues to be maintained.

I happen to also manage a pdf extraction library, gmft. My goal is to encapsulate multiple pdf table extraction options into one consistent format. I think camelot is a great option for its high throughput performance and focus on non-deep detection. Consequently, I would love to support integration between camelot and gmft. I normally try to bridge the gap myself, but based on the complexities and differing approaches of the 2 libraries, I think it must require some sort of mutual cooperation. Please let me know what you think!

Challenges:

  • My goal with gmft is to be pdf parser agnostic. I currently support pymupdf and pypdfium2. Meanwhile, pypdf_table_extraction uses solely pypdf.
  • camelot has a different set of dependencies (ghostscript, poppler, sqlite3)

Edit: this is not a bug, oops!

@conjuncts conjuncts added the bug Something isn't working label Oct 12, 2024
@stefan6419846
Copy link

Could you please elaborate why you consider the mentioned aspects challenges? pypdf_table_extraction is a library which you can call by its public APIs - IMHO it should not really matter how it is implemented internally and/or which external packages it depends on.

@bosd bosd removed the bug Something isn't working label Oct 12, 2024
@conjuncts
Copy link
Author

Sure. This is how my library is structured:

Documents: PyMuPDFDocument and PyPDFium2Document
Detectors: TATRDetector and Img2TableDetector
Structure analysis: TATRFormatter, hopefully CamelotFormatter.

Basically, it's the abstraction. From my perspective, pdf handling (ghostscript, poppler, pdfminer.six) can be abstracted into these features:

  • get word text content and bboxes
  • get an image of the page

The pdf handling can be encapsulated into that common interface. After that point, things become interchangeable, and the table structure algorithm does not need to know the pdf handler method. I think it's nice to separate the pdf handling logic from the table recognition logic.

The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2, even though camelot internally uses pdfminer.six/ghostscript/poppler.

I was worried that the camelot would be tightly coupled to pdfminer.six, but under a closer look at lattice.py, it actually seems surprisingly doable. Solely focusing on getting lattice.py to work with pypdfium2:

  • backend can be swapped out
  • only thing that relies on pdfminer.six is text_in_bbox_per_axis() from utils.py
  • the data comes from prepare_page_parse() in BaseParser
  • the text bboxes comes from get_text_objects() from utils.py
  • gives LTChar, LTImage, LTTextLineHorizontal, LTTextLineVertical
  • could probably swap that with text objects from pymupdf or pypdfium2

The fact that LTObjects are stored internally (as horizontal_text, vertical_text, all_textlines) does make it messy, but might still be doable:

  • initialize a LTTextLineHorizontal based on text and bbox
  • or, try to mock these fields:
  • get_text(), bbox, x0, x1, y0, y1, height, width, matrix, more?
  • make sure that the mocked LTObject works on every function that reads a textline (ie. text_in_bbox, textlines_overlapping_bbox)

I guess camelot having external dependencies isn't really an issue - that's true. What remains difficult is adapting camelot to support an entirely different pdf parser. Beyond just integration into my library, I think there are also conceptual advantages to this abstraction. So I might send in a pull request, but obviously it will be very messy.

@bosd
Copy link
Collaborator

bosd commented Oct 13, 2024

The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2

Currently I'm working on the pypdfium2 integration.
Soon I'll have a PR ready.

@bosd
Copy link
Collaborator

bosd commented Nov 10, 2024

PR for pdfium in #230 and merged.
So it should be possible to integrate with gmft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants