pypdf_table_extraction (camelot) and gmft? #174

conjuncts · 2024-10-12T03:13:04Z

Hello,

Thank you so much for continuing the development of camelot! I'm glad to see that camelot continues to be maintained.

I happen to also manage a pdf extraction library, gmft. My goal is to encapsulate multiple pdf table extraction options into one consistent format. I think camelot is a great option for its high throughput performance and focus on non-deep detection. Consequently, I would love to support integration between camelot and gmft. I normally try to bridge the gap myself, but based on the complexities and differing approaches of the 2 libraries, I think it must require some sort of mutual cooperation. Please let me know what you think!

Challenges:

My goal with gmft is to be pdf parser agnostic. I currently support pymupdf and pypdfium2. Meanwhile, pypdf_table_extraction uses solely pypdf.
- (unless Drop Poppler and switch to pdfium2 Backend? #89 goes through.)
camelot has a different set of dependencies (ghostscript, poppler, sqlite3)

Edit: this is not a bug, oops!

stefan6419846 · 2024-10-12T08:48:49Z

Could you please elaborate why you consider the mentioned aspects challenges? pypdf_table_extraction is a library which you can call by its public APIs - IMHO it should not really matter how it is implemented internally and/or which external packages it depends on.

conjuncts · 2024-10-12T20:39:25Z

Sure. This is how my library is structured:

Documents: PyMuPDFDocument and PyPDFium2Document
Detectors: TATRDetector and Img2TableDetector
Structure analysis: TATRFormatter, hopefully CamelotFormatter.

Basically, it's the abstraction. From my perspective, pdf handling (ghostscript, poppler, pdfminer.six) can be abstracted into these features:

get word text content and bboxes
get an image of the page

The pdf handling can be encapsulated into that common interface. After that point, things become interchangeable, and the table structure algorithm does not need to know the pdf handler method. I think it's nice to separate the pdf handling logic from the table recognition logic.

The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2, even though camelot internally uses pdfminer.six/ghostscript/poppler.

I was worried that the camelot would be tightly coupled to pdfminer.six, but under a closer look at lattice.py, it actually seems surprisingly doable. Solely focusing on getting lattice.py to work with pypdfium2:

backend can be swapped out
only thing that relies on pdfminer.six is text_in_bbox_per_axis() from utils.py
the data comes from prepare_page_parse() in BaseParser
the text bboxes comes from get_text_objects() from utils.py
gives LTChar, LTImage, LTTextLineHorizontal, LTTextLineVertical
could probably swap that with text objects from pymupdf or pypdfium2

The fact that LTObjects are stored internally (as horizontal_text, vertical_text, all_textlines) does make it messy, but might still be doable:

initialize a LTTextLineHorizontal based on text and bbox
or, try to mock these fields:
get_text(), bbox, x0, x1, y0, y1, height, width, matrix, more?
make sure that the mocked LTObject works on every function that reads a textline (ie. text_in_bbox, textlines_overlapping_bbox)

I guess camelot having external dependencies isn't really an issue - that's true. What remains difficult is adapting camelot to support an entirely different pdf parser. Beyond just integration into my library, I think there are also conceptual advantages to this abstraction. So I might send in a pull request, but obviously it will be very messy.

bosd · 2024-10-13T09:06:14Z

The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2

Currently I'm working on the pypdfium2 integration.
Soon I'll have a PR ready.

bosd · 2024-11-10T00:11:54Z

PR for pdfium in #230 and merged.
So it should be possible to integrate with gmft

conjuncts added the bug Something isn't working label Oct 12, 2024

bosd removed the bug Something isn't working label Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pypdf_table_extraction (camelot) and gmft? #174

pypdf_table_extraction (camelot) and gmft? #174

conjuncts commented Oct 12, 2024 •

edited

Loading

stefan6419846 commented Oct 12, 2024

conjuncts commented Oct 12, 2024

bosd commented Oct 13, 2024

bosd commented Nov 10, 2024

pypdf_table_extraction (camelot) and gmft? #174

pypdf_table_extraction (camelot) and gmft? #174

Comments

conjuncts commented Oct 12, 2024 • edited Loading

stefan6419846 commented Oct 12, 2024

conjuncts commented Oct 12, 2024

bosd commented Oct 13, 2024

bosd commented Nov 10, 2024

conjuncts commented Oct 12, 2024 •

edited

Loading