-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pypdf_table_extraction (camelot) and gmft? #174
Comments
Could you please elaborate why you consider the mentioned aspects challenges? |
Sure. This is how my library is structured: Documents: Basically, it's the abstraction. From my perspective, pdf handling (ghostscript, poppler, pdfminer.six) can be abstracted into these features:
The pdf handling can be encapsulated into that common interface. After that point, things become interchangeable, and the table structure algorithm does not need to know the pdf handler method. I think it's nice to separate the pdf handling logic from the table recognition logic. The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2, even though camelot internally uses pdfminer.six/ghostscript/poppler. I was worried that the camelot would be tightly coupled to pdfminer.six, but under a closer look at lattice.py, it actually seems surprisingly doable. Solely focusing on getting lattice.py to work with pypdfium2:
The fact that LTObjects are stored internally (as
I guess camelot having external dependencies isn't really an issue - that's true. What remains difficult is adapting camelot to support an entirely different pdf parser. Beyond just integration into my library, I think there are also conceptual advantages to this abstraction. So I might send in a pull request, but obviously it will be very messy. |
Currently I'm working on the pypdfium2 integration. |
PR for pdfium in #230 and merged. |
Hello,
Thank you so much for continuing the development of camelot! I'm glad to see that camelot continues to be maintained.
I happen to also manage a pdf extraction library, gmft. My goal is to encapsulate multiple pdf table extraction options into one consistent format. I think camelot is a great option for its high throughput performance and focus on non-deep detection. Consequently, I would love to support integration between camelot and gmft. I normally try to bridge the gap myself, but based on the complexities and differing approaches of the 2 libraries, I think it must require some sort of mutual cooperation. Please let me know what you think!
Challenges:
Edit: this is not a bug, oops!
The text was updated successfully, but these errors were encountered: