Extending PDF to Text Capability #6377
Unanswered
demongolem-biz2
asked this question in
Questions
Replies: 1 comment
-
Hello, @demongolem-biz2! Things might be clearer and more modular. We are working in this direction for the upcoming Haystack 2.0. (My impression is that I suggest the following approach:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As part of a pipeline, I would like to convert pdf files to text (most of my inputs are pdf files). There a number of things I would like to remove as part of a cleaning process. For one, my documents have headers and footers on each page which should be removed because they create extra noise. Other things are quite standard: caption removal on Figures which would just bleed in with the rest of the text, mathematical equations which often do not survive conversion, and so on.
Haystack I see provides two mechanisms to convert pdfs and I don't quite understand why both exists:
The latter has a clean_func parameter which I can see is very useful. The former has a parameter to suppress tables which could be useful in some situations.
What is the best way to clean information? Does either one excel, I guess the 2nd because it is more customizable? Is there a way to drill down to a deeper level within the Haystack codebase? Would another tool external to Haystack perhaps meet my needs better?
Beta Was this translation helpful? Give feedback.
All reactions