Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorise table of contents as a new category #126

Open
tskvivekmani opened this issue Oct 4, 2024 · 1 comment · May be fixed by #132
Open

Categorise table of contents as a new category #126

tskvivekmani opened this issue Oct 4, 2024 · 1 comment · May be fixed by #132

Comments

@tskvivekmani
Copy link

Context

We were building a RAG based solution and it involves lots of PDF file ingestion. So we tried integrating docling and it is doing a great job in PDF parsing and esp. table extraction.

Expectation

It was cleaning up all the header and footer in a file, which is what we needed as a part of cleanup.

Feature requirement

With respect to our knowledge base, we have files with lots of table of content (ToC) kind of pages. Ingesting them creates lots of noise during retrieval. When connected with @cau-git, he mentioned ToC related pages can be added as a new category instead of falling under tables.

@cau-git cau-git linked a pull request Oct 8, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
@tskvivekmani and others